Cluster-based near-duplicate document detection
US-2021360001-A1 · Nov 18, 2021 · US
US11810052B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11810052-B2 |
| Application number | US-202117390162-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jul 30, 2021 |
| Priority date | Jul 30, 2021 |
| Publication date | Nov 7, 2023 |
| Grant date | Nov 7, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A computer system, and a method at a computer system, the method including applying a mapping function to a received message to create an characteristic value, wherein the mapping function is adapted to map similar messages to similar characteristic values; comparing the characteristic value to a value associated with each of a plurality of message extractors; determining that the characteristic value does not match any value associated with the plurality of message extractors; identifying at least one message extractor from the plurality of message extractors, the identifying determining that the value associated with the message extractor and the characteristic value from the received message, when compared, satisfy a similarity criterion; and using the identified at least one message extractor to extract information from the received message.
Opening claim text (preview).
The invention claimed is: 1. A method at a computer system, the method comprising: applying a mapping function to a received message to create a characteristic value, wherein the mapping function is adapted to map similar messages to similar characteristic values; comparing the characteristic value to a value associated with each of a plurality of message extractors; determining that the characteristic value does not match any value associated with the plurality of message extractors; identifying at least one message extractor from the plurality of message extractors, the identifying determining that the value associated with the message extractor and the characteristic value from the received message, when compared, satisfy a similarity criterion, wherein the similarity criterion includes a similarity index being within a threshold distance from the characteristic value; and using the identified at least one message extractor to extract information from the received message. 2. The method of claim 1 , further comprising checking information extracted against known information field values to verify the identified at least one message extractor is correctly extracting information. 3. The method of claim 1 , wherein the received message is an email message and wherein the elements include Hypertext Markup Language (HTML) elements from the email message extracted using XPaths. 4. The method of claim 1 , wherein the characteristic value comprises a fixed length array, and the similarity index is created by correlating elements from the fixed length array with a fixed length array associated with each of the plurality of message extractors. 5. The method of claim 1 , wherein the identifying further comprises using information from within the received message to identify at least one message extractor. 6. The method of claim 5 , wherein the information comprises at least one of a merchant name, a sender address, a product name, a shipper, or an identifier format in the received message. 7. The method of claim 1 , wherein the identifying further uses a volume increase or decrease of messages associated with a characteristic value to identify the at least one message extractor. 8. The method of claim 1 , wherein the using further comprises: creating a quality score for information extracted from the received message; and extracting information when the quality score exceeds a quality threshold. 9. The method of claim 1 , wherein the using further comprises: creating a quality score for information extracted from the received message; and referring the message to one of an operator of a commerce platform or a receiving entity to verify information within the received message. 10. A computer system comprising: a processor; and a communications subsystem, wherein the computer system is configured to: apply a mapping function to a received message received through the communications subsystem to create a characteristic value, wherein the mapping function is adapted to map similar messages to similar characteristic values; compare the characteristic value to a value associated with each of a plurality of message extractors; determine that the characteristic value does not match any value associated with the plurality of message extractors; identify at least one message extractor from the plurality of message extractors, the identifying determining that the value associated with the message extractor and the characteristic value from the received message, when compared, satisfy a similarity criterion, wherein the similarity criterion includes a similarity index being within a threshold distance from the characteristic value; and use the identified at least one message extractor to extract information from the received message. 11. The computer system of claim 10 , wherein the computer system is further configured to check information extracted against known information field values to verify the identified at least one message extractor is correctly extracting information. 12. The computer system of claim 10 , wherein the received message is an email message and wherein the elements include Hypertext Markup Language (HTML) elements from the email message extracted using XPaths. 13. The computer system of claim 10 , wherein the characteristic value comprises a fixed length array, and the similarity index is created by correlating elements from the fixed length array with a fixed length array associated with each of the plurality of message extractors. 14. The computer system of claim 10 , wherein the computer system is further configured to identify by using information from within the received message to identify at least one message extractor. 15. The computer system of claim 14 , wherein the information comprises at least one of a merchant name, a sender address, a product name, a shipper, or an identifier format in the received message. 16. The computer system of claim 10 , wherein the computer system is further configured to identify by using a volume increase or decrease of messages associated with a characteristic value to identify the at least one message extractor. 17. The computer system of claim 10 , wherein the computer system is further configured to use the identified at least one message extractor by: creating a quality score for information extracted from the received message; and extracting information when the quality score exceeds a quality threshold. 18. The computer system of claim 10 , wherein the computer system is further configured to use the identified at least one message extractor by: creating a quality score for information extracted from the received message; and referring the message to one of an operator of a commerce platform or a receiving entity to verify information within the received message. 19. A non-transitory computer readable medium for storing instruction code, which, when executed by a processor of a computer system cause the computer system to: apply a mapping function to a received message received through the communications subsystem to create a characteristic value, wherein the mapping function is adapted to map similar messages to similar characteristic values; compare the characteristic value to a value associated with each of a plurality of message extractors; determine that the characteristic value does not match any value associated with the plurality of message extractors; identify at least one message extractor from the plurality of message extractors, the identifying determining that the value associated with the message extractor and the characteristic value from the received message, when compared, satisfy a similarity criterion, wherein the similarity criterion includes a similarity index being within a threshold distance from the characteristic value; and use the identified at least one message extractor to extract information from the received message. 20. The non-transitory computer readable medium of claim 19 , wherein the instruction code further cause the computer system to: check information extracted against known information field values to verify the identified at least one message extractor is correctly extracting information. 21. The non-transitory computer readable medium of claim 19 , wherein the received message is an email message and wherein the elements include Hypertext Markup Language (HTML) elements from the email message extracted using Xpaths.
Tracking · CPC title
Indexing; Data structures therefor; Storage structures · CPC title
Query processing · CPC title
Computer-aided management of electronic mailing [e-mailing] · CPC title
Mailbox-related aspects, e.g. synchronisation of mailboxes · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.