Information storage device and method for deduplication
US-2017300235-A1 · Oct 19, 2017 · US
US10885121B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10885121-B2 |
| Application number | US-201715840670-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 13, 2017 |
| Priority date | Dec 13, 2017 |
| Publication date | Jan 5, 2021 |
| Grant date | Jan 5, 2021 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods and systems for searching for similar documents include comparing an input index of a requested document to one or more stored indices for respective stored documents to produce a similarity score for each of the stored documents. Each index indicates which of a plurality of queries matched a respective document. The stored documents are filtered to remove dissimilar documents based on a comparison of each respective similarity score to a threshold. A list of any stored documents that remain after said filtering is output.
Opening claim text (preview).
What is claimed is: 1. A method for searching for similar documents, comprising: generating an input index for a requested document, based on a plurality of queries, each query including one or more text strings, wherein the input index records which queries have at least one text string from the one or more text strings that is present in the input document, wherein the one or more text strings of each of the plurality of queries have a summed probability of matching 50% or fewer of the stored documents; comparing the input index of the requested document to one or more stored indices for respective stored documents, to produce a similarity score for each of the stored documents; filtering the stored documents to remove dissimilar documents from a list of the stored documents based on a comparison of each respective similarity score to a threshold; and outputting the list of any stored documents that remain after said filtering. 2. The method of claim 1 , further comprising generating a respective stored index for each of the stored documents based on the plurality of queries. 3. The method of claim 1 , wherein comparing the input index to the one or more stored indices comprises performing an XOR operation between the input index and each stored index. 4. The method of claim 3 , wherein comparing the input index to the one or more stored indices further comprises counting a number of ‘1’ bits in an output of each XOR operation to produce the similarity score for each of the stored documents. 5. The method of claim 4 , wherein filtering the stored documents comprises removing documents having a similarity score above the threshold. 6. The method of claim 1 , wherein the queries comprise text strings selected from a predetermined text string collection. 7. A non-transitory computer readable storage medium comprising a computer readable program for searching for similar documents, wherein the computer readable program when executed on a computer causes the computer to perform the steps of: generating an input index for a requested document, based on a plurality of queries, each query including one or more text strings, wherein the input index records which queries have at least one text string from the one or more text strings that is present in the input document, wherein the one or more text strings of each of the plurality of queries have a summed probability of matching 50% or fewer of the stored documents; comparing the input index of the requested document to one or more stored indices for respective stored documents, to produce a similarity score for each of the stored documents; filtering the stored documents to remove dissimilar documents from a list of the stored documents based on a comparison of each respective similarity score to a threshold; and outputting the list of any stored documents that remain after said filtering. 8. A system for searching for similar documents, comprising: a user interface configured to receive a requested document and to display similar documents; an index module configured to generate an input index for the requested document based on a plurality of queries, each query including one or more text strings, wherein the input index records which queries have at least one text string from the one or more text strings that is present in the input document, wherein the one or more text strings of each of the plurality of queries have a summed probability of matching 50% or fewer of the stored documents; a database configured to store a plurality of documents and a respective index for each stored document, each index indicating which of a plurality of queries matched a respective document; a similarity module comprising a processor configured to compare an input index of the requested document to one or more stored indices for respective stored documents; and a search module configured to filter the stored documents to remove dissimilar documents from a list of the stored documents based on a comparison of each respective similarity score to a threshold and to output the list of any stored documents that remain after said filtering for display as the similar documents. 9. The system of claim 8 , wherein the index module is further configured to generate a respective stored index for each of the stored documents based on the plurality of queries. 10. The system of claim 8 , wherein the similarity module is further configured to compare the input index to the one or more stored indices comprises performing an XOR operation between the input index and each stored index. 11. The system of claim 10 , wherein the similarity module is further configured to compare the input index to the one or more stored indices further comprises counting a number of ‘1’ bits in an output of each XOR operation to produce the similarity score for each of the stored documents. 12. The system of claim 11 , wherein the search module is further configured to remove documents having a similarity score above the threshold. 13. The system of claim 8 , wherein the queries comprise text strings selected from a predetermined text string collection.
Indexing; Data structures therefor; Storage structures · CPC title
Document management systems · CPC title
Presentation of query results · CPC title
Vectors, bitmaps or matrices · CPC title
using ranking · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.