Fast filtering for similarity searches on indexed data

US10885121B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10885121-B2
Application numberUS-201715840670-A
CountryUS
Kind codeB2
Filing dateDec 13, 2017
Priority dateDec 13, 2017
Publication dateJan 5, 2021
Grant dateJan 5, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods and systems for searching for similar documents include comparing an input index of a requested document to one or more stored indices for respective stored documents to produce a similarity score for each of the stored documents. Each index indicates which of a plurality of queries matched a respective document. The stored documents are filtered to remove dissimilar documents based on a comparison of each respective similarity score to a threshold. A list of any stored documents that remain after said filtering is output.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for searching for similar documents, comprising: generating an input index for a requested document, based on a plurality of queries, each query including one or more text strings, wherein the input index records which queries have at least one text string from the one or more text strings that is present in the input document, wherein the one or more text strings of each of the plurality of queries have a summed probability of matching 50% or fewer of the stored documents; comparing the input index of the requested document to one or more stored indices for respective stored documents, to produce a similarity score for each of the stored documents; filtering the stored documents to remove dissimilar documents from a list of the stored documents based on a comparison of each respective similarity score to a threshold; and outputting the list of any stored documents that remain after said filtering. 2. The method of claim 1 , further comprising generating a respective stored index for each of the stored documents based on the plurality of queries. 3. The method of claim 1 , wherein comparing the input index to the one or more stored indices comprises performing an XOR operation between the input index and each stored index. 4. The method of claim 3 , wherein comparing the input index to the one or more stored indices further comprises counting a number of ‘1’ bits in an output of each XOR operation to produce the similarity score for each of the stored documents. 5. The method of claim 4 , wherein filtering the stored documents comprises removing documents having a similarity score above the threshold. 6. The method of claim 1 , wherein the queries comprise text strings selected from a predetermined text string collection. 7. A non-transitory computer readable storage medium comprising a computer readable program for searching for similar documents, wherein the computer readable program when executed on a computer causes the computer to perform the steps of: generating an input index for a requested document, based on a plurality of queries, each query including one or more text strings, wherein the input index records which queries have at least one text string from the one or more text strings that is present in the input document, wherein the one or more text strings of each of the plurality of queries have a summed probability of matching 50% or fewer of the stored documents; comparing the input index of the requested document to one or more stored indices for respective stored documents, to produce a similarity score for each of the stored documents; filtering the stored documents to remove dissimilar documents from a list of the stored documents based on a comparison of each respective similarity score to a threshold; and outputting the list of any stored documents that remain after said filtering. 8. A system for searching for similar documents, comprising: a user interface configured to receive a requested document and to display similar documents; an index module configured to generate an input index for the requested document based on a plurality of queries, each query including one or more text strings, wherein the input index records which queries have at least one text string from the one or more text strings that is present in the input document, wherein the one or more text strings of each of the plurality of queries have a summed probability of matching 50% or fewer of the stored documents; a database configured to store a plurality of documents and a respective index for each stored document, each index indicating which of a plurality of queries matched a respective document; a similarity module comprising a processor configured to compare an input index of the requested document to one or more stored indices for respective stored documents; and a search module configured to filter the stored documents to remove dissimilar documents from a list of the stored documents based on a comparison of each respective similarity score to a threshold and to output the list of any stored documents that remain after said filtering for display as the similar documents. 9. The system of claim 8 , wherein the index module is further configured to generate a respective stored index for each of the stored documents based on the plurality of queries. 10. The system of claim 8 , wherein the similarity module is further configured to compare the input index to the one or more stored indices comprises performing an XOR operation between the input index and each stored index. 11. The system of claim 10 , wherein the similarity module is further configured to compare the input index to the one or more stored indices further comprises counting a number of ‘1’ bits in an output of each XOR operation to produce the similarity score for each of the stored documents. 12. The system of claim 11 , wherein the search module is further configured to remove documents having a similarity score above the threshold. 13. The system of claim 8 , wherein the queries comprise text strings selected from a predetermined text string collection.

Assignees

Inventors

Classifications

  • Indexing; Data structures therefor; Storage structures · CPC title

  • G06F16/93Primary

    Document management systems · CPC title

  • Presentation of query results · CPC title

  • Vectors, bitmaps or matrices · CPC title

  • using ranking · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10885121B2 cover?
Methods and systems for searching for similar documents include comparing an input index of a requested document to one or more stored indices for respective stored documents to produce a similarity score for each of the stored documents. Each index indicates which of a plurality of queries matched a respective document. The stored documents are filtered to remove dissimilar documents based on …
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F16/93. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 05 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).