What technology area does this patent fall under?

Primary CPC classification G06F16/93. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Jan 05 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Fast filtering for similarity searches on indexed data

US10885121B2 · US · B2

Patent metadata
Field	Value
Publication number	US-10885121-B2
Application number	US-201715840670-A
Country	US
Kind code	B2
Filing date	Dec 13, 2017
Priority date	Dec 13, 2017
Publication date	Jan 5, 2021
Grant date	Jan 5, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods and systems for searching for similar documents include comparing an input index of a requested document to one or more stored indices for respective stored documents to produce a similarity score for each of the stored documents. Each index indicates which of a plurality of queries matched a respective document. The stored documents are filtered to remove dissimilar documents based on a comparison of each respective similarity score to a threshold. A list of any stored documents that remain after said filtering is output.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for searching for similar documents, comprising: generating an input index for a requested document, based on a plurality of queries, each query including one or more text strings, wherein the input index records which queries have at least one text string from the one or more text strings that is present in the input document, wherein the one or more text strings of each of the plurality of queries have a summed probability of matching 50% or fewer of the stored documents; comparing the input index of the requested document to one or more stored indices for respective stored documents, to produce a similarity score for each of the stored documents; filtering the stored documents to remove dissimilar documents from a list of the stored documents based on a comparison of each respective similarity score to a threshold; and outputting the list of any stored documents that remain after said filtering. 2. The method of claim 1 , further comprising generating a respective stored index for each of the stored documents based on the plurality of queries. 3. The method of claim 1 , wherein comparing the input index to the one or more stored indices comprises performing an XOR operation between the input index and each stored index. 4. The method of claim 3 , wherein comparing the input index to the one or more stored indices further comprises counting a number of ‘1’ bits in an output of each XOR operation to produce the similarity score for each of the stored documents. 5. The method of claim 4 , wherein filtering the stored documents comprises removing documents having a similarity score above the threshold. 6. The method of claim 1 , wherein the queries comprise text strings selected from a predetermined text string collection. 7. A non-transitory computer readable storage medium comprising a computer readable program for searching for similar documents, wherein the computer readable program when executed on a computer causes the computer to perform the steps of: generating an input index for a requested document, based on a plurality of queries, each query including one or more text strings, wherein the input index records which queries have at least one text string from the one or more text strings that is present in the input document, wherein the one or more text strings of each of the plurality of queries have a summed probability of matching 50% or fewer of the stored documents; comparing the input index of the requested document to one or more stored indices for respective stored documents, to produce a similarity score for each of the stored documents; filtering the stored documents to remove dissimilar documents from a list of the stored documents based on a comparison of each respective similarity score to a threshold; and outputting the list of any stored documents that remain after said filtering. 8. A system for searching for similar documents, comprising: a user interface configured to receive a requested document and to display similar documents; an index module configured to generate an input index for the requested document based on a plurality of queries, each query including one or more text strings, wherein the input index records which queries have at least one text string from the one or more text strings that is present in the input document, wherein the one or more text strings of each of the plurality of queries have a summed probability of matching 50% or fewer of the stored documents; a database configured to store a plurality of documents and a respective index for each stored document, each index indicating which of a plurality of queries matched a respective document; a similarity module comprising a processor configured to compare an input index of the requested document to one or more stored indices for respective stored documents; and a search module configured to filter the stored documents to remove dissimilar documents from a list of the stored documents based on a comparison of each respective similarity score to a threshold and to output the list of any stored documents that remain after said filtering for display as the similar documents. 9. The system of claim 8 , wherein the index module is further configured to generate a respective stored index for each of the stored documents based on the plurality of queries. 10. The system of claim 8 , wherein the similarity module is further configured to compare the input index to the one or more stored indices comprises performing an XOR operation between the input index and each stored index. 11. The system of claim 10 , wherein the similarity module is further configured to compare the input index to the one or more stored indices further comprises counting a number of ‘1’ bits in an output of each XOR operation to produce the similarity score for each of the stored documents. 12. The system of claim 11 , wherein the search module is further configured to remove documents having a similarity score above the threshold. 13. The system of claim 8 , wherein the queries comprise text strings selected from a predetermined text string collection.

Assignees

Inventors

Takuma Daisuke

Classifications

G06F16/41
Indexing; Data structures therefor; Storage structures · CPC title
G06F16/93Primary
Document management systems · CPC title
G06F16/248
Presentation of query results · CPC title
G06F16/2237
Vectors, bitmaps or matrices · CPC title
G06F16/24578
using ranking · CPC title

Patent family

Related publications grouped by family.

View patent family 66696162

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10885121B2 cover?: Methods and systems for searching for similar documents include comparing an input index of a requested document to one or more stored indices for respective stored documents to produce a similarity score for each of the stored documents. Each index indicates which of a plurality of queries matched a respective document. The stored documents are filtered to remove dissimilar documents based on …
Who is the assignee on this patent?: IBM
What technology area does this patent fall under?: Primary CPC classification G06F16/93. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Jan 05 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Information storage device and method for deduplication

Machine learned query generation on inverted indices

Range searches for database systems

Controlling access based on recognition of a user

Frequently asked questions