Incremental maintenance of inverted indexes for approximate string matching

US9514172B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9514172-B2
Application numberUS-201213595270-A
CountryUS
Kind codeB2
Filing dateAug 27, 2012
Priority dateJun 10, 2009
Publication dateDec 6, 2016
Grant dateDec 6, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

In embodiments of the disclosed technology, indexes, such as inverted indexes, are updated only as necessary to guarantee answer precision within predefined thresholds which are determined with little cost in comparison to the updates of the indexes themselves. With the present technology, a batch of daily updates can be processed in a matter of minutes, rather than a few hours for rebuilding an index, and a query may be answered with assurances that the results are accurate or within a threshold of accuracy.

First claim

Opening claim text (preview).

We claim: 1. A method for limiting updates of inverted indexes in a database, the method comprising: calculating a first inverse document frequency for each n-gram in a plurality of strings; calculating a first length for each string in the plurality of strings; for each n-gram, creating an inverted list comprising strings containing the n-gram; sorting each inverted list by lengths of the strings in the inverted list; receiving an update to the plurality of strings; calculating a second inverse document frequency of n-grams affected by the update; for each respective n-gram affected by the update: calculating an error for the respective n-gram based on the second inverse document frequency; and in response to determining that the error for the respective n-gram satisfies a predefined error threshold for the respective n-gram, wherein the predefined error threshold for the respective n-gram is determined based on a frequency of the respective n-gram in the plurality of strings: calculating a second length for each string affected by the update, and re-sorting each inverted list affected by the update based on the second length. 2. The method of claim 1 , further comprising: updating each inverted list affected by the update to generate a respective updated affected inverted list. 3. The method of claim 2 , further comprising: answering a query based on the respective updated affected inverted list. 4. The method of claim 1 , wherein the update is an addition of a new string. 5. The method of claim 1 , wherein the update is a deletion of a portion of an existing string. 6. The method of claim 1 , wherein the update is a modification of a portion of a string. 7. The method of claim 1 , wherein each update is buffered and propagated in a batch at regular time intervals. 8. An apparatus comprising: a processor; and a memory to store computer program instructions, the computer program instructions when executed on the processor, cause the processor to perform operations comprising: calculating a first inverse document frequency for each n-gram in a plurality of strings; calculating a first length for each string in the plurality of strings; for each n-gram, creating an inverted list comprising strings containing the n-gram; sorting each inverted list by lengths of the strings in the inverted list; receiving an update to the plurality of strings; calculating a second inverse document frequency of n-grams affected by the update; for each respective n-gram affected by the update: calculating an error for the respective n-gram based on the second inverse document frequency; and in response to determining that the error for the respective n-gram satisfies a predefined error threshold for the respective n-gram, wherein the predefined error threshold for the respective n-gram is determined based on a frequency of the respective n-gram in the plurality of strings: calculating a second length for each string affected by the update, and re-sorting each inverted list affected by the update based on the second length. 9. The apparatus of claim 8 , wherein the update is an addition of a new string. 10. The apparatus of claim 8 , wherein the update is a deletion of a portion of a string. 11. The apparatus of claim 8 , wherein the update is a modification of a portion of a string. 12. The apparatus of claim 8 , wherein each update is buffered and propagated in a batch at regular time intervals. 13. A non-transitory computer readable storage device storing computer program instructions, which, when executed on a processor, cause the processor to perform operations comprising: calculating a first inverse document frequency for each n-gram in a plurality of strings; calculating a first length for each string in the plurality of strings; for each n-gram, creating an inverted list comprising strings containing the n-gram; sorting each inverted list by lengths of the strings in the inverted list; receiving an update to the plurality of strings; calculating a second inverse document frequency of n-grams affected by the update; for each respective n-gram affected by the update: calculating an error for the respective n-gram based on the second inverse document frequency; and in response to determining that the error for the respective n-gram satisfies a predefined error threshold for the respective n-gram, wherein the predefined error threshold for the respective n-gram is determined based on a frequency of the respective n-gram in the plurality of strings: calculating a second length for each string affected by the update, and re-sorting each inverted list affected by the update based on the second length. 14. The non-transitory computer readable storage device of claim 13 , wherein the update is an addition of a new string. 15. The non-transitory computer readable storage device of claim 13 , wherein the update is a deletion of a portion of a string. 16. The non-transitory computer readable storage device of claim 13 , wherein the update is a modification of a portion of a string. 17. The non-transitory computer readable storage device of claim 13 , wherein each update is buffered and propagated in a batch at regular time intervals.

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9514172B2 cover?
In embodiments of the disclosed technology, indexes, such as inverted indexes, are updated only as necessary to guarantee answer precision within predefined thresholds which are determined with little cost in comparison to the updates of the indexes themselves. With the present technology, a batch of daily updates can be processed in a matter of minutes, rather than a few hours for rebuilding a…
Who is the assignee on this patent?
Hadjieleftheriou Marios, Koudas Nick, Srivastava Divesh, and 1 more
What technology area does this patent fall under?
Primary CPC classification G06F17/30336. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 06 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).