Data storage and retrieval system using online supervised hashing

US10990626B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10990626-B2
Application numberUS-201615754904-A
CountryUS
Kind codeB2
Filing dateSep 23, 2016
Priority dateSep 24, 2015
Publication dateApr 27, 2021
Grant dateApr 27, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A data storage and retrieval system employs online supervised hashing for indexing a data set and retrieving data items therefrom. A hash-based mapping is used to generate hash codes for indexing content items. Data items may be retrieved based on either/both a query label (using corresponding codewords) and the content item itself (using the hash codes). The hash-based mapping is updated using an objective function of distance between the hash codes and respective codewords for labels of labelled content items, preserving semantic similarities of content items. The codewords may be error-correcting codes. Techniques for efficiently updating the index include (1) cycle-based updating and ternary codewords, and (2) reservoir sample-based method of determining when to trigger an update.

First claim

Opening claim text (preview).

What is claimed is: 1. A data storage and retrieval system, comprising: a storage subsystem for storing a dataset of content items, at least some of the content items of the dataset being labelled content items having respective labels describing respective aspects of the content items; and a processing subsystem coupled to the storage subsystem and having a communications interface coupled to receive additions to the dataset and queries of the dataset, the processing subsystem being configured and operative to: maintain an index of the content items using hash codes and predetermined codewords as index values, the hash codes being dynamically generated by an online-learned hash-based mapping, the codewords representing respective labels of the labelled content items of the dataset, the codewords being error-correcting codes protecting against errors in the hash-based mapping and promoting selectivity of the hash codes produced by the hash-based mapping; maintain and dynamically update the hash-based mapping, and use the hash-based mapping to generate the hash codes as content items of the dataset are indexed, the hash-based mapping preserving semantic similarities of content items, the hash-based mapping being updated using an objective function of distance between the hash codes produced by the hash-based mapping and respective codewords for the labelled content items; and search the index to retrieve stored content items in response to the queries received via the communications interface, including (i) using the hash-based mapping to calculate a query hash code from a content item of the query, and (ii) using the query hash code to retrieve stored content items indexed by similar codewords or hash codes; wherein the processing subsystem performs a randomized reservoir sampling method that (i) maintains a reservoir sampling set of randomly selected content items and associated labels as they are added to the dataset during operation, (ii) uses the reservoir sampling set to determine when to update the index, and (iii) updates the index at determined times. 2. The data storage and retrieval system of claim 1 , wherein to maintain the hash-based mapping the processing subsystem performs a calculation minimizing an upper bound on a Hamming loss between the hash codes and the codewords. 3. The data storage and retrieval system of claim 1 , wherein to determine when to update the index, the processing subsystem performs a calculation comparing a measure of an amount of change of the hash codes of the reservoir to a predetermined threshold value. 4. The data storage and retrieval system of claim 3 , wherein the threshold is a dynamic threshold adapted based on the number of labels observed in preceding operation. 5. The data storage and retrieval system of claim 1 , wherein to maintain the hash-based mapping the processing subsystem evaluates a regularizer term in addition to the objective function of distance, the regularizer term encoding a measure of divergence between a current hash-based mapping and a candidate new hash-based mapping, to promote selection of a new hash-based mapping requiring less updating of the index than other candidate new hash-based mappings. 6. The data storage and retrieval system of claim 5 , wherein the processing subsystem uses the reservoir sampling set to evaluate the regularizer term. 7. The data storage and retrieval system of claim 6 , wherein the processing subsystem also uses the reservoir sampling set to determine when to update the index, and updates the index at determined times. 8. The data storage and retrieval system of claim 7 , wherein to determine when to update the index, the processing subsystem performs a calculation comparing a measure of an amount of change of the hash codes of the reservoir to a predetermined threshold value. 9. The data storage and retrieval system of claim 8 , wherein the threshold is a dynamic threshold adapted based on the number of labels observed in preceding operation. 10. The data storage and retrieval system of claim 1 , wherein to maintain the index and maintain the hash-based mapping, the processing subsystem performs ongoing processes divided into cycles during each of which a number of new labels are learned and corresponding content items are stored, both the hash-based mapping and the index being divided into sections corresponding to the cycles during which respective hash functions and codewords of the sections were first created and stored therein, the processes including, at a time of updating the hash-based mapping and the index based on adding a new labelled content item to the dataset: identifying respective sections of the hash-based mapping and index structure corresponding to a codeword for a label of the labelled content item; and updating the identified sections. 11. The data storage and retrieval system of claim 10 , wherein the codewords comprise ternary symbols including an inactive value, and wherein, to update the identified sections, the processing subsystem also updates other sections of the index structure by appending a series of symbols having the inactive value to existing symbols of existing hash codes. 12. A method of operating a data storage and retrieval system storing a dataset of content items, at least some of the content items of the dataset being labelled content items having respective labels describing respective aspects of the content items, the data storage and retrieval system having a communications interface for receiving additions to the dataset and queries of the dataset, comprising: maintaining an index of the content items using hash codes and predetermined codewords as index values, the hash codes being dynamically generated by an online-learned hash-based mapping, the codewords representing respective labels of the labelled content items of the dataset, the codewords being error-correcting codes protecting against errors in the hash-based mapping and promoting selectivity of the hash codes produced by the hash-based mapping; maintaining and dynamically updating the hash-based mapping, and using the hash-based mapping to generate the hash codes as content items of the dataset are indexed, the hash-based mapping preserving semantic similarities of content items, the hash-based mapping being updated using an objective function of distance between the hash codes produced by the hash-based mapping and respective codewords for the labelled content items; searching the index to retrieve stored content items in response to the queries received via the communications interface, including (i) using the hash-based mapping to calculate a query hash code from a content item of the query, and (ii) using the query hash code to retrieve stored content items indexed by similar codewords or hash codes; and randomized reservoir sampling that includes (i) maintains a reservoir sampling set of randomly s elected content items and associated labels as they are added to the dataset during operation, (ii) uses the reservoir sampling set to determine when to update the index, and (iii) updates the index at determined times. 13. The method of claim 12 , wherein maintaining the hash-based mapping includes minimizing an upper bound on a Hamming loss between the hash codes and the codewords. 14. The method of claim 12 , further including, to determine when to update the index, performing a calculation comparing a measure of an amount of change of the hash codes of the reservoir to a predetermined threshold value. 15. The method of claim 14 , wherein the threshold is a dynamic threshold adapted based on the number of labe

Assignees

Inventors

Classifications

  • Error detection codes other than CRC and single parity bit codes · CPC title

  • Protection against loss of memory contents {(contains no material, see G06F11/00)} · CPC title

  • Methods or arrangements for processing data by operating upon the order or content of the data handled (logic circuits H03K19/00) · CPC title

  • Query processing · CPC title

  • hash tables · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10990626B2 cover?
A data storage and retrieval system employs online supervised hashing for indexing a data set and retrieving data items therefrom. A hash-based mapping is used to generate hash codes for indexing content items. Data items may be retrieved based on either/both a query label (using corresponding codewords) and the content item itself (using the hash codes). The hash-based mapping is updated using…
Who is the assignee on this patent?
Univ Boston
What technology area does this patent fall under?
Primary CPC classification G06F16/9014. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 27 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).