Random draw forest index structure for searching large scale unstructured data

US10949467B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10949467-B2
Application numberUS-201816044286-A
CountryUS
Kind codeB2
Filing dateJul 24, 2018
Priority dateMar 1, 2018
Publication dateMar 16, 2021
Grant dateMar 16, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

System and method of generating an index structure for indexing a plurality of unstructured data objects, including: generating a set of compact feature vectors, the set including a compact feature vector for each of the data objects, the compact feature vector for each data object including a sequence of hashed values that represent the data object; generating a plurality of twisted compact feature vector sets for each of set of compact feature vectors, each of the twisted compact feature vector sets being generated by applying a respective random shuffling permutation to the set of compact feature vectors; and for each twisted compact feature vector set, generating an index for the data objects in which the data objects are slotted based on sequences of hashed values in the twisted compact feature vector set.

First claim

Opening claim text (preview).

The invention claimed is: 1. A method of generating a searchable index structure that indexes a plurality of data objects, comprising: for each data object: generating a compact feature vector for the data object, the compact feature vector including a sequence of hash values that represent the data object; shuffling the sequence of hash values included in the compact feature vector using a plurality of shuffling permutations to generate a plurality of shuffled sequences for the data object, each shuffled sequence for the data object including the hash values of the compact feature vector shuffled according to a respective one of the shuffling permutations; and indexing, based on the shuffled sequences, the data object in a plurality of index tables that each correspond to the respective one of the shuffling permutations; and storing the plurality of index tables as the searchable index structure which is searchable for candidate data objects that are similar to a query object using a shuffled query sequence of the query object generated using the shuffling permutations. 2. The method of claim 1 wherein each of the shuffling permutations is a random shuffling permutation that specifies a random order for the hash values of a respective shuffled sequence. 3. The method of claim 2 wherein the hash values are binary values, and each shuffling permutation includes a randomly generated sequence of shuffling values that each specify a sequence location for the hash values in the respective shuffled sequence. 4. The method of claim 1 further comprising: for each data object, performing feature extraction on the data object to generate a raw feature vector including a plurality of feature values of the data object; and wherein for each data object, generating the compact feature vector comprises hashing the raw feature vector for the data object, to generate the sequence of hash values that represent the data object. 5. The method of claim 4 wherein the hashing is a locality sensitive hashing (LSH) using approximate nearest neighbour (ANN) hashing functions. 6. The method of claim 1 wherein: the index table corresponding to each shuffling permutation is a tree structure comprising d-nodes and k-nodes; each d-node includes an array of slots each having a respective slot ID, at least some of the slots occupied with a pointer for either a k-node associated with the slot or a next level d-node; and each k-node includes a pointer for a corresponding one of the data objects, at least some of the k-nodes also including a pointer for a further k-node. 7. The method of claim 6 wherein, for each index table, each k-node is associated with a slot of a root d-node based on a first subsequence of the shuffled sequence for the k-node's corresponding data object generated using the shuffling permutation that the index table corresponds to. 8. The method of claim 7 wherein, for each index table, when a number of k-nodes associated with a slot of the root d-node exceeds a threshold, a next level d-node is added in the index table and associated with the slot of the root d-node, and each k-node associated with the slot of the root d-node is then associated with a slot of the next level d-node based on a second subsequence of the shuffled sequence for the k-node's corresponding data object generated using the shuffling permutation that the index table corresponds to. 9. The method of claim 1 further comprising performing a search of the plurality of data objects by: generating a compact query feature vector for a query object, the compact query feature vector including a sequence of hash values that represent the query object; shuffling the sequence of hash values using the plurality of shuffling permutations to generate a plurality of shuffled query sequences for the query object; and searching each index table based on the shuffled query sequence generated using the shuffling permutation that corresponds to the index table to identify candidate data objects that are similar to the query object. 10. A system for generating searchable index structure that indexes a plurality of data objects, comprising: one or more processing units; a system storage device coupled to each of the one or more processing units, the system storage device tangibly storing thereon executable instructions that, when executed by the one or more processing units, cause the one or more processing units to: generate a plurality of shuffling permutations that are each associated with a respective index table; for each data object in the plurality of data objects: generate a compact feature vector for the data object, the compact feature vector including a sequence of hash values that represent the data object, shuffle the sequence of hash values included in the compact feature vector using a plurality of shuffling permutations to generate a plurality of shuffled sequences for the data object, each shuffled sequence for the data object including the hash values of the compact feature vector shuffled according to a respective one of the shuffling permutations, and index, based on the shuffled sequences, the data object in a plurality of index tables that each correspond to a respective one of the shuffling permutations; and store, in the system storage device, the plurality of index tables as the searchable index structure which is searchable for candidate data objects that are similar to a query object using a shuffled query sequence of the query object generated using the shuffling permutations. 11. The system of claim 10 wherein each of the shuffling permutations is a random shuffling permutation that specifies a random order for the hash values of a respective shuffled sequence. 12. The system of claim 11 wherein the hash values are binary values, and each shuffling permutation includes a randomly generated sequence of shuffling values that each specify a sequence location for the hash values in the respective shuffled sequence. 13. The system claim 10 wherein the executable instructions, when executed by the one or more processing units, further cause the one or more processing units to: for each data object, perform feature extraction on the data object to generate a raw feature vector including a plurality of feature values of the data obiect; and wherein for each data object, the compact feature vector for the data object is generated by hashing the raw feature vector for the data object to generate the sequence of hash values. 14. The system of claim 13 wherein the hashing is a locality sensitive hashing (LSH) using approximate nearest neighbour (ANN) hashing functions. 15. The system of claim 10 wherein: each index table is a tree structure comprising d-nodes and k-nodes; each d-node includes an array of slots each having a respective slot ID, at least some of the slots occupied with a pointer for either a k-node associated with the slot or a next level d-node; and each k-node includes a pointer for a corresponding one of the data objects, at least some of the k-nodes also including a pointer for a further k-node. 16. The system of claim 15 wherein, for each index table, each k-node is associated with a slot of a root d-node based on a first subsequence of the shuffled sequence for the k-node's corresponding data object generated using the shuffling permutation associated with the index table. 17. The system of claim 16 wherein the executable instructions, when executed by the one or more processing units, further cause the one or more processing units to perform a s

Assignees

Inventors

Classifications

  • Trees · CPC title

  • Proximity, similarity or dissimilarity measures · CPC title

  • G06F16/906Primary

    Clustering; Classification · CPC title

  • Distances to closest patterns, e.g. nearest neighbour classification · CPC title

  • Matching criteria, e.g. proximity measures · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10949467B2 cover?
System and method of generating an index structure for indexing a plurality of unstructured data objects, including: generating a set of compact feature vectors, the set including a compact feature vector for each of the data objects, the compact feature vector for each data object including a sequence of hashed values that represent the data object; generating a plurality of twisted compact fe…
Who is the assignee on this patent?
Lu Yangdi, He Wenbo, Nabatchian Amirhosein, and 1 more
What technology area does this patent fall under?
Primary CPC classification G06F16/9027. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 16 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).