Hashing techniques for data set similarity determination

US9311403B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-9311403-B1
Application numberUS-201113162061-A
CountryUS
Kind codeB1
Filing dateJun 16, 2011
Priority dateJun 16, 2010
Publication dateApr 12, 2016
Grant dateApr 12, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems and computer program product embodiments for hashing techniques for determining similarity between data sets are described herein. A method embodiment includes, initializing a random number generator with a weighted min-hash value as a seed, wherein the weighted min-hash value approximates a similarity distance between data sets. A number of bits in the weighted min-hash value is determined by uniformly sampling an integer bit value using the random number generator. A system embodiment includes a repository configured to store a plurality of data sets and a hash generator configured to generate weighted min-hash values from the data sets. The system further includes a similarity determiner configured to determine a similarity between the data sets.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer implemented method of improving similarity distance approximation between data sets, comprising: initializing, using one or more processors, a random number generator with a weighted rain-hash value as a seed, wherein the weighted min-hash value approximates the similarity distance between two or more data sets; uniformly sampling, using the one or more processors, an integer bit value using the random number generator to determine a first number of bits used to represent a single weighted min-hash value; determining, using the one or more processors, an adjusted number of bits to sample based on distance estimate variances of a total number of bits to sample; and varying, using the one or more processors, the determined first number of bits used to represent the single, weighted min-hash value to determine a least number of bits capable of being used to represent the single weighted rain-hash value while achieving a target weighted rain-hash accuracy, wherein the weighted min-hash accuracy specifies a similarity distance approximation between the data sets; and adjusting a number of bits used to represent the single weighted rain-hash value based on the determined least number of bits. 2. The method of claim 1 , wherein the sampling minimizes inconsistency between the similarity distance and the approximation induced by the Hamming distance between bit samples. 3. The method of claim 1 , further comprising: selecting a plurality hash subsets from a plurality of weighted rain-hash values. 4. The method of claim 3 , wherein the selecting comprises: selecting the hash subsets using randomized hash selection. 5. The method of claim 3 , further comprising: adding hash subsets associated with the highest similarity distance approximation to a set of selected hashes. 6. An article of manufacture including a non-transitory computer-readable medium having instructions stored thereon that, when executed by a processing device, cause the processing device to perform operations comprising: initializing, using one or more processors, a random number generator with a weighted min-hash value as a seed, wherein the weighted min-hash value approximates a similarly distance between two or more data sets; uniformly sampling, using the one or more processors, an integer bit value using the random number generator to determine a first number of bits used to represent a single weighted min-hash value; determining, using the one or more processors, an adjusted number of bits to sample based on distance estimate variances of a total number of bits to sample; and varying, using the one or more processors, the determined first number of bits used to represent a single weighted min-hash value to determine a least number of bits capable of being used to represent, the single weighted rain-hash value while achieving a target weighted min-hash accuracy, wherein the weighted rain-hash accuracy specifies a similarity distance approximation between the data sets; and adjusting a number of bits used to represent the single weighted min-hash value based on the determined least number of bits. 7. The article of manufacture of claim 6 , wherein the sampling minimizes a Hamming distance between the data sets. 8. The article of manufacture of claim 6 , the operations further comprising: selecting a plurality hash subsets from a plurality of weighted rain-hash values. 9. The article of manufacture of claim 8 , the selecting comprising: selecting the hash subsets using randomized hash selection. 10. The article of manufacture of claim 8 , the operations further comprising: adding hash subsets with associated with the highest similarity distance approximation to a set of selected hashes. 11. A computer implemented method of improving similarity distance approximation between data sets, comprising: determining, using one or more processors, a number of bits to sample for a weighted min-hash value; initializing, using the one or more processors, a random number generator with the weighted min-hash value as a seed, wherein the weighted min-hash value approximates a similarity distance between two or more data sets; uniformly sampling, using the one or more processors, an integer bit value from numbers generated by the random number generator, wherein the sampled integer bit value comprises the determined number of bits to sample, wherein the sampling is performed in deterministic constant time, and wherein deterministic constant time sampling comprises pre-computing estimators of a distance between the two or more data sets based on random values; and determining a number of bits to sample based on a distance estimate variance of a total number of bits available and the determined number of bits; and varying, using the one or more processors, the determined first number of bits used to represent a single weighted min-hash value to determine a least number of bits capable of being used to represent the single weighted rain-hash value while achieving a target weighted min-hash accuracy, wherein the weighted min-hash accuracy specifies a similarity distance approximation between the data sets; and adjusting a number of bits used to represent the single weighted rain-hash value based on the determined least number of bits. 12. The method of claim 11 , further comprising determining a Hamming similarity of the determined number of bits to sample for the weighted min-hash value. 13. The method of claim 11 , wherein the one or more bits are sampled from one or more images.

Assignees

Inventors

Classifications

  • Physics · mapped topic

  • using colour · CPC title

  • using metadata automatically derived from the content · CPC title

  • G06F16/951Primary

    Indexing; Web crawling techniques · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9311403B1 cover?
Methods, systems and computer program product embodiments for hashing techniques for determining similarity between data sets are described herein. A method embodiment includes, initializing a random number generator with a weighted min-hash value as a seed, wherein the weighted min-hash value approximates a similarity distance between data sets. A number of bits in the weighted min-hash value …
Who is the assignee on this patent?
Ioffe Sergey, Google Inc
What technology area does this patent fall under?
Primary CPC classification G06F17/30864. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 12 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).