Method for approximating similarity between objects

US11314598B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11314598-B2
Application numberUS-201815964527-A
CountryUS
Kind codeB2
Filing dateApr 27, 2018
Priority dateApr 27, 2018
Publication dateApr 26, 2022
Grant dateApr 26, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems and methods for determining similarity between sets of objects are disclosed. A set of hashes are generated for a set of objects. A similarity vector is generated for the set of hashes. The similarity vector is a compact representation of the set of hashes and of the corresponding set of objects. The similarity of the set of objects is determined by comparing the similarity vector of the set of objects with other similarity vectors. In a data protection system, the set of objects can be placed with the node or system that stores objects that are most similar to the set of objects being placed.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for placing a set of objects in a distributed data protection system, the method comprising: generating a hash for each object in a set of objects and representing each hash as a vector having n entries, wherein the set of hashes constitute a matrix of vectors and wherein each of the vectors is a row in the matrix, wherein each of the n entries of each hash is included in a different column of the matrix of vectors, wherein the matrix of vectors has n columns, wherein each entry in each column includes a value from one of the hashes and wherein each entry for each vector corresponds to a portion of the corresponding hash; generating a similarity vector for the set of objects, wherein the similarity vector has n entries, wherein each entry in the similarity vector is generated from one of the n columns such that each entry in the similarity vector represents corresponding entries from all of the vectors in the matrix of vectors, wherein the distributed data protection system includes a plurality of nodes, wherein the similarity vector represents each of the objects in the set of objects, wherein each entry in the similarity vector includes a mean of values in a corresponding column in the n columns of the similarity vector; comparing the similarity vector with destination similarity vectors associated with sets of objects already placed in the distributed data protection system, wherein the comparisons result in similarity values that determine how similar the set of objects is to each of the sets of objects; selecting a node from the plurality of nodes based on the similarity values; and placing the set of objects with the selected node. 2. The method of claim 1 , wherein the columns are vertical and/or diagonal. 3. The method of claim 1 , wherein each entry in each column includes at least one bit. 4. The method of claim 1 , wherein comparing the similarity vector with the destination similarity vectors includes determining a Euclidean distance as a similarity value. 5. The method of claim 1 , wherein each node includes a portion of an index used for de-duplicating the objects at each node. 6. The method of claim 5 , further comprising backing up the set of objects at the selected node. 7. The method of claim 6 , further comprising de-duplicating the set of objects at the selected node. 8. A method for placing a set of objects in a distributed data protection system, the method comprising: identifying a set of objects for placement in a data protection system; generating a set of hashes corresponding to the set of objects, wherein the set of hashes includes a hash for each of the objects in the set of objects; generating a vector matrix that includes vectors, wherein each of the vectors is generated from one of the hashes such that each vector corresponds to an object in the set of objects, wherein each of the vectors include n entries, wherein each of the n entries of each hash is included in a different column of the vector matrix, wherein the vector matrix includes n columns, wherein each entry in each column includes a value from one of the hashes and wherein each entry for each vector corresponds to a portion of the corresponding hash; generating a similarity vector for the set of objects, wherein the similarity vector has n entries, wherein each entry in the similarity vector is generated from one of the n columns such that each entry in the similarity vector represents corresponding entries from all of the vectors in the matrix of vectors, wherein the similarity vector is a compact representation of all objects in the set of objects, wherein each entry in the similarity vector includes a mean of values in a corresponding column in the n columns of the similarity vector; comparing the similarity vector with destination similarity vectors associated with sets of objects already placed in the distributed data protection system, wherein the comparisons result in similarity values that determine how similar the set of objects is to each of the sets of objects; selecting a node from the plurality of nodes based on the similarity values; and placing the set of objects with the selected node. 9. The method of claim 8 , wherein the set of objects includes N objects and the set of hashes includes N hashes. 10. The method of claim 8 , wherein the columns are vertical and/or diagonal. 11. The method of claim 8 , wherein each entry in each column includes at least one bit. 12. The method of claim 8 , further comprising rebalancing the objects stored at the plurality of nodes. 13. The method of claim 8 , further comprising comparing the similarity vector with a subset of destination similarity vectors of each node. 14. The method of claim 8 , wherein each node maintains destination similarity vectors for different sets of objects. 15. The method of claim 8 , further comprising de-duplicating the set of objects at the selected node. 16. A non-transitory computer readable medium comprising computer executable instructions for performing operations of a method for placing a set of objects in a distributed data protection system, the method comprising: generating a hash for each object in a set of objects and representing each hash as a vector having n entries, wherein the set of hashes constitute a matrix of vectors and wherein each of the vectors is a row in the matrix, wherein each of the n entries of each hash is included in a different column of the matrix of vectors, wherein the matrix of vectors has n columns, wherein each entry in each column includes a value from one of the hashes and wherein each entry for each vector corresponds to a portion of the corresponding hash; generating a similarity vector for the set of objects, wherein the similarity vector has n entries, wherein each entry in the similarity vector is generated from one of the n columns such that each entry in the similarity vector represents corresponding entries from all of the vectors in the matrix of vectors, wherein the distributed data protection system includes a plurality of nodes, wherein the similarity vector represents each of the objects in the set of objects, wherein each entry in the similarity vector includes a mean of values in a corresponding column in the n columns of the similarity vector; comparing the similarity vector with destination similarity vectors associated with sets of objects already placed in the distributed data protection system, wherein the comparisons result in similarity values that determine how similar the set of objects is to each of the sets of objects; selecting a node from the plurality of nodes based on the similarity values; and placing the set of objects with the selected node.

Assignees

Inventors

Classifications

  • using de-duplication of the data · CPC title

  • for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS] · CPC title

  • Applying verification of the received information (cryptographic mechanisms or cryptographic arrangements for data integrity or data verification H04L9/32) · CPC title

  • Hash-based (content-based indexing of textual data G06F16/31) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11314598B2 cover?
Systems and methods for determining similarity between sets of objects are disclosed. A set of hashes are generated for a set of objects. A similarity vector is generated for the set of hashes. The similarity vector is a compact representation of the set of hashes and of the corresponding set of objects. The similarity of the set of objects is determined by comparing the similarity vector of th…
Who is the assignee on this patent?
Emc Ip Holding Co Llc
What technology area does this patent fall under?
Primary CPC classification G06F11/1453. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 26 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).