Data compression using nearest neighbor cluster

US11977959B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11977959-B2
Application numberUS-201916412970-A
CountryUS
Kind codeB2
Filing dateMay 15, 2019
Priority dateMay 15, 2019
Publication dateMay 7, 2024
Grant dateMay 7, 2024

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Disclosed are techniques for compressing data in a data storage system comprising searching a cluster of nearest neighbors, wherein the cluster has been created using a locality sensitive hashing algorithm, to determine if a data block can be compressed. In alternate embodiments, nearest neighbor clusters can be formed using unsupervised learning. Additionally, nearest neighbors can also be formed in alternate embodiments using one or more of the following algorithms: a k-means clustering algorithm, a k-medoids clustering algorithm, a mean shift algorithm, a generalized method of moment (GMM) algorithm, or a density based spatial clustering of applications with noise (DBSCAN) algorithm.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for compressing data in a data storage system comprising: searching a cluster of nearest neighbors to determine if a data block can be compressed, wherein the cluster has been created using a locality sensitive hashing function, the cluster has one or more statistical features that include an entropy, and determination of nearest neighbors is made by evaluating a plurality of hash values placed in a coordinate system having at least four dimensions in order to determine a distance between each neighbor; generating an exclusive OR (XOR) map representing how at least two data elements within the cluster differ from one another; using the XOR map with the cluster to maintain a difference between the at least two data elements within the cluster; and using an offload engine to calculate a RAID parity. 2. The method of claim 1 , further comprising: compressing the XOR map. 3. The method of claim 1 , wherein the locality sensitive hashing function is a secure hash algorithm 1 (“SHA-1”) or a Message Digest 5 (“MD5”) algorithm. 4. The method of claim 1 , wherein the cluster is created in an unsupervised learning environment. 5. The method of claim 1 , wherein the cluster is created using an offload engine. 6. The method of claim 5 , further comprising: identifying a hot block. 7. The method of claim 5 , further comprising: computing a message digest. 8. The method of claim 1 , wherein, instead of using locality sensitive hashing to create the cluster of nearest neighbors, one or more of the following algorithms has been used to create the cluster of nearest neighbors: a k-means clustering algorithm, a k-medoids clustering algorithm, a mean shift algorithm, a generalized method of moment (GMM) algorithm, or a density based spatial clustering of applications with noise (DBSCAN) algorithm. 9. A system for compressing data comprising: a memory comprising computer executable instructions; a processor executing the computer executable instructions, the computer-executable instructions when executed by the processor cause the processor to perform operations comprising: searching a cluster of nearest neighbors to determine if a data block can be compressed, wherein the cluster has been created using a locality sensitive hashing function, the cluster has one or more statistical features that include a chi square test, and determination of nearest neighbors is made by evaluating a plurality of hash values placed in a coordinate system having at least four dimensions in order to determine a distance between each neighbor; generating an exclusive OR (XOR) map representing how at least two data elements within the cluster differ from one another; using the XOR map with the cluster to maintain a difference between the at least two data elements within the cluster; and using an offload engine to calculate a RAID parity. 10. The system of claim 9 , further comprising: compressing the XOR map. 11. The system of claim 9 , wherein the locality sensitive hashing function is a secure hash algorithm 1 (“SHA-1”) or a Message Digest 5 (“MD5”) algorithm. 12. The system of claim 9 , wherein the cluster is created in an unsupervised learning environment. 13. The system of claim 9 , wherein the cluster is created using the offload engine. 14. The system of claim 13 , further comprising: computing a message digest. 15. The system of claim 13 , further comprising: computing an integrity value. 16. The system of claim 9 , wherein, instead of using locality sensitive hashing to create the cluster of nearest neighbors, one or more of the following algorithms has been used to create the cluster of nearest neighbors: a k-means clustering algorithm, a k-medoids clustering algorithm, a mean shift algorithm, a generalized method of moment (GMM) algorithm, or a density based spatial clustering of applications with noise (DBSCAN) algorithm. 17. A non-transitory, computer readable medium comprising code stored thereon that, when executed, performs the following acts: searching a cluster of nearest neighbors to determine if a data block can be compressed, wherein the cluster has been created using a locality sensitive hashing function, the cluster has one or more statistical features that include a Pearson correlation coefficient, and determination of nearest neighbors is made by evaluating a plurality of hash values placed in a coordinate system having at least four dimensions in order to determine a distance between each neighbor; generating an exclusive OR (XOR) map representing how at least two data elements within the cluster differ from one another; using the XOR map with the cluster to maintain a difference between the at least two data elements within the cluster; and using an offload engine to calculate a RAID parity. 18. The non-transitory, computer readable medium of claim 17 , wherein, instead of using locality sensitive hashing to create the cluster of nearest neighbors, one or more of the following algorithms has been used to create the cluster of nearest neighbors: a k-means clustering algorithm, a k-medoids clustering algorithm, a mean shift algorithm, a generalized method of moment (GMM) algorithm, or a density based spatial clustering of applications with noise (DBSCAN) algorithm.

Assignees

Inventors

Classifications

  • G06N20/00Primary

    Machine learning · CPC title

  • Logical and Boolean instructions, e.g. XOR, NOT · CPC title

  • with fixed number of clusters, e.g. K-means clustering · CPC title

  • Distances to closest patterns, e.g. nearest neighbour classification · CPC title

  • Non-supervised learning, e.g. competitive learning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11977959B2 cover?
Disclosed are techniques for compressing data in a data storage system comprising searching a cluster of nearest neighbors, wherein the cluster has been created using a locality sensitive hashing algorithm, to determine if a data block can be compressed. In alternate embodiments, nearest neighbor clusters can be formed using unsupervised learning. Additionally, nearest neighbors can also be for…
Who is the assignee on this patent?
Emc Ip Holding Co Llc
What technology area does this patent fall under?
Primary CPC classification G06N20/00. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue May 07 2024 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).