What technology area does this patent fall under?

Primary CPC classification G06F3/0641. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Jun 08 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Deduplication using nearest neighbor cluster

US11029871B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11029871-B2
Application number	US-201916412946-A
Country	US
Kind code	B2
Filing date	May 15, 2019
Priority date	May 15, 2019
Publication date	Jun 8, 2021
Grant date	Jun 8, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Disclosed are techniques for data deduplication, which include methods, systems, or computer products for reducing data redundancy in a data storage system comprising searching a cluster of nearest neighbors, wherein the cluster has been created using a locality sensitive hashing algorithm, to determine if a data block has been stored in the data storage system prior to writing the data block. In alternate embodiments, the nearest neighbor clusters could be created using one or more of the following algorithms: k-means clustering algorithm, a k-medoids clustering algorithm, a mean shift algorithm, a generalized method of moment (GMM) algorithm, or a density based spatial clustering of applications with noise (DBSCAN) algorithm.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of reducing data redundancy in a data storage system comprising; searching a cluster of nearest neighbors to determine if a data block has been stored in the data storage system prior to writing the data block, wherein the cluster has been created using a locality sensitive hashing function and determination of nearest neighbors is made by evaluating a plurality of hash values placed in a coordinate system having at least four dimensions in order to determine a distance between each neighbor. 2. The method of claim 1 , further comprising: writing the data block if no match is found, else storing mapping information for the data block if a match is found within the cluster of nearest neighbors. 3. The method of claim 1 , wherein the nearest neighbor cluster is created with a machine learning module or an offload engine. 4. The method of claim 1 , wherein the locality sensitive hashing function is a secure hash algorithm 1 (“SHA-1”) or a Message Digest 5 (“MD5”) algorithm. 5. The method of claim 1 further comprising: compressing one or more data sets within the cluster of nearest neighbors. 6. The method of claim 1 , wherein the cluster of nearest neighbors includes a plurality of master blocks. 7. A system comprising: one or more processors; and a memory configured to: search a cluster of nearest neighbors to determine if a data block has been stored in a data storage system prior to writing the data block, wherein the cluster has been created using a locality sensitive hashing function and determination of nearest neighbors is made by evaluating a plurality of hash values placed in a coordinate system having at least four dimensions in order to determine a distance between each neighbor. 8. The system of claim 7 further configured to: write the data block if no match is found, else storing mapping information for the data block if a match is found within the cluster of nearest neighbors. 9. The system of claim 7 , wherein the nearest neighbor cluster is created with a machine learning module or an offload engine. 10. The system of claim 7 , wherein the locality sensitive hashing function is a secure hash algorithm 1 (“SHA-1”) or a Message Digest 5 (“MD5”) algorithm. 11. The system of claim 7 further configured to: compress one or more data sets within the cluster of nearest neighbors. 12. The system of claim 7 , wherein the cluster of nearest neighbors includes a plurality of master blocks. 13. A non-transitory, computer readable medium comprising code stored thereon that, when executed, performs the following acts: searching a cluster of nearest neighbors to determine if a data block has been stored in a data storage system prior to writing the data block, wherein the cluster has been created using a locality sensitive hashing function and determination of nearest neighbors is made by evaluating a plurality of hash values placed in a coordinate system having at least four dimensions in order to determine a distance between each neighbor. 14. The non-transitory, computer readable medium of claim 13 , wherein the code stored thereon, when executed, additionally performs the following acts: writing the data block if no match is found, else storing mapping information for the data block if a match is found within the cluster of nearest neighbors. 15. The non-transitory, computer readable medium of claim 13 , wherein the nearest neighbor cluster is created with a machine learning module or an offload engine. 16. The non-transitory, computer readable medium of claim 13 , wherein the locality sensitive hashing function is a secure hash algorithm 1 (“SHA-1”) or a Message Digest 5 (“MD5”) algorithm. 17. The non-transitory, computer readable medium of claim 13 , wherein the cluster of nearest neighbors includes a plurality of master blocks. 18. A non-transitory, computer readable medium comprising code stored thereon that, when executed, performs the following acts: searching a cluster of nearest neighbors to determine if a data block has been stored in a data storage system prior to writing the data block wherein the cluster has been created using a locality sensitive hashing function and determination of nearest neighbors is made by evaluating a plurality of hash values placed in a coordinate system having at least four dimensions in order to determine a distance between each neighbor; and compressing one or more data sets within the cluster of nearest neighbors.

Assignees

Emc Ip Holding Co Llc

Inventors

Classifications

G06F3/0608
Saving storage space on storage systems · CPC title
G06N20/00
Machine learning · CPC title
G06F3/0641Primary
De-duplication techniques · CPC title
G06F3/0659
Command handling arrangements, e.g. command buffers, queues, command scheduling · CPC title
G06F3/0689
Disk arrays, e.g. RAID, JBOD · CPC title

Patent family

Related publications grouped by family.

View patent family 73245272

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11029871B2 cover?: Disclosed are techniques for data deduplication, which include methods, systems, or computer products for reducing data redundancy in a data storage system comprising searching a cluster of nearest neighbors, wherein the cluster has been created using a locality sensitive hashing algorithm, to determine if a data block has been stored in the data storage system prior to writing the data block. …
Who is the assignee on this patent?: Emc Ip Holding Co Llc
What technology area does this patent fall under?: Primary CPC classification G06F3/0641. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Jun 08 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Technologies for refining stochastic similarity search candidates

System and method for comparing plurality of documents

Data deduplication

Compression-based filtering for deduplication

Four-dimensional morton coordinate conversion processors, methods, systems, and instructions

Distributed deduplication using locality sensitive hashing

Frequently asked questions