Distributed deduplication using locality sensitive hashing

US2016019232A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2016019232-A1
Application numberUS-201414337070-A
CountryUS
Kind codeA1
Filing dateJul 21, 2014
Priority dateJul 21, 2014
Publication dateJan 21, 2016
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Deduplication in a distributed storage system is described. A deduplication manager identifies a data item that includes multiple data chunks. The deduplication manager defines a first extent on a first node in a distributed storage system. The deduplication manager compares the first extent to existing groups of similar extents to find one of the existing groups that has extents that are similar to the first extent. The deduplication manager selects a second extent from the found group of extents. The second closely matches the first extent and removes from the first extent one or more data chunks that are included in the first extent and the second extent. The deduplication manager associates, with the first extent, a pointer to the second extent for the removed one or more data chunks.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method comprising: identifying a data item including a plurality of data chunks; generating, by a processing device, a fingerprint for each of the plurality of data chunks; defining a first extent on a first node in a distributed storage system; comparing the first extent to existing groups of similar extents to find one of the existing groups that has extents that are similar to the first extent; selecting a second extent from the found group of extents, the second extent closely matching the first extent; removing, by the processing device, from the first extent one or more data chunks that are included in the first extent and the second extent; and associating, with the first extent, a pointer to the second extent for the removed one or more data chunks. 2 . The method of claim 1 further comprising storing data identifying the existing groups of similar extents in an index, wherein the first extent is associated with a first group of the existing groups and wherein the second extent is associated with the first group, wherein comparing the first extent to existing groups of similar extents comprises: identifying the first group in the index using a sketch of the first extent and a locality-sensitive hashing algorithm; and identifying the second extent as being associated with the first group. 3 . The method of claim 1 , wherein comparing the first extent to existing groups of similar extents comprises: querying a node in the distributed storage system for the first extent; and receiving an indication from the node that the second extent is similar to the first extent. 4 . The method of claim 1 , wherein the second extent comprises a fingerprint for each of the plurality of data chunks, wherein removing from the first extent one or more data chunks that are included in the first extent and the second extent comprises comparing the fingerprints for each of the plurality of data chunks in the first extent with the fingerprints for each of the plurality of data chunks in the second extent. 5 . The method of claim 1 , wherein the first extent is represented by a sketch that comprises fingerprints of the plurality of data chunks. 6 . The method of claim 1 , wherein the processing device executes a daemon to identify the plurality of data chunks. 7 . The method of claim 1 , wherein at least some of the plurality of data chunks of the first extent are sequentially addressed. 8 . A computer readable storage medium having instructions that, when executed by a processing device, cause the processing device to perform operations comprising: identifying a data item including a plurality of data chunks; generating, by the processing device, a fingerprint for each of the plurality of data chunks; defining a first extent on a first node in a distributed storage system, the first extent comprising a sketch that comprises the fingerprints for each of the plurality of data chunks; comparing the first extent to existing groups of similar extents to find one of the existing groups that has extents that are similar to the first extent; selecting a second extent from the found group of extents, the second extent that closely matching the first extent; removing, by the processing device, from the first extent one or more data chunks that are included in the first extent and the second extent; and associating, with the first extent, a pointer to the second extent for the removed one or more data chunks. 9 . The computer readable storage medium of claim 8 , the operations further comprising storing data identifying the existing groups of similar extents in an index, wherein the first extent is associated with a first group of the existing groups and wherein the second extent is associated with the first group, wherein comparing the first extent to existing groups of similar extents comprises: identifying the first group in the index using a sketch of the first extent and a locality-sensitive hashing algorithm; and identifying the second extent as being associated with the first group. 10 . The computer readable storage medium of claim 8 , wherein comparing the first extent to existing groups of similar extents comprises: querying a node in the distributed storage system for the first extent; and receiving an indication from the node that the second extent is similar to the first extent. 11 . The computer readable storage medium of claim 8 , wherein the second extent comprises a fingerprint for each of the plurality of data chunks, wherein removing from the first extent one or more data chunks that are included in the first extent and the second extent comprises comparing the fingerprints for each of the plurality of data chunks in the first extent with the fingerprints for each of the plurality of data chunks in the second extent. 12 . The computer readable storage medium of claim 8 , wherein at least some of the plurality of data chunks of the first extent are sequentially addresses. 13 . The computer readable storage medium of claim 8 , wherein, the first extent is represented by a sketch that comprises fingerprints of the plurality of data chunks. 14 . A computing device comprising: a memory; and a processing device coupled to the memory, the processing device to: identify a data item including a plurality of data chunks; generate a fingerprint for each of the plurality of data chunks; define a first extent on a first node in a distributed storage system; compare the first extent to existing groups of similar extents to find one of the existing groups that has extents that are similar to the first extent; select a second extent from the found group of extents, the second extent closely matching the first extent; remove from the first extent one or more data chunks that are included in the first extent and the second extent; and associate, with the first extent, a pointer to the second extent for the removed one or more data chunks. 15 . The computing device of claim 14 , the processing device further to store data identifying the existing groups of similar extents in an index, wherein the first extent is associated with a first group of the existing groups and wherein the second extent is associated with the first group, wherein when comparing the first extent to existing groups of similar extents, the processing device is to: identify the first group in the index using a sketch of the first extent and a locality-sensitive hashing algorithm; and identify the second extent as being associated with the first group. 16 . The computing device of claim 14 , wherein when comparing the first extent to existing groups of similar extents, the processing device is to: querying a node in the distributed storage system for the first extent; and receiving an indication from the node that the second extent is similar to the first extent. 17 . The computing device of claim 14 , wherein the second extent comprises a fingerprint for each of the plurality of data chunks, wherein when removing from the first extent one or more data chunks that are included in the first extent and the second extent, the processing device is to compare the fingerprints for each of the plurality of data chunks in the first extent with the fingerprints for each of the plurality of data chunks in the second extent. 18 . The computing device of claim 14 , wherein the first extent is represented by a sketch that comprises fingerprints of the plurality of data chunks. 19 . The computing

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2016019232A1 cover?
Deduplication in a distributed storage system is described. A deduplication manager identifies a data item that includes multiple data chunks. The deduplication manager defines a first extent on a first node in a distributed storage system. The deduplication manager compares the first extent to existing groups of similar extents to find one of the existing groups that has extents that are simil…
Who is the assignee on this patent?
Red Hat Inc
What technology area does this patent fall under?
Primary CPC classification G06F16/1752. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Jan 21 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).