Indexing a deduplicated cache system by integrating fingerprints of underlying deduplicated storage system
US-9336143-B1 · May 10, 2016 · US
US2016019232A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2016019232-A1 |
| Application number | US-201414337070-A |
| Country | US |
| Kind code | A1 |
| Filing date | Jul 21, 2014 |
| Priority date | Jul 21, 2014 |
| Publication date | Jan 21, 2016 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Deduplication in a distributed storage system is described. A deduplication manager identifies a data item that includes multiple data chunks. The deduplication manager defines a first extent on a first node in a distributed storage system. The deduplication manager compares the first extent to existing groups of similar extents to find one of the existing groups that has extents that are similar to the first extent. The deduplication manager selects a second extent from the found group of extents. The second closely matches the first extent and removes from the first extent one or more data chunks that are included in the first extent and the second extent. The deduplication manager associates, with the first extent, a pointer to the second extent for the removed one or more data chunks.
Opening claim text (preview).
What is claimed is: 1 . A method comprising: identifying a data item including a plurality of data chunks; generating, by a processing device, a fingerprint for each of the plurality of data chunks; defining a first extent on a first node in a distributed storage system; comparing the first extent to existing groups of similar extents to find one of the existing groups that has extents that are similar to the first extent; selecting a second extent from the found group of extents, the second extent closely matching the first extent; removing, by the processing device, from the first extent one or more data chunks that are included in the first extent and the second extent; and associating, with the first extent, a pointer to the second extent for the removed one or more data chunks. 2 . The method of claim 1 further comprising storing data identifying the existing groups of similar extents in an index, wherein the first extent is associated with a first group of the existing groups and wherein the second extent is associated with the first group, wherein comparing the first extent to existing groups of similar extents comprises: identifying the first group in the index using a sketch of the first extent and a locality-sensitive hashing algorithm; and identifying the second extent as being associated with the first group. 3 . The method of claim 1 , wherein comparing the first extent to existing groups of similar extents comprises: querying a node in the distributed storage system for the first extent; and receiving an indication from the node that the second extent is similar to the first extent. 4 . The method of claim 1 , wherein the second extent comprises a fingerprint for each of the plurality of data chunks, wherein removing from the first extent one or more data chunks that are included in the first extent and the second extent comprises comparing the fingerprints for each of the plurality of data chunks in the first extent with the fingerprints for each of the plurality of data chunks in the second extent. 5 . The method of claim 1 , wherein the first extent is represented by a sketch that comprises fingerprints of the plurality of data chunks. 6 . The method of claim 1 , wherein the processing device executes a daemon to identify the plurality of data chunks. 7 . The method of claim 1 , wherein at least some of the plurality of data chunks of the first extent are sequentially addressed. 8 . A computer readable storage medium having instructions that, when executed by a processing device, cause the processing device to perform operations comprising: identifying a data item including a plurality of data chunks; generating, by the processing device, a fingerprint for each of the plurality of data chunks; defining a first extent on a first node in a distributed storage system, the first extent comprising a sketch that comprises the fingerprints for each of the plurality of data chunks; comparing the first extent to existing groups of similar extents to find one of the existing groups that has extents that are similar to the first extent; selecting a second extent from the found group of extents, the second extent that closely matching the first extent; removing, by the processing device, from the first extent one or more data chunks that are included in the first extent and the second extent; and associating, with the first extent, a pointer to the second extent for the removed one or more data chunks. 9 . The computer readable storage medium of claim 8 , the operations further comprising storing data identifying the existing groups of similar extents in an index, wherein the first extent is associated with a first group of the existing groups and wherein the second extent is associated with the first group, wherein comparing the first extent to existing groups of similar extents comprises: identifying the first group in the index using a sketch of the first extent and a locality-sensitive hashing algorithm; and identifying the second extent as being associated with the first group. 10 . The computer readable storage medium of claim 8 , wherein comparing the first extent to existing groups of similar extents comprises: querying a node in the distributed storage system for the first extent; and receiving an indication from the node that the second extent is similar to the first extent. 11 . The computer readable storage medium of claim 8 , wherein the second extent comprises a fingerprint for each of the plurality of data chunks, wherein removing from the first extent one or more data chunks that are included in the first extent and the second extent comprises comparing the fingerprints for each of the plurality of data chunks in the first extent with the fingerprints for each of the plurality of data chunks in the second extent. 12 . The computer readable storage medium of claim 8 , wherein at least some of the plurality of data chunks of the first extent are sequentially addresses. 13 . The computer readable storage medium of claim 8 , wherein, the first extent is represented by a sketch that comprises fingerprints of the plurality of data chunks. 14 . A computing device comprising: a memory; and a processing device coupled to the memory, the processing device to: identify a data item including a plurality of data chunks; generate a fingerprint for each of the plurality of data chunks; define a first extent on a first node in a distributed storage system; compare the first extent to existing groups of similar extents to find one of the existing groups that has extents that are similar to the first extent; select a second extent from the found group of extents, the second extent closely matching the first extent; remove from the first extent one or more data chunks that are included in the first extent and the second extent; and associate, with the first extent, a pointer to the second extent for the removed one or more data chunks. 15 . The computing device of claim 14 , the processing device further to store data identifying the existing groups of similar extents in an index, wherein the first extent is associated with a first group of the existing groups and wherein the second extent is associated with the first group, wherein when comparing the first extent to existing groups of similar extents, the processing device is to: identify the first group in the index using a sketch of the first extent and a locality-sensitive hashing algorithm; and identify the second extent as being associated with the first group. 16 . The computing device of claim 14 , wherein when comparing the first extent to existing groups of similar extents, the processing device is to: querying a node in the distributed storage system for the first extent; and receiving an indication from the node that the second extent is similar to the first extent. 17 . The computing device of claim 14 , wherein the second extent comprises a fingerprint for each of the plurality of data chunks, wherein when removing from the first extent one or more data chunks that are included in the first extent and the second extent, the processing device is to compare the fingerprints for each of the plurality of data chunks in the first extent with the fingerprints for each of the plurality of data chunks in the second extent. 18 . The computing device of claim 14 , wherein the first extent is represented by a sketch that comprises fingerprints of the plurality of data chunks. 19 . The computing
based on file chunks · CPC title
Physics · mapped topic
Related publications grouped by family.
Answers are generated from the same data shown on this page.