Performing block deduplication using block sequence classifications
US-10037336-B1 · Jul 31, 2018 · US
US11093454B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11093454-B2 |
| Application number | US-201715799117-A |
| Country | US |
| Kind code | B2 |
| Filing date | Oct 31, 2017 |
| Priority date | Oct 31, 2017 |
| Publication date | Aug 17, 2021 |
| Grant date | Aug 17, 2021 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Embodiments are directed to techniques for performing deduplication. A method includes (a) obtaining a digest of a data block logically-positioned within a filesystem, the digest providing a hash value of data of the data block, (b) searching a Most Wanted Digest Cache (MWDC) within system memory for the digest, (c) locating an entry in the MWDC using the digest, wherein this locating indicates that the data block has the same data as another data block located elsewhere within the filesystem, the other data block having been previously persistently-stored, the entry having been added to the MWDC in response to the other data block having been deduplicated at least a plurality number of times, (d) locating a mapping structure referenced by the entry located from the MWDC, the mapping structure providing metadata about the other data block, and (e) deduplicating the data block and the other data block with reference to the located mapping structure.
Opening claim text (preview).
What is claimed is: 1. A method, performed by a computing device, of performing deduplication of data stored on the computing device, the method comprising: obtaining a digest of a data block logically-positioned within a filesystem managed by the computing device, the digest providing a hash value of data of the data block; searching a Most Wanted Digest Cache (MWDC) within system memory of the computing device for the digest; locating an entry in the MWDC within system memory using the digest, wherein this locating indicates that the data block has the same data as another data block located elsewhere within the filesystem, the other data block having been previously persistently-stored by the computing device, the entry having been added to the MWDC in response to the other data block having been deduplicated at least a plural number of times; locating a mapping structure referenced by the entry located from the MWDC, the mapping structure providing metadata about the other data block; deduplicating the data block and the other data block with reference to the located mapping structure; and performing a preliminary deduplication of the other data block by: reading a digest (hereinafter “third digest”) of yet another (hereinafter “third”) data block from yet another (hereinafter “third”) mapping structure, the third digest providing a hash value of data of the third data block, the third data block already being stored at a first address of persistent storage different than a second address of persistent storage where the other data block is stored, the third data block being pointed to by the third mapping structure; searching the MWDC within system memory of the computing device for the third digest; in response to failing to find the third digest in the MWDC, obtaining a persistent entry from a persistent digest log stored in secondary storage of the computing device using the digest of the third data block, wherein this obtaining indicates that the third data block has the same data as the other data block, the other data block having been previously persistently-stored by the computing device, the persistent entry having been added to the persistent digest log in response to the other data block having been analyzed by a background deduplication process; locating the mapping structure of the other data block as referenced by the persistent entry obtained from the persistent digest log; determining a popularity value of the other data block and comparing the popularity value to the plural number of times; in response to the popularity value of the other data block being at least the plural number of times, creating the entry in the MWDC indexed by the digest and inserting a reference to the mapping structure within the entry in the MWDC; and deduplicating the third data block and the other data block with reference to the located mapping structure. 2. The method of claim 1 , wherein obtaining the digest is performed in response to receiving a write request to write the data block to the filesystem on the computing device, the write request including data of the data block; and wherein obtaining the digest includes applying a cryptographic hashing algorithm to the data of the data block. 3. The method of claim 2 wherein deduplicating the data block and the other data block with reference to the located mapping structure includes: storing a reference to the mapping structure at a location within a file pointer structure of a file of the filesystem that corresponds to a location within the file where the data block is logically-positioned; and increasing a weight stored within the mapping structure to indicate that the other data block is now shared an additional time, the weight indicating how many locations within the filesystem share the other data block. 4. The method of claim 3 wherein deduplicating the data block and the other data block with reference to the located mapping structure further includes increasing a deduplication counter stored within the mapping structure to indicate that the other data block has been deduplicated an additional time, the deduplication counter indicating how many times the other data block has been deduplicated. 5. The method of claim 1 , wherein obtaining the digest is performed as part of a background deduplication process, the data block already being stored at a first address of persistent storage different than a second address of persistent storage where the other data block is stored, the data block being pointed to by another mapping structure, the other mapping structure being referenced by a mapping pointer at a location within a file pointer structure of a file of the filesystem that corresponds to a location within the file where the data block is logically-positioned; and wherein obtaining the digest includes reading the digest from the other mapping structure. 6. The method of claim 5 wherein deduplicating the data block and the other data block with reference to the located mapping structure includes: storing a reference to the mapping structure at the location within the file pointer structure of the file of the filesystem that corresponds to the location within the file where the data block is logically-positioned in place of the mapping pointer that referenced the other mapping structure; and increasing a weight stored within the mapping structure to indicate that the other data block is now shared an additional time, the weight indicating how many locations within the filesystem share the other data block. 7. The method of claim 6 wherein deduplicating the data block and the other data block with reference to the located mapping structure further includes increasing a deduplication counter stored within the mapping structure to indicate that the other data block has been deduplicated an additional time, the deduplication counter indicating how many times the other data block has been deduplicated. 8. The method of claim 1 wherein determining the popularity value of the other data block includes reading a weight value stored within the mapping structure, the weight value indicating how many locations within the filesystem share the other data block. 9. The method of claim 1 wherein determining the popularity value of the other data block includes reading a deduplication counter stored within the mapping structure, the deduplication counter indicating how many times the other data block has been deduplicated. 10. The method of claim 1 wherein performing the preliminary deduplication of the other data block further includes, in response to creating the entry in the MWDC: determining whether the MWDC contains more than a threshold number of entries; in response to determining that the MWDC contains more than the threshold number of entries, evicting a least recently used entry from the MWDC. 11. The method of claim 1 wherein the method further comprises, in response to deduplicating the data block and the other data block with reference to the located mapping structure, promoting the entry to a position within the MWDC indicating that it has been accessed more recently. 12. The method of claim 1 wherein the method further comprises, in response to deduplicating the data block and the other data block with reference to the located mapping structure: determining whether the entry has been accessed more than a threshold number of times; in response to determining that the entry has been accessed more than the threshold number of times promoting the entry from a lower level of the MWDC to a most recently accessed position of a higher level of the MWDC, the higher level of the MWDC providing mo
using de-duplication of the data · CPC title
De-duplication implemented within the file system, e.g. based on file segments (de-duplication techniques in storage systems for the management of data blocks G06F3/0641) · CPC title
Hash-based (content-based indexing of textual data G06F16/31) · CPC title
File access structures, e.g. distributed indices (arrangements of input from, or output to, record carriers G06F3/06) · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.