Speeding deduplication using a most wanted digest cache

US11093454B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11093454-B2
Application numberUS-201715799117-A
CountryUS
Kind codeB2
Filing dateOct 31, 2017
Priority dateOct 31, 2017
Publication dateAug 17, 2021
Grant dateAug 17, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Embodiments are directed to techniques for performing deduplication. A method includes (a) obtaining a digest of a data block logically-positioned within a filesystem, the digest providing a hash value of data of the data block, (b) searching a Most Wanted Digest Cache (MWDC) within system memory for the digest, (c) locating an entry in the MWDC using the digest, wherein this locating indicates that the data block has the same data as another data block located elsewhere within the filesystem, the other data block having been previously persistently-stored, the entry having been added to the MWDC in response to the other data block having been deduplicated at least a plurality number of times, (d) locating a mapping structure referenced by the entry located from the MWDC, the mapping structure providing metadata about the other data block, and (e) deduplicating the data block and the other data block with reference to the located mapping structure.

First claim

Opening claim text (preview).

What is claimed is: 1. A method, performed by a computing device, of performing deduplication of data stored on the computing device, the method comprising: obtaining a digest of a data block logically-positioned within a filesystem managed by the computing device, the digest providing a hash value of data of the data block; searching a Most Wanted Digest Cache (MWDC) within system memory of the computing device for the digest; locating an entry in the MWDC within system memory using the digest, wherein this locating indicates that the data block has the same data as another data block located elsewhere within the filesystem, the other data block having been previously persistently-stored by the computing device, the entry having been added to the MWDC in response to the other data block having been deduplicated at least a plural number of times; locating a mapping structure referenced by the entry located from the MWDC, the mapping structure providing metadata about the other data block; deduplicating the data block and the other data block with reference to the located mapping structure; and performing a preliminary deduplication of the other data block by: reading a digest (hereinafter “third digest”) of yet another (hereinafter “third”) data block from yet another (hereinafter “third”) mapping structure, the third digest providing a hash value of data of the third data block, the third data block already being stored at a first address of persistent storage different than a second address of persistent storage where the other data block is stored, the third data block being pointed to by the third mapping structure; searching the MWDC within system memory of the computing device for the third digest; in response to failing to find the third digest in the MWDC, obtaining a persistent entry from a persistent digest log stored in secondary storage of the computing device using the digest of the third data block, wherein this obtaining indicates that the third data block has the same data as the other data block, the other data block having been previously persistently-stored by the computing device, the persistent entry having been added to the persistent digest log in response to the other data block having been analyzed by a background deduplication process; locating the mapping structure of the other data block as referenced by the persistent entry obtained from the persistent digest log; determining a popularity value of the other data block and comparing the popularity value to the plural number of times; in response to the popularity value of the other data block being at least the plural number of times, creating the entry in the MWDC indexed by the digest and inserting a reference to the mapping structure within the entry in the MWDC; and deduplicating the third data block and the other data block with reference to the located mapping structure. 2. The method of claim 1 , wherein obtaining the digest is performed in response to receiving a write request to write the data block to the filesystem on the computing device, the write request including data of the data block; and wherein obtaining the digest includes applying a cryptographic hashing algorithm to the data of the data block. 3. The method of claim 2 wherein deduplicating the data block and the other data block with reference to the located mapping structure includes: storing a reference to the mapping structure at a location within a file pointer structure of a file of the filesystem that corresponds to a location within the file where the data block is logically-positioned; and increasing a weight stored within the mapping structure to indicate that the other data block is now shared an additional time, the weight indicating how many locations within the filesystem share the other data block. 4. The method of claim 3 wherein deduplicating the data block and the other data block with reference to the located mapping structure further includes increasing a deduplication counter stored within the mapping structure to indicate that the other data block has been deduplicated an additional time, the deduplication counter indicating how many times the other data block has been deduplicated. 5. The method of claim 1 , wherein obtaining the digest is performed as part of a background deduplication process, the data block already being stored at a first address of persistent storage different than a second address of persistent storage where the other data block is stored, the data block being pointed to by another mapping structure, the other mapping structure being referenced by a mapping pointer at a location within a file pointer structure of a file of the filesystem that corresponds to a location within the file where the data block is logically-positioned; and wherein obtaining the digest includes reading the digest from the other mapping structure. 6. The method of claim 5 wherein deduplicating the data block and the other data block with reference to the located mapping structure includes: storing a reference to the mapping structure at the location within the file pointer structure of the file of the filesystem that corresponds to the location within the file where the data block is logically-positioned in place of the mapping pointer that referenced the other mapping structure; and increasing a weight stored within the mapping structure to indicate that the other data block is now shared an additional time, the weight indicating how many locations within the filesystem share the other data block. 7. The method of claim 6 wherein deduplicating the data block and the other data block with reference to the located mapping structure further includes increasing a deduplication counter stored within the mapping structure to indicate that the other data block has been deduplicated an additional time, the deduplication counter indicating how many times the other data block has been deduplicated. 8. The method of claim 1 wherein determining the popularity value of the other data block includes reading a weight value stored within the mapping structure, the weight value indicating how many locations within the filesystem share the other data block. 9. The method of claim 1 wherein determining the popularity value of the other data block includes reading a deduplication counter stored within the mapping structure, the deduplication counter indicating how many times the other data block has been deduplicated. 10. The method of claim 1 wherein performing the preliminary deduplication of the other data block further includes, in response to creating the entry in the MWDC: determining whether the MWDC contains more than a threshold number of entries; in response to determining that the MWDC contains more than the threshold number of entries, evicting a least recently used entry from the MWDC. 11. The method of claim 1 wherein the method further comprises, in response to deduplicating the data block and the other data block with reference to the located mapping structure, promoting the entry to a position within the MWDC indicating that it has been accessed more recently. 12. The method of claim 1 wherein the method further comprises, in response to deduplicating the data block and the other data block with reference to the located mapping structure: determining whether the entry has been accessed more than a threshold number of times; in response to determining that the entry has been accessed more than the threshold number of times promoting the entry from a lower level of the MWDC to a most recently accessed position of a higher level of the MWDC, the higher level of the MWDC providing mo

Assignees

Inventors

Classifications

  • using de-duplication of the data · CPC title

  • De-duplication implemented within the file system, e.g. based on file segments (de-duplication techniques in storage systems for the management of data blocks G06F3/0641) · CPC title

  • Hash-based (content-based indexing of textual data G06F16/31) · CPC title

  • File access structures, e.g. distributed indices (arrangements of input from, or output to, record carriers G06F3/06) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11093454B2 cover?
Embodiments are directed to techniques for performing deduplication. A method includes (a) obtaining a digest of a data block logically-positioned within a filesystem, the digest providing a hash value of data of the data block, (b) searching a Most Wanted Digest Cache (MWDC) within system memory for the digest, (c) locating an entry in the MWDC using the digest, wherein this locating indicates…
Who is the assignee on this patent?
Emc Ip Holding Co Llc
What technology area does this patent fall under?
Primary CPC classification G06F16/1748. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Aug 17 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).