Scalable post-process deduplication

US9946724B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-9946724-B1
Application numberUS-201414230863-A
CountryUS
Kind codeB1
Filing dateMar 31, 2014
Priority dateMar 31, 2014
Publication dateApr 17, 2018
Grant dateApr 17, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Implementations are provided herein for data deduplication, and more particularly, to post-process data deduplication on a large scale out storage system. Multiple techniques and implementations are disclosed that offer greater efficiency, higher performance, and more stability when performing post-process data deduplication at large scale. Disclosed implementations are based on a process for data deduplication involving four main phases: enumeration, commonality, sharing, and update. Multi-level hashing can be used to identify candidates for deduplication during the enumeration phase, providing a more efficient use of compute resources. In addition, datasets can be phase rotated through the post-process deduplication steps providing a more controllable deduplication environment as well as a more efficient use of resources.

First claim

Opening claim text (preview).

What is claimed is: 1. A system comprising: a memory that has stored thereon computer executable components; and at least one processor that executes the following computer executable components stored in the memory: a phase rotation component that generates a set of datasets of a file system, wherein the set of datasets includes at least a first dataset and a second dataset, and wherein the phase rotation component sends the first dataset to the enumeration component for ingestion; an enumeration component that ingests a dataset by: reading a set of low level hashes associated with the dataset, wherein low level hashes in the set of low level hashes are associated with a logical block identifier of the file system; analyzing the set of low level hashes and determining a set of potential matching candidates; generating a set of high level hashes based on the set of potential matching candidates and associated logical block identifiers; and adding the set of high level hashes and associated logical block identifiers to a candidate table; a disk pool policy component that in response to the enumeration component generating a set of high level hashes, determines and associates a disk pool policy identifier with high level hashes in the set of high level hashes; a commonality component that determines a set of shareable blocks by comparing high level hashes in the set of high level hashes of the candidate table with other high level hashes of the candidate table and an index table, wherein the index table contains a set of high level hashes, associated disk pool policy identifiers, and associated shadow store logical block identifiers, and wherein the set of shareable blocks is based on common disk pool policy identifiers; and a sharing component that updates the file system based on the set of shareable blocks, wherein, in response to the sharing component updating the file system, the phase rotation component sends the second dataset to the enumeration component for ingestion. 2. The system of claim 1 , wherein the sharing component updates the file system by at least one of: storing a set of data blocks in the shadow store based on the block address associated with a shareable block in the set of shareable blocks; generating a shadow store pointer for a shareable block in the set of shareable blocks and updating a metadata structure associated with the shareable block with the shadow store pointer, wherein the shadow store pointer points to a shadow store block address; or updating the index table based on the high level hash associated with the shareable block and a shadow store logical block identifier. 3. The system of claim 1 , wherein the first dataset and the second dataset do not overlap. 4. The system of claim 1 , further comprising: a commonality range extension component that determines a largest shareable range for shareable blocks in the set of shareable blocks by analyzing neighboring blocks of the logical block identifiers associated with the shareable blocks. 5. The system of claim 1 , wherein the enumeration component ingests the dataset based on at least one of a sampled attribute associated with files of the dataset or an exclude attribute associated with files of the dataset. 6. The system of claim 1 , wherein the sharing component updates the file system by at least adding an entry to a reverse mapping table for shareable blocks in the set of shareable blocks wherein the entry includes at least a file identifier and a shadow store logical block identifier. 7. The system of claim 1 , wherein the index table is empty. 8. The system of claim 1 , wherein the low level hashes are 32-bit checksums. 9. The system of claim 1 , wherein the high level hashes are 160 bit SHA1 hashes. 10. A method comprising: generating a set of datasets of a file system, wherein the set of datasets includes at least a first dataset and a second dataset; ingesting the first dataset, wherein ingesting a dataset includes: scanning the dataset and reading a set of low level hashes based on the scanning, wherein low level hashes in the set of low level hashes are associated with a logical block identifier of the file system; analyzing the set of low level hashes and determining a set of potential matching candidates; generating a set of high level hashes based on the set of potential matching candidates and associated logical block identifiers; in response to the generating the set of high level hashes, determining a disk pool policy identifier for high level hashes in the set of high level hashes; associating the determined disk pool policy identifier with high level hashes in the set of high level hashes; and adding the set of high level hashes and associated logical block identifiers to a candidate table; determining a set of shareable blocks by comparing high level hashes in the set of high level hashes of the candidate table with other high level hashes of the candidate table and an index table, based in part on the set of shareable blocks having common disk pool policy identifiers, wherein the index table contains a set of high level hashes, associated disk pool identifers and associated shadow store logical block identifiers; updating the file system based on the set of shareable blocks; and in response to the updating the file system, ingesting the second dataset. 11. The method of claim 10 , wherein the updating the file system includes at least one of: storing a set of data blocks in the shadow store based on the block address associated with a shareable block in the set of shareable blocks; generating a shadow store pointer for a shareable block in the set of shareable blocks and updating a metadata structure associated with the shareable block with the shadow store pointer, wherein the shadow store pointer points to a shadow store logical block identifier; or updating the index table based on the high level hash associated with the shareable block and a shadow store block address. 12. The method of claim 10 , wherein the first dataset and the second dataset do not overlap. 13. The method of claim 10 , further comprising: analyzing neighboring blocks of the block addresses associated with the shareable blocks, determining a largest shareable range for shareable blocks in the set of shareable blocks based on the analyzing, wherein updating the file system is based on the largest shareable range. 14. The method of claim 10 , wherein the scanning the dataset is based on at least one of a sampled attribute associated with files of the dataset or an exclude attribute associated with files of the dataset. 15. The method of claim 10 , wherein the updating the file system is at least adding an entry to a reverse mapping table for shareable blocks in the set of shareable blocks wherein the entry includes at least a file identifier and a shadow store logical block identifier. 16. The method of claim 10 , wherein the index table is empty. 17. The method of claim 10 , wherein the low level hashes are 32-bit checksums. 18. The method of claim 10 , wherein the high level hashes are 160 bit SHA1 hashes.

Assignees

Inventors

Classifications

  • De-duplication implemented within the file system, e.g. based on file segments (de-duplication techniques in storage systems for the management of data blocks G06F3/0641) · CPC title

  • Physics · mapped topic

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9946724B1 cover?
Implementations are provided herein for data deduplication, and more particularly, to post-process data deduplication on a large scale out storage system. Multiple techniques and implementations are disclosed that offer greater efficiency, higher performance, and more stability when performing post-process data deduplication at large scale. Disclosed implementations are based on a process for d…
Who is the assignee on this patent?
Emc Corp, Emc Ip Holding Co Llc
What technology area does this patent fall under?
Primary CPC classification G06F16/1748. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 17 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).