Using index partitioning and reconciliation for data deduplication
US-9785666-B2 · Oct 10, 2017 · US
US10222987B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10222987-B2 |
| Application number | US-201615041938-A |
| Country | US |
| Kind code | B2 |
| Filing date | Feb 11, 2016 |
| Priority date | Feb 11, 2016 |
| Publication date | Mar 5, 2019 |
| Grant date | Mar 5, 2019 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A data deduplication process maintains a data dictionary including a storage tablet and a secondary index containing data indicative of previously received data blocks. The tablet includes hashes of previous data blocks and the index includes one or more cuckoo filters storing fingerprints derived from block hashes of previous data blocks. When a new data block arrives, its block hash and fingerprint are generated. The storage tablet is queried with the block hash and the secondary index is queried with the fingerprint. If the dictionary contains no matching block hash or fingerprint, the new data block is stored in its entirety. If the dictionary contains a matching block hash or fingerprint, the new data block may be a duplicate data block that can be deduplicated by storing a reference to the previous data block instead of storing the new data block in its entirety.
Opening claim text (preview).
What is claimed is: 1. A data deduplication method, comprising: responsive to detecting a data block, generating a block hash for the data block; querying a data dictionary for data indicative of a previous occurrence of the block hash, the data dictionary comprising: an active storage tablet, comprising a plurality of records, each record including a previously received block hash and a corresponding storage location; and a secondary index including a plurality of augmented cuckoo filters (ACFs), wherein each ACF includes a plurality of ACF entries representing previously received block hashes, wherein each of the plurality of ACF entries includes: a fingerprint derived from a corresponding block hash; and a tablet index indicative of a particular storage tablet associated with the corresponding block hash; wherein the plurality of ACFs includes: a first layer ACF, representing N1 block hashes, wherein N1 is an integer greater than 1; and a second layer ACF, each second layer ACF representing N2 previously generated first layer ACFs; wherein querying the data dictionary includes: generating the fingerprint for the block hash; and querying the plurality of ACFs for a matching fingerprint; and responsive to a result of the querying indicating no previous occurrence of the block hash: storing the data block to a storage location in a storage medium; storing the block hash and the storage location as a record in the active storage tablet; inserting filter construction fields derived from the block hash as a record in a filter construction array; and subject to sufficient entries in the filter construction array, generating a new ACF from a plurality of records in the filter construction array. 2. The method of claim 1 wherein querying the plurality of ACFs includes: accessing a set of filter indices identifying a set of ACF entries within which the fingerprint may be located; and comparing the fingerprint with fingerprints stored in the set of ACF entries. 3. The method of claim 2 , wherein the filter construction fields include: a fingerprint field for the fingerprint; a tablet index field for the tablet index; and a set of filter index fields, each of the filter index fields for one of the set of filter indices. 4. The method of claim 3 , further comprising: generating the filter construction fields, said generating including hashing the block hash according to a fingerprint hashing algorithm to generate the fingerprint. 5. The method of claim 1 wherein generating a new ACF includes: generating a new first layer ACF every N1 filter construction array records and including the new first layer ACF in the secondary index; and generating a new second layer ACF every N2*N1 filter construction array records and replacing existing first layer ACFs in the secondary index with the new second layer ACF. 6. The method of claim 5 , wherein the secondary index includes at least one of: a first layer ACF; a second layer ACF; and a third layer ACF representing N3 previously generated second layer ACFs. 7. The method of claim 6 , wherein generating a new ACF includes: generating a new third layer ACF every N3*N2*N1 filter construction array records and replacing existing second layer ACFs in the secondary index with new third layer ACF. 8. The method of claim 1 , further comprising: responsive to detecting occupied storage tablet records exceeding a particular threshold: storing the active storage tablet to a storage tablet library; and creating a new active storage tablet, including associating a time stamp and a tablet index with the new active storage tablet. 9. The method of claim 8 , wherein the data dictionary further includes a storage tablet cache including: one or more active storage tablets into one of which newly ingested block hashes and corresponding storage locations are inserted; and one or more retrieved storage tablets comprising storage tablets retrieved from the storage tablet library. 10. The method of claim 9 , wherein the data dictionary resides in random access memory and the storage tablet library resides in persistent storage. 11. The method of claim 9 , further comprising: associating the data block with a corresponding data stream, wherein storing the block hash in the active storage table comprises storing the block hash in a particular active storage tablet associated with the data stream. 12. The method of claim 9 , further comprising: responsive to a result of the query indicating a previous occurrence of the block hash: determining whether the data block is a duplicate of a previous data block corresponding to the previous occurrence of the block hash; and responsive to determining that the data block is a duplicate of the previous data block, storing a reference to the previous data block, in lieu of storing the data block, at the storage location. 13. The method of claim 12 , wherein the previous occurrence of the block hash corresponds to an ACF entry in the secondary index and wherein the method includes: retrieving a storage tablet from a storage tablet library and storing the storage tablet retrieved as a retrieved storage tablet in the storage tablet cache. 14. The method of claim 1 , further comprising: querying a second data dictionary, associated with a second data storage device, with the block hash; and responsive to the query hitting in the data dictionary of the second data storage device, storing, in the active storage tablet, the block hash and the most recent storage location associated with the block hash. 15. A data deduplication method, comprising: responsive to detecting a data block, generating a block hash for the data block; querying a data dictionary for data indicative of a previous occurrence of the block hash, the data dictionary comprising: an active storage tablet, comprising a plurality of records, each record including a previously received block hash and a corresponding storage location; and a secondary index including a plurality of augmented cuckoo filters (ACFs), wherein each of the plurality of ACFs includes a plurality of ACF entries representing previously received block hashes, wherein each of the plurality of ACF entries includes: a fingerprint derived from a corresponding block hash; and a tablet index indicative of a particular storage tablet associated with the corresponding block hash; wherein querying the data dictionary includes: generating the fingerprint for the block hash; and querying the plurality of ACFs for a matching fingerprint, wherein querying the ACF includes: determining a set of filter indices identifying a corresponding set of ACF entries within which the fingerprint may be located; and comparing the fingerprint with a fingerprint stored in each of the set of ACF entries; responsive to a result of the querying of the data dictionary indicating no previous occurrence of the block hash: storing the data block to a storage location in a storage medium; storing the block hash and the storage location as a record in the active storage tablet; inserting filter construction fields derived from the block hash as a record in a filter construction array wherein the filter construction fields include: a fingerprint field for the fingerprint; a tablet index field for the tablet index; and a set of filter index fields, each of the filter index fields for one of the set of filter indices, wherein the set of filter indices includes a first filter index and a second filter index and wherein generating the filter construction fields includes: hashing the bl
Hybrid storage combining heterogeneous device types, e.g. hierarchical storage, hybrid arrays · CPC title
Physics · mapped topic
De-duplication techniques · CPC title
Physics · mapped topic
Saving storage space on storage systems · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.