De-duplication systems and methods for application-specific data
US-2016306708-A1 · Oct 20, 2016 · US
US10353884B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10353884-B2 |
| Application number | US-201715601388-A |
| Country | US |
| Kind code | B2 |
| Filing date | May 22, 2017 |
| Priority date | Dec 3, 2014 |
| Publication date | Jul 16, 2019 |
| Grant date | Jul 16, 2019 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Multiple key-value stores may be employed to smooth out random updates (based on the extent ID) to the EMAP database. The updates to the EMAP database occur in a two-stage manner: (i) using an append-only log store for the first stage and (ii) using an on-disk hash store for the second stage. The append-only log store is used to convert the random updates to sequential write operations on the EMAP database. Once full, the contents of the log store are sorted and moved to the on-disk hash store, which holds the updates for a transient period of time to enable batching of the updates. Once sufficient batching of the extent map entries are accumulated, those entries are sorted and moved to the EMAP database. Thereafter, the EMAP database can be scanned to find extent map entries having identical checksum bits to perform data deduplication.
Opening claim text (preview).
What is claimed is: 1. A method comprising: appending extent identifiers of extents of a storage system into a log; based upon a threshold amount of the log being populated with extent identifiers, sorting and moving the extent identifiers into a hash store until expiration of a time period; sorting the extent identifiers in the hash store based upon expiration of the time period; and moving the extent identifiers from the hash store into a mapping structure mapping extent identifiers to storage location identifiers of corresponding extents. 2. The method of claim 1 , further comprising prioritizing deduplication of extents in the storage system based on the mapping structure. 3. The method of claim 1 , wherein an extent identifier comprises a checksum generated from an extent and a counter, wherein the counter distinguishes matching checksums. 4. The method of claim 1 , wherein the log comprises a key-value store and the hash store comprises a second key-value store, wherein the extent identifiers are the keys and storage location identifiers are the values. 5. The method of claim 1 , wherein the storage location identifiers are physical volume block numbers. 6. The method of claim 1 , wherein the appending comprises appending extent entries into the log, wherein an extent entry comprises an extent identifier and a corresponding storage location identifier. 7. The method of claim 6 , wherein moving the extent identifier from the hash store comprises: based on a determination that a node of the mapping structure includes the extent entry with a checksum of the extent identifier of the extent entry, inserting the extent entry from the hash store into the node and updating a score for the node based on a number of the extent entries in the node with a same checksum. 8. The method of claim 7 , comprising marking a dirty bit for a second node based on a determination that the second node has a second extent entry with a second checksum matching the checksum of the extent entry being inserted into the node. 9. The method of claim 1 , further comprising searching the log with a partial cuckoo hash of an extent based on a write operation to the storage system and performing deduplication based on finding a match. 10. A non-transitory computer-readable medium comprising instructions for performing a method, which when executed by a machine, causes the machine to: append extent identifiers of extents of a storage system into a first structure; based upon a threshold amount of the log being populated with extent identifiers, sort and move the extent identifiers into a second structure until expiration of a time period; sort the extent identifiers in the second structure based upon expiration of the time period; and update an extent map database with the extent identifiers from the second structure. 11. The non-transitory computer-readable medium of claim 10 , wherein the instructions cause the machine to prioritize deduplication of extents in the storage system based on the extent map database. 12. The non-transitory computer-readable medium of claim 10 , wherein an extent identifier comprises a checksum generated from an extent and a counter, wherein the counter distinguishes matching checksums. 13. The non-transitory computer-readable medium of claim 10 , wherein the first structure comprises a key-value store and the second structure comprises a second key-value store, wherein the extent identifiers are the keys and storage location identifiers are the values. 14. The non-transitory computer-readable medium of claim 13 , wherein the storage location identifiers are physical volume block numbers. 15. The non-transitory computer-readable medium of claim 10 , wherein the instructions cause the machine to: append extent entries into the first structure, wherein an extent entry comprises an extent identifier and a corresponding storage location identifier. 16. The non-transitory computer-readable medium of claim 15 , wherein the instructions cause the machine to: based on a determination that a node of an extent map database does not include the extent entry with a checksum of the extent identifier, inserting the extent entry from the first structure into a node of the extent map database. 17. The non-transitory computer-readable medium of claim 16 , wherein the instructions cause the machine to mark a dirty bit for a second node based on a determination that the second node has a second extent entry with a second checksum matching the checksum of the extent entry being inserted into the node. 18. The non-transitory computer-readable medium of claim 10 , wherein the instructions cause the machine to search the first structure with a partial cuckoo hash of an extent based on a write operation to the storage system and to perform deduplication based on finding a match. 19. A computing device comprising: a processor; a computer-readable medium comprising instructions executable by the processor to cause the processor to: append extent identifiers of extents of a storage system into a first structure; based upon a threshold amount of the log being populated with extent identifiers, sort and move the extent identifiers into a second structure until expiration of a time period; sort the extent identifiers in the second structure based upon expiration of the time period; and update an extent map database with the extent identifiers from the second structure. 20. The computing device of claim 19 , wherein the instructions cause the processor to search the second structure with a partial cuckoo hash of an extent based on a write operation to the storage system and to perform deduplication based on finding a match.
De-duplication implemented within the file system, e.g. based on file segments (de-duplication techniques in storage systems for the management of data blocks G06F3/0641) · CPC title
Database tuning (G06F16/2282 takes precedence; database performance monitoring G06F11/3409) · CPC title
Hash-based (content-based indexing of textual data G06F16/31) · CPC title
Saving storage space on storage systems · CPC title
Trees, e.g. B+trees · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.