Alignment fixing on a data protection system during continuous data replication to deduplicated storage
US-9772789-B1 · Sep 26, 2017 · US
US9921773B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-9921773-B2 |
| Application number | US-201514743520-A |
| Country | US |
| Kind code | B2 |
| Filing date | Jun 18, 2015 |
| Priority date | Jun 18, 2014 |
| Publication date | Mar 20, 2018 |
| Grant date | Mar 20, 2018 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Deduplicated data storage is provided by presenting a virtual volume mapped by a translation table to a physical volume of a physical data storage system. The translation table maps sets of ranges of duplicate data blocks of the virtual volume to corresponding individual ranges of shared data blocks of the physical volume. A hash table for identifying duplicate data is indexed by a portion of a hash value calculated from newly written data blocks, and has entries each identifying an address alignment of the corresponding data block. In operation, existing entries are replaced with new entries for colliding data blocks having better address alignment, promoting wider address-space separation of the entries. Upon occurrence of a hit in the hash table, for a given data block in a range of newly written data blocks, data blocks of the range are compared to corresponding blocks in a range identified by the hit to maximize a size of a region to be identified by the translation table as duplicate data.
Opening claim text (preview).
What is claimed is: 1. A method of operating a host computer in conjunction with a physical data storage system to provide deduplicated persistent data storage to a higher-level service of the host computer, comprising: presenting a virtual volume to the service for storage of service data, the virtual volume being mapped by a translation table to a physical volume of the physical data storage system, the physical volume having a size expressed as a number of fixed-size data blocks, the translation table mapping sets of ranges of duplicate data blocks of the virtual volume to corresponding individual ranges of shared data blocks of the physical volume, the mapping being used to access ranges of the shared data blocks to complete service operations directed to corresponding ranges of the duplicate data blocks; and using a hash table to track contents of the physical volume and identify newly written data as duplicate data to be mapped by the translation table to corresponding shared data blocks of the physical volume, the hash table being indexed by a portion of a fixed-size hash value calculated from newly written data blocks and having a number of entries substantially less than the number of data blocks of the physical volume, each entry identifying an address alignment of the data block from which the entry was created, the use of the hash table including (1) replacing existing entries with respective new entries for colliding data blocks having address alignment at a higher power-of-two multiple of a block size to promote wider address-space separation of the entries, and (2) upon occurrence of a hit in the hash table for a given data block in a range of newly written data blocks, comparing the data blocks of the range to corresponding physical volume data blocks in a range identified by the hit and extending a corresponding region identified by the translation table as duplicate data to include matching ones of the data blocks to maximize a size of the corresponding region identified as duplicate data. 2. The method of claim 1 , wherein the mapping is used to divide an I/O request into multiple sub-requests for corresponding sub-regions of data, the I/O request spanning distinct ranges of the virtual volume having respective distinct mappings to the physical volume, the I/O request being divided such that the sub-requests are directed to respective ones of the distinct ranges. 3. The method of claim 1 , wherein the translation table includes entries for respective ranges, each entry having data distributed among a set of fields including a range field storing starting and ending addresses of the respective range, a type field storing a duplication type of the respective range, a reference count field storing a count of duplications of the respective range, and a mapping field storing a respective mapping between the virtual volume and the physical volume for the respective range, and wherein an entry is identified as being involved in an I/O request by comparing contents of the range field against starting and ending addresses of the I/O request, overlap indicating that the corresponding region is involved in the I/O request. 4. The method of claim 1 , wherein regions of the virtual volume are identified as respective types including deduplication source (DS) regions, deduplicated (DD) regions, moved (DM) regions, and unknown (DU) regions, the DS regions containing original or first instance data, the DD regions (a) containing data for which identical data has been found in another region and (b) being translated and having respective freed physical-volume space assigned to extend a size of the virtual volume, the DM regions being translated, and the DU regions being untranslated regions. 5. The method of claim 4 , further including region-type-dependent data writing operations including a DS region overwrite and a DD region overwrite, the DS region overwrite including (a) initially checking whether a DS region being written to has previously been translated, (b) if the DS region has not previously been translated, then identifying a free space region and translating the DS region to the free space region, and (c) if the DS region has previously been translated, then writing to the DS region without translating the DS region, the DD region overwrite including (a) identifying a free space region irrespective of a previous translation of the DD region, (b) translating the DD region to the free space region, and (c) changing the type of the DD region to DM. 6. The method of claim 1 , further including, when a certain amount of duplicate data has been found and deduplicated, extending a size of the virtual volume by appending free regions corresponding to regions of the physical volume that are unused by virtue of deduplication of the duplicate data. 7. The method of claim 1 , wherein the portion of the hash value that indexes the hash table is a first portion, and a remainder portion of the hash value is stored in a remainder portion field of each entry to provide full-width comparison of the hash value to the entries. 8. The method of claim 1 , wherein a size of the hash table measured as number of entries is less than 1/100 a size of the physical volume measured in data blocks. 9. The method of claim 1 , wherein the entries of the hash table have values used in an evaluation operation used to replace existing entries with new entries, the values including (1) a first value reflecting recent usefulness of the entries for detecting duplicates, (2) a second value reflecting address alignment of the entries, and (3) a third value reflecting aging of entries in the hash table, the evaluation operation including calculating and comparing respective evaluation values for the existing and new entries and replacing an existing entry with a new entry having a higher evaluation value. 10. The method of claim 9 , wherein (1) the first value is a count of hits identifying duplicate data based on an entry, (2) the second value is an offset address for the data block corresponding to the entry, and (3) the third value is a time value incremented when a predetermined number of writes have not resulted in replacing entries of the hash table with new entries. 11. The method of claim 1 , wherein, upon detection of a duplicate data block being written to the virtual volume, a process of extending a range of blocks including the duplicate data block is performed to maximize a size of the range for identification as a duplicate range, the process including (a) a search for adjacent duplicate data blocks for data blocks written immediately prior to the writing of the duplicate data block and not previously identified as duplicates based on the contents of the hash table, and (b) a comparison of subsequently written adjacent data blocks with corresponding existing data blocks to detect duplication without first using the hash table to identify potential hits. 12. The method of claim 11 , wherein the comparison further results in detecting non-duplicate new data blocks, and processing the non-duplicate new data blocks to create one or more new entries in the hash table distinct from an existing entry for the duplicate data block. 13. The method of claim 1 , wherein respective copies of the hash table are maintained in volatile memory and in persistent storage, and further including periodically synchronizing the two copies, the synchronizing including merging into both copies new entries having better evaluation according to an evaluation operation, the evaluation operation including factors for address alignment and age of the respective entries, the hash table being divided into sections and synchroni
De-duplication implemented within the file system, e.g. based on file segments (de-duplication techniques in storage systems for the management of data blocks G06F3/0641) · CPC title
in relation to response time · CPC title
Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS] · CPC title
Saving storage space on storage systems · CPC title
De-duplication techniques · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.