De-duplication systems and methods for application-specific data
US-2016306708-A1 · Oct 20, 2016 · US
US9659047B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-9659047-B2 |
| Application number | US-201414559317-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 3, 2014 |
| Priority date | Dec 3, 2014 |
| Publication date | May 23, 2017 |
| Grant date | May 23, 2017 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
An extent map (EMAP) database may include one or more extent map entries configured to map extent IDs to PVBNs. Each extent ID may be apportioned into a most significant bit (MSB) portion, i.e., checksum bits, and a least significant bit (LSB) portion, i.e., duplicate bits. A hash may be applied to the data of the extent to calculate the checksum bits, which illustratively represent a fingerprint of the data. The duplicate bits may be configured to denote any reoccurrence of the checksum bits in the EMAP database, i.e., whether there is an existing extent with potentially identical data in a volume of the aggregate. Each extent map entry may be inserted on a node having one or more key/value pairs, wherein the key is the extent ID and the value is the PVBN. The EMAP database may be scanned and utilized to perform data deduplication.
Opening claim text (preview).
What is claimed is: 1. A system, comprising: a processor; and a non-transitory computer readable medium comprising program instructions executable by the processor to cause the system to: generate a checksum for data associated with a write request; determine if a node of a database includes one or more extent entries with a checksum matching the generated checksum; based on a determination that a node of the database includes one or more existing extent entries with a checksum matching the generated checksum, insert a new extent entry into the node, the new extent entry indicating the checksum and a value that is greater than a highest duplicate value of the one or more extent entries, wherein each of the one or more extent entries indicates the checksum and a different duplicate value; based on a determination that a node of the database does not include an extent entry with a checksum matching the generated checksum, select a node of the database based, at least in part, on the generated checksum; insert a new extent entry into the selected node of the database with the inserted, new extent entry indicating the checksum and a duplicate value that indicates the checksum has no duplicates in the database; scan the database to determine scores assigned to at least a subset of nodes of the database; for each of the nodes with a score that satisfies a threshold: identify a plurality of extent entries of the node having checksums that are identical, and perform data deduplication for data associated with the identified extent entries. 2. The system of claim 1 , wherein the program instructions further comprise program instructions to use an extent, corresponding to a first extent entry of the plurality of extent entries with a duplicate value indicating no duplicates, as a donor extent and one or more extents corresponding to the remaining extent entries of the plurality of extent entries as recipient extents. 3. The system of claim 1 , wherein the database is a B+ tree data structure. 4. A method, comprising: generating a first checksum for data associated with a write request; determining if a node of a database includes a set of one or more extent entries each having the first checksum; based on a determination that the node includes the set of extent entries each having the first checksum inserting a new extent entry into the node, the new extent entry having the first checksum and a duplicate value that is greater than a highest of duplicate values in the set of extent entries, wherein each extent entry of the set of extent entries has a different duplicate value; and based on a determination that a node of the database does not include an extent entry having the first checksum, selecting a node of the database based, at least in part, on the first checksum; inserting a new extent entry into the selected node with the inserted, new extent entry indicating the first checksum and a duplicate value that indicates the first checksum has no duplicates in the database; scanning the database to determine scores assigned to the nodes of the database; for each of the nodes with a score that satisfies a threshold: identifying a plurality of extent entries of the node having checksums that are identical, and performing data deduplication for data associated with the identified extent entries. 5. The method of claim 4 , further comprising using an extent, corresponding to a first extent entry of the plurality of extent entries, as a donor extent and extents corresponding to the remaining ones of the plurality of extent entries as recipient extents. 6. The method of claim 4 , wherein the database is a B+ tree data structure. 7. A non-transitory computer readable storage medium containing executable program instructions for execution by a processor comprising program instructions to: generate a checksum for data associated with a write request; determine if a node of a database includes one or more extent entries with a checksum matching the generated checksum; insert a new extent entry into the node that includes the one or more extent entries with the matching checksum, the new extent entry indicating the checksum and a duplicate value that is greater than a highest duplicate value of the one or more extent entries, based on a determination that the node includes one or more extent entries with a checksum matching the generated checksum; based on a determination that a node does not include an extent entry with a checksum matching the generated checksum, select a node of the database based, at least in part, on the generated checksum; insert a new extent entry into the selected node with the inserted, new extent entry indicating the checksum and a duplicate value that indicates that checksum has no duplicates in the database; scan the database to determine scores assigned to at least a subset of nodes of the database; for each of the nodes with a score that satisfies a threshold: identify a plurality of extent entries of the node haying the checksums that are identical, and perform data deduplication for data associated with the identified extent entries. 8. The non-transitory computer readable storage medium of claim 7 , further comprising program instructions to use an extent, corresponding to a first extent entry of the plurality of extent entries with a duplicate value indicating no duplicates, as a donor extent and one or more extents corresponding to the remaining extent entries as recipient extents. 9. The system of claim 1 , wherein the checksum and the duplicate value of each extent entry in the database is used as an extent identifier. 10. The system of claim 1 , wherein the program instructions further comprise program instructions executable by the processor to cause the system to: based on insertion of the new extent entry into the node with one or more extent entries with a checksum matching the generated checksum, update a marker for the node that indicates the node includes duplicate checksums across extent entries if not already set; and set a score value for the node to indicate the duplicate value of the new extent entry or a number of extent entries of the node with the checksum matching the generated checksum. 11. The system of claim 1 , wherein the program instructions further comprise program instructions executable by the processor to cause the system to map the checksum and the duplicate value to a physical volume block number corresponding to the write request. 12. The computer readable storage medium of claim 7 , wherein the checksum and the duplicate value of each extent entry in the database is used as an extent identifier. 13. The computer readable storage medium of claim 7 , wherein the program instructions further comprise program instructions to: based on insertion of the new extent entry into the node with one or more extent entries with a checksum matching the generated checksum, update a marker for the node that indicates the node includes duplicate checksums across extent entries if not already set; and set a score value for the node to indicate the duplicate value of the new extent entry or a number of extent entries of the node with the checksum matching the generated checksum. 14. The computer readable storage medium of claim 7 , further comprising program instructions to map the checksum and the duplicate value to a physical volume block number corresponding to the write request. 15. The method of claim 4 , wherein the checksum and the duplicate value of each extent entry in the database forms an extent identifier.
Saving storage space on storage systems · CPC title
Database tuning (G06F16/2282 takes precedence; database performance monitoring G06F11/3409) · CPC title
Trees, e.g. B+trees · CPC title
De-duplication techniques · CPC title
Hash tables · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.