Method and system for maintaining persistent live segment records for garbage collection
US-9715505-B1 · Jul 25, 2017 · US
US11307765B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11307765-B2 |
| Application number | US-201916356818-A |
| Country | US |
| Kind code | B2 |
| Filing date | Mar 18, 2019 |
| Priority date | Jul 27, 2015 |
| Publication date | Apr 19, 2022 |
| Grant date | Apr 19, 2022 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Data in a storage system is deduplicated after receiving from at least one writing entity requests for a plurality of write operations for a corresponding plurality of data blocks in a storage object. The received blocks are buffered and sorted in order and a sequence of clumps is created from the buffered blocks, where each clump comprises a grouping of at least one of the sorted, buffered blocks. A boundary is determined between at least one pair of clumps based at least in part on the content of at least one of the buffered blocks, and it is then determined whether at least one of the clumps is a duplicate of a previously stored clump.
Opening claim text (preview).
The invention claimed is: 1. A method for managing data in a storage system comprising: receiving, from at least one writing entity, requests for a plurality of write operations; wherein the plurality of write operations are for writing a plurality of received blocks to storage; wherein each block of the plurality of received blocks is associated with a Logical Block Address (LBA); creating a sequence of clumps from the plurality of received blocks, wherein each clump comprises a group of blocks, and wherein a first clump includes a first group of blocks in which the LBAs of the blocks are not contiguous in LBA space, wherein creating the sequence of clumps comprises: compressing each of the plurality of received blocks, sorting the plurality of compressed blocks in LBA order, and packing groups of blocks into corresponding clumps; creating a data structure for storing LBA-to-clump mappings; for each block of the plurality of received blocks, storing a corresponding entry in the data structure; within each entry in the data structure, storing: the LBA that is associated with the received block to which the entry corresponds, and a reference tuple that includes a clump identifier for the clump that stores data for the corresponding LBA of the entry and a position of the block within the clump; and after creating the data structure, using entries in the data structure to determine the clumps that store data for the received blocks. 2. The method of claim 1 wherein: the plurality of received blocks includes a particular block; the particular block is associated with a particular LBA; within the sequence of clumps, data for the particular block is stored in a particular clump; storing a corresponding entry in the data structure includes storing, within the data structure, a particular entry for the particular LBA; wherein the particular entry includes the particular LBA and a particular reference; and the particular reference includes: a particular clump identifier that identifies the particular clump; and a particular block position that indicates a position, within the particular clump, of the data that corresponds to the particular LBA. 3. The method of claim 2 wherein: the particular reference is a first reference associated with a first snapshot; the particular clump was written to storage during the first snapshot; the particular entry includes a second reference associated with a second snapshot; new data associated with the particular LBA was written to storage during the second snapshot; and the second reference includes: a second clump identifier that identifies a second clump that contains the new data; and a second block position that indicates a position, within the second clump, of the new data. 4. The method of claim 3 wherein the particular entry includes an indication of a snapshot ID of a snapshot during which the particular reference was created. 5. The method of claim 1 , in which the clump identifier stored in each reference is a clump fingerprint. 6. The method of claim 5 , in which the fingerprint is a hash value of the contents of the clump. 7. The method of claim 3 , further comprising scanning entries of the data structure to identify, as dead data blocks, those data blocks that are referenced by stale references, wherein a stale reference is a reference that (a) is associated with an LBA whose entry includes multiple references; and (b) is associated with a snapshot that is older than at least one other snapshot identified in the entry for the LBA. 8. The method of claim 7 , further comprising repacking clumps so as to retain only clumps having no dead data blocks. 9. The method of claim 1 , further comprising storing the data structure as a Log-Structured Merge tree. 10. One or more non-transitory computer-readable media storing instructions for managing data in a storage system, wherein the instructions, when executed by one or more processors, cause: receiving, from at least one writing entity, requests for a plurality of write operations; wherein the plurality of write operations are for writing a plurality of received blocks to storage; wherein each block of the plurality of received blocks is associated with a Logical Block Address (LBA); creating a sequence of clumps from the plurality of received blocks, wherein each clump comprises a group of blocks, and wherein a first clump includes a first group of blocks in which the LBAs of the blocks are not contiguous in LBA space, wherein creating the sequence of clumps comprises: compressing each of the plurality of received blocks; sorting the plurality of compressed blocks in LBA order; and packing groups of blocks into corresponding clumps; creating a data structure for storing LBA-to-clump mappings; for each block of the plurality of received blocks, storing a corresponding entry in the data structure; within each entry in the data structure, storing: the LBA that is associated with the received block to which the entry corresponds, and a reference tuple that includes a clump identifier for the clump that stores data for the corresponding LBA of the entry and a position of the block within the clump; and after creating the data structure, using entries in the data structure to determine the clumps that store data for the received blocks. 11. The one or more non-transitory computer-readable media of claim 10 wherein: the plurality of received blocks includes a particular block; the particular block is associated with a particular LBA; within the sequence of clumps, data for the particular block is stored in a particular clump; storing a corresponding entry in the data structure includes storing, within the data structure, a particular entry for the particular LBA; wherein the particular entry includes the particular LBA and a particular reference; and the particular reference includes: a particular clump identifier that identifies the particular clump; and a particular block position that indicates a position, within the particular clump, of the data that corresponds to the particular LBA. 12. The one or more non-transitory computer-readable media of claim 11 wherein: the particular reference is a first reference associated with a first snapshot; the particular clump was written to storage during the first snapshot; the particular entry includes a second reference associated with a second snapshot; new data associated with the particular LBA was written to storage during the second snapshot; and the second reference includes: a second clump identifier that identifies a second clump that contains the new data; and a second block position that indicates a position, within the second clump, of the new data. 13. The one or more non-transitory computer-readable media of claim 12 wherein the particular entry includes an indication of a snapshot ID of a snapshot during which the particular reference was created. 14. The one or more non-transitory computer-readable media of claim 10 , in which the clump identifier stored in each reference is a clump fingerprint. 15. The one or more non-transitory computer-readable media of claim 14 , in which the fingerprint is a hash value of the contents of the clump. 16. The one or more non-transitory computer-readable media of claim 12 , wherein execution of the instructions further causes scanning entries of the data structure to identify, as dead data blocks, those data blocks that are referenced by stale references, wherein a stale reference is a reference that (a) is associated with an L
Saving storage space on storage systems · CPC title
De-duplication techniques · CPC title
Digital input from, or digital output to, record carriers {, e.g. RAID, emulated record carriers or networked record carriers} · CPC title
Clustering or classification · CPC title
Indexing; Data structures therefor; Storage structures · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.