Inline and post-process data deduplication for a file system
US-2021109900-A1 · Apr 15, 2021 · US
US11797220B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11797220-B2 |
| Application number | US-202117408007-A |
| Country | US |
| Kind code | B2 |
| Filing date | Aug 20, 2021 |
| Priority date | Aug 20, 2021 |
| Publication date | Oct 24, 2023 |
| Grant date | Oct 24, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Data is ingested from a source system including by storing a plurality of data chunks in one or more chunk files and storing corresponding chunk identifiers associated with the plurality of data chunks in a first data structure. After data ingestion is complete, one or more duplicate data chunks that were stored during the data ingestion are determined and a second data structure is updated to include one or more entries corresponding to one or more determined duplicate data chunks.
Opening claim text (preview).
What is claimed is: 1. A method, comprising: ingesting data from a source system including by storing a plurality of data chunks in one or more chunk files and storing corresponding chunk identifiers associated with the plurality of data chunks in a first data structure, wherein the first data structure includes a first plurality of entries that includes a first entry, wherein the first entry of the first data structure corresponds to a first chunk file of the one or more chunk files and associates a chunk file identifier of the first chunk file with a first set of one or more chunk identifiers associated with a first set of one or more data chunks stored in the first chunk file, wherein the first entry includes a corresponding offset for the first set of one or more chunk identifiers within the first chunk file; and after data ingestion is complete, determining one or more duplicate data chunks that were stored during the data ingestion and updating a second data structure to include one or more entries corresponding to the one or more determined duplicate data chunks, wherein the second data structure is comprised of a second plurality of entries, wherein each of the second plurality of entries associates a corresponding chunk identifier of a stored data chunk with a corresponding chunk file identifier of a chunk file storing the stored data chunk, wherein determining the one or more duplicate data chunks that were stored during the data ingestion includes identifying a threshold number of entries associated with the first data structure that include a first chunk identifier included in the first set of one or more chunk identifiers and updating the second data structure to include a new entry that associates the first chunk identifier corresponding to a first data chunk with the first chunk file storing the first data chunk. 2. The method of claim 1 , wherein the plurality of data chunks are variable sized data chunks. 3. The method of claim 1 , wherein the first data structure and the second data structure are stored in a memory of a storage system. 4. The method of claim 1 , wherein ingesting the data from the source system includes generating a tree data structure that enables the plurality of data chunks to be located. 5. The method of claim 1 , wherein ingesting the data from the source system includes generating the corresponding chunk identifiers for each of the plurality of data chunks. 6. The method of claim 1 , wherein determining the one or more duplicate data chunks that were stored during the data ingestion includes: selecting the first entry of the first data structure; and determining whether the first chunk identifier associated with the first entry is a same chunk identifier associated with a threshold number of other entries of the first data structure. 7. The method of claim 6 , wherein determining the one or more duplicate data chunks that were stored during the data ingestion further includes updating the second data structure to include the first entry that associates the first chunk identifier corresponding to the first data chunk with the first chunk file storing the first data chunk in response to determining that the first chunk identifier associated with the first entry is the same chunk identifier associated with the threshold number of other entries of the first data structure. 8. The method of claim 7 , further comprising deleting the first data chunk corresponding to the first chunk identifier associated with the first entry from one or more chunk files corresponding to the threshold number of other entries. 9. The method of claim 8 , further comprising updating the other entries to unreference the first chunk identifier associated with the first data chunk. 10. The method of claim 1 , wherein determining the one or more duplicate data chunks that were stored during the data ingestion includes: selecting a second entry of the first data structure; and determining whether a corresponding chunk identifier associated with the selected second entry is a same chunk identifier associated with a threshold number of other entries of the first data structure. 11. The method of claim 10 , wherein determining the one or more duplicate data chunks that were stored during the data ingestion further includes modifying the chunk identifier associated with the selected second entry of the first data structure to be a different chunk identifier in response to determining that the chunk identifier associated with the selected second entry of the first data structure is not the same chunk identifier associated with the threshold number of other entries of the first data structure. 12. The method of claim 11 , further comprising updating the selected second entry of the first data structure to reference the different chunk identifier in place of the chunk identifier associated with the selected second entry of the first data structure. 13. The method of claim 12 , further comprising updating a node of a tree data structure that references the chunk identifier associated with the selected second entry to reference the different chunk identifier. 14. A computer program product embodied in a non-transitory computer readable medium and comprising computer instructions for: ingesting data from a source system including by storing a plurality of data chunks in one or more chunk files and storing corresponding chunk identifiers associated with the plurality of data chunks in a first data structure, wherein the first data structure includes a first plurality of entries that includes a first entry, wherein the first entry of the first data structure corresponds to a first chunk file of the one or more chunk files and associates a chunk file identifier of the first chunk file with a first set of one or more chunk identifiers associated with a first set of one or more data chunks stored in the first chunk file, wherein the first entry includes a corresponding offset for the first set of one or more chunk identifiers within the first chunk file; and after data ingestion is complete, determining one or more duplicate data chunks that were stored during the data ingestion and updating a second data structure to include one or more entries corresponding to the one or more determined duplicate data chunks, wherein the second data structure is comprised of a second plurality of entries, wherein each of the second plurality of entries associates a corresponding chunk identifier of a stored data chunk with a corresponding chunk file identifier of a chunk file storing the stored data chunk, wherein determining the one or more duplicate data chunks that were stored during the data ingestion includes identifying a threshold number of entries associated with the first data structure that include a first chunk identifier included in the first set of one or more chunk identifiers and updating the second data structure to include a new entry that associates the first chunk identifier corresponding to a first data chunk with the first chunk file storing the first data chunk. 15. The computer program product of claim 14 , wherein the plurality of data chunks are variable sized data chunks. 16. A system, comprising: one or more processors configured to: ingest data from a source system including by storing a plurality of data chunks in one or more chunk files and storing corresponding chunk identifiers associated with the plurality of data chunks in a first data structure, wherein the first data structure includes a first plurality of entries that includes a first entry, wherein the first entry of the first data structur
Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices · CPC title
Saving storage space on storage systems · CPC title
Erasing, e.g. deleting, data cleaning, moving of data to a wastebasket · CPC title
Non-volatile semiconductor memory device, e.g. flash memory, one time programmable memory [OTP] · CPC title
De-duplication techniques · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.