Reducing memory usage in storing metadata

US11797220B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11797220-B2
Application numberUS-202117408007-A
CountryUS
Kind codeB2
Filing dateAug 20, 2021
Priority dateAug 20, 2021
Publication dateOct 24, 2023
Grant dateOct 24, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Data is ingested from a source system including by storing a plurality of data chunks in one or more chunk files and storing corresponding chunk identifiers associated with the plurality of data chunks in a first data structure. After data ingestion is complete, one or more duplicate data chunks that were stored during the data ingestion are determined and a second data structure is updated to include one or more entries corresponding to one or more determined duplicate data chunks.

First claim

Opening claim text (preview).

What is claimed is: 1. A method, comprising: ingesting data from a source system including by storing a plurality of data chunks in one or more chunk files and storing corresponding chunk identifiers associated with the plurality of data chunks in a first data structure, wherein the first data structure includes a first plurality of entries that includes a first entry, wherein the first entry of the first data structure corresponds to a first chunk file of the one or more chunk files and associates a chunk file identifier of the first chunk file with a first set of one or more chunk identifiers associated with a first set of one or more data chunks stored in the first chunk file, wherein the first entry includes a corresponding offset for the first set of one or more chunk identifiers within the first chunk file; and after data ingestion is complete, determining one or more duplicate data chunks that were stored during the data ingestion and updating a second data structure to include one or more entries corresponding to the one or more determined duplicate data chunks, wherein the second data structure is comprised of a second plurality of entries, wherein each of the second plurality of entries associates a corresponding chunk identifier of a stored data chunk with a corresponding chunk file identifier of a chunk file storing the stored data chunk, wherein determining the one or more duplicate data chunks that were stored during the data ingestion includes identifying a threshold number of entries associated with the first data structure that include a first chunk identifier included in the first set of one or more chunk identifiers and updating the second data structure to include a new entry that associates the first chunk identifier corresponding to a first data chunk with the first chunk file storing the first data chunk. 2. The method of claim 1 , wherein the plurality of data chunks are variable sized data chunks. 3. The method of claim 1 , wherein the first data structure and the second data structure are stored in a memory of a storage system. 4. The method of claim 1 , wherein ingesting the data from the source system includes generating a tree data structure that enables the plurality of data chunks to be located. 5. The method of claim 1 , wherein ingesting the data from the source system includes generating the corresponding chunk identifiers for each of the plurality of data chunks. 6. The method of claim 1 , wherein determining the one or more duplicate data chunks that were stored during the data ingestion includes: selecting the first entry of the first data structure; and determining whether the first chunk identifier associated with the first entry is a same chunk identifier associated with a threshold number of other entries of the first data structure. 7. The method of claim 6 , wherein determining the one or more duplicate data chunks that were stored during the data ingestion further includes updating the second data structure to include the first entry that associates the first chunk identifier corresponding to the first data chunk with the first chunk file storing the first data chunk in response to determining that the first chunk identifier associated with the first entry is the same chunk identifier associated with the threshold number of other entries of the first data structure. 8. The method of claim 7 , further comprising deleting the first data chunk corresponding to the first chunk identifier associated with the first entry from one or more chunk files corresponding to the threshold number of other entries. 9. The method of claim 8 , further comprising updating the other entries to unreference the first chunk identifier associated with the first data chunk. 10. The method of claim 1 , wherein determining the one or more duplicate data chunks that were stored during the data ingestion includes: selecting a second entry of the first data structure; and determining whether a corresponding chunk identifier associated with the selected second entry is a same chunk identifier associated with a threshold number of other entries of the first data structure. 11. The method of claim 10 , wherein determining the one or more duplicate data chunks that were stored during the data ingestion further includes modifying the chunk identifier associated with the selected second entry of the first data structure to be a different chunk identifier in response to determining that the chunk identifier associated with the selected second entry of the first data structure is not the same chunk identifier associated with the threshold number of other entries of the first data structure. 12. The method of claim 11 , further comprising updating the selected second entry of the first data structure to reference the different chunk identifier in place of the chunk identifier associated with the selected second entry of the first data structure. 13. The method of claim 12 , further comprising updating a node of a tree data structure that references the chunk identifier associated with the selected second entry to reference the different chunk identifier. 14. A computer program product embodied in a non-transitory computer readable medium and comprising computer instructions for: ingesting data from a source system including by storing a plurality of data chunks in one or more chunk files and storing corresponding chunk identifiers associated with the plurality of data chunks in a first data structure, wherein the first data structure includes a first plurality of entries that includes a first entry, wherein the first entry of the first data structure corresponds to a first chunk file of the one or more chunk files and associates a chunk file identifier of the first chunk file with a first set of one or more chunk identifiers associated with a first set of one or more data chunks stored in the first chunk file, wherein the first entry includes a corresponding offset for the first set of one or more chunk identifiers within the first chunk file; and after data ingestion is complete, determining one or more duplicate data chunks that were stored during the data ingestion and updating a second data structure to include one or more entries corresponding to the one or more determined duplicate data chunks, wherein the second data structure is comprised of a second plurality of entries, wherein each of the second plurality of entries associates a corresponding chunk identifier of a stored data chunk with a corresponding chunk file identifier of a chunk file storing the stored data chunk, wherein determining the one or more duplicate data chunks that were stored during the data ingestion includes identifying a threshold number of entries associated with the first data structure that include a first chunk identifier included in the first set of one or more chunk identifiers and updating the second data structure to include a new entry that associates the first chunk identifier corresponding to a first data chunk with the first chunk file storing the first data chunk. 15. The computer program product of claim 14 , wherein the plurality of data chunks are variable sized data chunks. 16. A system, comprising: one or more processors configured to: ingest data from a source system including by storing a plurality of data chunks in one or more chunk files and storing corresponding chunk identifiers associated with the plurality of data chunks in a first data structure, wherein the first data structure includes a first plurality of entries that includes a first entry, wherein the first entry of the first data structur

Assignees

Inventors

Classifications

  • G06F3/0655Primary

    Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices · CPC title

  • Saving storage space on storage systems · CPC title

  • Erasing, e.g. deleting, data cleaning, moving of data to a wastebasket · CPC title

  • Non-volatile semiconductor memory device, e.g. flash memory, one time programmable memory [OTP] · CPC title

  • G06F3/0641Primary

    De-duplication techniques · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11797220B2 cover?
Data is ingested from a source system including by storing a plurality of data chunks in one or more chunk files and storing corresponding chunk identifiers associated with the plurality of data chunks in a first data structure. After data ingestion is complete, one or more duplicate data chunks that were stored during the data ingestion are determined and a second data structure is updated to …
Who is the assignee on this patent?
Cohesity Inc
What technology area does this patent fall under?
Primary CPC classification G06F3/0655. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Oct 24 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).