Data deduplication in a storage system

US11789917B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11789917-B2
Application numberUS-202217583365-A
CountryUS
Kind codeB2
Filing dateJan 25, 2022
Priority dateJan 25, 2022
Publication dateOct 17, 2023
Grant dateOct 17, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A storage control system receives a first data block to be written to a primary storage, and generates a content signature for the first data block. The storage control system adds a first entry for the first data block into a persistent deduplication database. The first entry comprises a key which comprises the content signature for the first data block. The persistent deduplication database comprises a tree data structure which comprises elements that are configured to store entries for data blocks. The storage control system merges the entries of at least two elements of the tree data structure to generate a set of merged entries which comprises the first entry for the first data block, and a second entry for a second data block, and commences a deduplication process in response to determining that the first entry and the second entry in the set of merged entries have matching keys.

First claim

Opening claim text (preview).

What is claimed is: 1. A method, comprising: receiving, by a storage control system, a first data block to be written to a primary storage; generating, by the storage control system, a content signature for the first data block; adding, by the storage control system, a first entry for the first data block into a persistent deduplication database, wherein the first entry for the first data block comprises a key which comprises the content signature for the first data block, and wherein the persistent deduplication database comprises a tree data structure which comprises elements that are configured to store entries for data blocks; merging, by the storage control system, the entries of at least two elements of the tree data structure to generate a set of merged entries which comprises the first entry for the first data block; determining, by the storage control system, whether the set of merged entries includes a second data block having a second entry with a key that matches the key of the first entry of the first data block; and commencing, by the storage control system, a deduplication process to determine if the first entry and the second entry correspond to duplicate data blocks, in response to determining that the first entry and the second entry in the set of merged entries have matching keys. 2. The method of claim 1 , wherein the tree data structure comprises a log-structured merge tree data structure, and wherein the elements comprise segments of the log-structured merge tree. 3. The method of claim 1 , wherein the tree data structure comprises a B-epsilon tree data structure, and wherein the elements comprise message buffers of internal nodes of the B-epsilon tree data structure. 4. The method of claim 1 , further comprising assigning, by the storage control system, a first virtual address for a logical address of the first data block, wherein the first entry further comprises the first virtual address and the logical address of the first data block, wherein the content signature and the first virtual address comprise a key:value pair for the first entry. 5. The method of claim 4 , further comprising: performing, by the storage control system, the deduplication process to determine whether the first data block comprises a same or similar content as the second data block; and in response to determining that the first data block comprises the same or similar content as the second data block: utilizing, by the storage control system, the logical address included in the first entry for the first data block, and a second virtual address included in the second entry for the second data block to update a logical-to-virtual pointer from the logical address of the first data block to the second virtual address included in the second entry for the second data block; and incrementing, by the storage control system, a reference count for the second virtual address. 6. The method of claim 1 , wherein generating the content signature for the first data block comprises generating a cryptographic hash value for content of the first data block. 7. The method of claim 1 , further comprising: writing, by the storage control system, the first data block to a persistent write cache; returning, by the storage control system, an acknowledge message to indicate that the first data block is written to the primary storage, in response to the first data block being stored in the persistent write cache; and destaging, by the storage control system, the first data block from the persistent write cache without writing the first data block to the primary storage, in response to the deduplication process validating that the first data block is a duplicate of existing data block in the primary storage. 8. The method of claim 1 , further comprising: searching, by the storage control system, a cached deduplication database to determine if the cached deduplication database comprises a cached entry having a content signature which matches the content signature of the first data block; in response to the storage control system determining that there is no cached entry having a content signature which matches the content signature of the first data block, the storage control system adds the first entry for the first data block into the persistent deduplication database; and in response to the storage control system identifying a given cached entry having a content signature which matches the content signature of the first data block, the storage control system performing an inline deduplication process to determine whether the first data block comprises a same or similar content as a third data block which is associated with the given cached entry. 9. An article of manufacture comprising a non-transitory processor-readable storage medium having stored therein program code of one or more software programs, wherein the program code is executable by one or more processors to implement a method which comprises: receiving, by a storage control system, a first data block to be written to a primary storage; generating, by the storage control system, a content signature for the first data block; adding, by the storage control system, a first entry for the first data block into a persistent deduplication database, wherein the first entry for the first data block comprises a key which comprises the content signature for the first data block, and wherein the persistent deduplication database comprises a tree data structure which comprises elements that are configured to store entries for data blocks; merging, by the storage control system, the entries of at least two elements of the tree data structure to generate a set of merged entries which comprises the first entry for the first data block; determining, by the storage control system, whether the set of merged entries includes a second data block having a second entry with a key that matches the key of the first entry of the first data block; and commencing, by the storage control system, a deduplication process to determine if the first entry and the second entry correspond to duplicate data blocks, in response to determining that the first entry and the second entry in the set of merged entries have matching keys. 10. The article of manufacture of claim 9 , wherein the tree data structure comprises a log-structured merge tree data structure, and wherein the elements comprise segments of the log-structured merge tree. 11. The article of manufacture of claim 9 , wherein the tree data structure comprises a B-epsilon tree data structure, and wherein the elements comprise message buffers of internal nodes of the B-epsilon tree data structure. 12. The article of manufacture of claim 9 , further comprising program code which is executable by the one or more processors to implement a method which comprises assigning, by the storage control system, a first virtual address for a logical address of the first data block, wherein the first entry further comprises the first virtual address and the logical address of the first data block, wherein the content signature and the first virtual address comprise a key:value pair for the first entry. 13. The article of manufacture of claim 12 , further comprising program code which is executable by the one or more processors to implement a method which comprises: performing, by the storage control system, the deduplication process to determine whether the first data block comprises a same or similar content as the second data block; and in response to determining that the first data block comprises the same or similar content as the second data block: utilizing, by the storage control system

Assignees

Inventors

Classifications

  • G06F16/215Primary

    Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors · CPC title

  • Trees, e.g. B+trees · CPC title

  • De-duplication implemented within the file system, e.g. based on file segments (de-duplication techniques in storage systems for the management of data blocks G06F3/0641) · CPC title

  • De-duplication techniques · CPC title

  • Caching, prefetching or hoarding of files · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11789917B2 cover?
A storage control system receives a first data block to be written to a primary storage, and generates a content signature for the first data block. The storage control system adds a first entry for the first data block into a persistent deduplication database. The first entry comprises a key which comprises the content signature for the first data block. The persistent deduplication database c…
Who is the assignee on this patent?
Dell Products Lp
What technology area does this patent fall under?
Primary CPC classification G06F16/215. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Oct 17 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).