Techniques for de-duplicating data storage systems using a segmented index

US10614036B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-10614036-B1
Application numberUS-201615394376-A
CountryUS
Kind codeB1
Filing dateDec 29, 2016
Priority dateDec 29, 2016
Publication dateApr 7, 2020
Grant dateApr 7, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques have been provided for storing data in a de-duplicated manner on a data storage system in a manner that allows for real-time reference to an index that is too large to fit within memory. This may be accomplished by segmenting the index into smaller segments, stored on disk. Only a subset of the segments may be loaded into memory at a given time. A predictive filter is stored in memory for each segment, allowing a de-duplication driver to quickly predict whether any given new block is likely to be indexed by each segment. Since identical blocks are often stored in long identical sequences (e.g., upon copying a disk image to a disk for a virtual machine), once a segment stored on disk is referenced many times in a short period, it is loaded into memory to allow the remainder of the long sequence to be de-duplicated.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of storing data in a de-duplicated manner on a data storage system, the method comprising: receiving a block of data to be stored to a logical address of the data storage system; filtering the block of data against each of a plurality of predictive filters to predict whether the block of data is likely to be indexed on each respective segment of a plurality of segments of a de-duplication index, the plurality of segments being partitioned into a first set of segments stored within system memory of the data storage system and a second set of segments not stored within system memory of the data storage system, each of the plurality of predictive filters, including the predictive filters corresponding to the segments of the second set, being stored within system memory, wherein filtering includes reading the plurality of predictive filters from system memory; in response to determining that the block of data is predicted to likely be indexed by each of a subset of the segments stored within system memory, checking whether the block of data is actually indexed by any of the segments of the subset; and in response to determining that the block of data is actually indexed by a particular segment of the subset of the segments stored within system memory, the particular segment indicating a location on the data storage system where the data block was previously stored: causing the logical address of the data storage system to point to the indicated location on the data storage system where the data block was previously stored; and refraining from writing the received block of data to the logical address of the data storage system, wherein filtering the block of data against each of the plurality of predictive filters includes updating a respective counter for each segment based on a prediction result of the predictive filter corresponding to that segment, and wherein the method further comprises swapping a set of the segments out of system memory based at least in part on the respective counters of the set of the segments. 2. The method of claim 1 wherein filtering the block of data against each of the plurality of predictive filters further includes: hashing the block of data using a predetermined hash function to yield a hashed value; and applying each predictive filter to the hashed value to yield a binary prediction about whether or not the block of data is likely to be indexed on a segment corresponding to that predictive filter. 3. The method of claim 2 wherein: filtering the block of data against each of the plurality of predictive filters further includes updating the respective counter for each segment based on whether the predictive filter corresponding to that segment yielded an affirmative or negative prediction value; and the method further comprises: comparing the respective counters for each segment of the plurality of segments; swapping a first segment from the second set of segments into system memory; and swapping a second segment from the first set of segments out of system memory, the second segment having a smaller counter than the first segment. 4. The method of claim 3 wherein updating the respective counter for each segment based on whether the predictive filter corresponding to that segment yielded an affirmative or negative prediction value includes: incrementing the respective counter for segments whose predictive filter yielded an affirmative prediction value; and decrementing the respective counter for segments whose predictive filter yielded a negative prediction value. 5. The method of claim 3 wherein the method further comprises: comparing the respective counters for each segment of the plurality of segments to a threshold minimum; and in response to determining that a counter for a third segment has a value smaller than the threshold minimum, deleting the third segment from the data storage system. 6. The method of claim 2 wherein each predictive filter is a Bloom filter. 7. The method of claim 2 wherein the method further comprises: receiving another block of data to be stored to another logical address of the data storage system; hashing the other block of data using the predetermined hash function to yield another hashed value; applying each predictive filter to the other hashed value to yield another binary prediction about whether or not the other block of data is likely to be indexed on each segment corresponding to that predictive filter; and in response to determining that the other block of data is not predicted to be indexed by any segment of the first set of segments stored within system memory: writing the other block of data to a new location on the data storage system; adding the other hashed value and a pointer to the new location to an open segment of the first set of segments stored within system memory; and updating the predictive filter for the open segment to account for the other hashed value having been added. 8. The method of claim 7 wherein the method further comprises: in response to adding the other hashed value and the pointer to the new location to the open segment, determining that the open segment has become full; in response to determining that the open segment has become full, closing the open segment and creating a new open segment as part of the de-duplication index within the first set stored within system memory. 9. The method of claim 2 wherein the method further comprises: receiving another block of data to be stored to another logical address of the data storage system; hashing the other block of data using the predetermined hash function to yield another hashed value; applying each predictive filter to the other hashed value to yield another binary prediction about whether or not the other block of data is likely to be indexed on each segment corresponding to that predictive filter; in response to determining that the other block of data is predicted to likely be indexed by each of a subset of the segments stored within system memory, checking whether the other block of data is actually indexed by any of the segments of the subset; in response to determining that the block of data is not actually indexed by any segment of the subset of the segments stored within system memory: writing the other block of data to a new location on the data storage system; adding the other hashed value and a pointer to the new location to an open segment of the first set of segments stored within system memory; and updating the predictive filter for the open segment to account for the other hashed value having been added. 10. The method of claim 9 wherein the method further comprises: in response to adding the other hashed value and the pointer to the new location to the open segment, determining that the open segment has become full; in response to determining that the open segment has become full, closing the open segment and creating a new open segment as part of the de-duplication index within the first set stored within system memory. 11. The method of claim 1 wherein receiving, filtering, determining that the block of data is predicted to likely be indexed by each of the subset of the segments stored within system memory, and checking are performed without reading from persistent storage of the data storage system. 12. An apparatus comprising: persistent storage devices providing data storage; network interface circuitry configured to communicate with a host over a network; and processing circuitry coupled to memory to form a control circuit configured to store data in a de-duplicated manner in data storage by: receiving a b

Assignees

Inventors

Classifications

  • De-duplication implemented within the file system, e.g. based on file segments (de-duplication techniques in storage systems for the management of data blocks G06F3/0641) · CPC title

  • Hash-based (content-based indexing of textual data G06F16/31) · CPC title

  • for networked environments · CPC title

  • Database-specific techniques · CPC title

  • using de-duplication of the data · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10614036B1 cover?
Techniques have been provided for storing data in a de-duplicated manner on a data storage system in a manner that allows for real-time reference to an index that is too large to fit within memory. This may be accomplished by segmenting the index into smaller segments, stored on disk. Only a subset of the segments may be loaded into memory at a given time. A predictive filter is stored in memor…
Who is the assignee on this patent?
Emc Ip Holding Co Llc, AMC IP Holding Company LLC
What technology area does this patent fall under?
Primary CPC classification G06F16/1748. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 07 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 5 related publications on this page (citations in our corpus or others sharing the same primary CPC).