Memory efficient sanitization of a deduplicated storage system using a perfect hash function

US9317218B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-9317218-B1
Application numberUS-201313763522-A
CountryUS
Kind codeB1
Filing dateFeb 8, 2013
Priority dateFeb 8, 2013
Publication dateApr 19, 2016
Grant dateApr 19, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques for sanitizing a storage system are described herein. In one embodiment, for each of fingerprints representing data chunks stored in a first container of the storage system, a lookup operation in a live bit vector based on the fingerprint is performed to determine whether a corresponding data chunk is live. In one embodiment, a bit in a copy bit vector corresponding to the data chunk is populated based on the lookup operation. In one embodiment, after all of the bits corresponding to the data chunks of the first container have been populated in the CBV, data chunks represented by the CBV are copied from the first container to a second container, and records of the data chunks in the first container are erased.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for sanitizing a storage system, the method comprising: for each of fingerprints representing data chunks stored in a first of a plurality of containers of the storage system, identifying a bucket associated with the fingerprint by applying a first hash function to the fingerprint, wherein the bucket represents a subset of fingerprints; determining a second hash function corresponding to the identified bucket, the second hash function is a collision free hash function; applying the second hash function to the fingerprint to generate a hash value; performing a lookup operation in a single live bit vector using the hash value as an index to determine a live status of a corresponding data chunk, the single live bit vector having a plurality of bits and each indicating whether one of a plurality of data chunks stored in the plurality of containers is live or dead, and populating a bit in a copy bit vector (CBV) corresponding to the data chunk based on the lookup operation by copying the determined live status from the single live bit vector to the bit in the CBV, the CBV including a plurality of bits and each storing a bit value indicating whether a data chunk is to be copied, wherein each bit in the CBV is referenced by a container ID and a data chunk ID associated with the fingerprint; after all of the bits corresponding to the data chunks of the first container have been populated in the CBV, copying live data chunks represented by the CBV from the first container to a second container; erasing records of the data chunks by padding a predetermined data value in the first container; and releasing the first container back to a pool of free containers for future reuse. 2. The method of claim 1 , further comprising: creating the collision-free hash function based on a plurality of fingerprints in an index; for each of a plurality of files stored in the storage system, obtaining a list of fingerprints representing data chunks of the file, and for each of the fingerprints, performing a hash operation on the fingerprint using the collision-free hash function; populating a bit in the live vector corresponding to the fingerprint based on the hash operation to indicate the corresponding data chunk is live. 3. The method of claim 2 , wherein a bit in the CBV corresponding to a data chunk is populated with a value based on a live vector bit that corresponds to the same data chunk. 4. The method of claim 3 , further comprising: receiving a data chunk to be stored in the storage system while sanitization is in progress; storing, in a buffer, a container identifier of a container storing the data chunk and a storage location identifier identifying a chunk offset within the identified container in which the data chunk is stored; and populating a bit in the CBV based on the container identifier and storage location identifier stored in the buffer. 5. The method of claim 3 , wherein the CBV is included in a container index that maps a fingerprint of a data chunk to a container storing the data chunk, and wherein each of the bits in the CBV is set, at the start of the sanitization process, to a predetermined value indicating the corresponding data chunk is dead. 6. The method of claim 1 , wherein data chunks are copied from the first container to a second container if the first container contains at least one dead data chunk. 7. The method of claim 1 , wherein each of the data chunks is a deduplicated data chunk, and wherein at least one of the data chunks is referenced by multiple files in a file system of the storage system. 8. The method of claim 1 , wherein deduplication is disabled during the sanitization process. 9. The method of claim 2 , wherein the collision-free hash function is a perfect hash function. 10. A non-transitory computer-readable medium having instructions stored therein, which when executed by a computer, cause the computer to perform operations, the operations comprising: for each of fingerprints representing data chunks stored in a first of a plurality of containers of the storage system, identifying a bucket associated with the fingerprint by applying a first hash function to the fingerprint, wherein the bucket represents a subset of fingerprints; determining a second hash function corresponding to the identified bucket, the second hash function is a collision free hash function; applying the second hash function to the fingerprint to generate a hash value; performing a lookup operation in a single live bit vector using the hash value as an index to determine a live status of a corresponding data chunk, the single live bit vector having a plurality of bits and each indicating whether one of a plurality of data chunks stored in the plurality of containers is live or dead, and populating a bit in a copy bit vector (CBV) corresponding to the data chunk based on the lookup operation by copying the determined live status from the single live bit vector to the bit in the CBV, the CBV including a plurality of bits and each storing a bit value indicating whether a data chunk is to be copied, wherein each bit in the CBV is referenced by a container ID and a data chunk ID associated with the fingerprint; after all of the bits corresponding to the data chunks of the first container have been populated in the CBV, copying live data chunks represented by the CBV from the first container to a second container; erasing records of the data chunks by padding a predetermined data value in the first container; and releasing the first container back to a pool of free containers for future reuse. 11. The non-transitory computer-readable medium of claim 10 , wherein the operations further comprise: creating the collision-free hash function based on a plurality of fingerprints in an index; for each of a plurality of files stored in the storage system, obtaining a list of fingerprints representing data chunks of the file, and for each of the fingerprints, performing a hash operation on the fingerprint using the collision-free hash function; populating a bit in the live vector corresponding to the fingerprint based on the hash operation to indicate the corresponding data chunk is live. 12. The non-transitory computer-readable medium of claim 11 , wherein a bit in the CBV corresponding to a data chunk is populated with a value based on a live vector bit that corresponds to the same data chunk. 13. The non-transitory computer-readable medium of claim 12 , wherein the operations further comprise: receiving a data chunk to be stored in the storage system while sanitization is in progress; storing, in a buffer, a container identifier of a container storing the data chunk and a storage location identifier identifying a chunk offset within the identified container in which the data chunk is stored; and populating a bit in the CBV based on the container identifier and storage location identifier stored in the buffer. 14. The non-transitory computer-readable medium of claim 12 , wherein the CBV is included in a container index that maps a fingerprint of a data chunk to a container storing the data chunk, and wherein each of the bits in the CBV is set, at the start of the sanitization process, to a predetermined value indicating the corresponding data chunk is dead. 15. The non-transitory computer-readable medium of claim 10 , wherein data chunks are copied from the first container to a second container if the first container contains at least one dead data chunk. 16. The non-transitory computer-readable medium of claim 10 ,

Assignees

Inventors

Classifications

  • G06F3/0655Primary

    Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices · CPC title

  • based on file chunks · CPC title

  • De-duplication implemented within the file system, e.g. based on file segments (de-duplication techniques in storage systems for the management of data blocks G06F3/0641) · CPC title

  • hash tables · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9317218B1 cover?
Techniques for sanitizing a storage system are described herein. In one embodiment, for each of fingerprints representing data chunks stored in a first container of the storage system, a lookup operation in a live bit vector based on the fingerprint is performed to determine whether a corresponding data chunk is live. In one embodiment, a bit in a copy bit vector corresponding to the data chunk…
Who is the assignee on this patent?
Emc Corp
What technology area does this patent fall under?
Primary CPC classification G06F3/0655. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 19 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).