Probabilistic algorithm to check whether a file is unique for deduplication

US11669495B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11669495-B2
Application numberUS-201916552908-A
CountryUS
Kind codeB2
Filing dateAug 27, 2019
Priority dateAug 27, 2019
Publication dateJun 6, 2023
Grant dateJun 6, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Disclosed techniques include deduplication. Techniques include determining whether a file is unique, and depending on whether the file is unique, deduplicating only part of the file or the entire file. The techniques include processing the first chunk of a file to determine whether the hash of the chunk hash is already within a chunk hash table, and if not, then a percentage of chunks of the file is similarly processed. If any of the hashes of chunks are already in the chunk hash table, then at least some of file has been previously deduplicated, and file is not unique the storage system. If none of the processed chunks have a hash that is already in the chunk hash table, then the file is considered to be unique within chunk store and only a partial percentage of the file's chunks are deduplicated. Not all of a unique file's chunks are deduplicated.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of deduplicating a first file, the method comprising: separating the first file into a first plurality of chunks; choosing a first chunk of the first file; determining a hash of the first chunk is not in a chunk hash data structure stored in a chunk store; determining, for a subset of the first plurality of chunks that is a percentage of the first plurality of chunks and that is less than all the first plurality of chunks, whether a hash of each of the chunks of the subset is in the chunk hash data structure, wherein each chunk of the subset of the first plurality of chunks is randomly selected from among all of the first plurality of chunks of the first file; and based on the determining for the subset that none of the hashes of the chunks of the subset are in the chunk hash data structure, including at least one of the chunks of the subset in the chunk hash data structure without including all of the first plurality of chunks in the chunk hash data structure. 2. The method of claim 1 , wherein the chunk hash data structure comprises a first plurality of key-value mappings between first plurality of keys and first plurality of values, the first plurality of keys each being a hash of a corresponding chunk. 3. The method of claim 1 , wherein the first chunk of the first file is in a given position within the first file, wherein the given position is chosen non-randomly. 4. The method of claim 3 , wherein the given position is a leading position of the first file. 5. The method of claim 1 , the method further comprising, based on determining the hash of the first chunk is not in the chunk hash data structure: adding a first key-value mapping to the chunk hash data structure, the first key-value mapping comprising (a) a first key that is the hash of the first chunk, and (b) a first value that is a chunk ID of the first chunk. 6. The method of claim 5 , the method further comprising, based on determining the hash of the first chunk is not in the chunk hash data structure: adding a second key-value mapping to a chunk ID data structure, wherein the chunk ID data structure comprises second plurality of key-value mappings between second plurality of keys and second plurality of values, the second plurality of keys being the chunk IDs of the chunk hash data structure, and the second plurality of values being sets of information about a corresponding chunk. 7. The method of claim 6 , wherein each of the sets of information about the corresponding chunk comprises at least one of: (a) the hash of the corresponding chunk, (b) a pointer to the corresponding chunk, or (c) a reference count of the corresponding chunk. 8. The method of claim 1 , further comprising: separating a second file into a second plurality of chunks; choosing a second chunk of the second file; determining a second hash of the second chunk of the second file is in the chunk hash data structure; and performing a process for including all of the second plurality of chunks of the second file in the chunk store, wherein the process for including all of the second plurality of chunks of the second file in the chunk store comprises deduplicating the second file using the chunk store. 9. The method of claim 1 , wherein a probability of at least one of the hash of the chunks of the subset being in the chunk hash data structure is based, at least in part, on the percentage of the first plurality of chunks included in the subset. 10. A non-transitory computer readable medium comprising instructions to be executed in a processor of a computer system, the instructions configured to cause the processor to: separate a file into a plurality of chunks; choose a chunk of the file; determine whether a hash of the chunk is in a chunk hash data structure stored in a chunk store; if the hash of the chunk is in the chunk hash data structure, perform a process for including all of the plurality of chunks of the file in the chunk store; and if the hash of the chunk is not in the chunk hash data structure: determine, for a subset of the plurality of chunks that is a percentage of the plurality of chunks and that is less than all the plurality of chunks, whether a hash of each of the chunks of the subset is in the chunk hash data structure, wherein each chunk of the subset of the plurality of chunks is randomly selected from among all of the plurality of chunks of the file; if the hash of any chunk of the subset is in the chunk hash data structure, perform the process for including all of the plurality of chunks of the file in the chunk store; and if none of the hashes of the chunks of the subset are in the chunk hash data structure, include at least one of the chunks of the subset in the chunk hash data structure without including all of the plurality of chunks in the chunk hash data structure. 11. The non-transitory computer readable medium of claim 10 , wherein the chunk hash data structure comprises a first plurality of key-value mappings between first plurality of keys and first plurality of values, the first plurality of keys each being a hash of a corresponding chunk. 12. The non-transitory computer readable medium of claim 10 , wherein the chunk of the file is in a given position within the file, wherein the given position is chosen non-randomly. 13. The non-transitory computer readable medium of claim 12 , wherein the given position is a leading position of the file. 14. The non-transitory computer readable medium of claim 10 , wherein the instructions are further configured to cause the processor to, if the hash of the chunk is not in the chunk hash data structure: add a first key-value mapping to the chunk hash data structure, the first key-value mapping comprising (a) a first key that is the hash of the chunk, and (b) a first value that is a chunk ID of the chunk. 15. The non-transitory computer readable medium of claim 14 , wherein the instructions are further configured to cause the processor to, if the hash of the chunk is not in the chunk hash data structure: add a second key-value mapping to a chunk ID data structure, wherein the chunk ID data structure comprises second plurality of key-value mappings between second plurality of keys and second plurality of values, the second plurality of keys being the chunk IDs of the chunk hash data structure, and the second plurality of values being sets of information about a corresponding chunk. 16. The non-transitory computer readable medium of claim 15 , wherein each of the sets of information about the corresponding chunk comprises at least one of: (a) the hash of the corresponding chunk, (b) a pointer to the corresponding chunk, or (c) a reference count of the corresponding chunk. 17. The non-transitory computer readable medium of claim 10 , wherein the process for including all of the plurality of chunks of the file in the chunk store comprises deduplicating the file using the chunk store. 18. The non-transitory computer readable medium of claim 10 , wherein a probability of at least one of the hash of the chunks of the subset being in the chunk hash data structure is based, at least in part, on the percentage of the plurality of chunks included in the subset. 19. A computer system comprising: a file; a chunk store; a chunk hash data structure; and at least one processor configured to: separate the file into a plurality of chunks; choose a chunk of the file; determine whether a hash of the chunk is in the chunk hash data structure stored in the chunk store; if the hash of the chun

Assignees

Inventors

Classifications

  • based on file chunks · CPC title

  • using file content signatures, e.g. hash values · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11669495B2 cover?
Disclosed techniques include deduplication. Techniques include determining whether a file is unique, and depending on whether the file is unique, deduplicating only part of the file or the entire file. The techniques include processing the first chunk of a file to determine whether the hash of the chunk hash is already within a chunk hash table, and if not, then a percentage of chunks of the fi…
Who is the assignee on this patent?
Vmware Inc
What technology area does this patent fall under?
Primary CPC classification G06F16/1752. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 06 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).