Utilizing global digests caching in similarity based data deduplication

US9891857B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9891857-B2
Application numberUS-201313941742-A
CountryUS
Kind codeB2
Filing dateJul 15, 2013
Priority dateJul 15, 2013
Publication dateFeb 13, 2018
Grant dateFeb 13, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Input data is partitioned into data chunks and digest values are calculated for each of the data chunks. The positions of similar repository data are found in a repository of data for each of the data chunks. The repository digests of the similar repository data are located and loaded into the global digests cache. The global digests cache contains digests previously loaded by other deduplication processes. The input digests of the input data are matched with the repository digests contained in the global digests cache for locating data matches. The processor prefers to match the input digests of the input data with the repository digests contained in the global digests cache which are of the similar repository data, rather than repository digests which are of other repository data that was not determined as similar to the input data chunks.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for utilizing a global digests cache in similarity based data deduplication in a data deduplication system using a processor device in a computing environment, comprising: partitioning input data into input data chunks, each of the input data chunks having a size of at least 16 Megabytes (MB); calculating input digest values for each of the input data chunks; finding positions of similar repository data in a repository of data for each of the input data chunks; locating and loading repository digests of the similar repository data into the global digests cache, wherein the global digests cache contains, prior to the loading of the repository digests of the similar repository data, at least a plurality of digests previously loaded by a plurality of deduplication processes; matching input digests of the input data and the repository digests contained in the global digests cache for locating data matches; preferring to match the input digests of the input data with the repository digests contained in the global digests cache which are of the similar repository data, rather than repository digests which are of other repository data that was not determined as similar to the input data chunks; and using the positions of the similar repository data to locate and linearly load into the global digests cache, digests and digest block boundaries of the similar repository data in a sequence corresponding to a placement order of calculated values of the digests of the similar repository data, the placement order of the calculated values of the digests of the similar repository data correlative to an order in which the input digest values were individually calculated such that the digests of the similar repository data are each individually stored in the global digests cache based on a calculation time and order of when each of the input digests were first calculated when in un-deduplicated form, thereby storing the digests of the similar repository data in a linear and sequential form independent of a deduplicated form by which data the digests describe is stored, wherein the global digest cache comprises a pool of a plurality of sequential arrays of digest entries of the digests and a hash table for pointing to contents within the plurality of sequential arrays. 2. The method of claim 1 , wherein the global digests cache contains the plurality of digests previously loaded by the plurality of deduplication processes. 3. The method of claim 2 , further including reusing at least one of the plurality of sequential arrays of digest entries of the global digests cache according to a least recently used (LRU) policy. 4. The method of claim 3 , further including applying the LRU policy on the plurality of sequential arrays of digest entries of the digest entries of the plurality of digests in the global digests cache. 5. The method of claim 4 , further including searching for the input digests by considering both the plurality of digests previously loaded by the plurality of deduplication processes and the digests of the similar repository data currently loaded into the global digests cache. 6. The method of claim 1 , further including performing one of: calculating similarity values for each of the input data chunks, searching for matching similarity values in a search structure containing the similarity values, and matching the digest values of the input data with the repository digest values of the repository digests loaded into the global digests cache for locating the data matches. 7. The method of claim 1 , further including matching input digests of the input data and repository digests contained in the global digests cache for finding data matches if the search for the similar repository data in the repository finds the similar repository data. 8. A system for utilizing a global digests cache in similarity based data deduplication in a data deduplication system of a computing environment, the system comprising: the data deduplication system; the global digests cache in association with the data deduplication system; a repository operating in the data deduplication system in communication with the global digests cache; and at least one processor device operable in the computing storage environment for controlling the data deduplication system, wherein the at least one processor device: partitions input data into input data chunks, each of the input data chunks having a size of at least 16 Megabytes (MB), calculates input digest values for each of the input data chunks, finds positions of similar repository data in a repository of data for each of the input data chunks, locates and loads repository digests of the similar repository data into the global digests cache, wherein the global digests cache contains, prior to the loading of the repository digests of the similar repository data, at least a plurality of digests previously loaded by a plurality of deduplication processes, matching input digests of the input data and the repository digests contained in the global digests cache for locating data matches, prefers to match the input digests of the input data with the repository digests contained in the global digests cache which are of the similar repository data, rather than repository digests which are of other repository data that was not determined as similar to the input data chunks, and uses the positions of the similar repository data to locate and linearly load into the global digests cache, digests and digest block boundaries of the similar repository data in a sequence corresponding to a placement order of calculated values of the digests of the similar repository data, the placement order of the calculated values of the digests of the similar repository data correlative to an order in which the input digest values were individually calculated such that the digests of the similar repository data are each individually stored in the global digests cache based on a calculation time and order of when each of the input digests were first calculated when in un-deduplicated form, thereby storing the digests of the similar repository data in a linear and sequential form independent of a deduplicated form by which data the digests describe is stored, wherein the global digest cache comprises a pool of a plurality of sequential arrays of digest entries of the digests and a hash table for pointing to contents within the plurality of sequential arrays. 9. The system of claim 8 , wherein the global digests cache contains the plurality of digests previously loaded by the plurality of deduplication processes. 10. The system of claim 9 , wherein the at least one processor device reuses at least one of the plurality of sequential arrays of digest entries of the global digests cache according to a least recently used (LRU) policy. 11. The system of claim 10 , wherein the at least one processor device applies the LRU policy on the plurality of sequential arrays of digest entries of the digest entries of the plurality of digests in the global digests cache. 12. The system of claim 11 , wherein the at least one processor device searches for the input digests by considering both the plurality of digests previously loaded by the plurality of deduplication processes and the digests of the similar repository data currently loaded into the global digests cache within a time window reflected by the global digests cache, wherein the global digests cache reflects a window of time backwards from a current time. 13. The system of claim 8 , wherein the at least one processor device performs one of: calculating similarity values for each of the

Assignees

Inventors

Classifications

  • Partitioned cache, e.g. separate instruction and operand caches · CPC title

  • in relation to data integrity, e.g. data losses, bit errors · CPC title

  • Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS] · CPC title

  • with dedicated cache, e.g. instruction or stack · CPC title

  • G06F3/0641Primary

    De-duplication techniques · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9891857B2 cover?
Input data is partitioned into data chunks and digest values are calculated for each of the data chunks. The positions of similar repository data are found in a repository of data for each of the data chunks. The repository digests of the similar repository data are located and loaded into the global digests cache. The global digests cache contains digests previously loaded by other deduplicati…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F12/0848. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 13 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).