Digest retrieval based on similarity search in data deduplication

US9547662B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9547662-B2
Application numberUS-201313839581-A
CountryUS
Kind codeB2
Filing dateMar 15, 2013
Priority dateMar 15, 2013
Publication dateJan 17, 2017
Grant dateJan 17, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

For digest retrieval based on similarity search in deduplication processing in a data deduplication system using a processor device in a computing environment, input data is partitioned into fixed sized data chunks. Similarity elements and digest block boundaries and digest values are calculated for each of the fixed sized data chunks. Matching similarity elements are searched for in a search structure containing the similarity elements for each of the fixed sized data chunks in a repository of data. Positions of similar data are located in the repository. The positions of the similar data are used to locate and load into the memory stored digest values and corresponding stored digest block boundaries of the similar data in the repository. The digest values and the corresponding digest block boundaries of the input data are matched with the stored digest values and the corresponding stored digest block boundaries to find data matches.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for retrieving digests based on a similarity search for efficient deduplication processing in a data deduplication system using a processor device in a computing environment, comprising: partitioning input data into data chunks; searching for matching similarity elements in a search structure containing similarity elements; finding positions of similar data in a repository of data; using the positions of the similar data to locate and load into a memory stored digest values of the similar data in the repository; matching digest values of the input data with the stored digest values loaded into memory to find data matches; calculating rolling hash values for each of the data chunks, wherein a rolling hash value is produced for each consecutive window of bytes in a byte offset; and selecting specified rolling hash values and associated positions of the specified rolling hash values as the similarity elements; wherein a single linear scan is used to produce both the similarity elements and the digest values to find the data matches. 2. The method of claim 1 , further including partitioning the input data into fixed sized data chunks. 3. The method of claim 1 , further including calculating digest block boundaries based on the rolling hash values, and calculating digest values corresponding to the digest blocks. 4. The method of claim 1 , further including defining the search structure to contain the similarity elements of data chunks stored in the repository of data. 5. The method of claim 1 , further including storing the digests in the repository in a form that corresponds to their occurrence in the data. 6. The method of claim 5 , further including locating in the repository digests corresponding to a specified interval of data, based on a position in the repository and a size of the specified interval of data. 7. The method of claim 1 , further including recording a data identity if an input digest value is identical to a stored digest value loaded in the memory, wherein the data identify includes the data covered in the input data and in repository data by the matching input digest and stored digest respectively. 8. A system for retrieving digests based on a similarity search for efficient deduplication processing in a data deduplication system of a computing environment, the system comprising: the data deduplication system; a repository operating in the data deduplication system; a memory in the data deduplication system; a search structure in association with the memory in the data deduplication system; and at least one processor device operable in the computing storage environment for controlling the data deduplication system, wherein the at least one processor device: partitions input data into data chunks, searches for matching similarity elements in the search structure containing similarity elements, finds positions of similar data in a repository of data, uses the positions of the similar data to locate and load into the memory stored digest values of the similar data in the repository, matches digest values of the input data with the stored digest values loaded into memory to find data matches, calculates rolling hash values for each of the data chunks, wherein a rolling hash value is produced for each consecutive window of bytes in a byte offset, and selects specified rolling hash values and associated positions of the specified rolling hash values as the similarity elements; wherein a single linear scan is used to produce both the similarity elements and the digest values to find the data matches. 9. The system of claim 8 , wherein the at least one processor device partitions the input data into fixed sized data chunks. 10. The system of claim 8 , wherein the at least one processor device calculates digest block boundaries based on the rolling hash values, and calculating digest values corresponding to the digest blocks. 11. The system of claim 8 , wherein the search structure contains the similarity elements of data chunks stored in the repository of data. 12. The system of claim 8 , wherein the at least one processor device stores the digests in the repository in a form that corresponds to their occurrence in the data. 13. The system of claim 12 , wherein the at least one processor device locates in the repository digests corresponding to a specified interval of data, based on a position in the repository and a size of the specified interval of data. 14. The system of claim 8 , wherein the at least one processor device records a data identity if an input digest value is identical to a stored digest value loaded in the memory, wherein the data identify includes the data covered in the input data and in repository data by the matching input digest and stored digest respectively. 15. A computer program product for retrieving digests based on a similarity search for efficient deduplication processing in a data deduplication system using a processor device in a computing environment, the computer program product comprising a non-transitory computer-readable storage medium having computer-readable program code portions stored therein, the computer-readable program code portions comprising: an executable portion that partitions input data into data chunks; an executable portion that searches for matching similarity elements in the search structure containing similarity elements; an executable portion that finds positions of similar data in a repository of data; an executable portion that uses the positions of the similar data to locate and load into the memory stored digest values of the similar data in the repository; an executable portion that matches the digest values of the input data with the stored digest values loaded into memory to find data matches; an executable portion that calculates rolling hash values for each of the data chunks, wherein a rolling hash value is produced for each consecutive window of bytes in a byte offset; and an executable portion that selects specified rolling hash values and associated positions of the specified rolling hash values as the similarity elements; wherein a single linear scan is used to produce both the similarity elements and the digest values to find the data matches. 16. The computer program product of claim 15 , further including an executable portion that partitions the input data into fixed sized data chunks. 17. The computer program product of claim 15 , further including an executable portion that calculates digest block boundaries based on the rolling hash values, and calculating digest values corresponding to the digest blocks. 18. The computer program product of claim 15 , further including an executable portion that defines the search structure to contain the similarity elements of data chunks stored in the repository of data. 19. The computer program product of claim 15 , further including an executable portion that stores the digests in the repository in a form that corresponds to their occurrence in the data. 20. The computer program product of claim 19 , further including an executable portion that locates in the repository digests corresponding to a specified interval of data, based on a position in the repository and a size of the specified interval of data. 21. The computer program product of claim 15 , further including an executable portion that records a data identity if an input digest value is identical to a stored digest value loaded in the memory, whe

Assignees

Inventors

Classifications

  • De-duplication implemented within the file system, e.g. based on file segments (de-duplication techniques in storage systems for the management of data blocks G06F3/0641) · CPC title

  • Physics · mapped topic

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9547662B2 cover?
For digest retrieval based on similarity search in deduplication processing in a data deduplication system using a processor device in a computing environment, input data is partitioned into fixed sized data chunks. Similarity elements and digest block boundaries and digest values are calculated for each of the fixed sized data chunks. Matching similarity elements are searched for in a search s…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F16/1748. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 17 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).