Efficient calculation of similarity search values and digest block boundaries for data deduplication

US9244937B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9244937-B2
Application numberUS-201313840094-A
CountryUS
Kind codeB2
Filing dateMar 15, 2013
Priority dateMar 15, 2013
Publication dateJan 26, 2016
Grant dateJan 26, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

For efficient calculation of both similarity search values and boundaries of digest blocks in data deduplication, input data is partitioned into chunks, and for each chunk a set of rolling hash values is calculated. A single linear scan of the rolling hash values is used to produce both similarity search values and boundaries of the digest blocks of the chunk.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for efficient calculation of both similarity search values and boundaries of digest blocks in a data deduplication system using a processor device in a computing environment, comprising: partitioning input data into data chunks; calculating a set of rolling hash values for each of the data chunks; using a single linear scan of the rolling hash values for producing both the similarity search values and the boundaries of the digest blocks; using each of the rolling hash values to contribute to the calculation of the similarity search values and to the calculation of the boundaries of the digest blocks; and discarding each of the rolling hash values after contributing to the calculation of the similarity search values and to the calculation of the boundaries of the digest blocks. 2. The method of claim 1 , further including corresponding each of the rolling hash values to a consecutive window of bytes in byte offsets. 3. The method of claim 1 , further including using the similarity search values to search for data similar to the input data in a repository of data. 4. The method of claim 1 , further including using the boundaries of the digest blocks to calculate digest values for each of the data chunks for digests matching. 5. The method of claim 1 , further including partitioning the input data into fixed sized data chunks. 6. A system for efficient calculation of both similarity search values and boundaries of digest blocks in a data deduplication system of a computing environment, the system comprising: the data deduplication system; a repository in the computing environment in communication with the data deduplication system; at least one processor device operable in the computing storage environment for controlling the data deduplication system, wherein the at least one processor device: partitions input data into data chunks, calculates a set of rolling hash values for each of the data chunks, uses a single linear scan of the rolling hash values for producing both the similarity search values and the boundaries of the digest blocks, uses each of the rolling hash values to contribute to the calculation of the similarity search values and to the calculation of the boundaries of the digest blocks, and discards each of the rolling hash values after contributing to the calculation of the similarity search values and to the calculation of the boundaries of the digest blocks. 7. The system of claim 6 , wherein the at least one processor device corresponds each of the rolling hash values to a consecutive window of bytes in byte offsets. 8. The system of claim 6 , wherein the at least one processor device uses the similarity search values to search for data similar to the input data in the repository of data. 9. The system of claim 6 , wherein the at least one processor device uses the boundaries of the digest blocks to calculate digest values for each of the data chunks for digests matching. 10. The system of claim 6 , wherein the at least one processor device the input data into fixed sized data chunks. 11. A computer program product for efficient calculation of both similarity search values and boundaries of digest blocks in a data deduplication system using a processor device in a computing environment, the computer program product comprising a computer-readable storage medium having computer-readable program code portions stored therein, the computer-readable program code portions comprising: a first executable portion that partitions input data into data chunks; a second executable portion that calculates a set of rolling hash values for each of the data chunks; a third executable portion that uses a single linear scan of the rolling hash values for producing both the similarity search values and the boundaries of the digest blocks; a fourth executable portion that uses each of the rolling hash values to contribute to the calculation of the similarity search values and to the calculation of the boundaries of the digest blocks; and a fifth executable portion that discards each of the rolling hash values after contributing to the calculation of the similarity search values and to the calculation of the boundaries of the digest blocks. 12. The computer program product of claim 11 , further including a sixth executable portion that corresponds each of the rolling hash values to a consecutive window of bytes in byte offsets. 13. The computer program product of claim 11 , further including a sixth executable portion that uses the similarity search values to search for data similar to the input data in a repository of data. 14. The computer program product of claim 11 , further including a sixth executable portion that uses the boundaries of the digest blocks to calculate digest values for each of the data chunks for digests matching. 15. The computer program product of claim 11 , further including a sixth executable portion that partitions the input data into fixed sized data chunks.

Assignees

Inventors

Classifications

  • Physics · mapped topic

  • Physics · mapped topic

  • De-duplication implemented within the file system, e.g. based on file segments (de-duplication techniques in storage systems for the management of data blocks G06F3/0641) · CPC title

  • with adaptation to user needs · CPC title

  • Ensuring data consistency and integrity · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9244937B2 cover?
For efficient calculation of both similarity search values and boundaries of digest blocks in data deduplication, input data is partitioned into chunks, and for each chunk a set of rolling hash values is calculated. A single linear scan of the rolling hash values is used to produce both similarity search values and boundaries of the digest blocks of the chunk.
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F17/30156. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 26 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).