Data de-duplication

US9632720B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9632720-B2
Application numberUS-201414336799-A
CountryUS
Kind codeB2
Filing dateJul 21, 2014
Priority dateAug 29, 2013
Publication dateApr 25, 2017
Grant dateApr 25, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method and device for data de-duplication, comprising: performing data chunk partition on a current data object by using a different standard in each of a plurality of logical passes; searching one or more first redundant data chunks of the current data object in each logic pass based on the data chunks partitioned on the current data object in the logical pass, respectively, and performing data de-duplication on the current data object based on all of the found first redundant data chunks of the current data object. Other embodiments of the present invention may also relate to a data de-duplication system and a corresponding computer program product.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for data de-duplication when processing a plurality of data objects, being executed on a hardware device, comprising: performing data chunk partition during each of a plurality of logical passes of a current data object of the plurality of data objects, where the data chunk partition is performed using a respective different standard in each logical pass of the plurality of logical passes; searching, during the each logical pass of the plurality of logical passes, to find one or more first redundant data chunks of the current data object based on data chunks partitioned on a data object previously processed using the respective different standard; and performing data de-duplication on the current data object based on all of the found first redundant data chunks of the current data object. 2. The method according to claim 1 , wherein performing data chunk partition comprises at least one of the following: performing data chunk partition on the current data object with a fingerprint algorithm by using different fingerprint masks in the respective logical passes; performing data chunk partition on the current data object with a fixed length algorithm by using different data chunk lengths in the respective logical passes; and performing data chunk partition on the current data object by using different partition algorithms in the respective logical passes. 3. The method according to claim 1 , wherein searching, during the each of the plurality of logical passes, to find one or more first redundant data chunks of the current data object comprises: searching, in each logical pass, first redundant data chunks of the current data object based on data chunks partitioned on a previous data object by using the standard of the logical pass and data chunks partitioned on the current data object by using the standard of the logical pass. 4. The method according to claim 1 , wherein performing de-duplication on the current data object comprises: eliminating overlap portions existing between two or more first redundant data chunks found in each logical pass of the plurality of logical passes, based on offset and length of the first redundant data chunks; performing data de-duplication on the current data object through deleting second data redundant chunks, wherein the second redundant data chunks include first redundant data chunks with the overlap portions being deleted. 5. The method according to claim 4 , wherein eliminating overlap portions existing between two or more first redundant data chunks comprises: sorting the first redundant data chunks according to offset of the first redundant data chunks; and merging two or more first redundant data chunks having overlap portions based on the sorted first redundant data chunks and according to length of the first redundant data chunks, so as to determine the second redundant data chunks of the current data object. 6. The method according to claim 5 , further comprising: recovering the current data object according to a link stored for the second redundant data chunks. 7. The method according to claim 4 , wherein the deleted second data redundant chunks are a plurality of discontinuous data chunks in a file. 8. The method according to claim 7 , wherein a different fingerprint mask is used for each respective logical pass of the plurality of passes when performing the data chunk partition. 9. A computer program product comprising program code stored on a non-transitory computer readable storage medium that is configured to perform the method of claim 1 when executed by a data processing apparatus. 10. The method according to claim 1 , wherein different data chunk distributions for a same data object are obtained in each of the plurality of logical passes. 11. A system for data de-duplication when processing a plurality of data objects, being executed on a hardware device, comprising: a memory; a data chunk partition unit configured to perform data chunk partition during each of a plurality of logical passes of a current data object of the plurality of data objects, where the data chunk partition is performed using a respective different standard in each logical pass of the plurality of logical passes; a first redundant data chunk determining unit configured to search, during the each logical pass of the plurality of logical passes, to find one or more first redundant data chunks of the current data object based on data chunks partitioned on a data object previously processed using the respective different standard; and a data de-duplication unit configured to perform data de-duplication on the current data object based on all of the found first redundant data chunks of the current data object. 12. The system according to claim 11 , wherein the data chunk partition unit is configured to perform at least one of the following: performing data chunk partition on the current data object with a fingerprint algorithm by using different fingerprint masks in the respective logical passes; performing data chunk partition on the current data object with a fixed length algorithm by using different data chunk lengths in the respective logical passes; and performing data chunk partition on the current data object by using different partition algorithms in the respective logical passes. 13. The system according to claim 11 , wherein the first redundant data chunk determining unit is configured to: search, in each logical pass, first redundant data chunks of the current data object based on data chunks partitioned on a previous data object by using the standard of the logical pass and data chunks partitioned on the current data object by using the standard of the logical pass. 14. The system according to claim 11 , wherein the data de-duplication unit further comprises: an overlap portion eliminating unit configured to eliminate overlap portions existing between two or more first redundant data chunks found in each logical pass of the plurality of logical passes, based on offset and length of the first redundant data chunks; wherein the data de-duplication unit is configured to perform data de-duplication on the current data object through deleting second redundant data chunks, wherein the second redundant data chunks include the first redundant data chunks with the overlap portions being eliminated. 15. The system according to claim 14 , wherein the overlap portion eliminating unit further comprises: a sorting unit configured to sort the first redundant data chunks according to offset of the first redundant data chunks; and a merging unit configured to merge two or more first redundant data chunks having overlap portions based on the sorted first redundant data chunks and according to length of the first redundant data chunks, so as to determine the second redundant data chunks of the current data object. 16. The system according to claim 15 , further comprising: a recovering unit configured to recover the current data object according to a link stored for the second redundant data chunks. 17. The system according to claim 14 , wherein the deleted second data redundant chunks are a plurality of discontinuous data chunks in a file. 18. The system according to claim 17 , wherein a different fingerprint mask is used for each respective logical pass of the plurality of passes when performing the data chunk partition. 19. The system according to claim 11 , wherein different data chunk distributions for a same data object are obtained in each of the plurality

Assignees

Inventors

Classifications

  • G06F3/0641Primary

    De-duplication techniques · CPC title

  • Disk device · CPC title

  • Saving storage space on storage systems · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9632720B2 cover?
A method and device for data de-duplication, comprising: performing data chunk partition on a current data object by using a different standard in each of a plurality of logical passes; searching one or more first redundant data chunks of the current data object in each logic pass based on the data chunks partitioned on the current data object in the logical pass, respectively, and performing d…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F3/0641. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 25 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).