Proximity and in-memory map based signature searching for duplicate data

US9766983B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9766983-B2
Application numberUS-13051408-A
CountryUS
Kind codeB2
Filing dateMay 30, 2008
Priority dateMar 5, 2008
Publication dateSep 19, 2017
Grant dateSep 19, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A computer implemented method and system obtains current signatures of data chunks and performs a proximity search of a library of previous signatures as a function of the likely location of corresponding data chunks. A full search of the library of previous signatures for those current signatures not found in the proximity search is also performed.

First claim

Opening claim text (preview).

The invention claimed is: 1. A method comprising: identifying a first plurality of signatures of data chunks in a first data backup of a plurality of data backups, wherein identifying the first plurality of signatures is based, at least in part, on the first data backup being a most recent of the plurality of data backups; performing a first search of the first plurality of signatures for each of a second plurality of signatures of data chunks for a second data backup, wherein the plurality of data backups precedes the second data backup, for each of the second plurality of signatures found in the first plurality of signatures, indicating that a corresponding data chunk is not to be stored for the second data backup; and after the first search, performing a second search of at least a third plurality of signatures of older data backups of the plurality of data backups, wherein performing the second search comprises, creating and updating an in-memory map with progressively older subsets of signatures of the third plurality of signatures that are incrementally loaded from a backup domain into memory, for each progressively older subset of signatures in the in-memory map, searching the in-memory map for the second plurality of signatures, and for each of the second plurality of signatures found in the in-memory map, removing a duplicate of a corresponding data chunk from either the second data backup or the older data backups. 2. The method of claim 1 wherein the second data backup corresponds to a volume on a machine. 3. The method of claim 1 , wherein indicating, for each of the second plurality of signatures found in the in-memory map, that the corresponding data chunk is not to be stored for the second backup comprises removing the signature found in the in-memory map from a list of the second plurality of signatures. 4. The method of claim 1 , wherein creating and updating the in-memory map comprises: after each searching of the in-memory map, removing a currently loaded one of the progressively older subsets of signatures from the in-memory map and inserting into the in-memory map a next most recent one of the progressively older subsets of signatures. 5. The method of claim 1 further comprising: in response to not finding a signature of the second plurality of signatures in the in-memory map, inserting into the in-memory map the signature that was not found; and in response to determining that a number of entries of the in-memory map reaches a threshold based on inserting the signature that was not found, removing the oldest signature from the in-memory map. 6. The method of claim 1 , wherein identifying the first plurality of signatures comprises: determining a source, destination, and backup method for the second data backup; identifying the first data backup as the last successful or incomplete backup from the source to the destination with the backup method; and accessing a file for the first data backup that includes the first plurality of signatures. 7. The method of claim 1 , wherein performing the second search is after completion of the second data backup. 8. The method of claim 1 wherein the in-memory map is a btree. 9. A method comprising: identifying a first plurality of signatures of data chunks of a first data backup of a plurality of data backups; loading a first subset of the first plurality of signatures from a backup domain into memory; creating an in-memory map with the first subset of signatures; searching the in-memory map for each of a second plurality of signatures of data chunks of a second data backup, wherein the first data backup precedes the second data backup and the first subset of signatures comprises most recent signatures of the first plurality of signatures; incrementally loading progressively older subsets of the first plurality of signatures from a backup domain into the memory; after each loading, removing at least one of the signatures from the in-memory map in order from oldest to youngest, wherein a number of the signatures removed from the in-memory map is based on a size boundary defined for the in-memory map; inserting currently loaded signatures from the memory into the in-memory map; for each of the second plurality of signatures found in the in-memory map, indicating that a corresponding data chunk is not to be stored for the second data backup; in response to not finding a signature of the second plurality of signatures in the in-memory map, inserting into the in-memory map the signature that was not found; and in response to determining that a number of entries of the in-memory map satisfies a threshold based on inserting the signature that was not found, removing the oldest of the signatures in the in-memory map. 10. The method of claim 9 wherein the second data backup corresponds to a volume on a machine. 11. The method of claim 9 wherein indicating, for each of the second plurality of signatures found in the in-memory map, that the corresponding data chunk is not to be stored for the second data backup comprises removing the signature found in the in-memory map from a list of the second plurality of signatures. 12. The method of claim 9 wherein the in-memory map is a btree. 13. A non-transitory computer readable storage medium comprising instructions for narrow duplicate searching of previous data backups, the instructions to: identify a first plurality of signatures of data chunks in a first data backup of a plurality of previous data backups, wherein identifying the first plurality of signatures is based, at least in part, on the first data backup being a most recent of the plurality of previous data backups; perform a first search of the first plurality of signatures for each of a second plurality of signatures of data chunks for a second data backup, wherein the first data backup precedes the second data backup and is a most recent of the plurality of previous data backups, for each of the second plurality of signatures found in the first plurality of signatures, indicate that a corresponding data chunk is not to be stored for the second data backup; and perform a second search of at least a third plurality of signatures of a third backup of the plurality of previous data backups that precedes the first backup, wherein the instructions to perform the second search comprise instructions to load a first subset of signatures of the third plurality of signatures from a backup domain into memory to create an in-memory map, search the in-memory map for the second plurality of signatures, and for each of the second plurality of signatures found in the in-memory map, indicate that a corresponding data chunk is not to be stored for the second data backup. 14. The non-transitory computer readable storage medium of claim 13 further comprising instructions to: load the subset of signatures from a backup domain into memory prior to searching. 15. The non-transitory computer readable storage medium of claim 13 , wherein the instructions to identify the first plurality of signatures comprise instructions to: determine a source, a destination, and a backup method for the second data backup; identify the first data backup as the last successful or incomplete backup from the source to the destination with the backup method; and access a file for the first data backup that includes the first plurality of signatures. 16. The non-transitory computer readable storage medium of claim 13 , wherein the instructions to perform the second search comprise instructions to: perform the second search after

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9766983B2 cover?
A computer implemented method and system obtains current signatures of data chunks and performs a proximity search of a library of previous signatures as a function of the likely location of corresponding data chunks. A full search of the library of previous signatures for those current signatures not found in the proximity search is also performed.
Who is the assignee on this patent?
Reddy Chandra, Parikh Prashant, Song Liqiu, and 1 more
What technology area does this patent fall under?
Primary CPC classification G06F11/1453. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 19 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).