Data boundary identification

US9690668B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9690668-B2
Application numberUS-13051708-A
CountryUS
Kind codeB2
Filing dateMay 30, 2008
Priority dateMar 5, 2008
Publication dateJun 27, 2017
Grant dateJun 27, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A system and method obtain a set of data and identify successive subsets of data within the set of data. A boundary identifying hash is calculated on a subset of data and compared with a boundary indicating value. If the calculated boundary identifying hash matches the boundary indicating value, a natural boundary is identified in the set of data.

First claim

Opening claim text (preview).

The invention claimed is: 1. A computer-implemented method comprising: identifying chunk boundaries in a data set and determining variable size data chunks within the data set based, on identifying the chunk boundaries, wherein identifying each of the chunk boundaries comprises, calculating, with a first hash function that is a rolling hash function, boundary identifying hashes based on progressively shifting a window over the data set until either reaching a maximum chunk size or matching a first boundary indicating value; after calculating each of the boundary identifying hashes, comparing the boundary identifying hash with the first boundary indicating value to determine whether the boundary identifying hash matches; before shifting the window, determining whether a size of data within the window and trailing the window since a preceding chunk boundary is equal to the maximum chunk size to determine whether the maximum chunk size has been reached; wherein each of the variable size data chunks is determined when the maximum chunk size is reached or a corresponding one of the chunk boundaries is identified, wherein each of the variable size data chunk comprises data within the window and data trailing the window since a preceding chunk boundary when the maximum chunk size is reached or when the corresponding one of the chunk boundaries is identified; calculating hashes of each of the variable size data chunks using a second hash function that is different than the first hash function; determining ones of the variable size data chunks to back up based on the calculated hashes of the variable size data chunks; and backing up the determined ones of the variable size data chunks. 2. The computer-implemented method of claim 1 , wherein the first hash function is an Adler hash function and the second hash function is a MD5 hash function. 3. The computer-implemented method of claim 1 , wherein identifying each of the chunk boundaries further comprises: for determining each of the variable size data chunks, comparing the boundary identifying hashes corresponding to the variable size data chunk with a second boundary indicating value as well as the first boundary indicating value until matching at least one of the boundary indicating values, wherein identifying the chunk boundary corresponding to the variable size data chunk is based on the second boundary indicating value when none of the boundary identifying hashes corresponding to the variable size data chunk matches the first boundary indicating value but at least one matches the second boundary indicating value. 4. The computer-implemented method of claim 1 further comprising: for each variable size data chunk determined based on identifying a chunk boundary, calculating, with the first hash function, a hash of the size of the variable size data chunk; and forming a signature for the variable size data chunk with the hash of the size of the variable size data chunk, the boundary identifying hash that identified the chunk boundary of the variable size data chunk, and the hash of the variable size data chunk; wherein determining ones of the variable size data chunks to back up comprises comparing the formed signatures to signatures in a database and indicating the ones of the variable size data chunks for which no matching signature is found as the ones of the variable size data chunks to back up. 5. The computer-implemented method of claim 1 , wherein the first hash function is weaker than the second hash function. 6. The computer-implemented method of claim 1 , wherein a size of the window is at least an identified minimum chunk size. 7. The computer-implemented method of claim 6 further comprising starting the window within each data subset occurring after each identified chunk boundary or occurring at a beginning of the data set, instead of starting the window at the beginning, wherein the data subsets are each at least the identified minimum chunk size. 8. One or more non-transitory machine readable media comprising program instructions for data chunk boundary identification, the program instructions to: identify chunk boundaries in a data set and determine variable size data chunks within the data set based on identification of the chunk boundaries within the data set, wherein the program instructions to identify each of the chunk boundaries comprise program instructions to, calculate, with a first hash function that is a rolling hash function, boundary identifying hashes based on progressively shifting a window over the data set until either reaching a maximum chunk size or matching a first boundary indicating value; after calculation of each boundary identifying hash, compare the boundary identifying hash with the first boundary indicating value to determine whether the boundary identifying hash matches; before shift of the window, determine whether a size of data within the window and trailing the window since a preceding chunk boundary is equal to the maximum chunk size to determine whether the maximum chunk size has been reached; wherein each of the variable size data chunks is determined when the maximum chunk size is reached or a corresponding one of the chunk boundaries is identified, wherein each of the variable size data chunk comprises data within the window and data trailing the window since a preceding chunk boundary when the maximum chunk size is reached or the corresponding one of the chunk boundaries is identified; calculate hashes of each of the variable size data chunks using a second hash function that is different than the first hash function; determine ones of the variable size data chunks to back up based on the calculated hashes of the variable size data chunks; and back up the determined ones of the variable size data chunks. 9. The non-transitory machine readable media of claim 8 , wherein the program instructions to identify each of the chunk boundaries comprises program instructions to: for the determination of each of the variable size data chunks, compare the boundary identifying hashes corresponding to the variable size data chunk with a second boundary indicating value as well as the first boundary indicating value until matching at least one of the boundary indicating values, wherein identification of the chunk boundary corresponding to the variable size data chunk is based on the second boundary indicating value when none of the boundary identifying hashes corresponding to the variable size data chunk matches the first boundary indicating value but at least one matches the second boundary indicating value. 10. The non-transitory machine readable media of claim 8 further comprising program instructions to: for each variable size data chunk determined based on identifying a chunk boundary, calculate, with the first hash function, a hash of the size of the variable size data chunk; and form a signature for the variable size data chunk with the hash of the size of the variable size data chunk, the boundary identifying hash that identified the chunk boundary of the variable size data chunk, and the hash of the variable size data chunk; wherein the program instructions to determine ones of the variable size data chunks to back up comprises program instructions to compare the formed signatures to signatures in a database, and indicate the ones of the variable size data chunks for which no matching signature is found as the ones of the variable size data chunks to back up. 11. The non-transitory machine readable media of claim 8 , wherein the first hash function is weaker than the second hash function. 12. The non-transitory machine readable media of claim 8 , wherein a

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9690668B2 cover?
A system and method obtain a set of data and identify successive subsets of data within the set of data. A boundary identifying hash is calculated on a subset of data and compared with a boundary indicating value. If the calculated boundary identifying hash matches the boundary indicating value, a natural boundary is identified in the set of data.
Who is the assignee on this patent?
Reddy Chandra, Karonde Pratap, Parikh Prashant, and 1 more
What technology area does this patent fall under?
Primary CPC classification G06F11/1453. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 27 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).