File immutability using a deduplication file system in a public cloud using new filesystem redirection
US-2024103978-A1 · Mar 28, 2024 · US
US9690668B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-9690668-B2 |
| Application number | US-13051708-A |
| Country | US |
| Kind code | B2 |
| Filing date | May 30, 2008 |
| Priority date | Mar 5, 2008 |
| Publication date | Jun 27, 2017 |
| Grant date | Jun 27, 2017 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A system and method obtain a set of data and identify successive subsets of data within the set of data. A boundary identifying hash is calculated on a subset of data and compared with a boundary indicating value. If the calculated boundary identifying hash matches the boundary indicating value, a natural boundary is identified in the set of data.
Opening claim text (preview).
The invention claimed is: 1. A computer-implemented method comprising: identifying chunk boundaries in a data set and determining variable size data chunks within the data set based, on identifying the chunk boundaries, wherein identifying each of the chunk boundaries comprises, calculating, with a first hash function that is a rolling hash function, boundary identifying hashes based on progressively shifting a window over the data set until either reaching a maximum chunk size or matching a first boundary indicating value; after calculating each of the boundary identifying hashes, comparing the boundary identifying hash with the first boundary indicating value to determine whether the boundary identifying hash matches; before shifting the window, determining whether a size of data within the window and trailing the window since a preceding chunk boundary is equal to the maximum chunk size to determine whether the maximum chunk size has been reached; wherein each of the variable size data chunks is determined when the maximum chunk size is reached or a corresponding one of the chunk boundaries is identified, wherein each of the variable size data chunk comprises data within the window and data trailing the window since a preceding chunk boundary when the maximum chunk size is reached or when the corresponding one of the chunk boundaries is identified; calculating hashes of each of the variable size data chunks using a second hash function that is different than the first hash function; determining ones of the variable size data chunks to back up based on the calculated hashes of the variable size data chunks; and backing up the determined ones of the variable size data chunks. 2. The computer-implemented method of claim 1 , wherein the first hash function is an Adler hash function and the second hash function is a MD5 hash function. 3. The computer-implemented method of claim 1 , wherein identifying each of the chunk boundaries further comprises: for determining each of the variable size data chunks, comparing the boundary identifying hashes corresponding to the variable size data chunk with a second boundary indicating value as well as the first boundary indicating value until matching at least one of the boundary indicating values, wherein identifying the chunk boundary corresponding to the variable size data chunk is based on the second boundary indicating value when none of the boundary identifying hashes corresponding to the variable size data chunk matches the first boundary indicating value but at least one matches the second boundary indicating value. 4. The computer-implemented method of claim 1 further comprising: for each variable size data chunk determined based on identifying a chunk boundary, calculating, with the first hash function, a hash of the size of the variable size data chunk; and forming a signature for the variable size data chunk with the hash of the size of the variable size data chunk, the boundary identifying hash that identified the chunk boundary of the variable size data chunk, and the hash of the variable size data chunk; wherein determining ones of the variable size data chunks to back up comprises comparing the formed signatures to signatures in a database and indicating the ones of the variable size data chunks for which no matching signature is found as the ones of the variable size data chunks to back up. 5. The computer-implemented method of claim 1 , wherein the first hash function is weaker than the second hash function. 6. The computer-implemented method of claim 1 , wherein a size of the window is at least an identified minimum chunk size. 7. The computer-implemented method of claim 6 further comprising starting the window within each data subset occurring after each identified chunk boundary or occurring at a beginning of the data set, instead of starting the window at the beginning, wherein the data subsets are each at least the identified minimum chunk size. 8. One or more non-transitory machine readable media comprising program instructions for data chunk boundary identification, the program instructions to: identify chunk boundaries in a data set and determine variable size data chunks within the data set based on identification of the chunk boundaries within the data set, wherein the program instructions to identify each of the chunk boundaries comprise program instructions to, calculate, with a first hash function that is a rolling hash function, boundary identifying hashes based on progressively shifting a window over the data set until either reaching a maximum chunk size or matching a first boundary indicating value; after calculation of each boundary identifying hash, compare the boundary identifying hash with the first boundary indicating value to determine whether the boundary identifying hash matches; before shift of the window, determine whether a size of data within the window and trailing the window since a preceding chunk boundary is equal to the maximum chunk size to determine whether the maximum chunk size has been reached; wherein each of the variable size data chunks is determined when the maximum chunk size is reached or a corresponding one of the chunk boundaries is identified, wherein each of the variable size data chunk comprises data within the window and data trailing the window since a preceding chunk boundary when the maximum chunk size is reached or the corresponding one of the chunk boundaries is identified; calculate hashes of each of the variable size data chunks using a second hash function that is different than the first hash function; determine ones of the variable size data chunks to back up based on the calculated hashes of the variable size data chunks; and back up the determined ones of the variable size data chunks. 9. The non-transitory machine readable media of claim 8 , wherein the program instructions to identify each of the chunk boundaries comprises program instructions to: for the determination of each of the variable size data chunks, compare the boundary identifying hashes corresponding to the variable size data chunk with a second boundary indicating value as well as the first boundary indicating value until matching at least one of the boundary indicating values, wherein identification of the chunk boundary corresponding to the variable size data chunk is based on the second boundary indicating value when none of the boundary identifying hashes corresponding to the variable size data chunk matches the first boundary indicating value but at least one matches the second boundary indicating value. 10. The non-transitory machine readable media of claim 8 further comprising program instructions to: for each variable size data chunk determined based on identifying a chunk boundary, calculate, with the first hash function, a hash of the size of the variable size data chunk; and form a signature for the variable size data chunk with the hash of the size of the variable size data chunk, the boundary identifying hash that identified the chunk boundary of the variable size data chunk, and the hash of the variable size data chunk; wherein the program instructions to determine ones of the variable size data chunks to back up comprises program instructions to compare the formed signatures to signatures in a database, and indicate the ones of the variable size data chunks for which no matching signature is found as the ones of the variable size data chunks to back up. 11. The non-transitory machine readable media of claim 8 , wherein the first hash function is weaker than the second hash function. 12. The non-transitory machine readable media of claim 8 , wherein a
using compression, e.g. sparse files · CPC title
using de-duplication of the data · CPC title
Physics · mapped topic
Related publications grouped by family.
Answers are generated from the same data shown on this page.