Efficient tape backup using deduplicated data
US-9448739-B1 · Sep 20, 2016 · US
US2017123711A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2017123711-A1 |
| Application number | US-201514928848-A |
| Country | US |
| Kind code | A1 |
| Filing date | Oct 30, 2015 |
| Priority date | Oct 30, 2015 |
| Publication date | May 4, 2017 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method and system for deduplicating data for a data storage system using similarity determinations are described. A tape library is arranged in a hierarchy of tape groups and tape plexes. Tape groups are an admin visible entity and are comprised of multiple tape plexes (at least equal to the number of replicas in a tape group). Tape plexes in turn comprise multiple tape cartridges. Data files and objects received within a time period are initially staged in a disk cache where they are logically segregated into cliques based on their expected deduplication ratios. These cliques are then evaluated for the amount of duplication they have with data existing in tape plexes. Based on the number of replicas being written, the top few tape plexes are selected from within the tape group. The cliques are deduplicated with data on the selected tape plexes, compressed, and written to tape.
Opening claim text (preview).
What is claimed is: 1 . A data storage system comprising: a memory resource to store instructions; one or more processors using the instructions stored in the memory resource to: receive data to be stored at the data storage system; determine a similarity between the received data and data stored on each of a plurality of storage elements at the data storage system; select one or more of the plurality of storage elements based on the determined similarity; and write the received data to the one or more selected storage elements, including, for each of the selected storage elements, deduplicating the received data with the data stored on that storage element. 2 . The system of claim 1 , comprising further instructions used by the one or more processors to: identify patterns of bytes within the received data; separate the received data into one or more subsets based on the identified patterns of bytes; and for each of the one or more subsets: determine a subset similarity between the subset and data stored on each of the plurality of storage elements at the data storage system; select one or more of the plurality of storage elements based on the subset similarity; and write the subset to the one or more selected storage elements, including, for each of the selected storage elements, deduplicating the subset with the data stored on that storage element. 3 . The system of claim 2 , comprising further instructions used by the one or more processors to: determine the subset similarity by (i) applying a hashing algorithm to the subset to generate a subset fingerprint, and (ii) comparing the subset fingerprint to stored fingerprints corresponding to the data stored on each of the plurality of storage elements; and store the generated subset fingerprints in association with the selected storage elements. 4 . The system of claim 1 , comprising further instructions used by the one or more processors to: copy the received data to create one or more replicas; and write each of the one or more replicas to one of the selected storage elements, including, for each of the replicas, deduplicating the replica with the data stored on that storage element. 5 . The system of claim 4 , wherein selecting the one or more of the plurality of storage elements comprises selecting the storage element with a highest similarity, and for each replica, selecting the storage element with a next highest similarity. 6 . The system of claim 1 , wherein the data stored on each of the plurality of storage elements are divided into windows based on how recently the data was stored, and wherein similarity is only determined between the received data and data from a predetermined number of recent windows on each of the plurality of storage elements. 7 . The system of claim 1 , wherein each of the plurality of storage elements comprises multiple linear tape cartridges. 8 . A method of writing data in a data storage system, the method being implemented by one or more processors and comprising: receiving data to be stored at the data storage system; determining a similarity between the received data and data stored on each of a plurality of storage elements at the data storage system; selecting one or more of the plurality of storage elements based on the determined similarity; and writing the received data to the one or more selected storage elements, including, for each of the selected storage elements, deduplicating the received data with the data stored on that storage element. 9 . The method of claim 8 , further comprising: identifying patterns of bytes within the received data; separating the received data into one or more subsets based on the identified patterns of bytes; and for each of the one or more subsets: determining a subset similarity between the subset and data stored on each of the plurality of storage elements at the data storage system; selecting one or more of the plurality of storage elements based on the subset similarity; and writing the subset to the one or more selected storage elements, including, for each of the selected storage elements, deduplicating the subset with the data stored on that storage element. 10 . The method of claim 9 , further comprising: determining the subset similarity by (i) applying a hashing algorithm to the subset to generate a subset fingerprint, and (ii) comparing the subset fingerprint to stored fingerprints corresponding to the data stored on each of the plurality of storage elements; and storing the generated subset fingerprints in association with the selected storage elements. 11 . The method of claim 8 , further comprising: copying the received data to create one or more replicas; and writing each of the one or more replicas to one of the selected storage elements, including, for each of the replicas, deduplicating the replica with the data stored on that storage element. 12 . The method of claim 11 , wherein selecting the one or more of the plurality of storage elements comprises selecting the storage element with a highest similarity, and for each replica, selecting the storage element with a next highest similarity. 13 . The method of claim 8 , wherein the data stored on each of the plurality of storage elements are divided into windows based on how recently the data was stored, and wherein similarity is only determined between the received data and data from a predetermined number of recent windows on each of the plurality of storage elements. 14 . The method of claim 8 , wherein each of the plurality of storage elements comprises multiple linear tape cartridges. 15 . A non-transitory computer-readable medium that stores instructions, executable by one or more processors, to cause the one or more processors to perform operations that comprise: receiving data to be stored at a data storage system; determining a similarity between the received data and data stored on each of a plurality of storage elements at the data storage system; selecting one or more of the plurality of storage elements based on the determined similarity; and writing the received data to the one or more selected storage elements, including, for each of the selected storage elements, deduplicating the received data with the data stored on that storage element. 16 . The non-transitory computer-readable medium of claim 15 , storing further instructions used by the one or more processors to perform operations that comprise: identifying patterns of bytes within the received data; separating the received data into one or more subsets based on the identified patterns of bytes; and for each of the one or more subsets: determining a subset similarity between the subset and data stored on each of the plurality of storage elements at the data storage system; selecting one or more of the plurality of storage elements based on the subset similarity; and writing the subset to the one or more selected storage elements, including, for each of the selected storage elements, deduplicating the subset with the data stored on that storage element. 17 . The non-transitory computer-readable medium of claim 16 , storing further instructions used by the one or more processors to perform operations that comprise: determining the subset similarity by (i) applying a hashing algorithm to the subset to generate a subset fingerprint, and (ii) comparing the subset fingerprint to stored fingerprints corresponding to the data stored on each of the plurality of storage elements; and storing the generated subset fingerprints
Saving storage space on storage systems · CPC title
Libraries, e.g. tape libraries, jukebox · CPC title
De-duplication techniques · CPC title
Replication mechanisms · CPC title
Improving or facilitating administration, e.g. storage management · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.