Deduplicating data for a data storage system using similarity determinations

US2017123711A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2017123711-A1
Application numberUS-201514928848-A
CountryUS
Kind codeA1
Filing dateOct 30, 2015
Priority dateOct 30, 2015
Publication dateMay 4, 2017
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method and system for deduplicating data for a data storage system using similarity determinations are described. A tape library is arranged in a hierarchy of tape groups and tape plexes. Tape groups are an admin visible entity and are comprised of multiple tape plexes (at least equal to the number of replicas in a tape group). Tape plexes in turn comprise multiple tape cartridges. Data files and objects received within a time period are initially staged in a disk cache where they are logically segregated into cliques based on their expected deduplication ratios. These cliques are then evaluated for the amount of duplication they have with data existing in tape plexes. Based on the number of replicas being written, the top few tape plexes are selected from within the tape group. The cliques are deduplicated with data on the selected tape plexes, compressed, and written to tape.

First claim

Opening claim text (preview).

What is claimed is: 1 . A data storage system comprising: a memory resource to store instructions; one or more processors using the instructions stored in the memory resource to: receive data to be stored at the data storage system; determine a similarity between the received data and data stored on each of a plurality of storage elements at the data storage system; select one or more of the plurality of storage elements based on the determined similarity; and write the received data to the one or more selected storage elements, including, for each of the selected storage elements, deduplicating the received data with the data stored on that storage element. 2 . The system of claim 1 , comprising further instructions used by the one or more processors to: identify patterns of bytes within the received data; separate the received data into one or more subsets based on the identified patterns of bytes; and for each of the one or more subsets: determine a subset similarity between the subset and data stored on each of the plurality of storage elements at the data storage system; select one or more of the plurality of storage elements based on the subset similarity; and write the subset to the one or more selected storage elements, including, for each of the selected storage elements, deduplicating the subset with the data stored on that storage element. 3 . The system of claim 2 , comprising further instructions used by the one or more processors to: determine the subset similarity by (i) applying a hashing algorithm to the subset to generate a subset fingerprint, and (ii) comparing the subset fingerprint to stored fingerprints corresponding to the data stored on each of the plurality of storage elements; and store the generated subset fingerprints in association with the selected storage elements. 4 . The system of claim 1 , comprising further instructions used by the one or more processors to: copy the received data to create one or more replicas; and write each of the one or more replicas to one of the selected storage elements, including, for each of the replicas, deduplicating the replica with the data stored on that storage element. 5 . The system of claim 4 , wherein selecting the one or more of the plurality of storage elements comprises selecting the storage element with a highest similarity, and for each replica, selecting the storage element with a next highest similarity. 6 . The system of claim 1 , wherein the data stored on each of the plurality of storage elements are divided into windows based on how recently the data was stored, and wherein similarity is only determined between the received data and data from a predetermined number of recent windows on each of the plurality of storage elements. 7 . The system of claim 1 , wherein each of the plurality of storage elements comprises multiple linear tape cartridges. 8 . A method of writing data in a data storage system, the method being implemented by one or more processors and comprising: receiving data to be stored at the data storage system; determining a similarity between the received data and data stored on each of a plurality of storage elements at the data storage system; selecting one or more of the plurality of storage elements based on the determined similarity; and writing the received data to the one or more selected storage elements, including, for each of the selected storage elements, deduplicating the received data with the data stored on that storage element. 9 . The method of claim 8 , further comprising: identifying patterns of bytes within the received data; separating the received data into one or more subsets based on the identified patterns of bytes; and for each of the one or more subsets: determining a subset similarity between the subset and data stored on each of the plurality of storage elements at the data storage system; selecting one or more of the plurality of storage elements based on the subset similarity; and writing the subset to the one or more selected storage elements, including, for each of the selected storage elements, deduplicating the subset with the data stored on that storage element. 10 . The method of claim 9 , further comprising: determining the subset similarity by (i) applying a hashing algorithm to the subset to generate a subset fingerprint, and (ii) comparing the subset fingerprint to stored fingerprints corresponding to the data stored on each of the plurality of storage elements; and storing the generated subset fingerprints in association with the selected storage elements. 11 . The method of claim 8 , further comprising: copying the received data to create one or more replicas; and writing each of the one or more replicas to one of the selected storage elements, including, for each of the replicas, deduplicating the replica with the data stored on that storage element. 12 . The method of claim 11 , wherein selecting the one or more of the plurality of storage elements comprises selecting the storage element with a highest similarity, and for each replica, selecting the storage element with a next highest similarity. 13 . The method of claim 8 , wherein the data stored on each of the plurality of storage elements are divided into windows based on how recently the data was stored, and wherein similarity is only determined between the received data and data from a predetermined number of recent windows on each of the plurality of storage elements. 14 . The method of claim 8 , wherein each of the plurality of storage elements comprises multiple linear tape cartridges. 15 . A non-transitory computer-readable medium that stores instructions, executable by one or more processors, to cause the one or more processors to perform operations that comprise: receiving data to be stored at a data storage system; determining a similarity between the received data and data stored on each of a plurality of storage elements at the data storage system; selecting one or more of the plurality of storage elements based on the determined similarity; and writing the received data to the one or more selected storage elements, including, for each of the selected storage elements, deduplicating the received data with the data stored on that storage element. 16 . The non-transitory computer-readable medium of claim 15 , storing further instructions used by the one or more processors to perform operations that comprise: identifying patterns of bytes within the received data; separating the received data into one or more subsets based on the identified patterns of bytes; and for each of the one or more subsets: determining a subset similarity between the subset and data stored on each of the plurality of storage elements at the data storage system; selecting one or more of the plurality of storage elements based on the subset similarity; and writing the subset to the one or more selected storage elements, including, for each of the selected storage elements, deduplicating the subset with the data stored on that storage element. 17 . The non-transitory computer-readable medium of claim 16 , storing further instructions used by the one or more processors to perform operations that comprise: determining the subset similarity by (i) applying a hashing algorithm to the subset to generate a subset fingerprint, and (ii) comparing the subset fingerprint to stored fingerprints corresponding to the data stored on each of the plurality of storage elements; and storing the generated subset fingerprints

Assignees

Inventors

Classifications

  • Saving storage space on storage systems · CPC title

  • Libraries, e.g. tape libraries, jukebox · CPC title

  • G06F3/0641Primary

    De-duplication techniques · CPC title

  • Replication mechanisms · CPC title

  • Improving or facilitating administration, e.g. storage management · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2017123711A1 cover?
A method and system for deduplicating data for a data storage system using similarity determinations are described. A tape library is arranged in a hierarchy of tape groups and tape plexes. Tape groups are an admin visible entity and are comprised of multiple tape plexes (at least equal to the number of replicas in a tape group). Tape plexes in turn comprise multiple tape cartridges. Data files…
Who is the assignee on this patent?
Netapp Inc
What technology area does this patent fall under?
Primary CPC classification G06F3/0641. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu May 04 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).