Data synchronization using redundancy detection

US9910906B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9910906-B2
Application numberUS-201514750944-A
CountryUS
Kind codeB2
Filing dateJun 25, 2015
Priority dateJun 25, 2015
Publication dateMar 6, 2018
Grant dateMar 6, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Managing data in a cloud computing environment, including data transfers. File level and block level similarities are identified, including for archive and nested archive files, residing on datacenters and regional repositories. A replication plan is generated based on receiving a replication instruction, and further based on similarity clusters by transferring unique data blocks and files from best available sources including regional repositories.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for transferring data on a plurality of computing nodes, comprising: receiving a request to transfer a first dataset from a source datacenter to a target datacenter; generating a plurality of similarity clusters, wherein each of the plurality of similarity clusters identifies a grouping of data blocks and comprises a list of hash codes of the data blocks and further comprises an image cluster identifier, and wherein the plurality of similarity clusters indicate a block-level similarity between data stored on a first computing node with the data stored on at least one other computing node among the plurality of computing nodes, wherein data stored on at least one computing node in the plurality of computing nodes comprises archived data, and wherein generating the plurality of similarity clusters comprises: extracting the archived data; comparing checksums of the extracted data; and generating the plurality of similarity clusters based on comparing the checksums. 2. The method of claim 1 , wherein additional data stored on the at least one computing node or on another computing node in the plurality of computing nodes, or both, comprises virtual machine (VM) image data, and wherein generating the similarity clusters further comprises: comparing checksums of the identified files with additional checksums of the VM image data; and generating the plurality of similarity clusters based on comparing the checksums with the additional checksums. 3. The method of claim 1 , further comprising: receiving an instruction to replicate a designated data set, stored on a source computing node, on a target computing node, wherein the source and target computing nodes are among the plurality of computing nodes; identifying a set of similarity clusters that are associated with the designated data set from among the plurality of similarity clusters; identifying a first subset of the set of similarity clusters, wherein data associated with the first subset of similarity clusters is stored only on the source computing node; identifying a second subset of the set of similarity clusters, wherein data associated with the second subset of similarity clusters is stored at least on the source computing node and on the target computing node; and identifying a third subset of the set of similarity clusters, wherein data associated with the third subset of similarity clusters is stored on the source computing node and a set of computing nodes other than the source computing node and other than the target computing node. 4. The method of claim 3 , further comprising generating a data replication plan, wherein the generating comprises: identifying the source computing node as a source for replicating the data associated with the first subset of similarity clusters; identifying at least one computing node among the set of computing nodes other than the source computing node and other than the target computing node as a source for replicating the data associated with the third subset of similarity clusters; and generating the data transfer plan based on the identifying. 5. The method of claim 4 , further comprising: generating an instruction to replicate the designated data set on the target computing node based on the data replication plan, whereby replication of the data associated with the second subset of similarity clusters on the target computing node is performed without transferring the data to the target computing node. 6. The method of claim 4 , where generating the data transfer plan further comprises: identifying a set of data repositories associated with a region of the source computing node, a region of the at least one computing node, or both; wherein generating the data transfer plan is further based on identifying the set of data repositories. 7. The method of claim 5 , further comprising: de-duplicating the un-archived data; generating the plurality of similarity clusters based on the de-duplicating. 8. The method of claim 1 , wherein the un-archiving comprises: recursively un-archiving nested archived data. 9. The method of claim 1 , wherein a format of the archived data is one of: tar.gz, tar.bz2, tar.xz, tgz, zip, tar, rar, rpm, and tcdriver. 10. A computer system for managing data on a plurality of computing nodes, comprising: a computer device having a processor and a tangible storage device; and a program embodied on the storage device for execution by the processor, the program having a plurality of program instructions for generating a plurality of similarity clusters, wherein each of the plurality of similarity clusters identifies a grouping of data blocks and comprises a list of hash codes of the data blocks and further comprises an image cluster identifier, and wherein the plurality of similarity clusters indicate a block-level similarity between data stored on a first computing node with the data stored on at least one other computing node among the plurality of computing nodes, wherein data stored on at least one computing node in the plurality of computing nodes comprises archived data, and wherein generating the plurality of similarity clusters comprises: extracting the archived data; comparing checksums of the extracted data; and generating the plurality of similarity clusters based on comparing the checksums. 11. The system of claim 10 , wherein additional data stored on the at least one computing node or on another computing node in the plurality of computing nodes, or both, comprises virtual machine (VM) image data, and wherein generating the similarity clusters further comprises: comparing checksums of the identified files with additional checksums of the VM image data; and generating the plurality of similarity clusters based on comparing the checksums with the additional checksums. 12. The system of claim 10 , wherein the program instructions further comprise instructions for: receiving an instruction to replicate a designated data set, stored on a source computing node, on a target computing node, wherein the source and target computing nodes are among the plurality of computing nodes; identifying a set of similarity clusters that are associated with the designated data set from among the plurality of similarity clusters; identifying a first subset of the set of similarity clusters, wherein data associated with the first subset of similarity clusters is stored only on the source computing node; identifying a second subset of the set of similarity clusters, wherein data associated with the second subset of similarity clusters is stored at least on the source computing node and on the target computing node; and identifying a third subset of the set of similarity clusters, wherein data associated with the third subset of similarity clusters is stored on the source computing node and a set of computing nodes other than the source computing node and other than the target computing node. 13. The system of claim 12 , wherein the program instructions further comprise instructions for generating a data replication plan, wherein the generating comprises: identifying the source computing node as a source for replicating the data associated with the first subset of similarity clusters; identifying at least one computing node among the set of computing nodes other than the source computing node and other than the target computing node as a source for replicating the data associated with the third subset of similarity clusters; and generating the data transfer plan based on the identifying. 14. The system of claim 13 , wherein the program instructions further

Assignees

Inventors

Classifications

  • Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor · CPC title

  • G06F16/178Primary

    Techniques for file synchronisation in file systems · CPC title

  • for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS] · CPC title

  • Hypervisor-specific management and integration aspects · CPC title

  • Replication or mirroring of data, e.g. scheduling or transport for data synchronisation between network nodes · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9910906B2 cover?
Managing data in a cloud computing environment, including data transfers. File level and block level similarities are identified, including for archive and nested archive files, residing on datacenters and regional repositories. A replication plan is generated based on receiving a replication instruction, and further based on similarity clusters by transferring unique data blocks and files from…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F16/178. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 06 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).