Scalable de-duplication for storage systems

US9239843B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9239843-B2
Application numberUS-63890709-A
CountryUS
Kind codeB2
Filing dateDec 15, 2009
Priority dateDec 15, 2009
Publication dateJan 19, 2016
Grant dateJan 19, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method for performing storage system de-duplication. The method includes accessing a plurality of initial partitions of files of a storage system and performing a de-duplication on each of the initial partitions. For each duplicate found, an indicator comprising the metadata that is similar across said each duplicate is determined. For each indicator, indicators are determined that infer a likelihood that data objects with said indicators contain duplicate data is high. Optimized partitions are generated in accordance with the chosen indicators. A de-duplication process is subsequently performed on each of the optimized partitions.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for performing storage system de-duplication, comprising: accessing a plurality of initial partitions of files of a storage system, wherein each of the plurality of initial partitions has object properties; performing a de-duplication on each of the initial partitions; for each duplicate partition found from the plurality of initial partitions, determining an indicator comprising metadata that is similar across said each duplicate partition, wherein the metadata is determined based on the object properties of the initial partitions; for each of the determined indicators, determining a ratio of the number of times the respective metadata is common across duplicate partitions of the initial partitions to the number of times the respective metadata is common across non-duplicate partitions of the initial partitions, wherein the determined indicators having high ratios weighted across all of the initial partitions are designated as chosen indicators; generating optimized partitions in accordance with the chosen indicators, wherein the chosen indicators are combined to generate the optimized partitions, wherein each optimized partition includes a separate de-duplication index structure, wherein each separate de-duplication index structure is distributed across data servers, and wherein each data server is responsible for performing de-duplication between a subset of the files according to the separate de-duplication index structure; and performing a de-duplication on each of the optimized partitions. 2. The method of claim 1 , wherein the plurality of initial partitions are randomly selected partitions of files of the storage system. 3. The method of claim 1 , wherein the chosen indicators enable the generation of optimized partitions in a plurality of different directions across the files of the storage system. 4. The method of claim 1 , wherein Boolean operations are used to combine chosen indicators to generate optimized partitions. 5. The method of claim 1 , wherein optimized partitions are generated iteratively to detect duplicates within the files of the storage system. 6. The method of claim 1 , wherein the de-duplication performed on each of the optimized partitions is performed out of band with respect to the storage system. 7. The method of claim 1 , wherein the de-duplication performed on each of the optimized partitions is performed in-band with respect to the storage system. 8. A non-transitory computer readable storage medium having stored thereon computer executable instructions that, if executed by a computer system, cause the computer system to perform a method comprising: out of a plurality of partitions of files on a file storage system, accessing a plurality of initial partitions, wherein each of the plurality of initial partitions has object properties; performing a de-duplication on each of the initial partitions; for each duplicate partition found from the plurality of initial partitions, determining an indicator comprising metadata that is similar across said each duplicate partition, wherein the metadata is determined based on the object properties of the initial partitions; for each of the determined indicators, determining a ratio of the number of times the respective metadata is common across duplicate partitions of the initial partitions to the number of times the respective metadata is common across non-duplicate partitions of the initial partitions, wherein the determined indicators having high ratios weighted across all of the initial partitions are designated as chosen indicators; generating optimized partitions in accordance with the chosen indicators, wherein the chosen indicators are combined to generate the optimized partitions, wherein each optimized partition includes a separate de-duplication index structure, wherein each separate de-duplication index structure is distributed across data servers, and wherein each data server is responsible for performing de-duplication between a subset of the files according to the separate de-duplication index structure; and performing a de-duplication on each of the optimized partitions. 9. The non-transitory computer readable storage medium of claim 8 , wherein the plurality of initial partitions are user defined partitions of files of the storage system. 10. The non-transitory computer readable storage medium of claim 8 , wherein the chosen indicators enable the generation of optimized partitions in a plurality of different directions across the files of the storage system. 11. The non-transitory computer readable storage medium of claim 8 , wherein Boolean operations are used to combine chosen indicators to generate optimized partitions. 12. The non-transitory computer readable storage medium of claim 8 , wherein optimized partitions are generated iteratively to detect duplicates within the files of the storage system. 13. The non-transitory computer readable storage medium of claim 8 , wherein the de-duplication performed on each of the optimized partitions is performed out of band with respect to the storage system. 14. The non-transitory computer readable storage medium of claim 8 , wherein the de-duplication performed on each of the optimized partitions is performed in-band with respect to the storage system. 15. A server computer system, comprising: a computer system having a computer processor coupled to non-transitory computer readable storage media and executing computer readable code which causes the computer system to: access a plurality of initial partitions of files of a storage system, wherein each of the plurality of initial partitions has object properties; perform a de-duplication on each of the initial partitions; for each duplicate partition found from the plurality of initial partitions, determine an indicator comprising metadata that is similar across said each duplicate partition, wherein the metadata is determined based on the object properties of the initial partitions; for each of the determined indicators, determining a ratio of the number of times the respective metadata is common across duplicate partitions of the initial partitions to the number of times the respective metadata is common across non-duplicate partitions of the initial partitions, wherein the determined indicators having high ratios weighted across all of the initial partitions are designated as chosen indicators; generate optimized partitions in accordance with the chosen indicators, wherein the chosen indicators are combined to generate the optimized partitions, wherein each optimized partition includes a separate de-duplication index structure, wherein each separate de-duplication index structure is distributed across data servers, and wherein each data server is responsible for performing de-duplication between a subset of the files according to the separate de-duplication index structure; and perform a de-duplication on each of the optimized partitions. 16. The server computer system of claim 1 , wherein the plurality of initial partitions are randomly selected partitions of files of the storage system. 17. The server computer system of claim 1 , wherein the chosen indicators enable the generation of optimized partitions in a plurality of different directions across the files of the storage system. 18. The server computer system of claim 1 , wherein Boolean operations are used to combine chosen indicators to generate optimized partitions. 19. The server computer system of claim 1 , wherein optimized partitions are generated

Assignees

Inventors

Classifications

  • based on file chunks · CPC title

  • De-duplication implemented within the file system, e.g. based on file segments (de-duplication techniques in storage systems for the management of data blocks G06F3/0641) · CPC title

  • Physics · mapped topic

  • Physics · mapped topic

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9239843B2 cover?
A method for performing storage system de-duplication. The method includes accessing a plurality of initial partitions of files of a storage system and performing a de-duplication on each of the initial partitions. For each duplicate found, an indicator comprising the metadata that is similar across said each duplicate is determined. For each indicator, indicators are determined that infer a li…
Who is the assignee on this patent?
Agrawal Mukund, Sridharan Srineet, Symantec Corp
What technology area does this patent fall under?
Primary CPC classification G06F16/1752. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 19 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).