De-duplication deployment planning

US9280551B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9280551-B2
Application numberUS-201313908955-A
CountryUS
Kind codeB2
Filing dateJun 3, 2013
Priority dateJun 3, 2013
Publication dateMar 8, 2016
Grant dateMar 8, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Assignment of files to a de-duplication domain. Address space of data files is divided into multiple containers. For each of the containers, a file metadata scan is performed to obtain file system metadata, which is aggregated and summarized in a content feature summary. A content feature summary prediction measurement is measured between containers from the generated content feature summary, and files from each container are assigned to a de-duplication domain based upon the content similarity predication measurement.

First claim

Opening claim text (preview).

We claim: 1. A computer program product for deploying de-duplication, the computer program product comprising a computer readable program storage device having program code embodied therewith, the program code executable by a processor to: divide files corresponding to an address space into multiple containers; perform a file metadata scan, including obtaining attributes for files in each container; aggregate the file attributes into characterizations for each attribute dimension and generate a content feature summary for each container based on a selection window and a signature list, wherein the content feature summary incorporates the characterizations and summarizes the signature list, wherein generating the content feature summary comprises program code to compute one or more discrete file summaries, and wherein computing a discrete file summary comprises program code to: initialize the signature list; select a file from a subset of files within one of the containers, and extract features from one or more attributes of the selected file; compute a signature from the one or more extracted features, wherein the signature comprises a numerical value; and add the signature to the signature list in response to the numerical value being less than a first threshold associated with the selection window; measure a content similarity prediction measurement between containers from the generated content feature summary; and assign files from each container to a de-duplication domain based on the computed content similarity prediction measurement. 2. The computer program product of claim 1 , wherein the file attributes include file system metadata. 3. The computer program product of claim 2 , further comprising program code to estimate a discrete file similarity through use of the one or more discrete file summaries, each discrete file summary including a list of records, each record corresponding to one file, and each signature being computed from a file name and a file size. 4. The computer program product of claim 2 , wherein the content similarity prediction measurement comprises an owner group distribution similarity, and wherein the measurement of the owner group distribution similarity comprises program code to compute the owner group distribution similarity, including program code to compare owner group distributions for two or more containers, and compute a correlation between the groups based on the comparison. 5. The computer program product of claim 4 , wherein the computation of the owner group distribution similarity further comprises program code to determine a set of owner groups, process metadata for files in a container, associate each file with one of the owner groups, and compute an owner group distribution. 6. The computer program product of claim 2 , wherein the measurement of the content similarity prediction measurement comprises program code to compute type and size distributions within the containers, compare the distributions between the containers, and compute a similarity between containers based on the comparison. 7. The computer program product of claim 6 , wherein the type and size distribution of a container measures a quantity of content within the container associated with each member of a listed set of type and size-range pairs. 8. The computer program product of claim 1 , wherein the assignment of files comprises program code to place containers with high similarity into a common de-duplication domain. 9. A system comprising: a processing unit in communication with memory, and data storage having files corresponding to an address space divided into containers; one or more tools in communication with the processing unit, the tools to deploy de-duplication within the address space, including the tools to: perform a metadata scan to obtain attributes for files in each container; aggregate the file attributes into characterizations for each attribute dimension and generate a content feature summary for each container based on a selection window and a signature list, wherein the content feature summary incorporates the characterizations and summarizes the signature list, wherein generating the content feature summary comprises the one or more tools to compute one or more discrete file summaries, and wherein computing a discrete file summary comprises the tools to: initialize the signature list; select a file from a subset of files within one or more of the containers, and extracting features from one or more attributes of the selected file; compute a signature from the one or more extracted features, wherein the signature comprises a numerical value; and add the signature to the signature list in response to the numerical value being less than the first threshold associated with the selection window; measure a content similarity prediction measurement between the containers from the generated content feature summary; and assign files from each container to a de-duplication domain based on the computed content similarity prediction measurement. 10. The system of claim 9 , further comprising the one or more tools to estimate a discrete file similarity through use of the one or more discrete file summaries, each discrete file summary including a list of records, each record corresponding to one file, and each signature being computed from a file name and a file size. 11. The system of claim 9 , wherein the content similarity prediction measurement comprises an owner group distribution similarity, and wherein the measurement of the owner group distribution similarity comprises the one or more tools to compute the owner group distribution similarity, including the one or more tools to compare owner group distributions for two or more containers, and compute a correlation between the groups based on the comparison. 12. The system of claim 11 , wherein the computation of the owner group distribution similarity further comprises the one or more tools to determine a set of owner groups, process metadata for files in a container, associate each file with one of the owner groups, and compute an owner group distribution. 13. The system of claim 9 , wherein the measurement of the content similarity prediction measurement comprises the one or more tools to compute type and size distributions within the containers, compare the distributions between the containers, and compute a similarity between the containers based on the comparison, wherein the type and size distribution of a container measures a quantity of content within the container associated with each member of a listed set of type and size-range pairs. 14. The computer program product of claim 1 , further comprising program code to generate a statistical distribution of each obtained attribute across the plurality of files in each container, and wherein the measurement of the content similarity prediction measurement includes program code to compare statistical distributions between containers. 15. The system of claim 9 , further comprising the one or more tools to generate a statistical distribution of each obtained attribute across the plurality of files in each container, and wherein the measurement of the content similarity prediction measurement includes the one or more tools to compare statistical distributions between containers. 16. The computer program product of claim 1 , wherein computing the signatures comprises program code to apply a hash function to the extracted features. 17. The system of claim 9 , wherein computing the signatures comprises the one or more tools to apply a hash f

Assignees

Inventors

Classifications

  • De-duplication implemented within the file system, e.g. based on file segments (de-duplication techniques in storage systems for the management of data blocks G06F3/0641) · CPC title

  • Physics · mapped topic

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9280551B2 cover?
Assignment of files to a de-duplication domain. Address space of data files is divided into multiple containers. For each of the containers, a file metadata scan is performed to obtain file system metadata, which is aggregated and summarized in a content feature summary. A content feature summary prediction measurement is measured between containers from the generated content feature summary, a…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F16/1748. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 08 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).