Cognitive deduplication-aware data placement in large scale storage systems

US10558646B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10558646-B2
Application numberUS-201715582703-A
CountryUS
Kind codeB2
Filing dateApr 30, 2017
Priority dateApr 30, 2017
Publication dateFeb 11, 2020
Grant dateFeb 11, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method for a data placement that attempts to predict the most suitable placement, in terms of data reduction, of a newly created storage volume based on the volumes known attributes and the current placement of volumes to deduplication domains is disclosed. The system uses machine learning to perform improved deduplication-aware placement. The system attempts to predict the deduplication domain where a newly created volume would eventually have the best content sharing. The system does this by using the known attributes of the volume at the time of creation, such as owner, volume name, initial size, creation time, and the history of data already in the system and its placement.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for deduplication of data stored in a computer system, the method implemented in the computer system comprising a processor, memory accessible by the processor, computer program instructions stored in the memory and executable by the processor, and data stored in the memory and accessible by the processor, the method comprising: receiving, at the computer system, a volume comprising at least one volume attribute; generating, at the computer system, a feature vector associated with the volume based on the at least one volume attribute, wherein the feature vector is generated by adding a feature to the feature vector for each of a plurality of deduplication domains, representing a lexical similarity between a name of the volume and volume names in each of the plurality of deduplication domains, determining capacity savings for removing the volume from each of the plurality of deduplication domains, and estimating a deduplication domain having a greatest capacity savings from among the plurality of deduplication domains; and generating, at the computer system, a recommended placement for the volume into existing deduplication domains using a trained model receiving the feature vector; wherein the trained model is trained, at the computer system, using a supervised learning algorithm that uses a set of input feature vectors and target variables, wherein the set of input feature vectors are generated, at the computer system, based on a set of existing volume attributes for existing volumes stored in the existing deduplication domains, and wherein the target variables are generated, at the computer system, based on capacity statistics comprise an estimate of physical size in each of the deduplication domains. 2. The method of claim 1 , wherein the set of existing volume attributes comprises a volume owner. 3. The method of claim 1 , wherein the set of existing volume attributes comprises a volume name. 4. The method of claim 3 , wherein the set of input feature vectors is derived from the volume name by using natural language processing. 5. The method of claim 1 , wherein the set of existing volume attributes comprises an initial size. 6. The method of claim 1 , wherein the set of existing volume attributes comprises a creation time. 7. The method of claim 1 , wherein the capacity statistics are calculated based on a physical capacity each of the existing volumes would have consumed in each of the deduplication domains. 8. The method of claim 1 , wherein generating target variables comprises calculating a score according to the physical capacity in each deduplication domain in the deduplication domains. 9. The method of claim 1 , wherein generating target variables comprises calculating a label representing the deduplication domain in the deduplication domains in which each volume in the existing volumes requires the least physical capacity. 10. The method of claim 1 , wherein generating a recommended placement comprises generating a label of a best domain for placement. 11. The method of claim 1 , wherein generating a recommended placement comprises generating a score for each deduplication domain in the deduplication domains. 12. The method of claim 1 , wherein the supervised learning algorithm comprises a support vector machine. 13. The method of claim 1 , wherein the supervised learning algorithm comprises a decision tree. 14. The method of claim 1 , further comprising: calculating probabilities for each deduplication domain in the deduplication domains according to the scores and using the probabilities for placing the newly created volume. 15. A computer system for deduplication of data stored in the computer system comprising a processor, memory accessible by the processor, computer program instructions stored in the memory and executable by the processor, and data stored in the memory and accessible by the processor to implement: at least two deduplication domains comprising a set of volumes, wherein the set of volumes comprise a set of volume attributes; and a storage management component that manages the at least two deduplication domains and is configured to: receive a new volume attribute associated with a new volume; generate a new feature vector based on the new volume attributes, wherein the feature vector is generated by adding a feature to the feature vector for each of a plurality of deduplication domains, representing a lexical similarity between a name of the volume and volume names in each of the plurality of deduplication domains, determining capacity savings for removing the volume from each of the plurality of deduplication domains, and estimating a deduplication domain having a greatest capacity savings from among the plurality of deduplication domains; and apply a model to the new feature vector to generate a recommended placement for the new volume; wherein the model uses a supervised learning algorithm that is trained using target variables and a set of input feature vectors by receiving the set of volume attributes, calculating the set of input feature vectors for each volume attribute in the set of volumes attributes, calculating capacity statistics for each volume in the set of volumes by considering all possible placements of each volume in the set of volumes to every deduplication domain in the at least two deduplication domains, wherein capacity statistics comprise an estimate of physical size in each domain in the at least two deduplication domains, and generating the target variables for each volume in the set of volumes based on the capacity statistics. 16. The system of claim 15 , wherein the set of volume attributes comprises a volume name. 17. The system of claim 16 , wherein the set of input feature vectors is derived from the volume name by using natural language processing. 18. The system of claim 15 , wherein the supervised learning algorithm comprises a support vector machine. 19. The system of claim 15 , wherein the supervised learning algorithm comprises a decision tree. 20. A computer program product comprising a non-transitory computer readable storage having program instructions embodied therewith, the program instructions executable by a computer, to cause the computer to perform a method for deduplication of data stored in the computer comprising: receiving, at the computer, a new volume; generating, at the computer, a new feature vector associated with the new volume based on at least one volume attribute of the new volume, wherein the feature vector is generated by adding a feature to the feature vector for each of a plurality of deduplication domains, representing a lexical similarity between a name of the volume and volume names in each of the plurality of deduplication domains, determining capacity savings for removing the volume from each of the plurality of deduplication domains, and estimating a deduplication domain having a greatest capacity savings from among the plurality of deduplication domains; applying, at the computer, a model comprising target variables to the new feature vector to generate a recommended placement for the volume; and placing, at the computer, the new volume into existing deduplication domains based on the recommended placement; wherein the model applies a supervised learning algorithm that: receives, at the computer, a set of existing volume attributes for existing volumes stored in the existing deduplication domains, calculates, at the computer, a set of input feature vectors for each ex

Assignees

Inventors

Classifications

  • Ensuring data consistency and integrity · CPC title

  • using kernel methods, e.g. support vector machines [SVM] · CPC title

  • Machine learning · CPC title

  • Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10558646B2 cover?
A method for a data placement that attempts to predict the most suitable placement, in terms of data reduction, of a newly created storage volume based on the volumes known attributes and the current placement of volumes to deduplication domains is disclosed. The system uses machine learning to perform improved deduplication-aware placement. The system attempts to predict the deduplication doma…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F16/2365. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 11 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).