What technology area does this patent fall under?

Primary CPC classification G06F16/2365. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Feb 11 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Cognitive deduplication-aware data placement in large scale storage systems

US10558646B2 · US · B2

Patent metadata
Field	Value
Publication number	US-10558646-B2
Application number	US-201715582703-A
Country	US
Kind code	B2
Filing date	Apr 30, 2017
Priority date	Apr 30, 2017
Publication date	Feb 11, 2020
Grant date	Feb 11, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method for a data placement that attempts to predict the most suitable placement, in terms of data reduction, of a newly created storage volume based on the volumes known attributes and the current placement of volumes to deduplication domains is disclosed. The system uses machine learning to perform improved deduplication-aware placement. The system attempts to predict the deduplication domain where a newly created volume would eventually have the best content sharing. The system does this by using the known attributes of the volume at the time of creation, such as owner, volume name, initial size, creation time, and the history of data already in the system and its placement.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for deduplication of data stored in a computer system, the method implemented in the computer system comprising a processor, memory accessible by the processor, computer program instructions stored in the memory and executable by the processor, and data stored in the memory and accessible by the processor, the method comprising: receiving, at the computer system, a volume comprising at least one volume attribute; generating, at the computer system, a feature vector associated with the volume based on the at least one volume attribute, wherein the feature vector is generated by adding a feature to the feature vector for each of a plurality of deduplication domains, representing a lexical similarity between a name of the volume and volume names in each of the plurality of deduplication domains, determining capacity savings for removing the volume from each of the plurality of deduplication domains, and estimating a deduplication domain having a greatest capacity savings from among the plurality of deduplication domains; and generating, at the computer system, a recommended placement for the volume into existing deduplication domains using a trained model receiving the feature vector; wherein the trained model is trained, at the computer system, using a supervised learning algorithm that uses a set of input feature vectors and target variables, wherein the set of input feature vectors are generated, at the computer system, based on a set of existing volume attributes for existing volumes stored in the existing deduplication domains, and wherein the target variables are generated, at the computer system, based on capacity statistics comprise an estimate of physical size in each of the deduplication domains. 2. The method of claim 1 , wherein the set of existing volume attributes comprises a volume owner. 3. The method of claim 1 , wherein the set of existing volume attributes comprises a volume name. 4. The method of claim 3 , wherein the set of input feature vectors is derived from the volume name by using natural language processing. 5. The method of claim 1 , wherein the set of existing volume attributes comprises an initial size. 6. The method of claim 1 , wherein the set of existing volume attributes comprises a creation time. 7. The method of claim 1 , wherein the capacity statistics are calculated based on a physical capacity each of the existing volumes would have consumed in each of the deduplication domains. 8. The method of claim 1 , wherein generating target variables comprises calculating a score according to the physical capacity in each deduplication domain in the deduplication domains. 9. The method of claim 1 , wherein generating target variables comprises calculating a label representing the deduplication domain in the deduplication domains in which each volume in the existing volumes requires the least physical capacity. 10. The method of claim 1 , wherein generating a recommended placement comprises generating a label of a best domain for placement. 11. The method of claim 1 , wherein generating a recommended placement comprises generating a score for each deduplication domain in the deduplication domains. 12. The method of claim 1 , wherein the supervised learning algorithm comprises a support vector machine. 13. The method of claim 1 , wherein the supervised learning algorithm comprises a decision tree. 14. The method of claim 1 , further comprising: calculating probabilities for each deduplication domain in the deduplication domains according to the scores and using the probabilities for placing the newly created volume. 15. A computer system for deduplication of data stored in the computer system comprising a processor, memory accessible by the processor, computer program instructions stored in the memory and executable by the processor, and data stored in the memory and accessible by the processor to implement: at least two deduplication domains comprising a set of volumes, wherein the set of volumes comprise a set of volume attributes; and a storage management component that manages the at least two deduplication domains and is configured to: receive a new volume attribute associated with a new volume; generate a new feature vector based on the new volume attributes, wherein the feature vector is generated by adding a feature to the feature vector for each of a plurality of deduplication domains, representing a lexical similarity between a name of the volume and volume names in each of the plurality of deduplication domains, determining capacity savings for removing the volume from each of the plurality of deduplication domains, and estimating a deduplication domain having a greatest capacity savings from among the plurality of deduplication domains; and apply a model to the new feature vector to generate a recommended placement for the new volume; wherein the model uses a supervised learning algorithm that is trained using target variables and a set of input feature vectors by receiving the set of volume attributes, calculating the set of input feature vectors for each volume attribute in the set of volumes attributes, calculating capacity statistics for each volume in the set of volumes by considering all possible placements of each volume in the set of volumes to every deduplication domain in the at least two deduplication domains, wherein capacity statistics comprise an estimate of physical size in each domain in the at least two deduplication domains, and generating the target variables for each volume in the set of volumes based on the capacity statistics. 16. The system of claim 15 , wherein the set of volume attributes comprises a volume name. 17. The system of claim 16 , wherein the set of input feature vectors is derived from the volume name by using natural language processing. 18. The system of claim 15 , wherein the supervised learning algorithm comprises a support vector machine. 19. The system of claim 15 , wherein the supervised learning algorithm comprises a decision tree. 20. A computer program product comprising a non-transitory computer readable storage having program instructions embodied therewith, the program instructions executable by a computer, to cause the computer to perform a method for deduplication of data stored in the computer comprising: receiving, at the computer, a new volume; generating, at the computer, a new feature vector associated with the new volume based on at least one volume attribute of the new volume, wherein the feature vector is generated by adding a feature to the feature vector for each of a plurality of deduplication domains, representing a lexical similarity between a name of the volume and volume names in each of the plurality of deduplication domains, determining capacity savings for removing the volume from each of the plurality of deduplication domains, and estimating a deduplication domain having a greatest capacity savings from among the plurality of deduplication domains; applying, at the computer, a model comprising target variables to the new feature vector to generate a recommended placement for the volume; and placing, at the computer, the new volume into existing deduplication domains based on the recommended placement; wherein the model applies a supervised learning algorithm that: receives, at the computer, a set of existing volume attributes for existing volumes stored in the existing deduplication domains, calculates, at the computer, a set of input feature vectors for each ex

Assignees

Inventors

Classifications

G06F16/2365Primary
Ensuring data consistency and integrity · CPC title
G06N20/10
using kernel methods, e.g. support vector machines [SVM] · CPC title
G06N20/00
Machine learning · CPC title
G06N5/01
Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound · CPC title

Patent family

Related publications grouped by family.

View patent family 63916643

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10558646B2 cover?: A method for a data placement that attempts to predict the most suitable placement, in terms of data reduction, of a newly created storage volume based on the volumes known attributes and the current placement of volumes to deduplication domains is disclosed. The system uses machine learning to perform improved deduplication-aware placement. The system attempts to predict the deduplication doma…
Who is the assignee on this patent?: IBM
What technology area does this patent fall under?: Primary CPC classification G06F16/2365. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Feb 11 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 7 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Ensuring reproducibility in an artificial intelligence infrastructure

Rules recommendation based on customer feedback

Edge-based adaptive machine learning for object recognition

Selective deduplication

System and method for improved placement of blocks in a deduplication-erasure code environment

De-duplication deployment planning

Data volume placement techniques

Frequently asked questions