Clustering storage systems for sharing of data patterns used for deduplication

US11775483B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11775483-B2
Application numberUS-202017136484-A
CountryUS
Kind codeB2
Filing dateDec 29, 2020
Priority dateDec 9, 2020
Publication dateOct 3, 2023
Grant dateOct 3, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

An apparatus comprises at least one processing device configured to collect, from a plurality of storage systems, data patterns for data stored in the plurality of storage systems and to cluster the plurality of storage systems into one or more data pattern sharing clusters based at least in part on the collected data patterns, a given one of the one or more data pattern sharing clusters comprising two or more of the plurality of storage systems. The at least one processing device is also configured to identify, for the given data pattern sharing cluster, a subset of the collected data patterns and to provide, to the two or more storage systems of the given data pattern sharing cluster, the identified subset of the data patterns, wherein the identified subset of the collected data patterns are utilized by the two or more storage systems in performing data deduplication.

First claim

Opening claim text (preview).

What is claimed is: 1. An apparatus comprising: at least one processing device comprising a processor coupled to a memory; the at least one processing device being configured to perform steps of: collecting, from a plurality of storage systems, data patterns for data stored in the plurality of storage systems; clustering the plurality of storage systems into two or more data pattern sharing clusters based at least in part on the collected data patterns, a first one of the two or more data pattern sharing clusters comprising a first subset of two or more of the plurality of storage systems, a second one of the two or more data pattern sharing clusters comprising a second subset of two or more of the plurality of storage systems, the second subset being different than the first subset; identifying, for the first and second data pattern sharing clusters, respective first and second subsets of the collected data patterns, the second subset of the collected data patterns being different than the first subset of the collected data patterns; providing the first subset of the collected data patterns to the two or more storage systems of the first data pattern sharing cluster, wherein the first subset of the collected data patterns are utilized by the two or more storage systems in the first data pattern sharing cluster for performing data deduplication; and providing the second subset of the collected data patterns to the two or more storage systems of the second data pattern sharing cluster, wherein the second subset of the collected data patterns are utilized by the two or more storage systems in the second data pattern sharing cluster for performing data deduplication; wherein identifying the first subset of the collected data patterns comprises selecting, for inclusion in the first subset of the collected data patterns, at least one data pattern collected from data deduplication software running on at least one of the two or more storage systems of the first data pattern sharing cluster which is not utilized by data deduplication software running on at least one of the two or more storage systems of the second data pattern sharing cluster. 2. The apparatus of claim 1 wherein the two or more storage systems in the first data pattern sharing cluster implement inline pattern detection for performing data deduplication, the inline pattern detection utilizing the first subset of the collected data patterns. 3. The apparatus of claim 2 wherein the inline pattern detection of a given one of the two or more storage systems in the first data pattern sharing cluster utilizes a set of predefined data patterns, the first subset of the collected data patterns comprising at least one data pattern not in the set of predefined data patterns. 4. The apparatus of claim 1 wherein collecting the data patterns comprises collecting, from each of the plurality of storage systems, a designated number of most frequently occurring data patterns for data stored in that storage system. 5. The apparatus of claim 1 wherein clustering the plurality of storage systems into the two or more data pattern sharing clusters comprises utilizing a mean-shift clustering algorithm. 6. The apparatus of claim 5 wherein the mean-shift clustering algorithm utilizes multidimensional scaling to achieve dimensionality reduction for the collected data patterns. 7. The apparatus of claim 6 wherein the multidimensional scaling takes as input a first data structure with entries characterizing a frequency of observation of each of the collected data patterns on each of the plurality of storage systems and provides as output a second data structure that projects the frequency of observation of each of the collected data patterns from a first dimension to a second dimension lower than the first dimension. 8. The apparatus of claim 6 wherein the mean-shift clustering algorithm produces a data structure that tags ones of the plurality of storage systems with labels corresponding to ones of the two or more data pattern sharing clusters to which the plurality of storage systems belong. 9. The apparatus of claim 1 wherein collecting the data patterns comprises generating a first data structure with entries denoting a frequency at which each of the collected data patterns is observed on each of the plurality of storage systems over a given time period. 10. The apparatus of claim 9 wherein clustering the plurality of storage systems takes as input the first data structure and produces a second data structure that tags the entries of the first data structure for each of the plurality of storage system with labels corresponding to ones of the two or more data pattern sharing clusters to which the plurality of storage systems belong. 11. The apparatus of claim 10 wherein identifying the first subset of the collected data patterns for the first data pattern sharing cluster comprises sorting the collected data patterns based at least in part on mean frequency of occurrence across the two or more storage systems in the first data pattern sharing cluster and selecting a designated number of the collected data patterns having a highest mean frequency of occurrence across the two or more storage systems in the first data pattern sharing cluster as the first subset of the collected data patterns for the first data pattern sharing cluster. 12. The apparatus of claim 1 wherein identifying the first subset of the collected data patterns for the first data pattern sharing cluster is based at least in part on frequencies of occurrence of the collected data patterns in each of the two or more storage systems of the first data pattern sharing cluster. 13. The apparatus of claim 1 wherein the at least one processing device is part of a monitoring and analytics platform external to the plurality of storage systems. 14. The apparatus of claim 13 wherein the monitoring and analytics platform comprises a cloud-based monitoring and analytics platform. 15. A computer program product comprising a non-transitory processor-readable storage medium having stored therein program code of one or more software programs, wherein the program code when executed by at least one processing device causes the at least one processing device to perform steps of: collecting, from a plurality of storage systems, data patterns for data stored in the plurality of storage systems; clustering the plurality of storage systems into two or more data pattern sharing clusters based at least in part on the collected data patterns, a first one of the two or more data pattern sharing clusters comprising a first subset of two or more of the plurality of storage systems, a second one of the two or more data pattern sharing clusters comprising a second subset of two or more of the plurality of storage systems, the second subset being different than the first subset; identifying, for the first and second data pattern sharing clusters, respective first and second subsets of the collected data patterns, the second subset of the collected data patterns being different than the first subset of the collected data patterns; providing the first subset of the collected data patterns to the two or more storage systems of the first data pattern sharing cluster, wherein the first subset of the collected data patterns are utilized by the two or more storage systems in the first data pattern sharing cluster for performing data deduplication; and providing the second subset of the collected data patterns to the two or more storage systems of the second data pattern sharing cluster, wherein the second subset of the collected data patterns are util

Assignees

Inventors

Classifications

  • De-duplication implemented within the file system, e.g. based on file segments (de-duplication techniques in storage systems for the management of data blocks G06F3/0641) · CPC title

  • G06F16/285Primary

    Clustering or classification · CPC title

  • G06F3/067Primary

    Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS] · CPC title

  • Saving storage space on storage systems · CPC title

  • Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11775483B2 cover?
An apparatus comprises at least one processing device configured to collect, from a plurality of storage systems, data patterns for data stored in the plurality of storage systems and to cluster the plurality of storage systems into one or more data pattern sharing clusters based at least in part on the collected data patterns, a given one of the one or more data pattern sharing clusters compri…
Who is the assignee on this patent?
Dell Products Lp
What technology area does this patent fall under?
Primary CPC classification G06F16/1748. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Oct 03 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).