Deduplication analysis
US-2022027250-A1 · Jan 27, 2022 · US
US11775483B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11775483-B2 |
| Application number | US-202017136484-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 29, 2020 |
| Priority date | Dec 9, 2020 |
| Publication date | Oct 3, 2023 |
| Grant date | Oct 3, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
An apparatus comprises at least one processing device configured to collect, from a plurality of storage systems, data patterns for data stored in the plurality of storage systems and to cluster the plurality of storage systems into one or more data pattern sharing clusters based at least in part on the collected data patterns, a given one of the one or more data pattern sharing clusters comprising two or more of the plurality of storage systems. The at least one processing device is also configured to identify, for the given data pattern sharing cluster, a subset of the collected data patterns and to provide, to the two or more storage systems of the given data pattern sharing cluster, the identified subset of the data patterns, wherein the identified subset of the collected data patterns are utilized by the two or more storage systems in performing data deduplication.
Opening claim text (preview).
What is claimed is: 1. An apparatus comprising: at least one processing device comprising a processor coupled to a memory; the at least one processing device being configured to perform steps of: collecting, from a plurality of storage systems, data patterns for data stored in the plurality of storage systems; clustering the plurality of storage systems into two or more data pattern sharing clusters based at least in part on the collected data patterns, a first one of the two or more data pattern sharing clusters comprising a first subset of two or more of the plurality of storage systems, a second one of the two or more data pattern sharing clusters comprising a second subset of two or more of the plurality of storage systems, the second subset being different than the first subset; identifying, for the first and second data pattern sharing clusters, respective first and second subsets of the collected data patterns, the second subset of the collected data patterns being different than the first subset of the collected data patterns; providing the first subset of the collected data patterns to the two or more storage systems of the first data pattern sharing cluster, wherein the first subset of the collected data patterns are utilized by the two or more storage systems in the first data pattern sharing cluster for performing data deduplication; and providing the second subset of the collected data patterns to the two or more storage systems of the second data pattern sharing cluster, wherein the second subset of the collected data patterns are utilized by the two or more storage systems in the second data pattern sharing cluster for performing data deduplication; wherein identifying the first subset of the collected data patterns comprises selecting, for inclusion in the first subset of the collected data patterns, at least one data pattern collected from data deduplication software running on at least one of the two or more storage systems of the first data pattern sharing cluster which is not utilized by data deduplication software running on at least one of the two or more storage systems of the second data pattern sharing cluster. 2. The apparatus of claim 1 wherein the two or more storage systems in the first data pattern sharing cluster implement inline pattern detection for performing data deduplication, the inline pattern detection utilizing the first subset of the collected data patterns. 3. The apparatus of claim 2 wherein the inline pattern detection of a given one of the two or more storage systems in the first data pattern sharing cluster utilizes a set of predefined data patterns, the first subset of the collected data patterns comprising at least one data pattern not in the set of predefined data patterns. 4. The apparatus of claim 1 wherein collecting the data patterns comprises collecting, from each of the plurality of storage systems, a designated number of most frequently occurring data patterns for data stored in that storage system. 5. The apparatus of claim 1 wherein clustering the plurality of storage systems into the two or more data pattern sharing clusters comprises utilizing a mean-shift clustering algorithm. 6. The apparatus of claim 5 wherein the mean-shift clustering algorithm utilizes multidimensional scaling to achieve dimensionality reduction for the collected data patterns. 7. The apparatus of claim 6 wherein the multidimensional scaling takes as input a first data structure with entries characterizing a frequency of observation of each of the collected data patterns on each of the plurality of storage systems and provides as output a second data structure that projects the frequency of observation of each of the collected data patterns from a first dimension to a second dimension lower than the first dimension. 8. The apparatus of claim 6 wherein the mean-shift clustering algorithm produces a data structure that tags ones of the plurality of storage systems with labels corresponding to ones of the two or more data pattern sharing clusters to which the plurality of storage systems belong. 9. The apparatus of claim 1 wherein collecting the data patterns comprises generating a first data structure with entries denoting a frequency at which each of the collected data patterns is observed on each of the plurality of storage systems over a given time period. 10. The apparatus of claim 9 wherein clustering the plurality of storage systems takes as input the first data structure and produces a second data structure that tags the entries of the first data structure for each of the plurality of storage system with labels corresponding to ones of the two or more data pattern sharing clusters to which the plurality of storage systems belong. 11. The apparatus of claim 10 wherein identifying the first subset of the collected data patterns for the first data pattern sharing cluster comprises sorting the collected data patterns based at least in part on mean frequency of occurrence across the two or more storage systems in the first data pattern sharing cluster and selecting a designated number of the collected data patterns having a highest mean frequency of occurrence across the two or more storage systems in the first data pattern sharing cluster as the first subset of the collected data patterns for the first data pattern sharing cluster. 12. The apparatus of claim 1 wherein identifying the first subset of the collected data patterns for the first data pattern sharing cluster is based at least in part on frequencies of occurrence of the collected data patterns in each of the two or more storage systems of the first data pattern sharing cluster. 13. The apparatus of claim 1 wherein the at least one processing device is part of a monitoring and analytics platform external to the plurality of storage systems. 14. The apparatus of claim 13 wherein the monitoring and analytics platform comprises a cloud-based monitoring and analytics platform. 15. A computer program product comprising a non-transitory processor-readable storage medium having stored therein program code of one or more software programs, wherein the program code when executed by at least one processing device causes the at least one processing device to perform steps of: collecting, from a plurality of storage systems, data patterns for data stored in the plurality of storage systems; clustering the plurality of storage systems into two or more data pattern sharing clusters based at least in part on the collected data patterns, a first one of the two or more data pattern sharing clusters comprising a first subset of two or more of the plurality of storage systems, a second one of the two or more data pattern sharing clusters comprising a second subset of two or more of the plurality of storage systems, the second subset being different than the first subset; identifying, for the first and second data pattern sharing clusters, respective first and second subsets of the collected data patterns, the second subset of the collected data patterns being different than the first subset of the collected data patterns; providing the first subset of the collected data patterns to the two or more storage systems of the first data pattern sharing cluster, wherein the first subset of the collected data patterns are utilized by the two or more storage systems in the first data pattern sharing cluster for performing data deduplication; and providing the second subset of the collected data patterns to the two or more storage systems of the second data pattern sharing cluster, wherein the second subset of the collected data patterns are util
De-duplication implemented within the file system, e.g. based on file segments (de-duplication techniques in storage systems for the management of data blocks G06F3/0641) · CPC title
Clustering or classification · CPC title
Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS] · CPC title
Saving storage space on storage systems · CPC title
Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.