System and method for balancing compression and read performance in a storage system

US11144507B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11144507-B2
Application numberUS-201916259510-A
CountryUS
Kind codeB2
Filing dateJan 28, 2019
Priority dateSep 26, 2013
Publication dateOct 12, 2021
Grant dateOct 12, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques for balancing data compression and read performance of data chunks of a storage system are described herein. According to one embodiment, similar data chunks are identified based on sketches of a plurality of data chunks stored in the storage system. A first portion of the similar data chunks as a first group is associated with a first storage area. The first storage area is associated with one or more data chunks that are dissimilar to the first group but are likely accessed together. The first group of the similar data chunks and its associated dissimilar data chunks are compressed and stored in the first storage area.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for balancing data compression and read performance of data chunks of a storage system, the method comprising: identifying similar data chunks based on sketches of a plurality of data chunks stored in the storage system; scanning a metadata to retrieve chunk identifiers (IDs) and sketches of the plurality of data chunks, storing a plurality of entries of a data structure, wherein each of the entries corresponds to one of the sketches and its respective chunk ID, including determining that a first sketch of the sketches includes a first feature and a second feature, sorting the entries of the data structure based on the first feature, identifying a subset of the entries of the data structure that are associated with the first feature, and sorting the subset of the entries of the data structure based on the second feature, wherein the second feature is determined based on a plurality of features of a data chunk that is associated with the chunk ID; associating a first portion of the similar data chunks as a first group with a first storage container; associating with the first storage container one or more data chunks that are dissimilar to the first group but are likely accessed together; compressing the first group of the similar data chunks and its associated dissimilar data chunks in a first compression region of the first storage container; and storing a plurality of storage containers in a persistent storage device of the storage system including the first storage container. 2. The method of claim 1 , further comprising: associating a second portion of the similar data chunks as a second group with a second storage area; associating with the second storage area one or more data chunks that are dissimilar to the second group but are likely accessed together; and compressing and storing the second group of the similar data chunks and its associated dissimilar data chunks in the second storage area. 3. The method of claim 1 , wherein a number of similar data chunks associated with the first storage container is limited to a predetermined minimum or maximum threshold. 4. The method of claim 1 , wherein the dissimilar data chunks are located near one or more of the similar data chunks in one or more files. 5. The method of claim 1 , wherein the dissimilar data chunks were accessed within a predetermined period of time in which the similar data chunks were accessed. 6. The method of claim 1 , wherein the similar data chunks are identified from data chunks associated with one or more files that have not been accessed for a predetermined period of time. 7. The method of claim 6 , wherein data chunks that have been recently accessed are not reorganized based on their similarity. 8. The method of claim 1 , wherein the dissimilar chunks include a second group of similar data chunks that is not similar to the first group of similar data chunks. 9. The method of claim 8 , wherein the similar data chunks of the first group represents different versions of a first data chunk, and wherein the similar data chunks of the second group represents different versions of a second data chunk. 10. The method of claim 1 , further comprising: determining that a third data chunk compressed and stored in a third storage area and a fourth data chunk compressed and stored in a fourth storage area are accessed frequently; and reorganizing data chunks stored in the third and fourth storage areas, such that the third data chunk and the fourth data chunk are compressed and stored together regardless whether they are similar. 11. A non-transitory machine-readable medium having instructions stored therein, which when executed by a processor, cause the processor to perform operations of balancing data compression and read performance of data chunks of a storage system, the operations comprising: identifying similar data chunks based on sketches of a plurality of data chunks stored in the storage system; scanning a metadata to retrieve chunk identifiers (IDs) and sketches of the plurality of data chunks, storing a plurality of entries of a data structure, wherein each of the entries corresponds to one of the sketches and its respective chunk ID, including determining that a first sketch of the sketches includes a first feature and a second feature, sorting the entries of the data structure based on the first feature, identifying a subset of the entries of the data structure that are associated with the first feature, and sorting the subset of the entries of the data structure based on the second feature, wherein the second feature is determined based on a plurality of features of a data chunk that is associated with the chunk ID; associating a first portion of the similar data chunks as a first group with a first storage container; associating with the first storage container one or more data chunks that are dissimilar to the first group but are likely accessed together; compressing the first group of the similar data chunks and its associated dissimilar data chunks in a first compression region of the first storage container; and storing a plurality of storage containers in a persistent storage device of the storage system including the first storage container. 12. The machine-readable medium of claim 11 , wherein the operations further comprise: associating a second portion of the similar data chunks as a second group with a second storage area; associating with the second storage area one or more data chunks that are dissimilar to the second group but are likely accessed together; and compressing and storing the second group of the similar data chunks and its associated dissimilar data chunks in the second storage area. 13. The machine-readable medium of claim 11 , wherein a number of similar data chunks associated with the first storage container is limited to a predetermined minimum or maximum threshold. 14. The machine-readable medium of claim 11 , wherein the dissimilar data chunks are located near one or more of the similar data chunks in one or more files. 15. The machine-readable medium of claim 11 , wherein the dissimilar data chunks were accessed within a predetermined period of time in which the similar data chunks were accessed. 16. The machine-readable medium of claim 11 , wherein the similar data chunks are identified from data chunks associated with one or more files that have not been accessed for a predetermined period of time. 17. The machine-readable medium of claim 16 , wherein data chunks that have been recently accessed are not reorganized based on their similarity. 18. The machine-readable medium of claim 11 , wherein the dissimilar chunks include a second group of similar data chunks that is not similar to the first group of similar data chunks. 19. The machine-readable medium of claim 18 , wherein the similar data chunks of the first group represents different versions of a first data chunk, and wherein the similar data chunks of the second group represents different versions of a second data chunk. 20. The machine-readable medium of claim 11 , wherein the operations further comprise: determining that a third data chunk compressed and stored in a third storage area and a fourth data chunk compressed and stored in a fourth storage area are accessed frequently; and reorganizing data chunks stored in the third and fourth storage areas, such that the third data chunk and the fourth data chunk are compressed and stored together regardle

Assignees

Inventors

Classifications

  • using compression, e.g. sparse files · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11144507B2 cover?
Techniques for balancing data compression and read performance of data chunks of a storage system are described herein. According to one embodiment, similar data chunks are identified based on sketches of a plurality of data chunks stored in the storage system. A first portion of the similar data chunks as a first group is associated with a first storage area. The first storage area is associat…
Who is the assignee on this patent?
Emc Ip Holding Co Llc
What technology area does this patent fall under?
Primary CPC classification G06F16/1744. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Oct 12 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).