Flexible and secure transformation of data using stream pipes
US-2015372807-A1 · Dec 24, 2015 · US
US10216754B1 · US · B1
| Field | Value |
|---|---|
| Publication number | US-10216754-B1 |
| Application number | US-201314038632-A |
| Country | US |
| Kind code | B1 |
| Filing date | Sep 26, 2013 |
| Priority date | Sep 26, 2013 |
| Publication date | Feb 26, 2019 |
| Grant date | Feb 26, 2019 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Techniques for balancing data compression and read performance of data chunks of a storage system are described herein. According to one embodiment, similar data chunks are identified based on sketches of a plurality of data chunks stored in the storage system. A first portion of the similar data chunks as a first group is associated with a first storage area. The first storage area is associated with one or more data chunks that are dissimilar to the first group but are likely accessed together. The first group of the similar data chunks and its associated dissimilar data chunks are compressed and stored in the first storage area.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method for balancing data compression and read performance of data chunks of a storage system, the method comprising: identifying similar data chunks based on sketches of a plurality of data chunks stored in the storage system; ordering the similar data chunks of the storage system to be positioned close to each other by scanning a metadata to retrieve chunk identifiers (IDs) and sketches of the plurality of data chunks, wherein each sketch includes a plurality of super features, each super feature being based on hashing one or more concatenated maximum hashes or minimum hashes of sub-regions of the corresponding data chunk, storing the chunk IDs and sketches in a data structure, wherein the data structure includes a plurality of entries, each corresponding to one of the sketches and its respective chunk ID, and sorting the entries of the data structure based on the sketches of the plurality of data chunks of the storage system, including determining that a first sketch of the sketches includes a first feature and a second feature, sorting the entries of the data structure based on the first feature, identifying a subset of the entries of the data structure that are associated with the first feature, and sorting the subset of the entries of the data structure based on the second feature, wherein the similar data chunks of the storage system are rearranged based on the sorted entries such that similar data chunks of the storage system are positioned close to each other; associating a first portion of the similar data chunks as a first group with a first storage container; associating with the first storage container one or more data chunks that are dissimilar to the first group but are likely accessed together; compressing the first group of the similar data chunks and its associated dissimilar data chunks in a first compression region of the first storage container, wherein the first storage container contains a plurality of compression regions, each compression region storing a plurality of data chunks and is represented by a region sketch that is generated based on sketches of the plurality of data chunks stored therein for purposes of identifying similar data chunks, wherein the region sketch is generated by one or more selected super features for the container, wherein the one or more selected super features includes: a maximum chunk super feature, or a minimum chunk super feature; and storing the first storage container in a persistent storage device of the storage system that stores a plurality of storage containers, wherein a data chunk stored in the persistent storage device is accessed by loading an entire compression region of a container associated with the data chunk into a memory, such that a number of input and output (TO) transactions is reduced. 2. The method of claim 1 , further comprising: associating a second portion of the similar data chunks as a second group with a second storage area; associating with the second storage area one or more data chunks that are dissimilar to the second group but are likely accessed together; and compressing and storing the second group of the similar data chunks and its associated dissimilar data chunks in the second storage area. 3. The method of claim 1 , wherein a number of similar data chunks associated with the first storage container is limited to a predetermined minimum or maximum threshold. 4. The method of claim 1 , wherein the dissimilar data chunks are located near one or more of the similar data chunks in one or more files. 5. The method of claim 1 , wherein the dissimilar data chunks were accessed within a predetermined period of time in which the similar data chunks were accessed. 6. The method of claim 1 , wherein the similar data chunks are identified from data chunks associated with one or more files that have not been accessed for a predetermined period of time. 7. The method of claim 6 , wherein data chunks that have been recently accessed are not reorganized based on their similarity. 8. The method of claim 1 , wherein the dissimilar chunks include a second group of similar data chunks that is not similar to the first group of similar data chunks. 9. The method of claim 8 , wherein the similar data chunks of the first group represents different versions of a first data chunk, and wherein the similar data chunks of the second group represents different versions of a second data chunk. 10. The method of claim 1 , further comprising: determining that a third data chunk compressed and stored in a third storage area and a fourth data chunk compressed and stored in a fourth storage area are accessed frequently; and reorganizing data chunks stored in the third and fourth storage areas, such that the third data chunk and the fourth data chunk are compressed and stored together regardless whether they are similar. 11. A non-transitory machine-readable medium having instructions stored therein, which when executed by a processor, cause the processor to perform operations for balancing data compression and read performance of data chunks of a storage system, the operations comprising: identifying similar data chunks based on sketches of a plurality of data chunks stored in the storage system; ordering the similar data chunks of the storage system to be positioned close to each other by scanning a metadata to retrieve chunk identifiers (IDs) and sketches of the plurality of data chunks, wherein each sketch includes a plurality of super features, each super feature being based on hashing one or more concatenated maximum hashes or minimum hashes of sub-regions of the corresponding data chunk, storing the chunk IDs and sketches in a data structure, wherein the data structure includes a plurality of entries, each corresponding to one of the sketches and its respective chunk ID, and sorting the entries of the data structure based on the sketches of the plurality of data chunks of the storage system, including determining that a first sketch of the sketches includes a first feature and a second feature, sorting the entries of the data structure based on the first feature, identifying a subset of the entries of the data structure that are associated with the first feature, and sorting the subset of the entries of the data structure based on the second feature, wherein the similar data chunks of the storage system are rearranged based on the sorted entries such that similar data chunks of the storage system are positioned close to each other; associating a first portion of the similar data chunks as a first group with a first storage container; associating with the first storage container one or more data chunks that are dissimilar to the first group but are likely accessed together; compressing the first group of the similar data chunks and its associated dissimilar data chunks in a first compression region of the first storage container, wherein the first storage container contains a plurality of compression regions, each compression region storing a plurality of data chunks and is represented by a region sketch that is generated based on sketches of the plurality of data chunks stored therein for purposes of identifying similar data chunks, wherein the region sketch is generated by one or more selected super features for the container, wherein the one or more selected super features includes: a maximum chunk super feature, or a minimum chunk super feature; and storing the first storage container in a persistent storage device of the storage system that stores a plurality of storage containers, wherein a data chunk stored in the persistent storage device is acce
Physics · mapped topic
using compression, e.g. sparse files · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.