Who is the assignee on this patent?

Emc Corp, Emc Ip Holding Co Llc

What technology area does this patent fall under?

Primary CPC classification G06F17/30153. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Feb 26 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).

System and method for balancing compression and read performance in a storage system

US10216754B1 · US · B1

Patent metadata
Field	Value
Publication number	US-10216754-B1
Application number	US-201314038632-A
Country	US
Kind code	B1
Filing date	Sep 26, 2013
Priority date	Sep 26, 2013
Publication date	Feb 26, 2019
Grant date	Feb 26, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques for balancing data compression and read performance of data chunks of a storage system are described herein. According to one embodiment, similar data chunks are identified based on sketches of a plurality of data chunks stored in the storage system. A first portion of the similar data chunks as a first group is associated with a first storage area. The first storage area is associated with one or more data chunks that are dissimilar to the first group but are likely accessed together. The first group of the similar data chunks and its associated dissimilar data chunks are compressed and stored in the first storage area.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for balancing data compression and read performance of data chunks of a storage system, the method comprising: identifying similar data chunks based on sketches of a plurality of data chunks stored in the storage system; ordering the similar data chunks of the storage system to be positioned close to each other by scanning a metadata to retrieve chunk identifiers (IDs) and sketches of the plurality of data chunks, wherein each sketch includes a plurality of super features, each super feature being based on hashing one or more concatenated maximum hashes or minimum hashes of sub-regions of the corresponding data chunk, storing the chunk IDs and sketches in a data structure, wherein the data structure includes a plurality of entries, each corresponding to one of the sketches and its respective chunk ID, and sorting the entries of the data structure based on the sketches of the plurality of data chunks of the storage system, including determining that a first sketch of the sketches includes a first feature and a second feature, sorting the entries of the data structure based on the first feature, identifying a subset of the entries of the data structure that are associated with the first feature, and sorting the subset of the entries of the data structure based on the second feature, wherein the similar data chunks of the storage system are rearranged based on the sorted entries such that similar data chunks of the storage system are positioned close to each other; associating a first portion of the similar data chunks as a first group with a first storage container; associating with the first storage container one or more data chunks that are dissimilar to the first group but are likely accessed together; compressing the first group of the similar data chunks and its associated dissimilar data chunks in a first compression region of the first storage container, wherein the first storage container contains a plurality of compression regions, each compression region storing a plurality of data chunks and is represented by a region sketch that is generated based on sketches of the plurality of data chunks stored therein for purposes of identifying similar data chunks, wherein the region sketch is generated by one or more selected super features for the container, wherein the one or more selected super features includes: a maximum chunk super feature, or a minimum chunk super feature; and storing the first storage container in a persistent storage device of the storage system that stores a plurality of storage containers, wherein a data chunk stored in the persistent storage device is accessed by loading an entire compression region of a container associated with the data chunk into a memory, such that a number of input and output (TO) transactions is reduced. 2. The method of claim 1 , further comprising: associating a second portion of the similar data chunks as a second group with a second storage area; associating with the second storage area one or more data chunks that are dissimilar to the second group but are likely accessed together; and compressing and storing the second group of the similar data chunks and its associated dissimilar data chunks in the second storage area. 3. The method of claim 1 , wherein a number of similar data chunks associated with the first storage container is limited to a predetermined minimum or maximum threshold. 4. The method of claim 1 , wherein the dissimilar data chunks are located near one or more of the similar data chunks in one or more files. 5. The method of claim 1 , wherein the dissimilar data chunks were accessed within a predetermined period of time in which the similar data chunks were accessed. 6. The method of claim 1 , wherein the similar data chunks are identified from data chunks associated with one or more files that have not been accessed for a predetermined period of time. 7. The method of claim 6 , wherein data chunks that have been recently accessed are not reorganized based on their similarity. 8. The method of claim 1 , wherein the dissimilar chunks include a second group of similar data chunks that is not similar to the first group of similar data chunks. 9. The method of claim 8 , wherein the similar data chunks of the first group represents different versions of a first data chunk, and wherein the similar data chunks of the second group represents different versions of a second data chunk. 10. The method of claim 1 , further comprising: determining that a third data chunk compressed and stored in a third storage area and a fourth data chunk compressed and stored in a fourth storage area are accessed frequently; and reorganizing data chunks stored in the third and fourth storage areas, such that the third data chunk and the fourth data chunk are compressed and stored together regardless whether they are similar. 11. A non-transitory machine-readable medium having instructions stored therein, which when executed by a processor, cause the processor to perform operations for balancing data compression and read performance of data chunks of a storage system, the operations comprising: identifying similar data chunks based on sketches of a plurality of data chunks stored in the storage system; ordering the similar data chunks of the storage system to be positioned close to each other by scanning a metadata to retrieve chunk identifiers (IDs) and sketches of the plurality of data chunks, wherein each sketch includes a plurality of super features, each super feature being based on hashing one or more concatenated maximum hashes or minimum hashes of sub-regions of the corresponding data chunk, storing the chunk IDs and sketches in a data structure, wherein the data structure includes a plurality of entries, each corresponding to one of the sketches and its respective chunk ID, and sorting the entries of the data structure based on the sketches of the plurality of data chunks of the storage system, including determining that a first sketch of the sketches includes a first feature and a second feature, sorting the entries of the data structure based on the first feature, identifying a subset of the entries of the data structure that are associated with the first feature, and sorting the subset of the entries of the data structure based on the second feature, wherein the similar data chunks of the storage system are rearranged based on the sorted entries such that similar data chunks of the storage system are positioned close to each other; associating a first portion of the similar data chunks as a first group with a first storage container; associating with the first storage container one or more data chunks that are dissimilar to the first group but are likely accessed together; compressing the first group of the similar data chunks and its associated dissimilar data chunks in a first compression region of the first storage container, wherein the first storage container contains a plurality of compression regions, each compression region storing a plurality of data chunks and is represented by a region sketch that is generated based on sketches of the plurality of data chunks stored therein for purposes of identifying similar data chunks, wherein the region sketch is generated by one or more selected super features for the container, wherein the one or more selected super features includes: a maximum chunk super feature, or a minimum chunk super feature; and storing the first storage container in a persistent storage device of the storage system that stores a plurality of storage containers, wherein a data chunk stored in the persistent storage device is acce

Assignees

Inventors

Classifications

G06F17/30153Primary
Physics · mapped topic
G06F16/1744Primary
using compression, e.g. sparse files · CPC title

Patent family

Related publications grouped by family.

View patent family 65410931

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10216754B1 cover?: Techniques for balancing data compression and read performance of data chunks of a storage system are described herein. According to one embodiment, similar data chunks are identified based on sketches of a plurality of data chunks stored in the storage system. A first portion of the similar data chunks as a first group is associated with a first storage area. The first storage area is associat…
Who is the assignee on this patent?: Emc Corp, Emc Ip Holding Co Llc
What technology area does this patent fall under?: Primary CPC classification G06F17/30153. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Feb 26 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).