Processing device configured for efficient generation of compression estimates for datasets

US11609883B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11609883-B2
Application numberUS-201815991380-A
CountryUS
Kind codeB2
Filing dateMay 29, 2018
Priority dateMay 29, 2018
Publication dateMar 21, 2023
Grant dateMar 21, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

An apparatus in one embodiment comprises at least one processing device comprising a processor coupled to a memory. The processing device is configured to identify a dataset to be scanned to generate a compression estimate for that dataset, to designate a scan criterion to be utilized in the scan, and for each of a plurality of pages of the dataset, to scan the page, where scanning the page includes performing a computation on the page to obtain a page result, determining whether or not the page result satisfies the designated scan criterion, and responsive to the page result satisfying the designated scan criterion, updating a corresponding entry of a compression estimate table for the dataset. The processing device generates the compression estimate for the dataset based at least in part on contents of the compression estimate table. The scan criterion may comprise, for example, a designated content-based signature prefix, or a designated subset inclusion characteristic defining a polynomial-based signature subspace.

First claim

Opening claim text (preview).

What is claimed is: 1. An apparatus comprising: at least one processing device comprising a processor coupled to a memory; the processing device being configured: to identify a dataset stored on a first storage system to be scanned to generate a first compression estimate for that dataset; to designate a scan criterion to be utilized in the scan; for each of a plurality of pages of the dataset, to scan the page by: performing a computation on the page to obtain a page result; determining whether or not the page result satisfies the designated scan criterion; and responsive to the page result satisfying the designated scan criterion, updating a corresponding entry of a compression estimate table for the dataset; to generate the first compression estimate for the dataset based at least in part on contents of the compression estimate table; to generate a second compression estimate for the dataset based at least in part on the contents of the compression estimate table and enhanced compression functionality available in a second storage system; and to automatically migrate the dataset from the first storage system to the second storage system for compression based at least in part on the second compression estimate indicating that a threshold level of enhanced compression is achieved at the second storage system; wherein scanning the plurality of pages of the dataset comprises sequentially scanning through the plurality of pages of the dataset and applying the designated scan criterion individually to each of the page results of respective ones of the plurality of pages as part of the scanning; wherein the designated scan criterion utilized in the scanning of the plurality of pages of the dataset defines a subspace of a total scan space for the scan; and wherein the designated scan criterion utilized in the scanning of the plurality of pages of the dataset further establishes a sampling ratio of the scanned pages as part of the scanning, based at least in part on the defined subspace of the total scan space for the scan. 2. The apparatus of claim 1 wherein the processing device is implemented in one of: a host device configured to communicate over a network with the first storage system that stores the dataset; and the first storage system that stores the dataset. 3. The apparatus of claim 1 wherein the dataset comprises a set of one or more logical storage volumes of the first storage system. 4. The apparatus of claim 1 wherein the designated scan criterion comprises a designated content-based signature prefix and scanning the page comprises: computing a content-based signature for the page; comparing an initial portion of the content-based signature to the designated content-based signature prefix; and responsive to a match between the initial portion and the designated content-based signature prefix, updating a corresponding entry of the compression estimate table for the dataset. 5. The apparatus of claim 4 wherein the designated content-based signature prefix comprises a specified number of initial content-based signature bytes with the initial bytes each having a designated value. 6. The apparatus of claim 1 wherein the designated scan criterion comprises a designated subset inclusion characteristic and scanning the page comprises: computing a polynomial-based signature for the page; determining whether or not the polynomial-based signature satisfies the designated subset inclusion characteristic; and responsive to the polynomial-based signature satisfying the designated subset inclusion characteristic, computing a content-based signature for the page and updating a corresponding entry of the compression estimate table for the dataset based at least in part on the content-based signature. 7. The apparatus of claim 6 wherein the designated subset inclusion characteristic specifies that application of a designated function to the polynomial-based signature produces a particular result. 8. The apparatus of claim 6 wherein the polynomial-based signature comprises an n-bit cyclic redundancy check (CRC) value. 9. The apparatus of claim 1 wherein updating a corresponding entry of the compression estimate table for a given one of the pages of the dataset comprises one of the following operations (i) and (ii): (i) responsive to a page identifier of the given page not already being present in the compression estimate table, inserting the page identifier into the compression estimate table and setting an associated counter to an initial value; and (ii) responsive to the page identifier already being present in the compression estimate table, incrementing its associated counter. 10. The apparatus of claim 1 wherein the corresponding entry is configured to include a page identifier and further wherein the page identifier comprises a specified number of initial bytes of a content-based signature of that page. 11. The apparatus of claim 1 wherein the compression estimate table for the dataset comprises a plurality of entries for respective ones of the pages of that dataset and wherein each of the entries is configured to include a page identifier that comprises less than an entire content-based signature of its corresponding page. 12. The apparatus of claim 1 wherein generating the first compression estimate for the dataset based at least in part on contents of the compression estimate table further comprises: computing a partial compression estimate based at least in part on compression values associated with respective entries of the compression estimate table; and scaling the partial compression estimate to obtain the first compression estimate for the dataset; wherein scaling the partial compression estimate comprises processing the partial compression estimate utilizing an inverse of the sampling ratio. 13. The apparatus of claim 1 wherein the processing device is configured to adjust one or more characteristics of a storage configuration of the dataset based at least in part on the first compression estimate generated for the dataset. 14. The apparatus of claim 1 wherein the processing device is configured: to generate one or more additional compression estimates for respective ones of one or more additional datasets; and to select a particular one of the datasets for compression based at least in part on their respective compression estimates. 15. A method comprising: identifying a dataset stored on a first storage system to be scanned to generate a first compression estimate for that dataset; designating a scan criterion to be utilized in the scan; for each of a plurality of pages of the dataset, scanning the page by: performing a computation on the page to obtain a page result; determining whether or not the page result satisfies the designated scan criterion; and responsive to the page result satisfying the designated scan criterion, updating a corresponding entry of a compression estimate table for the dataset; generating the first compression estimate for the dataset based at least in part on the contents of the compression estimate table; generating a second compression estimate for the dataset based at least in part on the contents of the compression estimate table and enhanced compression functionality available in a second storage system; and automatically migrating the dataset from the first storage system to the second storage system for compression based at least in part on the second compression estimate indicating that a threshold level of enhanced compression is achieved at the second storage system; wherein scanning the plurality of pages

Assignees

Inventors

Classifications

  • G06F3/0638Primary

    Organizing or formatting or addressing of data · CPC title

  • based on delta files · CPC title

  • hash tables · CPC title

  • Saving storage space on storage systems · CPC title

  • In-line storage system · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11609883B2 cover?
An apparatus in one embodiment comprises at least one processing device comprising a processor coupled to a memory. The processing device is configured to identify a dataset to be scanned to generate a compression estimate for that dataset, to designate a scan criterion to be utilized in the scan, and for each of a plurality of pages of the dataset, to scan the page, where scanning the page inc…
Who is the assignee on this patent?
Emc Ip Holding Co Llc
What technology area does this patent fall under?
Primary CPC classification G06F3/0638. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 21 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).