Data skipping and compression through partitioning of data

US2017046367A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2017046367-A1
Application numberUS-201514821915-A
CountryUS
Kind codeA1
Filing dateAug 10, 2015
Priority dateAug 10, 2015
Publication dateFeb 16, 2017
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Conventionally, in addition to indexing, a synopsis of a base table of a database is used to skip and compress data. However, scanning of the entire synopsis for all queries is required, which takes a long time when the synopsis gets significantly big in a large data warehouse. A method for efficient data skipping and compression through vertical partitioning of data is provided to eliminate the cost of synopsis storage overhead while enabling the synopsis search functionality.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method for scanning and skipping data blocks, the method comprising: partitioning projection of each data value of a set of data values into a plurality of data types, wherein the date types include numerical and/or comparable bytes value; and storing the plurality of data types in a set of separate columns, wherein there is a separate column for each data type; wherein: at least the step of storing the plurality of data types is performed by computer software running on computer hardware. 2 . The method of claim 1 , further comprising: retrieving a data block, the data block comprising a subset of the plurality of data types; applying a plurality of predicates on the data block, the plurality of predicates corresponding to the plurality of data types; skipping the data block, conditioned upon the failure of any predicate of the plurality of predicates; and returning the data block, conditioned upon the passing of each predicate of the plurality of predicates. 3 . The method of claim 1 , wherein the step of partitioning projection of each data value includes: transforming each data value into a transformed value through the use of a custom formula that includes a geospatial grid; and dividing the transformed value into a set of most significant digits and least significant digits. 4 . The method of claim 1 , further comprising: identifying a set of correlated values from the plurality of data types. 5 . The method of claim 4 , wherein the set of correlated values includes a prefix, a postfix, and/or a set of substrings. 6 . The method of claim 4 , further comprising: storing only once the set of correlated values. 7 . The method of claim 1 , further comprising: compressing the plurality of data types in the set of separate columns. 8 . The method of claim 1 , further comprising: sorting the plurality of data types in the set of separate columns. 9 . The method of claim 1 , further comprising: generating a set of range summaries of the plurality of data types. 10 . A computer program product for scanning and skipping data blocks, the computer program product comprising a computer readable storage medium having stored thereon: first program instructions programmed to partition projection of each data value of a set of data values into a plurality of data types, wherein the date types include numerical and/or comparable bytes value; and second program instructions programmed to store the plurality of data types in a set of separate columns, wherein there is a separate column for each data type; wherein: at least the step of storing the plurality of data types is performed by computer software running on computer hardware. 11 . The computer program product of claim 10 , further comprising: third program instructions programmed to retrieve a data block, the data block comprising a subset of the plurality of data types; fourth program instructions programmed to apply a plurality of predicates on the data block, the plurality of predicates corresponding to the plurality of data types; fifth program instructions programmed to skip the data block, conditioned upon the failure of any predicate of the plurality of predicates; and sixth program instructions programmed to return the data block, conditioned upon the passing of each predicate of the plurality of predicates. 12 . The computer program product of claim 10 , further comprising: third program instructions programmed to identify a set of correlated values from the plurality of data types. 13 . The computer program product of claim 12 , wherein the set of correlated values includes a prefix, a postfix, and/or a set of substrings. 14 . The computer program product of claim 12 , further comprising: fourth program instructions programmed to store only once the set of correlated values. 15 . A computer system for scanning and skipping data blocks, the computer system comprising: a processor(s) set; and a computer readable storage medium; wherein: the processor set is structured, located, connected, and/or programmed to run program instructions stored on the computer readable storage medium; and the program instructions include: first program instructions programmed to partition projection of each data value of a set of data values into a plurality of data types, wherein the date types include numerical and/or comparable bytes value; and second program instructions programmed to store the plurality of data types in a set of separate columns, wherein there is a separate column for each data type; wherein: at least the step of storing the plurality of data types is performed by computer software running on computer hardware. 16 . The computer system of claim 15 , further comprising: third program instructions programmed to retrieve a data block, the data block comprising a subset of the plurality of data types; fourth program instructions programmed to apply a plurality of predicates on the data block, the plurality of predicates corresponding to the plurality of data types; fifth program instructions programmed to skip the data block, conditioned upon the failure of any predicate of the plurality of predicates; and sixth program instructions programmed to return the data block, conditioned upon the passing of each predicate of the plurality of predicates. 17 . The computer system of claim 15 , further comprising: third program instructions programmed to compress the plurality of data types in the set of separate columns. 18 . The computer system of claim 15 , further comprising: third program instructions programmed to sort the plurality of data types in the set of separate columns. 19 . The computer system of claim 15 , further comprising: third program instructions programmed to generate a set of range summaries of the plurality of data types. 20 . The computer system of claim 15 , further comprising: third program instructions programmed to identify a set of correlated values from the plurality of data types.

Assignees

Inventors

Classifications

  • Ensuring data consistency and integrity · CPC title

  • G06F16/215Primary

    Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors · CPC title

  • Unary operations; Data partitioning operations · CPC title

  • Physics · mapped topic

  • Physics · mapped topic

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2017046367A1 cover?
Conventionally, in addition to indexing, a synopsis of a base table of a database is used to skip and compress data. However, scanning of the entire synopsis for all queries is required, which takes a long time when the synopsis gets significantly big in a large data warehouse. A method for efficient data skipping and compression through vertical partitioning of data is provided to eliminate th…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F16/215. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Feb 16 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).