Encoding data stored in a column-oriented manner

US9325344B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9325344-B2
Application numberUS-201113206827-A
CountryUS
Kind codeB2
Filing dateAug 10, 2011
Priority dateDec 3, 2010
Publication dateApr 26, 2016
Grant dateApr 26, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Data stored in a column-oriented manner is encoded using a data mining algorithm for finding column patterns among a set of data tuples, where each data tuple contains a set of columns, and the data mining algorithm treats all columns and all column combinations and column ordering similarly or in the same manner when looking for column patterns. Column values are ordered occurring in the column patterns based on their frequencies into a prefix tree, where the prefix tree defines a pattern order. The data tuples are sorted according to the pattern order, resulting in sorted data tuples, and columns of the sorted data tuples are encoded using run-length encoding.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for encoding data stored in a column-oriented manner, comprising: finding column patterns among a set of data tuples, wherein each data tuple contains a set of columns, and wherein the column patterns are identified based upon a single column individually, and upon a combination of columns together, by grouping each column together with one or more other columns, without regard for column order, and evaluating rows of the grouped columns to determine the column patterns; identifying single column values occurring in the column patterns, in order to reduce a number of nodes of a prefix tree comprising a plurality of nodes; ordering the single column values occurring in the column patterns based on their corresponding frequencies into the prefix tree, wherein the prefix tree defines a pattern order and is constructed to share the nodes corresponding to the single column values occurring more frequently; sorting the data tuples according to the pattern order, resulting in sorted data tuples; and encoding columns of the sorted data tuples using run-length encoding. 2. The method according to claim 1 , further comprising: dividing the sorted data tuples into smaller already precompressed parts; and recursively using the smaller already precompressed parts to find the column patterns among the sorted data tuples. 3. The method according to claim 1 , further comprising using only values of different columns in each data tuple. 4. The method according to claim 1 , further comprising using a limited number of columns of a single table as the set of columns. 5. The method according to claim 4 , further comprising interpreting the single table as a transactional database with the data tuples as transactions and pairs of column ID and column value as items. 6. The method according to claim 5 , further comprising: generating an output list containing at least one of all column value combinations and all closed column value combinations under a minimum support threshold, which is dynamically adaptable to reduce a size of the output list; and bounding a maximum length of each column value combination by the number of columns used in each data tuple. 7. The method according to claim 4 , further comprising using only columns of the single table having certain characteristics promising a high compression ratio for building the data tuples. 8. The method according to claim 7 , further comprising: estimating cardinality for columns of the single table using an incremental cardinality algorithm; and using the cardinality as a characteristic promising a high compression ratio. 9. The method according to claim 1 , further comprising: assigning a pair of column ID and column value to each node of the prefix tree, each node having an ID field which is unique within the prefix tree and two counter fields; and linking the nodes together by at least one of pattern links, column links or rank links. 10. The method according to claim 1 , further comprising modifying paths of the prefix tree of the column patterns by weighting or swapping the nodes of the prefix tree to optimize the prefix tree. 11. A data processing system for encoding data stored in a column-oriented manner, comprising: a processor configured to: find column patterns among a set of data tuples, wherein each data tuple contains a set of columns, and wherein the column patterns are identified based upon a single column individually, and upon a combination of columns together, by grouping each column together with one or more other columns, without regard for column order, and evaluating rows of the grouped columns to determine the column patterns; identify single column values occurring in the column patterns, in order to reduce a number of nodes of a prefix tree comprising a plurality of nodes; order the single column values occurring in the column patterns based on their corresponding frequencies into the prefix tree, wherein the prefix tree defines a pattern order and is constructed to share the nodes corresponding to the single column values occurring more frequently; sort the data tuples according to the pattern order, resulting in sorted data tuples; and encode columns of the sorted data tuples using run-length encoding. 12. The data processing system according to claim 11 , wherein the sorting of the data tuples by the processor includes dividing the sorted data tuples into smaller already precompressed parts, and performing a recursive run for finding the column patterns among the sorted data tuples. 13. A computer program product comprising: a non-transitory computer readable storage medium having a computer readable program code embodied therewith, the computer readable program code configured to perform operations of: finding column patterns among a set of data tuples, wherein each data tuple contains a set of columns, and wherein the column patterns are identified based upon a single column individually, and upon a combination of columns together, by grouping each column together with one or more other columns, without regard for column order, and evaluating rows of the grouped columns to determine the column patterns; identifying single column values occurring in the column patterns, in order to reduce a number of nodes of a prefix tree comprising a plurality of nodes; ordering the single column values occurring in the column patterns based on their corresponding frequencies into the prefix tree, wherein the prefix tree defines a pattern order and is constructed to share the nodes corresponding to the single column values occurring more frequently; sorting the data tuples according to the pattern order, resulting in sorted data tuples; and encoding columns of the sorted data tuples using run-length encoding. 14. The computer program product of claim 13 , where the operations further comprise: dividing the sorted data tuples into smaller already precompressed parts; and recursively using the smaller already precompressed parts to find the column patterns among the sorted data tuples. 15. The computer program product of claim 13 , wherein the operations use only values of different columns in each data tuple. 16. The computer program product of claim 13 , wherein the operations use a limited number of columns of a single table as the set of columns. 17. The computer program product of claim 16 , the operations further comprising interpreting the single table as a transactional database with the data tuples as transactions and pairs of column ID and column value as items. 18. The computer program product of claim 17 , wherein the operations further comprise: generating an output list containing at least one of all column value combinations and all closed column value combinations under a minimum support threshold, which is dynamically adaptable to reduce a size of the output list; and bounding a maximum length of each column value combination by the number of columns used in each data tuple. 19. The computer program product of claim 16 , the operations further comprising using only columns of the single table having certain characteristics promising a high compression ratio for building the data tuples. 20. The computer program product of claim 19 , the operations further comprising: estimating cardinality for columns of the single table using an incremental cardinality algorithm; and using the cardinality as a characteristic promising a high compression ratio. 21. The comput

Assignees

Inventors

Classifications

  • H03M7/40Primary

    Conversion to or from variable length codes, e.g. Shannon-Fano code, Huffman code, Morse code · CPC title

  • Conversion to or from run-length codes, i.e. by representing the number of consecutive digits, or groups of digits, of the same kind by a code word and a digit indicative of that kind · CPC title

  • Physics · mapped topic

  • Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9325344B2 cover?
Data stored in a column-oriented manner is encoded using a data mining algorithm for finding column patterns among a set of data tuples, where each data tuple contains a set of columns, and the data mining algorithm treats all columns and all column combinations and column ordering similarly or in the same manner when looking for column patterns. Column values are ordered occurring in the colum…
Who is the assignee on this patent?
Beier Felix, Draese Oliver, Stolze Knut, and 1 more
What technology area does this patent fall under?
Primary CPC classification H03M7/40. Mapped technology areas include Electricity.
When was this patent published?
Publication date Tue Apr 26 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).