Avoidance of intermediate data skew in a massive parallel processing environment
US-2015186465-A1 · Jul 2, 2015 · US
US9569493B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-9569493-B2 |
| Application number | US-201314144893-A |
| Country | US |
| Kind code | B2 |
| Filing date | Dec 31, 2013 |
| Priority date | Dec 31, 2013 |
| Publication date | Feb 14, 2017 |
| Grant date | Feb 14, 2017 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A computer-implemented method for minimizing join operation processing time within a database system based on estimated joined table spread of the database system has been provided. The computer-implemented method includes, estimating value distribution of data in a joined table, wherein the joined table is a result of join operation between two instances of tables of a database system. The computer-implemented method further includes determining boundaries for partitioning at least one range of attributes of the estimated value distribution, wherein the boundaries for partitioning at least one range of attributes of the estimated value distribution corresponds to a same number of rows of the joined table. The computer-implemented method further includes determining at least one assignment of the determined partition of the at least one range of attributes to processing units of the database system.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method for minimizing join operation processing time within a database system based on estimated joined table spread of the database system, the computer implemented method comprising the steps of: estimating, by one or more processors, value distribution of data in a joined table, wherein the joined table is a result of a join operation between two instances of tables of a database system and the estimated value distribution data is based on a count of a number of rows with a particular value of the two instances of the tables of the join operation and calculation of a number of rows of the joined table; determining, by the one or more processors, boundaries for partitioning at least one range of attributes of the estimated value distribution of data, wherein the boundaries for partitioning is based on a density distribution function of the joined table, and wherein the density distribution function represents a spread of data within a data column of the joined table, the joined table being a result of a join operation between two instances of tables of the database system; and determining, by the one or more processors, at least one assignment of the determined partition of the at least one range of attributes to processing units of the database system; providing, by the one or more processors, parameterized functions that represent the estimated value distribution; determining the total number of rows in the join table; determining, by the one or more processors, a set of estimates for the number of rows in the join table in a first set of consequent common attribute value ranges based on the parameterized functions; determining, by the one or more processors, a parametrized function representing the value distribution of the common attribute in the join table by solving a first equation system involving definite integrals of over the first set of consequent attribute ranges and the set of estimates for numbers of rows; and determining, by the one or more processors, a second set of consequent common attribute value ranges by solving a second equation system involving definite integrals of over the second set of consequent common attribute value ranges, each definite integral equaling the total number of rows in the join table divided by the number of processing nodes. 2. The computer-implemented method according to claim 1 , wherein the estimated value distribution of data is a density distribution function based on values of columns of the joined table of the database system. 3. The computer-implemented method according to claim 2 , further includes: splitting, by the one or more processors, the joined table, into at least one partitioning range of attributes of the estimated value distribution data. 4. The computer-implemented method according to claim 1 , wherein the assignment of the determined partition of the at least one range of attributes is based on defined hash functions of data distribution within the processing units. 5. The computer-implemented method according to claim 4 , wherein the defined hash functions are adaptive to map value of columns within the processing units. 6. The computer-implemented method according to claim 5 , wherein the defined hash functions decreases intermediate processing impact of query execution time of the processing units of a database system.
Join order optimisation · CPC title
Selectivity estimation or determination · CPC title
Intermediate data storage techniques for performance improvement · CPC title
Join operations · CPC title
Hash tables · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.