Partitioned join with dense inner table representation
US-9953057-B2 · Apr 24, 2018 · US
US2018165331A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2018165331-A1 |
| Application number | US-201615374158-A |
| Country | US |
| Kind code | A1 |
| Filing date | Dec 9, 2016 |
| Priority date | Dec 9, 2016 |
| Publication date | Jun 14, 2018 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A massively parallel processing shared nothing relational database management system includes a plurality of storages assigned to a plurality of compute nodes. The system comprises a non-transitory memory having instructions and one or more processors in communication with the memory. The one or more processors execute the instructions to store a set of data in a first set of storages in the plurality of storages. The first set of data is hashed into a repartitioned set of data. The first set of storages is reassigned to a second set of compute nodes in the plurality of compute nodes. The repartitioned set of data is distributed to the second set of compute nodes and a database operation is performed on the repartitioned set of data by the second set of compute nodes.
Opening claim text (preview).
What is claimed is: 1 . A massively parallel processing shared nothing relational database management system comprising: a plurality of storages assigned to a plurality of compute nodes; a non-transitory memory storing instructions; and one or more processors in communication with the non-transitory memory, wherein the one or more processors execute the instructions to: store a set of data in a first set of storages in the plurality of storage, the first set of storages assigned to a first set of compute nodes in the plurality of compute nodes; repartition the set of data by hashing into a repartitioned set of data; reassign the first set of storages to a second set of compute nodes in the plurality of compute nodes; distribute the repartitioned set of data to the second set of compute nodes; and perform a database operation on the repartitioned set of data by the second set of compute nodes. 2 . The system of claim 1 , wherein the repartition the set of data includes forming smaller hash buckets of the set of data by hashing. 3 . The system of claim 1 , wherein the repartition is omitted when a repartition key is the same key used to partition the set of data. 4 . The system of claim 1 , wherein the reassign includes form a network connections between the first set of storages and the second set of compute nodes, and wherein the distribute includes distribute the repartitioned set of data to the second set of compute nodes by way of the network connections. 5 . The system of claim 4 , wherein the first set of storages and the first set of compute nodes form a shared nothing node in the system, and wherein the database operation includes at least one of an inner join, scan and redistribute. 6 . The system of claim 5 , wherein the first set of storages include at least an integrated circuit memory to store the set of data, and wherein the first set of compute nodes include at least an integrated circuit processor coupled to the integrated circuit memory by a signal path to transfer the set of data. 7 . The system of claim 1 , further comprising the one or more processors executing the instructions to: obtain a plurality of logic plans that include the database operation on the set of data stored in the first set of storages; determining a cost of redistributing the set of data to at least another compute node for each logic plan in the plurality of logic plans; determining a cost reduction from inter-partition parallelism for each logic plan in the plurality of logic plans; and selecting a logic plan from the plurality of logic plans based on the cost of redistributing the set of data and the cost reduction from inter-partition parallelism. 8 . A computer-implemented method for accessing data, the method comprising: obtaining a plurality of logic plans to respond to a query; determining a cost of redistributing a set of data stored in a storage assigned to a compute node to at least another compute node for each logic plan in the plurality of logic plans; determining a cost reduction from inter-partition parallelism for each logic plan in the plurality of logic plans; and selecting a logic plan from the plurality of logic plans based on the cost of redistributing the set of data and the cost reduction from inter-partition parallelism. 9 . The computer-implemented method of claim 8 , wherein the logic plan includes at least one database operation on the set of data. 10 . The computer-implemented method of claim 9 , wherein the at least one database operation on the set of data includes at least one of a join, hash aggregation and redistribution. 11 . The computer-implemented method of claim 8 , wherein the determining the cost of redistributing the set of data comprises: calculating a number of tuples to be processed in the set of data; calculating a width of a tuple in the set of data; calculating a hashing cost factor for the set of data; calculating an average data transfer speed through a network coupled between the storage and at least the another compute node; calculating a degree of inter-partition parallelism with a skew factor; and calculating the cost of redistributing the set of data in response to at least the number of tuples to be processed, the width of the tuple, the hashing cost factor, the average data transfer speed and the degree of inter-partition parallelism with the skew factor. 12 . The computer-implemented method of claim 11 , wherein the skew factor represents a data skew associated with the set of data. 13 . The computer-implemented method of claim 8 , wherein the determining the cost reduction from inter-partition parallelism comprises: calculating an operator cost of a hash join on the data set; calculating an operator cost of a hash aggregate on the data set; calculating a hashing cost factor for the set of data; calculating a degree of inter-partition parallelism with a skew factor; and calculating the cost reduction from inter-parallelism in response to the operator cost of the hash join or the operator cost of the hash aggregate and the degree of inter-partition parallelism with the skew factor. 14 . The computer-implemented method of claim 13 , wherein the skew factor is computed based on a percentage of a most common value in the data set. 15 . The computer-implemented method of claim 8 , wherein the computer-implemented method is performed at least partially by a massively parallel processing shared nothing relational database management system. 16 . A non-transitory computer-readable medium storing computer instructions, that when executed by one or more processors, cause one or more processors to perform the steps of: store a set of data in a first set of storages in a plurality of storages, the first set of storages assigned to a first set of compute nodes in a plurality of compute nodes; obtain a plurality of logic plans to respond a query that accesses the set of data; determine a cost of redistributing the set of data stored in the first set of storages to a second set of compute nodes for each logic plan in the plurality of logic plans; determine a cost reduction from inter-partition parallelism for each logic plan in the plurality of logic plans; select a logic plan from the plurality of logic plans based on the cost of redistributing the set of data and the cost reduction from inter-partition parallelism; repartition the set of data by hashing into a repartitioned set of data; reassign the first set of storages to the second set of compute nodes; distribute the repartitioned set of data to the second set of compute nodes; and perform a database operation on the repartitioned set of data by the second set of compute nodes to provide an answer to the query. 17 . The non-transitory computer-readable medium of claim 16 , wherein the plurality of storages and plurality of compute nodes are included in a massively parallel processing shared nothing relational database management system. 18 . The non-transitory computer-readable medium of claim 16 , wherein determine the cost of redistributing the set of data comprises: calculate a number of tuples to be processed in the set of data; calculate a width of a tuple in the set of data; calculate a hashing cost factor for the set of data; calculate an average data transfer speed through a network coupled between at the first set of storages and the second set of compute nodes; calculate a degree of inter-partition parallelism with a skew factor; and calculate the cost of redistribut
Selectivity estimation or determination · CPC title
of parallel queries · CPC title
Join order optimisation · CPC title
Unary operations; Data partitioning operations · CPC title
Hash tables · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.