Avoidance of intermediate data skew in a massive parallel processing environment
US-9569494-B2 · Feb 14, 2017 · US
US11853323B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11853323-B2 |
| Application number | US-202318118595-A |
| Country | US |
| Kind code | B2 |
| Filing date | Mar 7, 2023 |
| Priority date | Feb 19, 2014 |
| Publication date | Dec 26, 2023 |
| Grant date | Dec 26, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method, apparatus, and system for join operations of a plurality of relations that are distributed over a plurality of storage locations over a network of computing components.
Opening claim text (preview).
What is claimed is: 1. A method, comprising: receiving a relational join query comprising a join operation, an indication of a first relation and a second relation to be joined, and a predicate, wherein the first relation and the second relation are partitioned over processing nodes of a cluster; determining, by a processing device prior to starting distribution of the first or second relation to a plurality of probe operators of a probe operation, whether to distribute the first relation to the probe operation using a broadcast join or to distribute the second relation to the probe operation using a re-partitioning join, wherein the determining is based at least in part on an estimated size of the second relation and a cost metric; based on the determining, distributing the first relation or the second relation to the processing nodes of the cluster associated with the probe operation; and performing, at the processing nodes of the cluster associated with the probe operation, the relational join query using at least one of a hash join, a sort-merge join, or a nested-loop join to generate a third relation that contains all combinations of tuples in the first relation and the second relation that satisfy the predicate. 2. The method of claim 1 , wherein the determining is further based on an actual size of the first relation. 3. The method of claim 2 , wherein the actual size of the first relation is determined during execution of the join operation. 4. The method of claim 2 , further comprising building a hash index for the first relation, wherein the actual size of the first relation is determined based on the building of the hash index. 5. The method of claim 2 , wherein the first relation comprises tuples, the method further comprising: forwarding the tuples to a plurality of build operators in accordance with a partition move; and determining a total number of tuples processed for the partition move; wherein the actual size of the first relation is based on the total number of tuples processed for the partition move. 6. The method of claim 1 , further comprising, upon determining to distribute the first relation to the probe operation using the broadcast join: setting links between the plurality of build operators and the plurality of probe operators to broadcast links; setting links between the second relation and the plurality of probe operators to synchronous links; and sending the first relation through the broadcast links so that each partition of the first relation is broadcasted to every one of the plurality of probe operators. 7. The method of claim 1 , further comprising, upon determining to distribute the second relation to the probe operation using the re-partitioning join: setting links between the plurality of build operators and the plurality of probe operators to synchronous links; setting links between the second relation and the plurality of probe operators to partition links; and sending the second relation through the partition links to the plurality of probe operators. 8. A system, comprising: a memory to store a plurality of relations; and a processing device operatively coupled with the memory, the processing device to: receive a relational join query comprising a join operation, an indication of a first relation and a second relation to be joined, and a predicate, wherein the first relation and the second relation are partitioned over processing nodes of a cluster; determine, prior to starting distribution of the first or second relation to a plurality of probe operators of a probe operation, whether to distribute the first relation to the probe operation using a broadcast join or to distribute the second relation to the probe operation using a re-partitioning join, wherein the determination is based at least in part on an estimated size of the second relation and a cost metric; based on the determination, distribute the first relation or the second relation to the processing nodes of the cluster associated with the probe operation; and perform, at the processing nodes of the cluster associated with the probe operation, the relational join query using at least one of a hash join, a sort-merge join, or a nested-loop join to generate a third relation that contains all combinations of tuples in the first relation and the second relation that satisfy the predicate. 9. The system of claim 8 , wherein the determination is further based on an actual size of the first relation. 10. The system of claim 9 , wherein the actual size of the first relation is determined during execution of the join operation. 11. The system of claim 9 , wherein the processing device is further to build a hash index for the first relation, wherein the actual size of the first relation is determined based on the build of the hash index. 12. The system of claim 9 , wherein the first relation comprises tuples, and the processing device is further to: forward the tuples to the plurality of build operators in accordance with a partition move; and determine a total number of tuples processed for the partition move; wherein the actual size of the first relation is based on the total number of tuples processed for the partition move. 13. The system of claim 8 , wherein if the processing device determines to distribute the first relation to the probe operation using the broadcast join, the processing device is further to: set links between the plurality of build operators and the plurality of probe operators to broadcast links; set links between the second relation and the plurality of probe operators to synchronous links; and send the first relation through the broadcast links so that each partition of the first relation is broadcasted to every one of the plurality of probe operators. 14. The system of claim 8 , wherein if the processing device determines to distribute the second relation to the probe operation using the re-partitioning join, the processing device is further to: set links between the plurality of build operators and the plurality of probe operators to synchronous links; set links between the second relation and the plurality of probe operators to partition links; and send the second relation through the partition links to the plurality of probe operators. 15. A non-transitory computer readable medium having instructions stored thereon that, when executed by a processing device, cause the processing device to: receive a relational join query comprising a join operation, an indication of a first relation and a second relation to be joined, and a predicate, wherein the first relation and the second relation are partitioned over processing nodes of a cluster; determine, prior to starting distribution of the first or second relation to a plurality of probe operators of a probe operation, whether to distribute the first relation to the probe operation using a broadcast join or to distribute the second relation to the probe operation using a re-partitioning join, wherein the determination is based at least in part on an estimated size of the second relation and a cost metric; based on the determination, distribute the first relation or the second relation to the processing nodes of the cluster associated with the probe operation; and perform, at the processing nodes of the cluster associated with the probe operation, the relational join query using at least one of a hash join, a sort-merge join, or a nested-loop join to generate a third relation that contains all combinations of tuples in the first relation and the second relation that satisfy the predicate.
Asynchronous replication or reconciliation · CPC title
Intra-oral devices · CPC title
Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues · CPC title
the resource being the memory · CPC title
considering hardware capabilities · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.