Efficient computations and network communications in a distributed computing environment
US-2018336075-A1 · Nov 22, 2018 · US
US11455302B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11455302-B2 |
| Application number | US-202017007801-A |
| Country | US |
| Kind code | B2 |
| Filing date | Aug 31, 2020 |
| Priority date | May 15, 2020 |
| Publication date | Sep 27, 2022 |
| Grant date | Sep 27, 2022 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods for distributed histogram computation in a framework utilizing data stream sketches and samples are performed by systems and devices. Distributions of large data sets are scanned once and processed by a computing pool, without sorting, to generate local sketches and value samples of each distribution. The local sketches and samples are utilized to construct local histograms on which cardinality estimates are obtained for query plan generation of distributed queries against distributions. Local statistics of distributions are also merged and consolidated to construct a global histogram representative of the entire data set. The global histogram is utilized to determine a cardinality estimation for query plan generation of incoming queries against the entire data set. The addition of new data to a data set or distribution involves a scan of the new data from which new statistics are generated and then merged with existing statistics for a new global histogram.
Opening claim text (preview).
What is claimed is: 1. A system comprising: a control node that is hardware-based and that is configured to: generate a statistical query task directed acyclic graph that is associated with a data set; and assign a plurality of distributed queries, respectively associated with distributions of the data set and generated by dividing the statistical query task directed acyclic graph, to a plurality of processing nodes that are hardware-based; the plurality of processing nodes, one or more of which, at least partially in parallel for each distribution respectively, are configured to: generate by each of the plurality of processing nodes, data structures that respectively store first data associated with respective frequencies of values, second data associated with a number of distinct values, and third data associated with a random sampling of the values; construct, by each of the plurality of processing nodes, a histogram respectively based on the first data, the second data, and the third data of distributions processed by said each of the plurality of processing nodes; merge, by a first processing node of the plurality of processing nodes, the first data generated by the plurality of processing nodes, and the second data generated by the plurality of processing nodes; and merge, by a second processing node of the plurality of processing nodes, the third data generated by the plurality of processing nodes; and a third processing node of the plurality of processing nodes configured to construct a global histogram of the data set based on the merged first data, the merged second data, and the merged third data. 2. The system of claim 1 , wherein the plurality of processing nodes, one or more of which, at least partially in parallel on each distribution respectively, are configured to: transmit, by each of the plurality of processing nodes to each other processing node and subsequent to said constructing, a count value indicative of a number of rows in respective distributions processed by each of the plurality of processing nodes. 3. The system of claim 1 , wherein the first processing node, the second processing node, and the third processing node are a same processing node. 4. The system of claim 1 , wherein the control node, subsequent to said constructing the global histogram, is configured to: generate a query plan, having a cardinality estimate based at least on the global histogram, of an incoming query directed to the data set; generate a plurality of distributed queries respectively associated with distributions of the data set based at least on the incoming query and the estimated cardinality; assign the plurality of distributed queries to one or more of the plurality of processing nodes; and return a query result based on performance of the plurality of distributed queries. 5. The system of claim 4 , wherein the control node is configured to: receive the incoming query prior to said constructing the global histogram. 6. The system of claim 1 , wherein constructing the global histogram is performed without sorting the data set and is not based on sorting the data set. 7. The system of claim 1 , wherein the plurality of processing nodes, one or more of which, at least partially in parallel for each distribution respectively, and to construct the histogram, are configured to: determine a subset of values in the distribution based at least on the first data, each value in the subset having a respective frequency that meets or exceeds a threshold value; and construct the histogram as including a separate bin in the histogram for each value in the subset and as including additional bins having equi-depth partitioning in the histogram by allocating, for each other value in the distribution that is not included in the subset, a value to an additional bin via quantile distribution. 8. A method performed by a computing system, the method comprising: performing by a control node: generating a statistical query task directed acyclic graph associated with a data set; dividing the statistical query task directed acyclic graph into a plurality of distributed queries respectively associated with distributions of the data set; and assigning the plurality of distributed queries to a plurality of processing nodes; performing, at least partially in parallel, by one or more of the plurality of processing nodes on each distribution: generating by each of the plurality of processing nodes, data structures that respectively store first data associated with respective frequencies of values, second data associated with a number of distinct values, and third data associated with a random sampling of the values; constructing, by each of the plurality of processing nodes, a histogram respectively based on the first data, the second data, and the third data of distributions processed by each of the plurality of processing nodes; transmitting, by each of the plurality of processing nodes to each other processing node, a count value indicative of a number of rows in respective distributions processed by each of the plurality of processing nodes; merging, by a first processing node of the plurality of processing nodes, data structures having the first data generated by the plurality of processing nodes, and data structures having the second data generated by the plurality of processing nodes; and merging, by a second processing node of the plurality of processing nodes, data structures having the third data generated by the plurality of processing nodes; and constructing, by a third processing node of the plurality of processing nodes, a global histogram of the data set based on the merged first data, the merged second data, and the merged third data. 9. The method of claim 8 , wherein the first processing node, the second processing node, and the third processing node are a same processing node. 10. The method of claim 8 , further comprising: subsequent to said constructing the global histogram, performing by the control node: generating a query plan, having a cardinality estimate based at least on the global histogram, of an incoming query directed to the data set; and generating a plurality of distributed queries respectively associated with a distribution of the data set based at least on the incoming query and the estimated cardinality. 11. The method of claim 10 , further comprising: subsequent to said generating the plurality of distributed queries respectively associated with the distribution of the data set, performing by the control node: assigning the plurality of distributed queries to one or more of the plurality of processing nodes; and returning a query result based on performance of the plurality of distributed queries. 12. The method of claim 10 , further comprising: receiving the incoming query prior to said constructing the global histogram. 13. The method of claim 8 , wherein constructing the global histogram is performed without sorting the data set and is not based on sorting the data set. 14. A method performed by a computing system, the method comprising: performing a construction of a histogram by: determining information associated with the data set based on a scan of a data set; generating a first data structure, based on the information, that stores first data associated with respective frequencies of values in the data set; generating a second data structure, based on the information, that stores second data associated with a number of distinct values in the data set; generating a third data structure, based on the information, that stores third data associated with a rando
Selectivity estimation or determination · CPC title
Grouping and aggregation · CPC title
Data stream processing; Continuous queries · CPC title
of parallel queries · CPC title
Approximate or statistical queries · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.