Distributed histogram computation framework using data stream sketches and samples

US11455302B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11455302-B2
Application numberUS-202017007801-A
CountryUS
Kind codeB2
Filing dateAug 31, 2020
Priority dateMay 15, 2020
Publication dateSep 27, 2022
Grant dateSep 27, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods for distributed histogram computation in a framework utilizing data stream sketches and samples are performed by systems and devices. Distributions of large data sets are scanned once and processed by a computing pool, without sorting, to generate local sketches and value samples of each distribution. The local sketches and samples are utilized to construct local histograms on which cardinality estimates are obtained for query plan generation of distributed queries against distributions. Local statistics of distributions are also merged and consolidated to construct a global histogram representative of the entire data set. The global histogram is utilized to determine a cardinality estimation for query plan generation of incoming queries against the entire data set. The addition of new data to a data set or distribution involves a scan of the new data from which new statistics are generated and then merged with existing statistics for a new global histogram.

First claim

Opening claim text (preview).

What is claimed is: 1. A system comprising: a control node that is hardware-based and that is configured to: generate a statistical query task directed acyclic graph that is associated with a data set; and assign a plurality of distributed queries, respectively associated with distributions of the data set and generated by dividing the statistical query task directed acyclic graph, to a plurality of processing nodes that are hardware-based; the plurality of processing nodes, one or more of which, at least partially in parallel for each distribution respectively, are configured to: generate by each of the plurality of processing nodes, data structures that respectively store first data associated with respective frequencies of values, second data associated with a number of distinct values, and third data associated with a random sampling of the values; construct, by each of the plurality of processing nodes, a histogram respectively based on the first data, the second data, and the third data of distributions processed by said each of the plurality of processing nodes; merge, by a first processing node of the plurality of processing nodes, the first data generated by the plurality of processing nodes, and the second data generated by the plurality of processing nodes; and merge, by a second processing node of the plurality of processing nodes, the third data generated by the plurality of processing nodes; and a third processing node of the plurality of processing nodes configured to construct a global histogram of the data set based on the merged first data, the merged second data, and the merged third data. 2. The system of claim 1 , wherein the plurality of processing nodes, one or more of which, at least partially in parallel on each distribution respectively, are configured to: transmit, by each of the plurality of processing nodes to each other processing node and subsequent to said constructing, a count value indicative of a number of rows in respective distributions processed by each of the plurality of processing nodes. 3. The system of claim 1 , wherein the first processing node, the second processing node, and the third processing node are a same processing node. 4. The system of claim 1 , wherein the control node, subsequent to said constructing the global histogram, is configured to: generate a query plan, having a cardinality estimate based at least on the global histogram, of an incoming query directed to the data set; generate a plurality of distributed queries respectively associated with distributions of the data set based at least on the incoming query and the estimated cardinality; assign the plurality of distributed queries to one or more of the plurality of processing nodes; and return a query result based on performance of the plurality of distributed queries. 5. The system of claim 4 , wherein the control node is configured to: receive the incoming query prior to said constructing the global histogram. 6. The system of claim 1 , wherein constructing the global histogram is performed without sorting the data set and is not based on sorting the data set. 7. The system of claim 1 , wherein the plurality of processing nodes, one or more of which, at least partially in parallel for each distribution respectively, and to construct the histogram, are configured to: determine a subset of values in the distribution based at least on the first data, each value in the subset having a respective frequency that meets or exceeds a threshold value; and construct the histogram as including a separate bin in the histogram for each value in the subset and as including additional bins having equi-depth partitioning in the histogram by allocating, for each other value in the distribution that is not included in the subset, a value to an additional bin via quantile distribution. 8. A method performed by a computing system, the method comprising: performing by a control node: generating a statistical query task directed acyclic graph associated with a data set; dividing the statistical query task directed acyclic graph into a plurality of distributed queries respectively associated with distributions of the data set; and assigning the plurality of distributed queries to a plurality of processing nodes; performing, at least partially in parallel, by one or more of the plurality of processing nodes on each distribution: generating by each of the plurality of processing nodes, data structures that respectively store first data associated with respective frequencies of values, second data associated with a number of distinct values, and third data associated with a random sampling of the values; constructing, by each of the plurality of processing nodes, a histogram respectively based on the first data, the second data, and the third data of distributions processed by each of the plurality of processing nodes; transmitting, by each of the plurality of processing nodes to each other processing node, a count value indicative of a number of rows in respective distributions processed by each of the plurality of processing nodes; merging, by a first processing node of the plurality of processing nodes, data structures having the first data generated by the plurality of processing nodes, and data structures having the second data generated by the plurality of processing nodes; and merging, by a second processing node of the plurality of processing nodes, data structures having the third data generated by the plurality of processing nodes; and constructing, by a third processing node of the plurality of processing nodes, a global histogram of the data set based on the merged first data, the merged second data, and the merged third data. 9. The method of claim 8 , wherein the first processing node, the second processing node, and the third processing node are a same processing node. 10. The method of claim 8 , further comprising: subsequent to said constructing the global histogram, performing by the control node: generating a query plan, having a cardinality estimate based at least on the global histogram, of an incoming query directed to the data set; and generating a plurality of distributed queries respectively associated with a distribution of the data set based at least on the incoming query and the estimated cardinality. 11. The method of claim 10 , further comprising: subsequent to said generating the plurality of distributed queries respectively associated with the distribution of the data set, performing by the control node: assigning the plurality of distributed queries to one or more of the plurality of processing nodes; and returning a query result based on performance of the plurality of distributed queries. 12. The method of claim 10 , further comprising: receiving the incoming query prior to said constructing the global histogram. 13. The method of claim 8 , wherein constructing the global histogram is performed without sorting the data set and is not based on sorting the data set. 14. A method performed by a computing system, the method comprising: performing a construction of a histogram by: determining information associated with the data set based on a scan of a data set; generating a first data structure, based on the information, that stores first data associated with respective frequencies of values in the data set; generating a second data structure, based on the information, that stores second data associated with a number of distinct values in the data set; generating a third data structure, based on the information, that stores third data associated with a rando

Assignees

Inventors

Classifications

  • Selectivity estimation or determination · CPC title

  • Grouping and aggregation · CPC title

  • Data stream processing; Continuous queries · CPC title

  • of parallel queries · CPC title

  • Approximate or statistical queries · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11455302B2 cover?
Methods for distributed histogram computation in a framework utilizing data stream sketches and samples are performed by systems and devices. Distributions of large data sets are scanned once and processed by a computing pool, without sorting, to generate local sketches and value samples of each distribution. The local sketches and samples are utilized to construct local histograms on which car…
Who is the assignee on this patent?
Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G06F16/24545. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 27 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).