Framework for managing dynamic configurations of data intake and query systems deployed in remote computing environments
US-12182151-B1 · Dec 31, 2024 · US
US2018239792A1 · US · A1
| Field | Value |
|---|---|
| Publication number | US-2018239792-A1 |
| Application number | US-201815900730-A |
| Country | US |
| Kind code | A1 |
| Filing date | Feb 20, 2018 |
| Priority date | Feb 17, 2017 |
| Publication date | Aug 23, 2018 |
| Grant date | — |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A process creates a data sketch for a data set with many rows. A user selects data fields from the data source. The process allocates storage for N bins, where each bin has storage space for a key value and an associated counter value (which is initialized to zero). The process sequentially accesses the rows from the data source (e.g., as a stream). For each row, the process computes a respective key value using data values for the selected data fields. When the respective key value matches a key value for a respective bin, the process increments the counter value for the respective bin. Otherwise, the process identifies a respective bin with a smallest counter value c. The process increments the counter value of the respective bin, and with probability 1/(1+c), replaces the key value of the respective bin with the respective key value.
Opening claim text (preview).
What is claimed is: 1 . A method of data sketching, comprising: at a computer system having one or more processors and memory storing one or more programs configured for execution by the one or more processors: selecting a plurality of data fields from a data source having a plurality of data rows; determining a number N of bins for aggregating rows of the data source; allocating storage in the memory for N bins, each bin comprising storage space for a key value and an associated counter value, with the associated counter value initialized to zero; sequentially processing the plurality of rows in the data source, including, for each row: computing a respective key value for the respective row using data values for the selected plurality of data fields in the respective row; when the respective key value matches a key value for a respective bin, incrementing the counter value for the respective bin; when the respective key value does not match any bin key value and there exists one or more bins having a counter value of zero: selecting a respective bin having a counter value of zero; setting the key value of the respective bin to be the respective key value; and setting the counter value of the respective bin to be one; when the respective key value does not match any bin key value and all of the N bins have non-zero counter values: identifying a respective bin with a smallest counter value c; incrementing the counter value of the respective bin; and with probability 1/(1+c), replacing the key value of the respective bin with the respective key value. 2 . The method of claim 1 , wherein the data source is streaming. 3 . The method of claim 1 , further comprising: receiving a query from a client to identify M frequent items from the N bins, wherein M is a positive integer less than N; in response to receiving the query: selecting, from the N bins, M bins having the largest counter values; and returning the key values of the selected Mbins to the client. 4 . The method of claim 1 , further comprising: receiving a query from a client to estimate a count of rows from the data source satisfying a user-specified filter condition; in response to receiving the query: determining a subset of the bins whose key values satisfy the user-specified filter condition; computing a sum of the counter values for the bins in the determined subset; and returning, to the client, the sum as an estimate of the count of rows from the data source satisfying the filter condition. 5 . The method of claim 1 , wherein determining the number N of bins for aggregating rows of the data source comprises: receiving user specification of an error limit c for estimating subset sums for the data set; receiving a user estimate P of a fraction of the rows in the data set that will satisfy typical subset sum filters; and selecting the number N to be a positive integer satisfying N ≥ 1 P ɛ 2 . 6 . The method of claim 1 , wherein the selected data fields are f 1 , f 2 , . . . , f k , and computing the respective key value for the respective row comprises forming a concatenation f 1 (r)+f 2 (r)+ . . . +f k (r), where f i (r) specifies the f i data field value for the row r for i=1, 2, . . . , k, and f i (r) casts the corresponding data field value as a string when the data type of the data field f i is not a string. 7 . The method of claim 1 , wherein the selected data fields are f 1 , f 2 , . . . , f k , and computing the respective key value for the respective row comprises forming a k-tuple (f 1 (r), f 2 (r), . . . , f k (r)), where f i (r) specifies the f i data field value for the row r for i=1, 2, . . . , k. 8 . The method of claim 1 , wherein each bin further comprises storage space for a hash of the bin key value, and matching a respective key value to a key value for a respective bin comprises: computing a respective hash value of the respective key value; and comparing the respective hash value to bin hash values. 9 . The method of claim 1 , further comprising partitioning data values for a first data field, of the selected plurality of data fields, into a plurality of distinct partitions, each partition consisting of a respective list or range of data values; wherein matching the respective key value to a respective bin value comprises: identifying the data values of the first data field in the respective key value and in the respective bin key value, and determining that the identified data values are in a same partition. 10 . A computer system having one or more computing devices, each computing device having one or more processors and memory, wherein the memory stores one or more programs configured for execution by the one or more processors, and the one or more programs comprise instructions for: selecting a plurality of data fields from a data source having a plurality of data rows; determining a number N of bins for aggregating rows of the data source; allocating storage in the memory for Nbins, each bin comprising storage space for a key value and an associated counter value, with the associated counter value initialized to zero; sequentially processing the plurality of rows in the data source, including, for each row: computing a respective key value for the respective row using data values for the selected plurality of data fields in the respective row; when the respective key value matches a key value for a respective bin, incrementing the counter value for the respective bin; when the respective key value does not match any bin key value and there exists one or more bins having a counter value of zero: selecting a respective bin having a counter value of zero; setting the key value of the respective bin to be the respective key value; and setting the counter value of the respective bin to be one; when the respective key value does not match any bin key value and all of the N bins have non-zero counter values: identifying a respective bin with a smallest counter value c; incrementing the counter value of the respective bin; and with probability 1/(1+c), replacing the key value of the respective bin with the respective key value. 11 . The computer system of claim 10 , wherein the data source is streaming. 12 . The computer system of claim 10 , wherein the one or more programs further comprise instructions for: receiving a query from a client to identify M frequent items from the N bins, wherein M is a positive integer less than N; in response to receiving the query: selecting, from the N bins, M bins having the largest counter values; and returning the key values of the selected M bins to the client. 13 . The computer system of claim 10 , wherein the one or more programs further comprise instructions for: receiving a query from a client to estimate a count of rows from the data source satisfying a user-specified filter condition; in response to receiving the query: determining a subset of the bins whose key values satisfy the user-specified filter condition; computing a sum of the counter values for the bins in the determined subset; and returning, to the client, the sum as
Query optimisation · CPC title
Management thereof · CPC title
Query execution · CPC title
Probabilistic graphical models, e.g. probabilistic networks · CPC title
Physics · mapped topic
Related publications grouped by family.
Answers are generated from the same data shown on this page.