Unbiased Space-Saving Data Sketches for Estimating Disaggregated Subset Sums and Estimating Frequent Items

US2018239792A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2018239792-A1
Application numberUS-201815900730-A
CountryUS
Kind codeA1
Filing dateFeb 20, 2018
Priority dateFeb 17, 2017
Publication dateAug 23, 2018
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A process creates a data sketch for a data set with many rows. A user selects data fields from the data source. The process allocates storage for N bins, where each bin has storage space for a key value and an associated counter value (which is initialized to zero). The process sequentially accesses the rows from the data source (e.g., as a stream). For each row, the process computes a respective key value using data values for the selected data fields. When the respective key value matches a key value for a respective bin, the process increments the counter value for the respective bin. Otherwise, the process identifies a respective bin with a smallest counter value c. The process increments the counter value of the respective bin, and with probability 1/(1+c), replaces the key value of the respective bin with the respective key value.

First claim

Opening claim text (preview).

What is claimed is: 1 . A method of data sketching, comprising: at a computer system having one or more processors and memory storing one or more programs configured for execution by the one or more processors: selecting a plurality of data fields from a data source having a plurality of data rows; determining a number N of bins for aggregating rows of the data source; allocating storage in the memory for N bins, each bin comprising storage space for a key value and an associated counter value, with the associated counter value initialized to zero; sequentially processing the plurality of rows in the data source, including, for each row: computing a respective key value for the respective row using data values for the selected plurality of data fields in the respective row; when the respective key value matches a key value for a respective bin, incrementing the counter value for the respective bin; when the respective key value does not match any bin key value and there exists one or more bins having a counter value of zero: selecting a respective bin having a counter value of zero; setting the key value of the respective bin to be the respective key value; and setting the counter value of the respective bin to be one; when the respective key value does not match any bin key value and all of the N bins have non-zero counter values: identifying a respective bin with a smallest counter value c; incrementing the counter value of the respective bin; and with probability 1/(1+c), replacing the key value of the respective bin with the respective key value. 2 . The method of claim 1 , wherein the data source is streaming. 3 . The method of claim 1 , further comprising: receiving a query from a client to identify M frequent items from the N bins, wherein M is a positive integer less than N; in response to receiving the query: selecting, from the N bins, M bins having the largest counter values; and returning the key values of the selected Mbins to the client. 4 . The method of claim 1 , further comprising: receiving a query from a client to estimate a count of rows from the data source satisfying a user-specified filter condition; in response to receiving the query: determining a subset of the bins whose key values satisfy the user-specified filter condition; computing a sum of the counter values for the bins in the determined subset; and returning, to the client, the sum as an estimate of the count of rows from the data source satisfying the filter condition. 5 . The method of claim 1 , wherein determining the number N of bins for aggregating rows of the data source comprises: receiving user specification of an error limit c for estimating subset sums for the data set; receiving a user estimate P of a fraction of the rows in the data set that will satisfy typical subset sum filters; and selecting the number N to be a positive integer satisfying N ≥ 1 P   ɛ 2 . 6 . The method of claim 1 , wherein the selected data fields are f 1 , f 2 , . . . , f k , and computing the respective key value for the respective row comprises forming a concatenation f 1 (r)+f 2 (r)+ . . . +f k (r), where f i (r) specifies the f i data field value for the row r for i=1, 2, . . . , k, and f i (r) casts the corresponding data field value as a string when the data type of the data field f i is not a string. 7 . The method of claim 1 , wherein the selected data fields are f 1 , f 2 , . . . , f k , and computing the respective key value for the respective row comprises forming a k-tuple (f 1 (r), f 2 (r), . . . , f k (r)), where f i (r) specifies the f i data field value for the row r for i=1, 2, . . . , k. 8 . The method of claim 1 , wherein each bin further comprises storage space for a hash of the bin key value, and matching a respective key value to a key value for a respective bin comprises: computing a respective hash value of the respective key value; and comparing the respective hash value to bin hash values. 9 . The method of claim 1 , further comprising partitioning data values for a first data field, of the selected plurality of data fields, into a plurality of distinct partitions, each partition consisting of a respective list or range of data values; wherein matching the respective key value to a respective bin value comprises: identifying the data values of the first data field in the respective key value and in the respective bin key value, and determining that the identified data values are in a same partition. 10 . A computer system having one or more computing devices, each computing device having one or more processors and memory, wherein the memory stores one or more programs configured for execution by the one or more processors, and the one or more programs comprise instructions for: selecting a plurality of data fields from a data source having a plurality of data rows; determining a number N of bins for aggregating rows of the data source; allocating storage in the memory for Nbins, each bin comprising storage space for a key value and an associated counter value, with the associated counter value initialized to zero; sequentially processing the plurality of rows in the data source, including, for each row: computing a respective key value for the respective row using data values for the selected plurality of data fields in the respective row; when the respective key value matches a key value for a respective bin, incrementing the counter value for the respective bin; when the respective key value does not match any bin key value and there exists one or more bins having a counter value of zero: selecting a respective bin having a counter value of zero; setting the key value of the respective bin to be the respective key value; and setting the counter value of the respective bin to be one; when the respective key value does not match any bin key value and all of the N bins have non-zero counter values: identifying a respective bin with a smallest counter value c; incrementing the counter value of the respective bin; and with probability 1/(1+c), replacing the key value of the respective bin with the respective key value. 11 . The computer system of claim 10 , wherein the data source is streaming. 12 . The computer system of claim 10 , wherein the one or more programs further comprise instructions for: receiving a query from a client to identify M frequent items from the N bins, wherein M is a positive integer less than N; in response to receiving the query: selecting, from the N bins, M bins having the largest counter values; and returning the key values of the selected M bins to the client. 13 . The computer system of claim 10 , wherein the one or more programs further comprise instructions for: receiving a query from a client to estimate a count of rows from the data source satisfying a user-specified filter condition; in response to receiving the query: determining a subset of the bins whose key values satisfy the user-specified filter condition; computing a sum of the counter values for the bins in the determined subset; and returning, to the client, the sum as

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2018239792A1 cover?
A process creates a data sketch for a data set with many rows. A user selects data fields from the data source. The process allocates storage for N bins, where each bin has storage space for a key value and an associated counter value (which is initialized to zero). The process sequentially accesses the rows from the data source (e.g., as a stream). For each row, the process computes a respecti…
Who is the assignee on this patent?
Tableau Software Inc
What technology area does this patent fall under?
Primary CPC classification G06F16/2272. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Aug 23 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).