Query processing in data analysis

US11445240B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11445240-B2
Application numberUS-202017109701-A
CountryUS
Kind codeB2
Filing dateDec 2, 2020
Priority dateOct 26, 2016
Publication dateSep 13, 2022
Grant dateSep 13, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

In implementations of the subject matter described herein, a solution for query processing is provided. In this solution, data subsets are pre-stored for example in a fast access storage device for data analysis, each including data entries corresponding to one or more dimensions. If two or more data subsets are needed to cover target dimensions corresponding to query items in a received query, instead of turning to analyze a source data set that is not stored, the query is decomposed into subqueries. By means of the decomposing, the target dimension(s) corresponding to the query item(s) in each subquery can be covered by a single data subset. The data subset is analyzed for each subquery and a query result for the query is determined based on analysis results of the subqueries. In such way, the query result for the query can obtained in a fast manner from the available data subsets.

First claim

Opening claim text (preview).

We claim: 1. A computer-implemented method, comprising: receiving a query including a plurality of query items associated with a plurality of target dimensions of a data entry; determining at least two of a plurality of data subsets are needed to cover the plurality of target dimensions, at least one of the plurality of data subsets including data entries corresponding to at least one of the plurality of target dimensions; in response to determining that the at least two of the plurality of data subsets are needed to cover the plurality of target dimensions, decomposing the query into a plurality of subqueries, each of the plurality of subqueries having at least one of the plurality of query items, wherein decomposing the query into the plurality of sub queries comprises: determining correlations between respective pairs of query items among the plurality of query items, wherein determining the correlations comprises determining mutual information between the respective pairs of query items based on probabilities of presence of the plurality of query items in corresponding target dimensions; and determining associations of the plurality of target dimensions based on the plurality of target dimensions corresponding to the respective data subsets and the correlations; and determining a query result for the query by analyzing a data entry in the plurality of data subsets that is corresponding to a target dimension associated with the at least one query item in each of the plurality of subqueries. 2. The method of claim 1 , wherein decomposing the query into the plurality of sub queries further comprises: decomposing, based on the determined associations of the plurality of target dimensions, the query into the plurality of subqueries such that target dimensions corresponding to a subquery having one or more of the plurality of query items are determined as having an association. 3. The method of claim 1 , wherein determining the associations of the plurality of target dimensions comprises: determining two of the plurality of target dimensions as having an association based on at least one of: a correlation between a pair of query items associated with the two target dimensions being greater than a threshold correlation; the two target dimensions being covered by a first data subset of the plurality of data subsets; and the two target dimensions and a further target dimension having an association with one of the two target dimensions being covered by a second data subset of the plurality of data subsets. 4. The method of claim 1 , further comprising: creating a plurality of candidate data subsets from a source data set based on a predetermined coverage rate for combinations of source dimensions of the source data set, each of the plurality of candidate data subsets covering at least two of the source dimensions; combining at least two of the plurality of candidate data subsets into a combined candidate data subset such that the combined candidate data subset covers source dimensions of the at least two candidate data subsets; identifying, from the plurality of candidate data subsets, a candidate data subset with source dimensions covered by the combined candidate data subset; and determining the plurality of data subsets based on remaining candidate data subsets other than the identified candidate data subset. 5. The method of claim 4 , further comprising selecting the at least two candidate data subsets by: determining a data size of each of the plurality of candidate data subsets; and selecting, from the plurality of candidate data subsets, the at least two candidate data subsets with respective data sizes smaller than a threshold data size. 6. The method of claim 5 , wherein determining the data size of each of the plurality of candidate data subsets comprises: sampling a plurality of data entries from data entries included in a given candidate data subset; determining a first number of different data entries and a second number of data entries having a frequency of occurrence lower than a threshold frequency among the sampled plurality of data entries; determining, based on the first number and the second number, a number of different data entries included in the given candidate data subset; and determining, based on the number of the different data entries, the data size of the given candidate data subset. 7. The method of claim 4 , wherein determining the plurality of data subsets based on the remaining candidate data subsets comprises: determining whether a total data size of the remaining candidate data subsets exceeds a storage space available for storing the plurality of data subsets; and in response to the total data size being equal to or smaller than the storage space, determining the remaining candidate data subsets as the plurality of data subsets. 8. A device, comprising: a processing unit; a memory coupled to the processing unit and storing instructions thereon, the instructions, when executed by the processing unit, cause the processing unit to: receive a query including a plurality of query items associated with a plurality of target dimensions of a data entry; determine whether at least two of a plurality of data subsets are needed to cover the plurality of target dimensions, at least one of the plurality of data subsets including data entries corresponding to at least one of the plurality of target dimensions; in response to determining that the at least two of the plurality of data subsets are needed to cover the plurality of target dimensions, decompose the query into a plurality of subqueries, each of the plurality of subqueries having at least one of the plurality of query items, wherein decomposing the query into the plurality of subqueries comprises: determine correlations between respective pairs of query items among the plurality of query items, wherein determining the correlations comprises determining mutual information between the respective pairs of query items based on probabilities of presence of the plurality of query items in corresponding target dimensions; and determine associations of the plurality of target dimensions based on the plurality of target dimensions corresponding to the respective data subsets and the correlations; and determine a query result for the query by analyzing a data entry in the plurality of data subsets that is corresponding to a target dimension associated with the at least one query item in each of the plurality of subqueries. 9. The device of claim 8 , wherein, to decompose the query into the plurality of subqueries, the processing unit is further caused to: decompose, based on the determined associations of the plurality of target dimensions, the query into the plurality of subqueries such that target dimensions corresponding to a subquery having one or more of the plurality of query items are determined as having an association. 10. The device of claim 8 , wherein, to determine the association of the plurality of target dimensions, the processing unit is caused to: determining two of the plurality of target dimensions as having an association based on at least one of: a correlation between a pair of query items associated with the two target dimensions being greater than threshold correlation; the two target dimensions being covered by a first data subset of the plurality of data subsets; and the two target dimensions and a further target dimension having an association with one of the two target dimensions being covered by a second data subset of the plurality of data subsets. 11. The device of claim 8 , wherein the processing unit is further caused to: create a plurality of c

Assignees

Inventors

Classifications

  • Content synchronisation processes, e.g. decoder synchronisation · CPC title

  • Run-time optimisation · CPC title

  • G06F16/283Primary

    Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP · CPC title

  • Adapting the video stream to a specific local network, e.g. a Bluetooth® network · CPC title

  • Data partitioning, e.g. horizontal or vertical partitioning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11445240B2 cover?
In implementations of the subject matter described herein, a solution for query processing is provided. In this solution, data subsets are pre-stored for example in a fast access storage device for data analysis, each including data entries corresponding to one or more dimensions. If two or more data subsets are needed to cover target dimensions corresponding to query items in a received query,…
Who is the assignee on this patent?
Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification G06F16/283. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 13 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).