Run time prediction for data queries

US10133775B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-10133775-B1
Application numberUS-201414219802-A
CountryUS
Kind codeB1
Filing dateMar 19, 2014
Priority dateMar 19, 2014
Publication dateNov 20, 2018
Grant dateNov 20, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques are described for modeling data query execution time based on a cost of data queries, where the cost provides a measure of the processing resources used by the data query while executing. Using regression analysis or other statistical methods, a model may be generated that enables the prediction of the query execution time based on the query cost. In some cases, the model may be generated based on a linear regression analysis of previously measured execution times and previously determined data query costs. The model may be stored and employed prior to, or during, the subsequent execution of a data query, to predict the execution time of the data query. Data queries that execute substantially longer than the predicted execution time may be terminated.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method, comprising: accessing cost data describing costs of executing a plurality of data queries on one or more data storage systems, the costs indicating consumption of processing resources during execution of the plurality of data queries; accessing execution time data describing previous execution times of the plurality of data queries; generating a preliminary model based on a linear regression analysis that correlates the costs and the previous execution times of the plurality of data queries; determining a variance that characterizes a distribution, relative to the preliminary model, of pairs of the cost data and the execution time data associated with individual executions of the plurality of data queries; identifying a subset of the pairs of the cost data and the execution time data, the subset omitting one or more of the pairs of the cost data and the execution time data that are outside a multiplicative factor times the variance; generating a model based on the linear regression analysis that provides a first predicted execution time as a function of a cost of a data query; executing the data query on the one or more data storage systems; determining that the data query exceeds a threshold execution time while executing on the one or more data storage systems; based on the determination, employing the model to calculate a second predicted execution time of the data query based on the cost of the data query; and based at least partly on the second predicted execution time, determining a time period for executing the data query on the one or more data storage systems such that the data query is predicted to complete within the time period, wherein the time period includes a sum of the second predicted execution time and a multiplicative factor times the variance that characterizes a distribution of a combined query cost and execution time data, relative to the model. 2. The method of claim 1 , further comprising: executing the data query on the one or more data storage systems; and terminating the data query based on a determination that the executing of the data query exceeds the time period. 3. The method of claim 1 , wherein the calculating of the second predicted execution time of the data query is further based on one or more parameters; and the one or more parameters include one or more of: memory capacity used by the one or more data storage systems, processing resources used by the one or more data storage systems, a proportion of the processing resources available during the execution of the plurality of data queries, a number of nodes that executed the plurality of data queries, or an amount of data processed during the execution of the plurality of data queries. 4. A system, comprising: at least one computing device configured to implement one or more services, wherein the one or more services are configured to: generate a preliminary model based on a linear regression analysis that correlates costs and previous execution times of a plurality of data queries; determine a variance that characterizes a distribution, relative to the preliminary model, of pairs of cost data and execution time data associated with individual executions of the plurality of data queries; identify a subset of the pairs of the cost data and the execution time data, the subset omitting one or more of the pairs of the cost data and the execution time data that are outside a multiplicative factor times the variance; generate a model based on the regression analysis that provides a first predicted execution time as a function of a cost of data query; employ the model to calculate a second predicted execution time of the data query based on a cost of the data query; and based at least partly on the second predicted execution time, determine a time period for executing the data query on one or more data storage systems such that the data query is predicted to complete within the time period, and wherein the time period includes a sum of the second predicted execution time and a multiplicative factor times a variance that characterizes a distribution of a combined query cost and execution time data, relative to the model. 5. The system of claim 4 , wherein the regression analysis is a linear regression analysis. 6. The system of claim 4 , wherein the model is based at least in part on the regression analysis that correlates the costs and the previous execution times of the plurality of data queries on the one or more data storage systems, such that the model is associated with the one or more data storage systems. 7. The system of claim 6 , wherein the one or more data storage systems are managed at least partly using a relational database management system (RDBMS). 8. The system of claim 4 , wherein: the one or more services are further configured to access the execution time data describing one or more recent execution times of the data query executing on the one or more data storage systems within a recent period of time; and the second predicted execution time of the data query is further based on the one or more recent execution times of the data query. 9. The system of claim 4 , wherein the employing of the model to calculate the second predicted execution time of the data query is based on determining that the data query exceeded a threshold execution time while executing on the one or more data storage systems. 10. The system of claim 4 , wherein the one or more services are further configured to: execute the data query on the one or more data storage systems; and terminate the data query based on a determination that the executing of the data query exceeds the time period. 11. The system of claim 4 , wherein the one or more services are further configured to: access the cost data describing the costs of executing the plurality of data queries on the one or more data storage systems; access the execution time data describing the previous execution times of the plurality of data queries executing on the one or more data storage systems; and perform the regression analysis to generate the model. 12. The system of claim 4 , wherein the variance that characterizes distribution, relative to the model, includes one or more parameters and the previous execution times associated with individual executions of the plurality of data queries. 13. The system of claim 4 , wherein the calculating of the second predicted execution time of the data query is further based on one or more parameters; and the one or more parameters include one or more of: memory capacity used by the one or more data storage systems, processing resources used by the one or more data storage systems, a proportion of the processing resources available during the execution of the plurality of data queries, a number of nodes that executed the plurality of data queries, or an amount of data processed during the execution of the plurality of data queries. 14. One or more non-transitory computer-readable media storing instructions which, when executed by at least one processor, instruct the at least one processor to perform actions comprising: generating a preliminary model based on a linear regression analysis that correlates costs and previous execution times of a plurality of data queries; determining a variance that characterizes a distribution, relative to the preliminary model, of pairs of cost data and execution time data associated with individual executions of the plurality of data queries; identifying a subset of the pairs of the cost data and the execution time data, the sub

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10133775B1 cover?
Techniques are described for modeling data query execution time based on a cost of data queries, where the cost provides a measure of the processing resources used by the data query while executing. Using regression analysis or other statistical methods, a model may be generated that enables the prediction of the query execution time based on the query cost. In some cases, the model may be gene…
Who is the assignee on this patent?
Amazon Tech Inc
What technology area does this patent fall under?
Primary CPC classification G06F16/2453. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Nov 20 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).