Data processing method and related apparatus
US-2024152807-A1 · May 9, 2024 · US
US10133775B1 · US · B1
| Field | Value |
|---|---|
| Publication number | US-10133775-B1 |
| Application number | US-201414219802-A |
| Country | US |
| Kind code | B1 |
| Filing date | Mar 19, 2014 |
| Priority date | Mar 19, 2014 |
| Publication date | Nov 20, 2018 |
| Grant date | Nov 20, 2018 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Techniques are described for modeling data query execution time based on a cost of data queries, where the cost provides a measure of the processing resources used by the data query while executing. Using regression analysis or other statistical methods, a model may be generated that enables the prediction of the query execution time based on the query cost. In some cases, the model may be generated based on a linear regression analysis of previously measured execution times and previously determined data query costs. The model may be stored and employed prior to, or during, the subsequent execution of a data query, to predict the execution time of the data query. Data queries that execute substantially longer than the predicted execution time may be terminated.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method, comprising: accessing cost data describing costs of executing a plurality of data queries on one or more data storage systems, the costs indicating consumption of processing resources during execution of the plurality of data queries; accessing execution time data describing previous execution times of the plurality of data queries; generating a preliminary model based on a linear regression analysis that correlates the costs and the previous execution times of the plurality of data queries; determining a variance that characterizes a distribution, relative to the preliminary model, of pairs of the cost data and the execution time data associated with individual executions of the plurality of data queries; identifying a subset of the pairs of the cost data and the execution time data, the subset omitting one or more of the pairs of the cost data and the execution time data that are outside a multiplicative factor times the variance; generating a model based on the linear regression analysis that provides a first predicted execution time as a function of a cost of a data query; executing the data query on the one or more data storage systems; determining that the data query exceeds a threshold execution time while executing on the one or more data storage systems; based on the determination, employing the model to calculate a second predicted execution time of the data query based on the cost of the data query; and based at least partly on the second predicted execution time, determining a time period for executing the data query on the one or more data storage systems such that the data query is predicted to complete within the time period, wherein the time period includes a sum of the second predicted execution time and a multiplicative factor times the variance that characterizes a distribution of a combined query cost and execution time data, relative to the model. 2. The method of claim 1 , further comprising: executing the data query on the one or more data storage systems; and terminating the data query based on a determination that the executing of the data query exceeds the time period. 3. The method of claim 1 , wherein the calculating of the second predicted execution time of the data query is further based on one or more parameters; and the one or more parameters include one or more of: memory capacity used by the one or more data storage systems, processing resources used by the one or more data storage systems, a proportion of the processing resources available during the execution of the plurality of data queries, a number of nodes that executed the plurality of data queries, or an amount of data processed during the execution of the plurality of data queries. 4. A system, comprising: at least one computing device configured to implement one or more services, wherein the one or more services are configured to: generate a preliminary model based on a linear regression analysis that correlates costs and previous execution times of a plurality of data queries; determine a variance that characterizes a distribution, relative to the preliminary model, of pairs of cost data and execution time data associated with individual executions of the plurality of data queries; identify a subset of the pairs of the cost data and the execution time data, the subset omitting one or more of the pairs of the cost data and the execution time data that are outside a multiplicative factor times the variance; generate a model based on the regression analysis that provides a first predicted execution time as a function of a cost of data query; employ the model to calculate a second predicted execution time of the data query based on a cost of the data query; and based at least partly on the second predicted execution time, determine a time period for executing the data query on one or more data storage systems such that the data query is predicted to complete within the time period, and wherein the time period includes a sum of the second predicted execution time and a multiplicative factor times a variance that characterizes a distribution of a combined query cost and execution time data, relative to the model. 5. The system of claim 4 , wherein the regression analysis is a linear regression analysis. 6. The system of claim 4 , wherein the model is based at least in part on the regression analysis that correlates the costs and the previous execution times of the plurality of data queries on the one or more data storage systems, such that the model is associated with the one or more data storage systems. 7. The system of claim 6 , wherein the one or more data storage systems are managed at least partly using a relational database management system (RDBMS). 8. The system of claim 4 , wherein: the one or more services are further configured to access the execution time data describing one or more recent execution times of the data query executing on the one or more data storage systems within a recent period of time; and the second predicted execution time of the data query is further based on the one or more recent execution times of the data query. 9. The system of claim 4 , wherein the employing of the model to calculate the second predicted execution time of the data query is based on determining that the data query exceeded a threshold execution time while executing on the one or more data storage systems. 10. The system of claim 4 , wherein the one or more services are further configured to: execute the data query on the one or more data storage systems; and terminate the data query based on a determination that the executing of the data query exceeds the time period. 11. The system of claim 4 , wherein the one or more services are further configured to: access the cost data describing the costs of executing the plurality of data queries on the one or more data storage systems; access the execution time data describing the previous execution times of the plurality of data queries executing on the one or more data storage systems; and perform the regression analysis to generate the model. 12. The system of claim 4 , wherein the variance that characterizes distribution, relative to the model, includes one or more parameters and the previous execution times associated with individual executions of the plurality of data queries. 13. The system of claim 4 , wherein the calculating of the second predicted execution time of the data query is further based on one or more parameters; and the one or more parameters include one or more of: memory capacity used by the one or more data storage systems, processing resources used by the one or more data storage systems, a proportion of the processing resources available during the execution of the plurality of data queries, a number of nodes that executed the plurality of data queries, or an amount of data processed during the execution of the plurality of data queries. 14. One or more non-transitory computer-readable media storing instructions which, when executed by at least one processor, instruct the at least one processor to perform actions comprising: generating a preliminary model based on a linear regression analysis that correlates costs and previous execution times of a plurality of data queries; determining a variance that characterizes a distribution, relative to the preliminary model, of pairs of cost data and execution time data associated with individual executions of the plurality of data queries; identifying a subset of the pairs of the cost data and the execution time data, the sub
Plan optimisation · CPC title
Query optimisation · CPC title
Database tuning (G06F16/2282 takes precedence; database performance monitoring G06F11/3409) · CPC title
Physics · mapped topic
Physics · mapped topic
Related publications grouped by family.
Answers are generated from the same data shown on this page.