Determining statistics for cost-based optimization of a workflow

US9367594B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9367594-B2
Application numberUS-201313853416-A
CountryUS
Kind codeB2
Filing dateMar 29, 2013
Priority dateMar 29, 2013
Publication dateJun 14, 2016
Grant dateJun 14, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques, systems, and articles of manufacture for determining statistics for cost-based optimization of a workflow. A method includes generating individual sets of statistics for each intermediate relation of a workflow, wherein said intermediate relations comprise results of stages of any plan of the workflow, and wherein each individual set of statistics computes cardinality of the corresponding intermediate relation; determining a global set of statistics for the workflow, wherein said global set of statistics comprises at least one of the individual sets of statistics for each of the intermediate relations; instrumenting a given plan of the workflow to collect the global set of statistics during execution; executing the given plan to collect the global set of statistics; and determining a plan of the workflow with the lowest cost by comparing the cost of multiple plans, wherein the cost of each plan is derived from the global set of statistics.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: generating one or more individual sets of statistics for each of one or more intermediate relations of a workflow, wherein said one or more intermediate relations comprise results of one or more stages of any plan of the workflow, and wherein each individual set of statistics computes input cardinality of a corresponding intermediate relation of the workflow; determining which statistics from the one or more individual sets of statistics to designate as a global set of statistics for the workflow, wherein said global set of statistics comprises at least one of the one or more individual sets of statistics for each of the one or more intermediate relations, and wherein said determining which statistics to designate as the global set of statistics comprises: computing a cost for collecting each individual set of statistic of the one or more individual sets of statistics based on corresponding computed input cardinality of the individual set of statistic; wherein the at least one of the one or more individual sets of statistics for each of the one or more intermediate relations comprised in the global set of statistics for the workflow has a lowest computed cost; instrumenting multiple distinct plans of the workflow to be executed, wherein said instrumenting comprises identifying each of the multiple distinct plans by identifying multiple distinct orderings of multiple distinct operators of the workflow, wherein the multiple distinct operators comprise two or more of (i) a select operator, (ii) a project operator, (iii) a join operator, (iv) a group-by operator, (v) a transformation operator, (vi) a filter operator, and (vii) a user-defined function operator; executing each of the instrumented multiple distinct plans to collect statistics for the global set of statistics for each of the executed multiple distinct plans; and determining one of the multiple plans of the workflow with a lowest cost by comparing a cost of each of the multiple plans of the workflow, wherein the cost of each of the multiple plans is derived from (a) the global set of statistics for each of the multiple distinct plans and (b) a cost model that estimates the cost based on (i) cardinalities of input relations, (ii) central processing unit (CPU) and disk-access speeds, and (iii) memory availability, and wherein said determining comprises adhering to one or more constrained resources associated with the workflow, wherein said one or more constrained resources comprise one or more memory constraints comprising at least a limit on total memory available for computing statistics; using the determined one of the multiple plans with the lowest cost for each of one or more subsequent runs of the workflow; wherein the steps are carried out by a computer device. 2. The method of claim 1 , wherein the workflow comprises an Extract-Transform-Load (ETL) workflow. 3. The method of claim 1 , wherein the workflow comprises a structured query language (SQL) statement. 4. The method of claim 1 , wherein said cost for collecting each individual set of statistic of the one or more individual sets of statistics comprises a memory cost for holding the one or more individual sets of statistics. 5. The method of claim 1 , wherein said cost for collecting each individual set of statistic of the one or more individual sets of statistics comprises computational overhead for measuring the one or more individual sets of statistics. 6. The method of claim 1 , wherein said one or more constrained resources comprise one or more error constraints. 7. The method of claim 1 , comprising: implementing one or more semantics to leverage existing statistics. 8. The method of claim 1 , comprising: adhering to one or more requirements imposed due to materialized intermediate relations during reordering of the workflow. 9. The method of claim 1 , further comprising repeating (i) said instrumenting, (ii) said executing, and (iii) said determining one of the multiple plans of the workflow per a given temporal interval. 10. An article of manufacture comprising a computer readable storage medium having computer readable instructions tangibly embodied thereon which, when implemented, cause a computer to carry out a plurality of method steps comprising: generating one or more individual sets of statistics for each of one or more intermediate relations of a workflow, wherein said one or more intermediate relations comprise results of one or more stages of any plan of the workflow, and wherein each individual set of statistics computes input cardinality of a corresponding intermediate relation of the workflow; determining which statistics from the one or more individual sets of statistics to designate as a global set of statistics for the workflow, wherein said global set of statistics comprises at least one of the one or more individual sets of statistics for each of the one or more intermediate relations, and wherein said determining which statistics to designate as the global set of statistics comprises: computing a cost for collecting each individual set of statistic of the one or more individual sets of statistics based on corresponding computed input cardinality of the individual set of statistic; wherein the at least one of the one or more individual sets of statistics for each of the one or more intermediate relations comprised in the global set of statistics for the workflow has a lowest computed cost; instrumenting multiple distinct plans of the workflow to be executed, wherein said instrumenting comprises identifying each of the multiple distinct plans by identifying multiple distinct orderings of multiple distinct operators of the workflow, wherein the multiple distinct operators comprise two or more of (i) a select operator, (ii) a project operator, (iii) a join operator, (iv) a group-by operator, (v) a transformation operator, (vi) a filter operator, and (vii) a user-defined function operator; executing each of the instrumented multiple distinct plans to collect statistics for the global set of statistics for each of the executed multiple distinct plans; and determining one of the multiple plans of the workflow with a lowest cost by comparing a cost of each of the multiple plans of the workflow, wherein the cost of each of the multiple plans is derived from (a) the global set of statistics for each of the multiple distinct plans and (b) a cost model that estimates the cost based on (i) cardinalities of input relations, (ii) central processing unit (CPU) and disk-access speeds, and (iii) memory availability, and wherein said determining comprises adhering to one or more constrained resources associated with the workflow, wherein said one or more constrained resources comprise one or more memory constraints comprising at least a limit on total memory available for computing statistics; using the determined one of the multiple plans with the lowest cost for each of one or more subsequent runs of the workflow. 11. The article of manufacture of claim 10 , wherein said cost for collecting each individual set of statistic of the one or more individual sets of statistics comprises a memory cost for holding the one or more individual sets of statistics and/or computational overhead for measuring the one or more individual sets of statistics. 12. The article of manufacture of claim 10 , wherein said one or more constrained resources comprise one or more error constraints. 13. The article of manufacture of claim 10 , wherein the plurality of method steps further comprises repeating (i) said instrumenting, (ii) said executing, and (iii) said determining one of the multiple plan

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9367594B2 cover?
Techniques, systems, and articles of manufacture for determining statistics for cost-based optimization of a workflow. A method includes generating individual sets of statistics for each intermediate relation of a workflow, wherein said intermediate relations comprise results of stages of any plan of the workflow, and wherein each individual set of statistics computes cardinality of the corresp…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F16/254. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 14 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).