Parallel computing execution plan optimization

US9235446B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9235446-B2
Application numberUS-201213530490-A
CountryUS
Kind codeB2
Filing dateJun 22, 2012
Priority dateJun 22, 2012
Publication dateJan 12, 2016
Grant dateJan 12, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The use of statistics collected during the parallel distributed execution of the tasks of a job may be used to optimize the performance of the task or similar recurring tasks. An execution plan for a job is initially generated, in which the execution plan includes tasks. Statistics regarding operations performed in the tasks are collected while the tasks are executed via parallel distributed execution. Another execution plan is then generated for another recurring job, in which the additional execution plan has at least one task in common with the execution plan for the job. The additional execution plan is subsequently optimized based at least on the statistics to produce an optimized execution plan.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer storage medium storing computer-executable instructions that, when executed, cause one or more processors to perform acts comprising: generating a first execution plan for a job that includes a plurality of tasks; collecting statistics regarding operations performed in the plurality of tasks while the plurality of tasks are executed via parallel distributed execution; generating a second execution plan for a recurring job, the second execution plan having at least one task in common with the first execution plan for the job; and optimizing the second execution plan based at least on the statistics to produce an optimized execution plan, wherein the optimizing includes at least one of: reducing a number of parallelisms for operations at a tail end of the job; or replacing pair-wise joins of data with broadcast joins of data process by at least one of the operations. 2. The computer storage medium of claim 1 , further comprising: executing a plurality of further tasks specified in the optimized execution plan to generate execution results for the recurring job; and outputting the execution results of the recurring job. 3. The computer storage medium of claim 2 , further comprising: collecting additional statistics regarding additional operations performed in the further tasks while the further tasks are executed via parallel distributed execution; and optimizing the additional operations on-the-fly during the parallel distributed execution of the further tasks based at least on the additional statistics. 4. The computer storage medium of claim 1 , wherein the generating the first execution plan includes translating a job script for the job into the first execution plan. 5. The computer storage medium of claim 1 , wherein the collecting the statistics includes: inserting collectors into a set of operations selected from the operations performed in the tasks; receiving local statistical data from the collectors while the set of the operations are executed via parallel distributed execution; and composing the local statistical data provided by the collectors into the statistics. 6. The computer storage medium of claim 5 , further comprising selecting the set of operations from the operations performed in the task by ignoring at least one operation that produces a predictable result. 7. The computer storage medium of claim 5 , wherein the optimizing includes matching the statistics captured by the collectors to expressions specified in the second execution plan based at least on hash values that represent locations of the collectors in a query graph of the first execution plan. 8. The computer storage medium of claim 1 , wherein the statistics include a first number of distinctive values in a data table column and a second number of values that repeat over a predetermined fraction of rows in the data table column, and wherein the optimizing includes avoiding skewing in partitioning of operations in the recurring job based at least on the first number and the second number. 9. The computer storage medium of claim 8 , further comprising: estimating the first number via distributed hash sketching; and estimating the second number by at least receiving frequency estimates obtained by each collector, totaling up the frequency estimates, and tallying up distinct values, wherein each frequency estimate is obtained by a corresponding collector through lossy counting of a subset of data observed by a task that is monitored by the corresponding collector. 10. The computer storage medium of claim 1 , wherein the optimizing includes at least one of: determining a sequence of operations based at least on operator selectivity and computation costs of the operations; or determining a degree of parallelism for the operations, a number of partitions for the operations, or operations to fit into a particular task based at least on the computation costs of the operations and a size of data to be processed. 11. A computer-implemented method, comprising: generating a first execution plan for a job that includes a plurality of tasks; collecting statistics regarding operations performed in the plurality of tasks while the plurality of tasks are executed via parallel distributed execution; optimizing a second execution plan for a recurring job based at least on the statistics to produce an optimized execution plan, the second execution plan having at least one task in common with the first execution plan, wherein the optimizing includes at least one of: reducing a number of parallelisms for operations at a tail end of the job; or replacing pair-wise joins of data with broadcast joins of data process by at least one of the operations; executing a plurality of further tasks specified in the optimized execution plan to generate execution results for the recurring job; and outputting the execution results of the recurring job. 12. The computer-implemented method of claim 11 , further comprising: collecting additional statistics regarding additional operations performed in the further tasks while the further tasks are executed via parallel distributed execution; and optimizing the additional operations on-the-fly during the parallel distributed execution of the further tasks based at least on the additional statistics. 13. The computer-implemented method of claim 11 , wherein the collecting the statistics includes: inserting collectors into a set of operations selected from the operations performed in the tasks; receiving local statistical data from the collectors while the set of the operations are executed via parallel distributed execution; and composing the local statistical data provided by the collectors into the statistics. 14. The computer-implemented method of claim 11 , wherein the optimizing includes at least one of: determining a sequence of operations based at least on operator selectivity and computation costs of the operations; or determining a degree of parallelism for the operations, a number of partitions for the operations, or operations to fit into a particular task based at least on the computation costs of operations. 15. The computer-implemented method of claim 11 , wherein the collecting the statistics includes: selecting a set of operations from the operations performed in the tasks by ignoring at least one operation that produces a predictable result; inserting collectors into the set of operations; and receiving at least some of the statistics from the collectors. 16. A computing device, comprising: one or more processors; and a memory that includes a plurality of computer-executable components executed by the one or more processors, comprising: a compiler that generates a first execution plan for a job that includes a plurality of tasks; at least one job manager that collects statistics regarding operations performed in the plurality of tasks while the plurality of tasks are executed via parallel distributed execution; and a compile-time optimizer that optimizes a second execution plan based at least on the statistics to produce an optimized execution plan, the second execution plan having at least one task in common with the first execution plan, wherein the compile-time optimizer is configured to perform at least one of: reducing a number of parallelisms for operations at a tail end of the job; or replacing pair-wise joins of data with broadcast joins of data process by at least one of the operations. 17. The computing device of claim 16 , wherein the compiler generates the second e

Assignees

Inventors

Classifications

  • G06F9/5066Primary

    Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs (mappping at compile time, see G06F8/451) · CPC title

  • Querying · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9235446B2 cover?
The use of statistics collected during the parallel distributed execution of the tasks of a job may be used to optimize the performance of the task or similar recurring tasks. An execution plan for a job is initially generated, in which the execution plan includes tasks. Statistics regarding operations performed in the tasks are collected while the tasks are executed via parallel distributed ex…
Who is the assignee on this patent?
Bruno Nicolas, Zhou Jingren, Kandula Srikanth, and 3 more
What technology area does this patent fall under?
Primary CPC classification G06F9/5066. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 12 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).