Cost-based optimization of configuration parameters and cluster sizing for hadoop

US9367601B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9367601-B2
Application numberUS-201313843347-A
CountryUS
Kind codeB2
Filing dateMar 15, 2013
Priority dateMar 26, 2012
Publication dateJun 14, 2016
Grant dateJun 14, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Cost-based optimization of configuration parameters and cluster sizing for distributed data processing systems are disclosed. According to an aspect, a method includes receiving at least one job profile of a MapReduce job. The method also includes using the at least one job profile to predict execution of the MapReduce job within a plurality of different predetermined settings of a distributed data processing system. Further, the method includes determining one of the predetermined settings that optimizes performance of the MapReduce job. The method may also include automatically adjusting the distributed data processing system to the determined predetermined setting.

First claim

Opening claim text (preview).

What is claimed: 1. A method comprising: at a processor and memory: collecting monitoring data during execution of a MapReduce job comprising a first set of configuration settings on a MapReduce framework, wherein collecting the monitoring data comprises applying dynamic instrumentation to the MapReduce job and the MapReduce framework during execution of the MapReduce job on the MapReduce framework, wherein applying dynamic instrumentation comprises applying a specified set of event-condition-action (ECA) rules to the MapReduce framework, and wherein an event comprises at least one of a entry function, exit function, memory allocation, system call event occurring during execution of the MapReduce job on the MapReduce framework; generating dataflow fields and cost fields of a job profile of the MapReduce job based on the collected monitoring data; and communicating the job profile to a prediction process configured to predict the behavior of the MapReduce job comprising a second set of configuration settings on the MapReduce framework using the job profile. 2. The method of claim 1 , wherein collecting monitoring data comprises receiving one of a dataflow measure, an execution time measure, and a resource usage measure during execution of the MapReduce job on the MapReduce framework. 3. The method of claim 1 , wherein collecting monitoring data comprises receiving run-time monitoring information from the MapReduce job and MapReduce framework during execution of the MapReduce job on the MapReduce framework. 4. The method of claim 1 , wherein the data flow fields of the job profile comprises dataflow information associated with the collected monitoring data of the MapReduce job. 5. The method of claim 4 , wherein the dataflow information comprises one of a size of data and I/O transfer processed during execution of the MapReduce job. 6. The method of claim 1 , wherein the cost fields of the job profile comprise cost information associated with the collected monitoring data of the MapReduce job. 7. The method of claim 6 , wherein the cost information comprises one of resource usage and execution time associated with the collected monitoring data of the MapReduce job. 8. The method of claim 1 , wherein the job profile fields comprises one of a dataflow statistics field and a cost statistic field. 9. The method of claim 1 , wherein using the job profile comprises using the job profile to simulate execution of the MapReduce job on the MapReduce framework, the second set of configuration settings of the MapReduce job being different from the first set of configuration settings used to collect the monitoring data. 10. The method of claim 1 , wherein the MapReduce job comprises map tasks and reduce tasks. 11. The method of claim 1 , wherein an action comprises at least one of obtaining the duration of a function call, examining the memory state, and counting the number of bytes transferred during execution of the MapReduce job on the MapReduce framework. 12. The method of claim 1 , wherein at least one of the dataflow fields comprises the amount of data flowing through at least one task phase during the execution of the MapReduce job on the MapReduce framework. 13. A computing device comprising: a processor and memory configured to: collect monitoring data during execution of a MapReduce job comprising a first set of configuration settings on the MapReduce framework, wherein collecting the monitoring data comprises applying dynamic instrumentation to the MapReduce job and the MapReduce framework during execution of the MapReduce job on the MapReduce framework, wherein applying dynamic instrumentation comprises applying a specified set of event-condition-action (ECA) rules to the MapReduce framework, and wherein an event comprises at least one of a entry function, exit function, memory allocation, system call event occurring during execution of the MapReduce job on the MapReduce framework; generate dataflow fields and cost fields of a job profile based on the collected monitoring data; and communicate the job profile to a prediction process configured to predict the behavior of the MapReduce job comprising a second set of configuration settings on the MapReduce framework using the job profile.

Assignees

Inventors

Classifications

  • for planning or managing the needed capacity · CPC title

  • Performance evaluation by statistical analysis · CPC title

  • Performance evaluation by tracing or monitoring · CPC title

  • Monitoring of software · CPC title

  • for parallel or distributed programming · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9367601B2 cover?
Cost-based optimization of configuration parameters and cluster sizing for distributed data processing systems are disclosed. According to an aspect, a method includes receiving at least one job profile of a MapReduce job. The method also includes using the at least one job profile to predict execution of the MapReduce job within a plurality of different predetermined settings of a distributed …
Who is the assignee on this patent?
Univ Duke
What technology area does this patent fall under?
Primary CPC classification G06F11/3404. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jun 14 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).