System and method for proactive task scheduling of a copy of outlier task in a computing environment

US9307048B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9307048-B2
Application numberUS-97933810-A
CountryUS
Kind codeB2
Filing dateDec 28, 2010
Priority dateDec 28, 2010
Publication dateApr 5, 2016
Grant dateApr 5, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The described implementations relate to distributed computing. One implementation provides a system that can include an outlier detection component that is configured to identify an outlier task from a plurality of tasks based on runtimes of the plurality of tasks. The system can also include a cause evaluation component that is configured to evaluate a cause of the outlier task. For example, the cause of the outlier task can be an amount of data processed by the outlier task, contention for resources used to execute the outlier task, or a communication link with congested bandwidth that is used by the outlier task to input or output data. The system can also include one or more processing devices configured to execute one or more of the components.

First claim

Opening claim text (preview).

The invention claimed is: 1. A system comprising: one or more processing devices; and one or more computer-readable storage media storing instructions which, when executed by the one or more processing devices, configure the one or more processing devices to: identify an outlier task from a plurality of tasks of a phase of a job based on corresponding runtimes of the plurality of tasks, the outlier task being identified while the outlier task is executing and taking longer to complete than other tasks from the phase of the job, wherein the plurality of tasks share the same code; make a determination whether the outlier task has more input data to process than the other tasks of the phase of the job; in a first instance when the determination is that the outlier task has more input data to process than the other tasks of the phase of the job, continue to let the outlier task execute without scheduling a copy of the outlier task responsive to the determination; and in a second instance when the determination is that the outlier task does not have more input data to process than the other tasks of the phase of the job compare an estimated remaining time for the outlier task to complete to an estimated time for the copy of the outlier task to complete, and when the estimated time for the copy of the outlier task to complete is less than the estimated remaining time for the outlier task to complete, schedule the copy of the outlier task. 2. The system according to claim 1 , wherein the instructions further configure the one or more processing devices to: in the second instance, determine whether a cause of the outlier task is contention for resources by evaluating progress reports reflecting processor utilization, memory utilization, or both on a server that is currently executing the outlier task. 3. The system according to claim 1 , wherein the outlier task executes concurrently with an individual other task from the plurality of tasks. 4. The system according to claim 2 , wherein the instructions configure the one or more processing devices to: responsive to determining that the cause of the outlier task is contention for resources, schedule the copy of the outlier task on a different server than the outlier task. 5. The system according to claim 4 , wherein the instructions further configure the one or more processing devices to: schedule at least one subsequent task from a subsequent processing phase of the job after completion of the outlier task. 6. The system according to claim 1 , wherein the instructions further configure the one or more processing devices to: schedule the plurality of tasks in a plurality of processing slots, the outlier task executing within an individual processing slot. 7. The system according to claim 6 , wherein the instructions further configure the one or more processing devices to: determine whether there is an available processing slot; and in the second instance, begin executing the copy of the outlier task in the available processing slot while the outlier task continues executing within the individual processing slot. 8. The system according to claim 1 , wherein the phase is a reducing phase of the job and the reduce phase is preceded by a mapping phase. 9. The system according to claim 1 , embodied as a scheduling server that initiates the plurality of tasks on other servers. 10. The system according to claim 1 , wherein the instructions further configure the one or more processing devices to, in the second instance: determine the estimated remaining time for the outlier task to complete based on an amount of the input data that the outlier task has to process and a rate at which the outlier task processes the input data. 11. The system according to claim 10 , wherein the instructions further configure the one or more processing devices to, in the second instance: determine the estimated time for the copy of the outlier task to complete based on processing rates at which the other tasks in the phase are processing the input data. 12. The system according to claim 10 , wherein the instructions further configure the one or more processing devices to: determine the estimated time for the copy of the outlier task to complete based on the amount of the input data that the outlier task has to process. 13. A system comprising: one or more processing devices; and one or more computer-readable storage devices comprising instructions which, when executed by one or more processing devices, cause the one or more processing devices to: monitor execution of a plurality of tasks associated with a job, the plurality of tasks comprising an individual task that is processing input data; after the individual task has already processed some of the input data and is continuing to process remaining input data: determine a rate at which the individual task is processing the input data and an amount of the remaining input data that the individual task has yet to process, determine an estimated remaining time for the individual task to complete based on the rate at which the individual task is processing the input data and the amount of the remaining input data, determine a predicted completion time for a new copy of the individual task, determine an estimated probability that the new copy of the individual task will complete sooner than the individual task based on the estimated remaining time for the individual task to complete and the predicted completion time for the new copy of the task, and while the individual task continues executing, schedule the new copy of the individual task that is currently executing when the estimated probability that the new copy of the individual task will complete sooner than the individual task exceeds a threshold. 14. The system according to claim 13 , wherein the instructions further cause the one or more processing devices to: schedule the new copy of the individual task to run concurrently with the individual task without killing and restarting the individual task. 15. The system according to claim 13 , wherein the instructions further cause the one or more processing devices to: determine whether an available processing slot is available for scheduling the new copy of the individual task; and schedule the new copy of the individual task when there is an available processing slot. 16. The system according to claim 14 , wherein the instructions further cause the one or more processing devices to: limit a number of copies of the individual task to a fixed number. 17. A method performed by at least one computing device, the method comprising: determining an estimated probability that output data of a completed task will be lost due to a fault on a server that executed the completed task, wherein the completed task has processed input data to obtain the output data; determining an estimated time to repeat the completed task based on an amount of the input data that was processed by the completed task; determining a cost to recompute the completed task based on both the estimated probability that the output data of the completed task will be lost and the estimated time to repeat the completed task; determining another estimated time to replicate the output data of the completed task by transferring the output data to another server; comparing the another estimated time to replicate the output data to the cost to recompute the completed task to determine whether to replicate the output data on the another server; and replicating the output data on the another server when th

Assignees

Inventors

Classifications

  • Constraint · CPC title

  • H04L67/325Primary

    Electricity · mapped topic

  • involving task migration · CPC title

  • G06F9/5038Primary

    considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration (scheduling strategies G06F9/4881 and subgroups) · CPC title

  • H04L67/62Primary

    Establishing a time schedule for servicing the requests · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9307048B2 cover?
The described implementations relate to distributed computing. One implementation provides a system that can include an outlier detection component that is configured to identify an outlier task from a plurality of tasks based on runtimes of the plurality of tasks. The system can also include a cause evaluation component that is configured to evaluate a cause of the outlier task. For example, t…
Who is the assignee on this patent?
Kandula Srikanth, Ananthanarayanan Ganesh, Greenberg Albert, and 4 more
What technology area does this patent fall under?
Primary CPC classification H04L67/325. Mapped technology areas include Electricity.
When was this patent published?
Publication date Tue Apr 05 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).