Dynamic Shard Allocation Adjustment

US2016011901A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2016011901-A1
Application numberUS-201414327338-A
CountryUS
Kind codeA1
Filing dateJul 9, 2014
Priority dateJul 9, 2014
Publication dateJan 14, 2016
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The present disclosure relates to dynamically adjusting shard allocation during parallel processing operations. One example method includes determining a target completion time for a batch data processing job of an input data set performed by a plurality of tasks, each of the plurality of tasks processing a different input shard including a different portion of the input data set; identifying a first task having an estimated completion time greater than the target completion time of the batch data processing job; and splitting the first input shard into a first split input shard and a second split input shard different from the first split input shard, the first split input shard including a first portion of the first input shard, and the second split input shard including a second portion of the first input shard different from the first portion.

First claim

Opening claim text (preview).

What is claimed is: 1 . A computer-implemented method executed by one or more processors, the method comprising: determining a target completion time for a batch data processing job of an input data set, the batch data processing job performed by a plurality of tasks, each of the plurality of tasks processing a different input shard that includes a different portion of the input data set; identifying, from the plurality of tasks, a first task having an estimated completion time greater than the target completion time of the batch data processing job, the estimated completion time representing an estimated time at which the first task will complete processing the portion of the input data set at a first input shard being processed by the first task; and splitting the first input shard into a first split input shard and a second split input shard different from the first split input shard, the first split input shard including a first portion of the first input shard, and the second split input shard including a second portion of the first input shard different from the first portion. 2 . The method of claim 1 , further comprising: assigning the first split input shard to be processed by a second task; and inserting the second split input shard into a pool of unassigned shard to await processing by an available task. 3 . The method of claim 2 , wherein the second task is the first task and the third task is an idle task associated with the batch data processing job. 4 . The method of claim 1 , further comprising: identifying that the batch data processing job has entered an endgame state where all shards associated with the input data set are being processed by a task and there is at least one idle task associated with the batch processing job, wherein identifying the first task and splitting the first input shard are performed after identifying that the batch data processing job has entered the endgame state. 5 . The method of claim 1 , wherein each of the plurality of tasks includes an estimated completion time representing an estimated time at which the task will complete processing its input shard. 6 . The method of claim 5 , wherein the target completion time is determined based at least in part on the estimated completion times for the plurality of tasks for the batch data processing job. 7 . The method of claim 5 , further comprising determining the estimated completion time for each of the plurality of tasks based at least in part on at least one of: a size of a portion of the input shard remaining to be processed for each of the plurality of tasks, a processing rate associated with each of the plurality of tasks for a portion of the input shard already processed for each of the plurality of tasks, or a recent processing rate associated with each of the plurality of tasks for a portion of the input shard processed within a time window for each of the plurality of tasks. 8 . The method of claim 1 , wherein the target completion time is determined based on at least one of: historical information associated with the batch data processing job, or historical information associated with other batch data processing jobs. 9 . The method of claim 1 , wherein the target completion time is a deadline specified by an administrator of the batch data processing job for finishing the batch data processing job. 10 . The method of claim 1 , wherein identifying the first task and splitting the first input shard are performed by the first task itself. 11 . The method of claim 10 , wherein identifying the first task includes: receiving, by the first task from a supervisor process associated with the batch data processing job, the target completion time; and comparing, by the first task, the target completion time to the estimated completion time to determine that the first task will not complete by the target completion time. 12 . The method of claim 11 , wherein splitting the first task includes: sending, by the first task, an indication of a portion of the first input shard that the first task will not be able to process by the target completion time to the supervisor process; and assigning, by the supervisor process, the portion of the first input shard to an idle task for processing. 13 . The method of claim 1 , wherein determining a target completion time for a batch data processing job includes computing an average of the estimated completion times of the plurality of tasks and of estimated completion times for one or more idle tasks associated with the batch data processing job, wherein the estimated completion times for the one or more idle tasks have a value of 0. 14 . A non-transitory, computer-readable medium storing instructions operable when executed to cause at least one processor to perform operations comprising: determining a target completion time for a batch data processing job of an input data set, the batch data processing job performed by a plurality of tasks, each of the plurality of tasks processing a different input shard that includes a different portion of the input data set; identifying, from the plurality of tasks, a first task having an estimated completion time greater than the target completion time of the batch data processing job, the estimated completion time representing an estimated time at which the first task will complete processing the portion of the input data set at a first input shard being processed by the first task; and splitting the first input shard into a first split input shard and a second split input shard different from the first split input shard, the first split input shard including a first portion of the first input shard, and the second split input shard including a second portion of the first input shard different from the first portion. 15 . The computer-readable medium of claim 14 , further comprising: assigning the first split input shard to be processed by a second task; and inserting the second split input shard into a pool of unassigned shard to await processing by an available task. 16 . The computer-readable medium of claim 15 , wherein the second task is the first task and the third task is an idle task associated with the batch data processing job. 17 . The computer-readable medium of claim 14 , the operations further comprising: identifying that the batch data processing job has entered an endgame state where all shards associated with the input data set are being processed by a task and there is at least one idle task associated with the batch processing job, wherein identifying the first task and splitting the first input shard are performed after identifying that the batch data processing job has entered the endgame state. 18 . The computer-readable medium of claim 14 , wherein each of the plurality of tasks includes an estimated completion time representing an estimated time at which the task will complete processing its input shard. 19 . The computer-readable medium of claim 18 , wherein the target completion time is determined based at least in part on the estimated completion times for the plurality of tasks for the batch data processing job. 20 . A system comprising: memory for storing data; and one or more processors operable to perform operations comprising: determining a target completion time for a batch data processing job of an input data set, the batch data processing job performed by a plurality of tasks, each of the plurality of tasks processing a different input shard that includes a differ

Assignees

Inventors

Classifications

  • G06F9/4843Primary

    by program, e.g. task dispatcher, supervisor, operating system · CPC title

  • G06F9/46Primary

    Multiprogramming arrangements · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2016011901A1 cover?
The present disclosure relates to dynamically adjusting shard allocation during parallel processing operations. One example method includes determining a target completion time for a batch data processing job of an input data set performed by a plurality of tasks, each of the plurality of tasks processing a different input shard including a different portion of the input data set; identifying a…
Who is the assignee on this patent?
Google Inc
What technology area does this patent fall under?
Primary CPC classification G06F9/4843. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Jan 14 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).