What technology area does this patent fall under?

Primary CPC classification G06F9/4881. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Jan 26 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Dataflow execution time estimation for in-memory distributed processing framework

US10901782B2 · US · B2

Patent metadata
Field	Value
Publication number	US-10901782-B2
Application number	US-201816040774-A
Country	US
Kind code	B2
Filing date	Jul 20, 2018
Priority date	Jul 20, 2018
Publication date	Jan 26, 2021
Grant date	Jan 26, 2021

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Techniques are provided for dataflow execution time estimation for distributed processing frameworks. An exemplary method comprises: obtaining an input dataset for a dataflow for execution; determining a substantially minimal data unit for a given operation of the dataflow processed by the given operation; estimating a number of rounds required to execute a number of data units in the input dataset using nodes assigned to execute the given operation; determining an execution time spent by the given operation to process one data unit; estimating the execution time for the given operation based on the execution time spent by the given operation to process one data unit and the number of rounds required to execute the number of data units in the input dataset; and executing the given operation with the input dataset. A persistent cost model is optionally employed to record the execution times of known dataflow operations.

First claim

Opening claim text (preview).

What is claimed is: 1. A method, comprising: obtaining an input dataset for a dataflow for execution in a distributed processing framework comprising a plurality of processing nodes; determining a data unit for a given operation of the dataflow, wherein the data unit is based on a size of a partition of data processed by each of the processing nodes assigned to execute the given operation, wherein the partitioning of the input dataset is determined for the given operation of the dataflow; estimating a number of rounds required to execute a number of data units in the input dataset using the processing nodes of the distributed processing framework assigned to execute the given operation, wherein the number of rounds is based on a relative number of the data units processed by each of the processing nodes assigned to execute the given operation; determining, using at least one processing device, a round execution time for one or more of the processing nodes assigned to execute the given operation to process one data unit; and estimating, using the at least one processing device, an operation execution time for the given operation to process the input dataset based on the round execution time and the number of rounds required to execute the number of data units in the input dataset, wherein the given operation is executed with the input dataset in one or more of the distributed processing framework and a second distributed processing framework. 2. The method of claim 1 , wherein the step of determining the round execution time further comprises the steps of instrumenting source code of the given operation to obtain the start and end times of the given operation; and executing the instrumented code to capture the round execution time of the given operation. 3. The method of claim 1 , further comprising the step of generating a persistent cost model containing the operation execution times of known dataflow operations. 4. The method of claim 3 , wherein the persistent cost model is instantiated in a data structure indexed by the given operation, an input data structure, input data, and an infrastructure of the distributed processing framework; and wherein the persistent cost model indicates the total execution time of the given operation. 5. The method of claim 3 , wherein the persistent cost model provides an estimate for an execution of a newly submitted dataflow, and wherein the persistent cost model is extended as new operations are executed. 6. The method of claim 1 , further comprising the step of capturing the execution time of the execution of the given operation and comparing the captured execution time with the estimated operation execution time from the estimating step. 7. The method of claim 6 , further comprising the step of updating the operation execution time for the given operation in a cost model when the difference between the captured execution time and the estimated operation execution time from the estimating step exceeds a predefined threshold. 8. The method of claim 1 , wherein the execution of the given operation comprises executing a single round of the given operation when the given operation comprises a sample operation. 9. A system, comprising: a memory; and at least one processing device, coupled to the memory, operative to implement the following steps: obtaining an input dataset for a dataflow for execution in a distributed processing framework comprising a plurality of processing nodes; determining a data unit for a given operation of the dataflow, wherein the data unit is based on a size of a partition of data processed by each of the processing nodes assigned to execute the given operation, wherein the partitioning of the input dataset is determined for the given operation of the dataflow; estimating a number of rounds required to execute a number of data units in the input dataset using the processing nodes of the distributed processing framework assigned to execute the given operation, wherein the number of rounds is based on a relative number of the data units processed by each of the processing nodes assigned to execute the given operation; determining, using at least one processing device a round execution time for one or more of the processing nodes assigned to execute the given operation to process one data unit; and estimating, using the at least one processing device, an operation execution time for the given operation to process the input dataset based on the round execution time and the number of rounds required to execute the number of data units in the input dataset, wherein the given operation is executed with the input dataset in one or more of the distributed processing framework and a second distributed processing framework. 10. The system of claim 9 , wherein the step of determining the round execution time further comprises the steps of instrumenting source code of the given operation to obtain the start and end times of the given operation; and executing the instrumented code to capture the round execution time of the given operation. 11. The system of claim 9 , further comprising the step of generating a persistent cost model containing the operation execution times of known dataflow operations. 12. The system of claim 11 , wherein the persistent cost model provides an estimate for an execution of a newly submitted dataflow, and wherein the persistent cost model is extended as new operations are executed. 13. The system of claim 9 , further comprising the step of capturing the execution time of the execution of the given operation and comparing the captured execution time with the estimated operation execution time from the estimating step. 14. The system of claim 9 , wherein the execution of the given operation comprises executing a single round of the given operation when the given operation comprises a sample operation. 15. A computer program product, comprising a non-transitory machine-readable storage medium having encoded therein executable code of one or more software programs, wherein the one or more software programs when executed by at least one processing device perform the following steps: obtaining an input dataset for a dataflow for execution in a distributed processing framework comprising a plurality of processing nodes; determining a data unit for a given operation of the dataflow, wherein the data unit is based on a size of a partition of data processed by each of the processing nodes assigned to execute the given operation, wherein the partitioning of the input dataset is determined for the given operation of the dataflow; estimating a number of rounds required to execute a number of data units in the input dataset using the processing nodes of the distributed processing framework assigned to execute the given operation, wherein the number of rounds is based on a relative number of the data units processed by each of the processing nodes assigned to execute the given operation; determining, using at least one processing device, a round execution time for one or more of the processing nodes assigned to execute the given operation to process one data unit; and estimating, using the at least one processing device, an operation execution time for the given operation to process the input dataset based on the round execution time and the number of rounds required to execute the number of data units in the input dataset, wherein the given operation is executed with the input dataset in one or more of the distributed processing framework and a second distributed processing framework. 16. The computer program product of claim 1

Assignees

Emc Ip Holding Co Llc

Inventors

Classifications

G06F9/4881Primary
Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues · CPC title
G06F9/485
Task life-cycle, e.g. stopping, restarting, resuming execution (G06F9/4881 takes precedence) · CPC title
G06F9/5083
Techniques for rebalancing the load in a distributed system · CPC title

Patent family

Related publications grouped by family.

View patent family 69161771

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10901782B2 cover?: Techniques are provided for dataflow execution time estimation for distributed processing frameworks. An exemplary method comprises: obtaining an input dataset for a dataflow for execution; determining a substantially minimal data unit for a given operation of the dataflow processed by the given operation; estimating a number of rounds required to execute a number of data units in the input dat…
Who is the assignee on this patent?: Emc Ip Holding Co Llc
What technology area does this patent fall under?: Primary CPC classification G06F9/4881. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Jan 26 2021 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).