Distributed execution of data processing pipelines

US10866831B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10866831-B2
Application numberUS-201816009996-A
CountryUS
Kind codeB2
Filing dateJun 15, 2018
Priority dateJun 15, 2018
Publication dateDec 15, 2020
Grant dateDec 15, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method for executing a data processing pipeline may be provided. The method may include identifying a file providing a runtime environment required for executing a series of data processing operations comprising the data processing pipeline. The file may be identified based on one or more tags associated with the data processing pipeline. The one or more tags may specify at least one runtime requirement for the series of data processing operations. The file may be executed to generate an executable package that includes a plurality of components required for executing the series of data processing operations. The series of data processing operations included in the data processing pipeline may be executed by at least executing the executable package to provide the runtime environment required for executing the series of data processing operations. Related systems and articles of manufacture, including computer program products, are also provided.

First claim

Opening claim text (preview).

What is claimed is: 1. A system, comprising: at least one data processor; and at least one memory storing instructions which, when executed by the at least one data processor, result in operations comprising: dividing a graph representative of a data processing pipeline, the data processing pipeline including a series of data processing operations, the graph being divided into a first subgraph and a second subgraph, each of the first subgraph and the second subgraph including some but not all of the series of data processing operations included in the data processing pipeline; identifying a first file providing a first runtime environment required for executing a first data processing operation included in the first subgraphs; identifying a second file providing a second runtime environment required for executing a second data processing operation included in the second subgraph, each of the first file and the second file comprising a script including a sequence of instructions, and executing the sequence of instructions generates an executable package that includes a plurality of components required for executing each of the first data processing operation and the second data processing operation; and triggering an execution of the data processing pipeline by at least sending the first file to a first computing node in a cluster of computing nodes and the second file to a second computing node in the cluster of computing nodes the first computing node being scheduled to execute the first data processing operation included in the first subgraph, and the second computing node being scheduled to execute the second data processing operation included in the second subgraph. 2. The system of claim 1 , wherein the first computing node is further scheduled to execute a third data processing operation include in the first subgraph, wherein the second runtime environment is required for executing the third data processing operation, and wherein the execution of the data processing pipeline is further triggered by at least sending, to the first computing node, the second file. 3. The system of claim 1 , further comprising: scheduling the first computing node to execute the first data processing operation and the second computing node to execute the second data processing operation, the scheduling being based on at least one of a load balance and available resources across the cluster of computing nodes. 4. The system of claim 1 , wherein the plurality of components required for executing the first data processing operation includes at least one of a programming code, a runtime, a library, an environment variable, and a configuration file. 5. The system of claim 1 , wherein the graph includes a plurality of nodes interconnected by one or more edges, wherein each of the plurality of nodes correspond to a data processing operation from the series of data processing operations, and wherein the one or more edges indicate a flow of data between different data processing operations. 6. The system of claim 1 , wherein the first file is identified based at least on a first tag associated with the first file matching a second tag associated with the first data processing operation, wherein the first tag specifies at least one runtime requirement supported by the first file, and wherein the second tag specifies at least one runtime requirement of the first data processing operation. 7. The system of claim 1 , wherein executing the first data processing operation includes at least one of receiving data from a database and sending data to the database. 8. A computer-implemented method, comprising: dividing a graph representative of a data processing pipeline, the data processing pipeline including a series of data processing operations, the graph being divided into a first subgraph and a second subgraph, each of the first subgraph and the second sub graph including some but not all of the series of data processing operations included in the data processing pipeline; identifying a first file providing a first runtime environment required for executing a first data processing operation included in the first subgraphs; identifying a second file providing a second runtime environment required for executing a second data processing operation included in the second subgraph, each of the first file and the second file comprising a script including a sequence of instructions, and executing the sequence of instructions generates an executable package that includes a plurality of components required for executing each of the first data processing operation and the second data processing operation; and triggering an execution of the data processing pipeline by at least sending the first file to a first computing node in a cluster of computing nodes and the second file to a second computing node in the cluster of computing nodes, the first computing node being scheduled to execute the first data processing operation included in the first subgraph, and the second computing node being scheduled to execute the second data processing operation included in the second subgraph. 9. The method of claim 8 , wherein the first computing node is further scheduled to execute a third data processing operation include in the first subgraph, wherein the second runtime environment is required for executing the third data processing operation, and wherein the execution of the data processing pipeline is further triggered by at least sending, to the first computing node, the second file. 10. The method of claim 8 , further comprising: scheduling the first computing node to execute the first data processing operation and the second computing node to execute the second data processing operation, the scheduling being based on at least one of a load balance and available resources across the cluster of computing nodes. 11. The method of claim 8 , wherein the plurality of components required for executing the first data processing operation includes at least one of a programming code, a runtime, a library, an environment variable, and a configuration file. 12. The method of claim 8 , wherein the graph includes a plurality of nodes interconnected by one or more edges, wherein each of the plurality of nodes correspond to a data processing operation from the series of data processing operations, and wherein the one or more edges indicate a flow of data between different data processing operations. 13. The method of claim 8 , wherein the first file is identified based at least on a first tag associated with the first file matching a second tag associated with the first data processing operation, wherein the first tag specifies at least one runtime requirement supported by the first file, and wherein the second tag specifies at least one runtime requirement of the first data processing operation. 14. A non-transitory computer-readable medium storing instructions, which when executed by at least one data processor, result in operations comprising: dividing a graph representative of a data processing pipeline, the data processing pipeline including a series of data processing operations, the graph being divided into a first subgraph and a second subgraph, each of the first subgraph and the second sub graph including some but not all of the series of data processing operations included in the data processing pipeline; identifying a first file providing a first runtime environment required for executing a first data processing operation included in the first subgraphs; identifying a second file providing a second runtime environment required for executing a second data processing operation included in

Assignees

Inventors

Classifications

  • G06F9/5027Primary

    the resource being a machine, e.g. CPUs, Servers, Terminals · CPC title

  • Grid computing · CPC title

  • G06F9/4881Primary

    Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues · CPC title

  • Techniques for rebalancing the load in a distributed system · CPC title

  • Task decomposition · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10866831B2 cover?
A method for executing a data processing pipeline may be provided. The method may include identifying a file providing a runtime environment required for executing a series of data processing operations comprising the data processing pipeline. The file may be identified based on one or more tags associated with the data processing pipeline. The one or more tags may specify at least one runtime …
Who is the assignee on this patent?
Sap Se
What technology area does this patent fall under?
Primary CPC classification G06F9/5027. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Dec 15 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 12 related publications on this page (citations in our corpus or others sharing the same primary CPC).