Partition-based data stream processing framework

US10635644B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10635644-B2
Application numberUS-201314077167-A
CountryUS
Kind codeB2
Filing dateNov 11, 2013
Priority dateNov 11, 2013
Publication dateApr 28, 2020
Grant dateApr 28, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A control node of a multi-tenant stream processing service receives a request indicating an operation to be performed on data records of a particular data stream. Based on a stream partitioning policy, the control node determines an initial number of worker nodes to be used. The control node configures a worker node to perform the operation on received records. In response to a determination that the worker node is in an unhealthy state, the control node configures a replacement worker node.

First claim

Opening claim text (preview).

What is claimed is: 1. A system, comprising: one or more computing devices that implement a multi-tenant stream processing service configured to: implement one or more programmatic interfaces configured to receive configuration information for processing a data stream, wherein the configuration information specifies a plurality of processing stages for the processing, and for one or more individual ones of the processing stages: (a) a partitioning policy to divide data records received by the processing stage over time into a plurality of stream partitions to be distributed among worker nodes of the processing stage, (b) a processing operation to be performed by the worker nodes of the processing stage on the stream partitions in accordance with the partitioning policy, and (c) an output distribution descriptor for results of the processing stage; receive, from a particular client via the one or more programmatic interfaces, configuration information for processing a particular data stream specifying at least a first processing stage to be performed according to a first partitioning policy specified in the configuration information and a second processing stage to be performed according to a second partitioning policy specified in the configuration information, and wherein output of the first processing stage is to be provided as input to the second processing stage; determine, for the second processing stage, based at least in part on the second partitioning policy and an estimated performance capability of resources to be deployed as worker nodes for the second processing stage, an initial number of worker nodes for the second processing stage; assign, based at least in part on the second partitioning policy, individual ones of a plurality of stream partitions of the output of the first processing stage to respective ones of the initial number of worker nodes for the second processing stage, wherein one or more of the stream partitions are assigned to a particular worker node of the second processing stage and another one of the stream partitions is assigned to another worker node of the second processing stage; cause the particular worker node of the second processing stage to (a) receive data records of the one or more stream partitions in accordance with the second partitioning policy, (b) perform a particular processing operation on received data records of the one or more stream partitions, (c) store progress records indicating most recently processed data records that the particular worker node has processed, and (d) transfer results of the particular processing operation to one or more destinations in accordance with a particular output distribution descriptor for the second processing stage; cause the other worker node to perform the particular processing operation on received data records of the other stream partition; monitor a health state of the particular worker node; and in response to a determination that the particular worker node is in an undesired state, configure a replacement worker node for the second processing stage to replace the particular worker node, wherein the replacement worker node accesses a progress record stored by the particular worker node being replaced to identify, based at least in part on the progress record, a next data record of the one or more stream partitions on which to perform the particular processing operation by the replacement worker node. 2. The system as recited in claim 1 , wherein an output distribution descriptor for the first processing stage that indicates that results of the first processing stage are to be distributed, as data records of a different data stream, to one or more ingestion nodes of the second processing stage configured for the different data stream, in accordance with the second partitioning policy. 3. The system as recited in claim 1 , wherein to cause the particular worker node and the other worker node to perform the particular processing operation, the one or more computing devices are further configured to: instantiate and configure respective virtual machines as the particular worker node and the other worker node. 4. The system as recited in claim 1 , wherein the one or more computing devices is further configured to: in response to a determination that a different worker node configured to perform another processing operation for another processing stage is in an undesired state, configure a different replacement worker node to perform the other processing operation on one or more subsequently-received data records determined without accessing a progress record stored by the different worker node. 5. The system as recited in claim 1 , wherein the one or more computing devices is further configured to: in response to a determination that a workload level at a different worker node of the first processing stage meets a triggering criterion, implement a stage reconfiguration operation comprising one or more of: (a) a dynamic repartitioning of the particular data stream performed while additional data records of the particular data stream continue to be processed, (b) an assignment of an alternate worker node to at least one stream partition previously being processed at the different worker node, (c) a change to a number of worker nodes configured for the first processing stage, or (d) a transfer of a worker node of the first processing stage from one server to another server. 6. A method, comprising: performing, by one or more computing devices that implement a multi-tenant stream processing service: implementing one or more programmatic interfaces configured to receive configuration information for processing a data stream, wherein the configuration information specifies a plurality of processing stages for the processing, and for one or more individual ones of the processing stages: (a) a partitioning policy to divide data records received by the processing stage over time into a plurality of stream partitions to be distributed among worker nodes of the processing stage, (b) a processing operation to be performed by the worker nodes of the processing stage on the stream partitions in accordance with the partitioning policy, and (c) an output distribution descriptor for results of the processing stage; receiving, from a particular client via the one or more programmatic interfaces, configuration information for processing a particular data stream specifying at least a first processing stage to be performed according to a first partitioning policy specified in the configuration information and a second processing stage to be performed according to a second partitioning policy specified in the configuration information, and wherein output of the first processing stage is to be provided as input to the second processing stage; determining, for the second processing stage, based at least in part on the second partitioning policy and an estimated performance capability of resources to be deployed as worker nodes for the second processing stage, an initial number of worker nodes for the second processing stage; assigning, based at least in part on the second partitioning policy, individual ones of a plurality of stream partitions of the output of the first processing stage to respective ones of the initial number of worker nodes for the second processing stage, wherein one or more of the stream partitions are assigned to a particular worker node of the second processing stage and another one of the stream partitions is assigned to another worker node of the second processing stage; causing the particular worker node of the second processing stage to perform: (a) receiving data records of the one or more stream partitions in accordance with the second partitioning policy, (b) performing a particular proce

Assignees

Inventors

Classifications

  • where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems (multiprogramming arrangements G06F9/46; allocation of resources G06F9/50) · CPC title

  • eliminating a faulty processor or activating a spare · CPC title

  • Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs (mappping at compile time, see G06F8/451) · CPC title

  • Backup restoration techniques · CPC title

  • G06F16/21Primary

    Design, administration or maintenance of databases · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10635644B2 cover?
A control node of a multi-tenant stream processing service receives a request indicating an operation to be performed on data records of a particular data stream. Based on a stream partitioning policy, the control node determines an initial number of worker nodes to be used. The control node configures a worker node to perform the operation on received records. In response to a determination th…
Who is the assignee on this patent?
Amazon Tech Inc
What technology area does this patent fall under?
Primary CPC classification G06F16/21. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Apr 28 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).