Cost-aware replication of intermediate data in dataflows

US8949558B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-8949558-B2
Application numberUS-201113097200-A
CountryUS
Kind codeB2
Filing dateApr 29, 2011
Priority dateApr 29, 2011
Publication dateFeb 3, 2015
Grant dateFeb 3, 2015

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Described herein are methods, systems, apparatuses and products for cost-aware replication of intermediate data in dataflows. An aspect provides receiving at least one measurement indicative of a reliability cost associated with executing a dataflow; computing a degree of replication of at least one intermediate data set in the dataflow based on the reliability cost; and communicating at least one replication factor to at least one component of a system responsible for replication of the at least one intermediate data set in the dataflow; wherein the at least one intermediate data set is replicated according to the replication factor. Other embodiments are disclosed.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer program product comprising: a non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code comprising: computer readable program code configured to receive at least one measurement indicative of a reliability cost associated with executing a dataflow; computer readable program code configured to compute a degree of replication of at least one intermediate data set in the dataflow based on the reliability cost; wherein the reliability cost comprises a metric based at least on a sum of two sub-costs, the two sub-costs comprising a cost of replication for the at least one intermediate data set in the dataflow and a cost of regeneration for the at least one intermediate data set in the dataflow; computer readable program code configured to minimize the reliability cost; and computer readable program code configured to communicate at least one replication factor, corresponding to the computed degree of replication, to at least one component responsible for replication of the at least one intermediate data set in the dataflow; wherein the at least one intermediate data set is replicated according to the replication factor. 2. The computer program product according to claim 1 , wherein the at least one measurement indicative of a reliability cost associated with executing the dataflow includes at least one measurement relevant to determining how much intermediate data to replicate at one or more stages of the dataflow. 3. The computer program product according to claim 1 , wherein to compute at least one replication factor further comprises solving a constrained optimization problem. 4. The computer program product according to claim 3 , wherein the constrained optimization problem takes into account at least one of the cost of replication for the at least one intermediate data set in the dataflow and the cost of regeneration for the at least one intermediate data set in the dataflow. 5. The computer program product according to claim 4 , wherein the cost of replication comprises a cost incurred for at least one of creating or destroying replicas of the at least one intermediate data set in the dataflow. 6. The computer program product according to claim 4 , wherein the cost of regeneration comprises a cost of regenerating the at least one intermediate data set in the dataflow. 7. The computer program product according to claim 1 , wherein the at least one measurement indicative of a reliability cost associated with executing the dataflow is obtained from at least one sensor that monitors compute stages in the dataflow at run time. 8. The computer program product according to claim 1 , wherein the computer readable program code configured to compute a degree of replication of at least one intermediate data set in the dataflow is further configured to compute a degree of replication responsive to at least one of: a predetermined, periodic timing mechanism; and a completion of a compute stage in the dataflow. 9. The computer program product according to claim 1 , wherein the at least one replication, factor comprises at least one of: an instruction to replicate an intermediate data set of the dataflow for a particular stage; and an instruction to delete a replica for an intermediate data set of the dataflow already replicated for a particular stage. 10. The computer program product according to claim 1 , wherein the dataflow comprises a stage wise data computation process in which at least one subsequent stage depends on an intermediate data set computed at a preceding stage. 11. The computer program product according to claim 1 , further comprising computer readable program code configured to provide software as a service in a distributed computing environment. 12. A method comprising: receiving at least one measurement indicative of a reliability cost associated with executing a dataflow; computing a degree of replication of at least one intermediate data set in the dataflow based on the reliability cost; wherein the reliability cost comprises a metric based at least on a sum of two sub-costs, the two sub-costs comprising a cost of replication for the at least one intermediate data set in the dataflow and a cost of regeneration for the at least one intermediate data set in the dataflow; minimizing the reliability cost; and communicating at least one replication factor, corresponding to the computed degree of replication, to at least one component responsible for replication of the at least one intermediate data set in the dataflow; wherein the at least one intermediate data set is replicated according to the replication factor. 13. The method according to claim 12 , wherein to compute at least one replication factor further comprises solving a constrained optimization problem. 14. The method according to claim 13 , wherein the constrained optimization problem takes into account at least one of the cost of replication for the at least one intermediate data set in the dataflow and the cost of regeneration for the at least one intermediate data set in the dataflow. 15. The method according to claim 14 , wherein the cost of replication comprises a cost incurred for at least one of creating or destroying replicas of the at least one intermediate data set in the dataflow. 16. The method according to claim 14 , wherein the cost of regeneration comprises a cost of regenerating the at least one intermediate data set in the dataflow. 17. The method according to claim 12 , wherein the at least one measurement indicative of a reliability cost associated with executing the dataflow is obtained from at least one sensor that monitors compute stages in the dataflow at run time. 18. The method according to claim 12 , wherein computing a degree of replication of at least one intermediate data set in the dataflow further comprises computing a degree of replication responsive to at least one of: a predetermined, periodic timing mechanism; and a completion of a compute stage in the dataflow. 19. The method according to claim 12 , wherein the at least one replication factor comprises at least one of: an instruction to replicate an intermediate data set of the dataflow for a particular stage; and an instruction to delete a replica for an intermediate data set of the dataflow already replicated for a particular stage. 20. The method according to claim 12 , wherein the dataflow comprises a stage wise data computation process in which at least one subsequent stage depends on an intermediate data set computed at a preceding stage. 21. The method according to claim 12 , further comprising providing software as a service in a distributed computing environment. 22. A system comprising: at least one processor; and a memory device operatively connected to the at least one processor; wherein, responsive to execution of program instructions accessible to the at least one processor, the at least one processor is configured to: receive at least one measurement indicative of a reliability cost associated with executing a dataflow; compute a degree of replication of at least one intermediate data set in the dataflow based on the reliability cost; wherein the reliability cost comprises a metric based at least on a sum of two sub-costs, the two sub-costs comprising a cost of replication for the at least one intermediate data set in the dataflow and a cost of regeneration for the a

Assignees

Inventors

Classifications

  • Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor · CPC title

  • Physics · mapped topic

  • Physics · mapped topic

  • G06F12/02Primary

    Addressing or allocation; Relocation (program address sequencing G06F9/00; arrangements for selecting an address in a digital store G11C8/00) · CPC title

  • Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs (mappping at compile time, see G06F8/451) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US8949558B2 cover?
Described herein are methods, systems, apparatuses and products for cost-aware replication of intermediate data in dataflows. An aspect provides receiving at least one measurement indicative of a reliability cost associated with executing a dataflow; computing a degree of replication of at least one intermediate data set in the dataflow based on the reliability cost; and communicating at least …
Who is the assignee on this patent?
Castillo Claris, Steinder Malgorzata, Tantawi Asser Nasreldin, and 1 more
What technology area does this patent fall under?
Primary CPC classification G06F12/02. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Feb 03 2015 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).