Systems and Methods for Efficient Data Preprocessing of Machine Learning Workloads
US-2024403138-A1 · Dec 5, 2024 · US
US9256460B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-9256460-B2 |
| Application number | US-201313843425-A |
| Country | US |
| Kind code | B2 |
| Filing date | Mar 15, 2013 |
| Priority date | Mar 15, 2013 |
| Publication date | Feb 9, 2016 |
| Grant date | Feb 9, 2016 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Techniques are disclosed for qualified checkpointing of a data flow model having data flow operators and links connecting the data flow operators. A link of the data flow model is selected based on a set of checkpoint criteria. A checkpoint is generated for the selected link. The checkpoint is selected from different checkpoint types. The generated checkpoint is assigned to the selected link. The data flow model, having at least one link with no assigned checkpoint, is executed.
Opening claim text (preview).
What is claimed is: 1. A computer program product for qualified checkpointing of a data flow model having a plurality of data flow operators and a plurality of links connecting the data flow operators, the computer program product comprising: a non-transitory computer-readable medium having program code embodied therewith, the program code executable by one or more computer processors to: select a link of the plurality of links of the data flow model, that satisfies one or more checkpoint criteria from a predefined set of checkpoint criteria; generate a checkpoint for the selected link, wherein the checkpoint is selected from a retargetable checkpoint, a connection checkpoint, a parallel checkpoint, a bottleneck checkpoint, and a recovery checkpoint, wherein at least four of the retargetable checkpoint, the connection checkpoint, the parallel checkpoint, the bottleneck checkpoint, and the recovery checkpoint are generable by the program code, wherein the generated checkpoint is assigned to the selected link; and execute the data flow model, wherein at least one link of the plurality of links of the data flow model has no assigned checkpoint. 2. The computer program product of claim 1 , wherein the predefined set of checkpoint criteria specifies to: generate the retargetable checkpoint between two sub-flows of different processing types; and generate the connection checkpoint between two sub-flows having different design focus properties. 3. The computer program product of claim 2 , wherein the predefined set of checkpoint criteria further specifies to: generate the parallel checkpoint between two sub-flows having a measure of pipeline parallelism beyond a predefined threshold; generate the bottleneck checkpoint between an upstream sub-flow and a data flow operator identified as having a potential bottleneck; and generate the recovery checkpoint between an upstream sub-flow and a data flow operator having a highest measure of likelihood of failing among the plurality of data flow operators of the data flow model. 4. The computer program product of claim 3 , wherein the data flow model is executable across different runtime engine types, wherein a first sub-flow of the data flow is executable on a retargetable engine type, and wherein multiple sub-flows of the data flow are executable in parallel. 5. The computer program product of claim 4 , whereby the data flow supports both performance enhancement and failure recovery, without requiring full checkpointing of the data flow, wherein full checkpointing of the data flow comprises assigning a respective checkpoint to each link of the data flow. 6. The computer program product of claim 5 , wherein the program code is of an application, wherein the application includes a request handler component, an engine selector component, an engine manager component, a score composer component, and an execution manager component. 7. The computer program product of claim 1 , wherein generating the checkpoint for the selected link comprises: generating the retargetable checkpoint between two sub-flows of different processing types. 8. The computer program product of claim 1 , wherein generating the checkpoint for the selected link comprises: generating the connection checkpoint between two sub-flows having different design focus properties. 9. The computer program product of claim 1 wherein generating the checkpoint for the selected link comprises: generating the parallel checkpoint between two sub-flows having a measure of pipeline parallelism beyond a predefined threshold. 10. The computer program product of claim 1 , wherein generating the checkpoint for the selected link comprises: generating the bottleneck checkpoint between an upstream sub-flow and a data flow operator identified as having a potential bottleneck. 11. A system for qualified checkpointing of a data flow model having a plurality of data flow operators and a plurality of links connecting the data flow operators, the system comprising: one or more computer processors; a memory containing a program which, when executed by the one or more computer processors, is configured to perform an operation comprising: selecting a link of the plurality of links of the data flow model, that satisfies one or more checkpoint criteria from a predefined set of checkpoint criteria; generating a checkpoint for the selected link, wherein the checkpoint is selected from a retargetable checkpoint, a connection checkpoint, a parallel checkpoint, a bottleneck checkpoint, and a recovery checkpoint, wherein at least four of the retargetable checkpoint, the connection checkpoint, the parallel checkpoint, the bottleneck checkpoint, and the recovery checkpoint are generable by the program, wherein the generated checkpoint is assigned to the selected link; and executing the data flow model, wherein at least one link of the plurality of links of the data flow model has no assigned checkpoint. 12. The system of claim 11 , wherein the predefined set of checkpoint criteria specifies to: generate the retargetable checkpoint between two sub-flows of different processing types; and generate the connection checkpoint between two sub-flows having different design focus properties. 13. The system of claim 12 , wherein the predefined set of checkpoint criteria further specifies to: generate the parallel checkpoint between two sub-flows having a measure of pipeline parallelism beyond a predefined threshold; generate the bottleneck checkpoint between an upstream sub-flow and a data flow operator identified as having a potential bottleneck; and generate the recovery checkpoint between an upstream sub-flow and a data flow operator having a highest measure of likelihood of failing among the plurality of data flow operators of the data flow model. 14. The system of claim 13 , wherein the data flow model is executable across different runtime engine types, wherein a first sub-flow of the data flow is executable on a retargetable engine type, and wherein multiple sub-flows of the data flow are executable in parallel. 15. The system of claim 14 , whereby the data flow supports both performance enhancement and failure recovery, without requiring full checkpointing of the data flow, wherein full checkpointing of the data flow comprises assigning a respective checkpoint to each link of the data flow; wherein the program includes a request handler component, an engine selector component, an engine manager component, a score composer component, and an execution manager component. 16. The system of claim 11 , wherein generating the checkpoint for the selected link comprises: generating the retargetable checkpoint between two sub-flows of different processing types. 17. The system of claim 11 , wherein generating the checkpoint for the selected link comprises: generating the connection checkpoint between two sub-flows having different design focus properties. 18. The system of claim 11 , wherein generating the checkpoint for the selected link comprises: generating the parallel checkpoint between two sub-flows having a measure of pipeline parallelism beyond a predefined threshold. 19. The system of claim 11 , wherein generating the checkpoint for the selected link comprises: generating the bottleneck checkpoint between an upstream sub-flow and a data flow operator identified as having a potential bottleneck. 20. The system of claim 11 , wherein generating the checkpoint for the selected link comprises: generating the recovery checkpo
Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs (mappping at compile time, see G06F8/451) · CPC title
Point-in-time backing up or restoration of persistent data · CPC title
Saving or restoring of program or task context · CPC title
Error detection or correction of the data by redundancy in operations (error detection or correction of the data by redundancy in hardware G06F11/16) · CPC title
Using snapshots, i.e. a logical point-in-time copy of the data · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.