Proactive failure recovery model for distributed computing using a checkpoint frequency determined by a MTBF threshold

US9348710B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9348710-B2
Application numberUS-201414445369-A
CountryUS
Kind codeB2
Filing dateJul 29, 2014
Priority dateJul 29, 2014
Publication dateMay 24, 2016
Grant dateMay 24, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

This disclosure generally describes methods and systems, including computer-implemented methods, computer-program products, and computer systems, for providing a proactive failure recovery model for distributed computing. One computer-implemented method includes building a virtual tree-like computing structure of a plurality of computing nodes, for each computing node of the virtual tree-like computing structure, performing, by a hardware processor, a node failure prediction model to calculate a mean time between failure (MTBF) associated with the computing node, determining whether to perform a checkpoint of the computing node based on a comparison between the calculated MTBF and a maximum and minimum threshold, migrating a process from the computing node to a different computing node acting as a recovery node, and resuming execution of the process on the different computing node.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method, comprising: building a virtual tree-like computing structure of a plurality of computing nodes; for each computing node of the virtual tree-like computing structure, performing, by a hardware processor, a node failure prediction model to calculate a mean time between failure (MTBF) associated with the computing node; determining whether to perform a checkpoint of the computing node based on a comparison between the calculated MTBF and a maximum and minimum threshold; migrating a process from the computing node to a different computing node acting as a recovery node; and resuming execution of the process on the different computing node. 2. The method of claim 1 , further comprising: collecting at least a computing power and node location parameter for each computing node; dividing the computing nodes into collections based on their node location parameter; and sorting the computing nodes within each collection based on the computing power parameter. 3. The method of claim 2 , further comprising: identifying an upper-limit and lower-limit threshold used to determine levels of the sorted computing nodes; sorting the computing nodes within each collection into horizontal levels based on the computing power parameter and the identified upper-limit and lower-limit thresholds; recording the horizontal level placement and a vertical placement into a node-record-information table associated with each computing node; and populating each node-record-information table with a designated recovery node. 4. The method of claim 3 , wherein the upper-limit and lower-limit thresholds are determined from a cross plot of the collected computing power and node location parameters for each computing node and the vertical placement is determined based at least on the node location parameter for each computing node. 5. The method of claim 1 , wherein the MTBF is calculated based at least upon a network or data storage failure. 6. The method of claim 1 , further comprising: creating a checkpoint when the MTBF of the computing node is less than the minimum threshold; and updating the minimum threshold associated with the computing node to equal the MTBF. 7. The method of claim 6 , further comprising: determining that a failure of the computing node has occurred; and using the last checkpoint taken for the computing node as a process state. 8. A non-transitory, computer-readable medium storing computer-readable instructions, the instructions executable by a computer and configured to: build a virtual tree-like computing structure of a plurality of computing nodes; for each computing node of the virtual tree-like computing structure, perform a node failure prediction model to calculate a mean time between failure (MTBF) associated with the computing node; determine whether to perform a checkpoint of the computing node based on a comparison between the calculated MTBF and a maximum and minimum threshold; migrate a process from the computing node to a different computing node acting as a recovery node; and resume execution of the process on the different computing node. 9. The medium of claim 8 , further including instructions to: collect at least a computing power and node location parameter for each computing node; divide the computing nodes into collections based on their node location parameter; and sort the computing nodes within each collection based on the computing power parameter. 10. The medium of claim 9 , further including instructions to: identify an upper-limit and lower-limit threshold used to determine levels of the sorted computing nodes; sort the computing nodes within each collection into horizontal levels based on the computing power parameter and the identified upper-limit and lower-limit thresholds; record the horizontal level placement and a vertical placement into a node-record-information table associated with each computing node; and populate each node-record-information table with a designated recovery node. 11. The medium of claim 10 , wherein the upper-limit and lower-limit thresholds are determined from a cross plot of the collected computing power and node location parameters for each computing node and the vertical placement is determined based at least on the node location parameter for each computing node. 12. The medium of claim 8 , wherein the MTBF is calculated based at least upon a network or data storage failure. 13. The medium of claim 8 , further including instructions to: create a checkpoint when the MTBF of the computing node is less than the minimum threshold; and update the minimum threshold associated with the computing node to equal the MTBF. 14. The medium of claim 13 , further including instructions to: determine that a failure of the computing node has occurred; and use the last checkpoint taken for the computing node as a process state. 15. A computer system, comprising: at least one hardware processor interoperably coupled with a memory storage and configured to: build a virtual tree-like computing structure of a plurality of computing nodes; for each computing node of the virtual tree-like computing structure, perform a node failure prediction model to calculate a mean time between failure (MTBF) associated with the computing node; determine whether to perform a checkpoint of the computing node based on a comparison between the calculated MTBF and a maximum and minimum threshold; migrate a process from the computing node to a different computing node acting as a recovery node; and resume execution of the process on the different computing node. 16. The system of claim 15 , further configured to: collect at least a computing power and node location parameter for each computing node; divide the computing nodes into collections based on their node location parameter; and sort the computing nodes within each collection based on the computing power parameter. 17. The system of claim 16 , further configured to: identify an upper-limit and lower-limit threshold used to determine levels of the sorted computing nodes; sort the computing nodes within each collection into horizontal levels based on the computing power parameter and the identified upper-limit and lower-limit thresholds; record the horizontal level placement and a vertical placement into a node-record-information table associated with each computing node; and populate each node-record-information table with a designated recovery node. 18. The system of claim 17 , wherein the upper-limit and lower-limit thresholds are determined from a cross plot of the collected computing power and node location parameters for each computing node and the vertical placement is determined based at least on the node location parameter for each computing node. 19. The system of claim 15 , wherein the MTBF is calculated based at least upon a network or data storage failure. 20. The system of claim 15 , further configured to: create a checkpoint when the MTBF of the computing node is less than the minimum threshold; and update the minimum threshold associated with the computing node to equal the MTBF; determine that a failure of the computing node has occurred; and use the last checkpoint taken for the computing node as a process state.

Assignees

Inventors

Classifications

  • Restarting or rejuvenating · CPC title

  • Checkpointing the instruction stream · CPC title

  • G06F11/203Primary

    using migration · CPC title

  • within a central processing unit [CPU] · CPC title

  • involving logging of persistent data for recovery · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9348710B2 cover?
This disclosure generally describes methods and systems, including computer-implemented methods, computer-program products, and computer systems, for providing a proactive failure recovery model for distributed computing. One computer-implemented method includes building a virtual tree-like computing structure of a plurality of computing nodes, for each computing node of the virtual tree-like c…
Who is the assignee on this patent?
Saudi Arabian Oil Co
What technology area does this patent fall under?
Primary CPC classification G06F11/203. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue May 24 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).