Fault detection and recovery as a service

US9240937B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9240937-B2
Application numberUS-201113076963-A
CountryUS
Kind codeB2
Filing dateMar 31, 2011
Priority dateMar 31, 2011
Publication dateJan 19, 2016
Grant dateJan 19, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

The monitoring by a monitoring node of a process performed by a monitored node is often devised as a tightly coupled interaction, but such coupling may reduce the re-use of monitoring resources and processes and increase the administrative complexity of the monitoring scenario. Instead, fault detection and recovery may be designed as a non-proprietary service, wherein a set of monitored nodes, together performing a set of processes, may register for monitoring by a set of monitoring nodes. In the event of a failure of a process, or of an entire monitored node, the monitoring nodes may collaborate to initiate a restart of the processes on the same or a substitute monitored node (possibly in the state last reported by the respective processes). Additionally, failure of a monitoring node may be detected, and all monitored nodes assigned to the failed monitoring node may be reassigned to a substitute monitoring node.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of configuring a first monitoring node having a processor to monitor monitored nodes executing at least one process, the first monitoring node included in a monitoring node set comprising at least one other monitoring node, respective monitoring nodes assigned to monitor a monitored node subset, the method comprising: responsive to receiving from a monitored node a request for the first monitoring node to monitor at least one process executing on the monitored node: adding the monitored node to the monitored node subset assigned to the first monitoring node; and registering the at least one process of the monitored node for monitoring; responsive to receiving from the monitored node a logic set associating, for respective states of a process executing on the monitored node, a logic to be performed responsive to the monitored node reporting the state, store the logic set in association with the respective states of the monitored node; after storing the logic set and responsive to detecting that the process of the monitored node has entered a selected state, perform, at the first monitoring node and on behalf of the monitored node, the logic associated with the selected state of the process in the logic set of the monitored node; and responsive to detecting a failure of at least one process of the monitored node, restarting the process. 2. The method of claim 1 : at least one monitored node configured to send to the monitoring node, within a notification period, a persistence indicator; and detecting a failure comprising: detecting an elapsing of a notification period without having received a persistence indicator from the monitored node. 3. The method of claim 1 : the first monitoring node monitoring at least two monitored nodes; and restarting at least one process on a monitored node comprising: selecting a substitute monitored node, and restarting at least one process of the monitored node on the substitute monitored node. 4. The method of claim 3 : detecting a failure of at least one process of a monitored node comprising: detecting a failure of a monitored node; and restarting the least one process on the monitored node comprising: restarting all processes of the monitored node on the substitute monitored node. 5. The method of claim 1 : at least one process of a monitored node configured to report to the first monitoring node at least one state of at least one process; and the method further comprising: responsive to receiving a report of a state of at least one process reported by a monitored node, storing the state of the process. 6. The method of claim 5 , restarting a process on a monitored node comprising: restarting the process on a monitored node at the state last reported by the process. 7. The method of claim 1 : the monitoring nodes of the monitoring node set configured to store at least one status of at least one monitored node; and the method further comprising: synchronizing the at least one status of at least one monitored node with at least one other monitoring node of the monitoring node set. 8. The method of claim 1 : detecting a failure of at least one process of a monitored node comprising: detecting a failure of a monitored node; and restarting the least one process on the monitored node comprising: conferring with at least one other monitoring node of the monitoring node set to choose a substitute monitored node for the monitored node. 9. The method of claim 1 , respective monitored nodes assigned for monitoring by at least one monitoring node of the monitoring node set. 10. The method of claim 9 , registering at least one process of a monitored node for monitoring comprising: conferring with at least one other monitoring node of the monitoring node set to choose a monitoring node for monitoring the monitored node. 11. The method of claim 1 , further comprising: the failed monitoring node configured to send to the monitoring node, within a notification period, a persistence indicator; and responsive to detecting an elapsing of a notification period without having received a persistence indicator from the failed monitoring node: removing the failed monitoring node from the monitoring node set; among the monitoring nodes of the monitoring node set, choosing a substitute monitoring node; and reassigning the monitored node subset of the failed monitoring node to the substitute monitoring node. 12. The method of claim 1 , further comprising: responsive to receiving from a monitored node a failure indicator of a failed monitoring node: removing the failed monitoring node from the monitoring node set; among the monitoring nodes of the monitoring node set, choosing a substitute monitoring node; and reassigning the monitored node subset of the failed monitoring node to the substitute monitoring node. 13. The method of claim 1 , further comprising: responsive to detecting a failure of a failed monitoring node of the monitoring node set: removing the failed monitoring node from the monitoring node set; conferring with at least one other monitoring node of the monitoring node set to choose a substitute monitoring node for the monitored node; and reassigning the monitored node subset of the failed monitoring node to the substitute monitoring node. 14. The method of claim 1 , further comprising: responsive to detecting a failure of a failed monitoring node of the monitoring node set: removing the failed monitoring node from the monitoring node set; among the monitoring nodes of the monitoring node set, choosing a substitute monitoring node; reassigning the monitored node subset of the failed monitoring node to the substitute monitoring node; and sending to the monitoring nodes of the monitored node subset a reassignment notification that identifies the substitute monitoring node for the monitored node. 15. The method of claim 1 , further comprising: responsive to detecting a failure of a failed monitoring node of the monitoring node set: removing the failed monitoring node from the monitoring node set; among the monitoring nodes of the monitoring node set, choosing a substitute monitoring node; and reassigning the monitored node subset of the failed monitoring node to the substitute monitoring node. 16. The method of claim 1 , wherein: the logic set provided by the monitored node specifies a failure logic to be performed responsive to detecting a failure of the monitored node; and the method further comprises: responsive to detecting a failure of a failed monitoring node, perform the failure logic associated with the failure of the monitored node in the logic set. 17. A method of configuring a monitored node executing at least one process on a processor to be monitored by a monitoring node set, the method comprising: responsive to receiving a notification of an assignment of the monitored node to a first monitoring node of the monitoring node set: setting the first monitoring node as a selected monitoring node, and sending to the first monitoring node a logic set associating, for respective states of a process executing on the monitored node, a logic to be performed by the monitoring node responsive to the monitored node reporting the state; sending to the selected monitoring node a request to register at least one process executing on the monitored node for monitoring by the monitoring node; after sending the logic set to the first monitoring node, reporting the state of the process to the monitoring node, wherein the state reported to the monitoring node is

Assignees

Inventors

Classifications

  • H04L43/12Primary

    Network monitoring probes · CPC title

  • without idle spare hardware · CPC title

  • where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems (multiprogramming arrangements G06F9/46; allocation of resources G06F9/50) · CPC title

  • Restarting or rejuvenating · CPC title

  • Monitoring arrangements for monitoring the status of the computing system or of the computing system component, e.g. monitoring if the computing system is on, off, available, not available (error or fault processing without redundancy G06F11/0703; error detection or correction by redundancy in data representation G06F11/08; error detection or correction of the data by redundancy in operations G06F11/14; error detection or correction by redundancy in hardware G06F11/16) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9240937B2 cover?
The monitoring by a monitoring node of a process performed by a monitored node is often devised as a tightly coupled interaction, but such coupling may reduce the re-use of monitoring resources and processes and increase the administrative complexity of the monitoring scenario. Instead, fault detection and recovery may be designed as a non-proprietary service, wherein a set of monitored nodes, …
Who is the assignee on this patent?
Katiyar Atul, Polinati Chinna Babu, Microsoft Technology Licensing Llc
What technology area does this patent fall under?
Primary CPC classification H04L43/12. Mapped technology areas include Electricity.
When was this patent published?
Publication date Tue Jan 19 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).