Software defined failure detection of many nodes

US10547499B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10547499-B2
Application numberUS-201715694866-A
CountryUS
Kind codeB2
Filing dateSep 4, 2017
Priority dateSep 4, 2017
Publication dateJan 28, 2020
Grant dateJan 28, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Embodiments of the present systems and methods may provide the capability to monitor and detect failure of nodes in a data center environment by using a software defined failure detector that can be adjusted to varying conditions and data center topology. In an embodiment, a computer-implemented method for monitoring and detecting failure of electronic systems may comprise, in a system comprising a plurality of networked computer systems, defining at least one failure detection agent to monitor operation of other failure detection agents running on at least some of the electronic systems, and defining, at the controller, and transmitting, from the controller, topology information defining a topology of the failure detection agents to the failure detection agents, wherein the topology information includes information defining which failure detection agents each failure detection agent is to monitor.

First claim

Opening claim text (preview).

The invention claimed is: 1. A computer-implemented method for monitoring and detecting failure of electronic systems comprising: in a system comprising a plurality of networked computer systems, defining a plurality of failure detection agents to monitor operation of other failure detection agents running on at least some of the electronic systems; defining, at a controller, and transmitting, from the controller, topology information defining a topology of the failure detection agents to the failure detection agents; and wherein the topology information includes a first topology information defining which failure detection agents each failure detection agent is to monitor, a second topology information defining which failure detection agents each failure detection agent is to notify when a failure is detected or suspected, and a third topology information defining a topology for propagating instructions to the failure detection agents, and wherein the first topology information, the second topology information, and the third topology information are independent of each other. 2. The method of claim 1 , wherein each failure detection agent is configured to communicate failure information to at least a controller, to at least one other failure detection agent, or both. 3. The method of claim 2 , wherein the topology information further includes information defining which failure detection agents each failure detection agent is to notify when a failure is detected or suspected. 4. The method of claim 3 , wherein the topology information further includes information defining the topology information that is to be propagated to the failure detection agents. 5. The method of claim 3 , wherein the controller is configured to: receive a notification of a suspicion of a failure, and resolve the suspicion of the failure to determine whether to refute or accept the suspicion of the failure. 6. The method of claim 3 , wherein at least one failure detection agent is configured to: receive a notification of a suspicion of a failure, and resolve the suspicion of the failure to determine whether to refute or accept the suspicion of the failure. 7. The method of claim 1 , further comprising: modifying, at the controller, the topology information based on changes in conditions notified to the controller from at least one failure detection agent during operation of the electronic systems. 8. A system for monitoring and detecting failure of electronic systems comprising: at least one controller, implemented in a computer system comprising a processor, memory accessible by the processor, and computer program instructions stored in the memory and executable by the processor, the controller configured to define and transmit, to a plurality of failure detection agents, topology information defining a topology of the failure detection agents, wherein the topology information includes a first topology information defining which failure detection agents each failure detection agent is to monitor, a second topology information defining which failure detection agents each failure detection agent is to notify when a failure is detected or suspected, and a third topology information defining a topology for propagating instructions to the failure detection agents, and wherein the first topology information, the second topology information, and the third topology information are independent of each other; and a plurality of failure detection agents, each failure detection agent implemented in a computer system comprising a processor, memory accessible by the processor, and computer program instructions stored in the memory and executable by the processor, each failure detection agent configured to receive the topology information from the at least one controller and to monitor operation of other failure detection agents based on the received the topology information. 9. The system of claim 8 , wherein each failure detection agent is further configured to communicate failure information to at least one controller, to at least one other failure detection agent, or both. 10. The system of claim 9 , wherein the topology information further includes information defining which failure detection agents each failure detection agent is to notify when a failure is detected or suspected. 11. The system of claim 10 , wherein the topology information further includes information defining the topology information that is to be propagated to the failure detection agents. 12. The system of claim 10 , wherein at least one controller is further configured to: receive a notification of a suspicion of a failure, and resolve the suspicion of the failure to determine whether to refute or accept the suspicion of the failure. 13. The system of claim 10 , wherein at least one failure detection agent is further configured to: receive a notification of a suspicion of a failure, and resolve the suspicion of the failure to determine whether to refute or accept the suspicion of the failure. 14. The system of claim 8 , wherein the controller may be further configured to modify the topology information based on changes in conditions notified to the controller from at least one failure detection agent during operation of the electronic systems. 15. A computer program product for monitoring and detecting failure of electronic systems, the computer program product comprising a non-transitory computer readable storage having program instructions embodied therewith, the program instructions executable by a computer, to cause the computer to perform a method comprising: in a system comprising a plurality of networked computer systems, defining a plurality of failure detection agents to monitor operation of other failure detection agents running on at least some of the electronic systems; and defining, at a controller, and transmitting, from the controller, topology information defining a topology of the failure detection agents to the failure detection agents; wherein the topology information includes a first topology information defining which failure detection agents each failure detection agent is to monitor, a second topology information defining which failure detection agents each failure detection agent is to notify when a failure is detected or suspected, and a third topology information defining a topology for propagating instructions to the failure detection agents, and wherein the first topology information, the second topology information, and the third topology information are independent of each other. 16. The computer program product of claim 15 , wherein each failure detection agent is configured to communicate failure information to at least a controller, to at least one other failure detection agent, or both. 17. The computer program product of claim 16 , wherein the topology information further includes information defining which failure detection agents each failure detection agent is to notify when a failure is detected or suspected. 18. The computer program product of claim 17 , wherein the topology information further includes information defining the topology information that is to be propagated to the failure detection agents. 19. The computer program product of claim 17 , wherein at least one of the controller or at least one failure detection agent is configured to: receive a notification of a suspicion of a failure, and resolve the suspicion of the failure to determine whether to refute or accept the suspicion of the failure. 20. The computer

Assignees

Inventors

Classifications

  • Management of faults, events, alarms or notifications · CPC title

  • Network monitoring probes · CPC title

  • comprising network management agents or mobile agents therefor · CPC title

  • for recovering from a failure of a protocol instance or entity, e.g. service redundancy protocols, protocol state redundancy or protocol service redirection (management of faults, events, alarms or notifications in data switching networks H04L41/06) · CPC title

  • H04L67/10Primary

    in which an application is distributed across nodes in the network (software deployment G06F8/60; multiprogramming arrangements G06F9/46) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10547499B2 cover?
Embodiments of the present systems and methods may provide the capability to monitor and detect failure of nodes in a data center environment by using a software defined failure detector that can be adjusted to varying conditions and data center topology. In an embodiment, a computer-implemented method for monitoring and detecting failure of electronic systems may comprise, in a system comprisi…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification H04L67/10. Mapped technology areas include Electricity.
When was this patent published?
Publication date Tue Jan 28 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).