Identifying alarms for a root cause of a problem in a data processing system

US9497072B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9497072-B2
Application numberUS-201414242861-A
CountryUS
Kind codeB2
Filing dateApr 1, 2014
Priority dateApr 1, 2014
Publication dateNov 15, 2016
Grant dateNov 15, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods for monitoring a networked computing environment and for consolidating multiple alarms under a single root cause are described. In some embodiments, in response to detecting an alert corresponding with a performance issue in a networked computing environment, a root cause identification tool may aggregate a plurality of alarms from a plurality of performance management tools monitoring the networked computing environment. The root cause identification tool may then generate a failure graph associated with the performance issue based on the plurality of alarms, determine a first set of leaf nodes of the failure graph, determine a first chain of failures based on the first set of leaf nodes, suppress (or hide) alarms that are not associated with the first chain of failures, and output a consolidated alarm associated with the first chain of failures.

First claim

Opening claim text (preview).

What is claimed is: 1. A method for monitoring a networked computing environment, comprising: detecting an alert corresponding with a performance issue within the networked computing environment; aggregating data from a plurality of monitoring applications monitoring the networked computing environment, the aggregated data comprises a plurality of alarms; generating a failure graph based on the aggregated data, the failure graph comprises a plurality of nodes and a set of directed edges, each directed edge of the set of directed edges corresponds with a causal relationship between a pair of the plurality of nodes, the alert corresponds with a root node of the failure graph; detecting a second alert corresponding with a second performance issue within the networked computing environment, the alert corresponds with a first application-level failure of a first application within the networked computing environment and the second alert corresponds with a second application-level failure of a second application within the networked computing environment different from the first application; generating a second failure graph based on the aggregated data, the second alert corresponds with a root node of the second failure graph; identifying a first set of leaf nodes associated with the failure graph; identifying a second set of leaf nodes associated with the second failure graph; and identifying a common leaf node that is common to both the failure graph and the second failure graph, the first set of leaf nodes comprises the common leaf node and the second set of leaf nodes comprises the common leaf node; identifying a first leaf node of the plurality of nodes, the first leaf node corresponds with a root cause of the performance issue; determining a first chain of failures corresponding with the first leaf node and the root node of the failure graph; and outputting a consolidated alarm corresponding with the first chain of failures. 2. The method of claim 1 , wherein: the plurality of monitoring applications comprises an application-level monitor, a network-level monitor, and a system-level monitor. 3. The method of claim 1 , wherein: the aggregated data comprises log file data generated by devices within the networked computing environment. 4. The method of claim 1 , wherein: the aggregated data comprises help desk ticket information associated with help desk tickets covering performance issues affecting the networked computing environment. 5. The method of claim 1 , further comprising: suppressing each alarm of the plurality of alarms that is not associated with a node in the first chain of failures. 6. The method of claim 1 , wherein: the outputting a consolidated alarm corresponding with the first chain of failures comprises transmitting a message specifying the root cause of the performance issue to a target recipient. 7. The method of claim 1 , wherein: the outputting a consolidated alarm corresponding with the first chain of failures comprises transmitting a message providing information only associated with the first chain of failures to a target recipient. 8. The method of claim 1 , wherein: the plurality of monitoring applications comprises an application-level monitor, the application-level monitor generates a first alarm of the plurality of alarms in response to a performance metric for an application being outside of an acceptable range, the acceptable range is determined based on a time of day. 9. The method of claim 1 , wherein: the performance issue comprises an unavailability of the first application, the networked computing environment comprises a plurality of servers, the root cause of the performance issue comprises a power failure to a first server of the plurality of servers. 10. A system for monitoring a networked computing environment, comprising: a network interface configured to receive data from a plurality of monitoring applications monitoring the networked computing environment; and a processor configured to detect an alert corresponding with a performance issue within the networked computing environment and aggregate the data from the plurality of monitoring applications, the processor configured to generate a failure graph based on the aggregated data, the failure graph comprises a plurality of nodes and a set of directed edges, each directed edge of the set of directed edges corresponds with a causal relationship between a pair of the plurality of nodes, the alert corresponds with a root node of the failure graph, the processor configured to detect a second alert corresponding with a second performance issue within the networked computing environment, the alert corresponds with a first application-level failure of a first application within the networked computing environment and the second alert corresponds with a second application-level failure of a second application within the networked computing environment different from the first application, the processor configured to generate a second failure graph based on the aggregated data, the second alert corresponds with a root node of the second failure graph, the processor configured to identify a common leaf node that is common to both the failure graph and the second failure graph, the processor configured to identify a first leaf node of the plurality of nodes, the first leaf node corresponds with a root cause of the performance issue, the processor configured to determine a first chain of failures corresponding with the first leaf node and the root node of the failure graph and output a consolidated alarm corresponding with the first chain of failures. 11. The system of claim 10 , wherein: the plurality of monitoring applications comprises an application-level monitor, a network-level monitor, and a system-level monitor. 12. The system of claim 10 , wherein: the aggregated data comprises a plurality of alarms generated by the plurality of monitoring applications monitoring the networked computing environment. 13. The system of claim 10 , wherein: the aggregated data comprises log file data generated by devices within the networked computing environment. 14. The system of claim 10 , wherein: the aggregated data comprises help desk ticket information associated with help desk tickets covering performance issues affecting the networked computing environment. 15. The system of claim 10 , wherein: the processor configured to output the consolidated alarm by transmitting a message providing information only associated with the first chain of failures to a target recipient. 16. A computer program product, comprising: a non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code comprising: computer readable program code configured to detect an alert corresponding with a performance issue within a networked computing environment; computer readable program code configured to aggregate data from a plurality of monitoring applications monitoring the networked computing environment, the aggregated data comprises a plurality of alarms; computer readable program code configured to generate a failure graph based on the aggregated data, the failure graph comprises a plurality of nodes and a set of directed edges, each directed edge of the set of directed edges corresponds with a causal relationship between a pair of the plurality of nodes, the alert corresponds with a root node of the failure graph; computer readable program code configured to detect a second alert corresponding with a second performance iss

Assignees

Inventors

Classifications

  • H04L41/065Primary

    involving logical or physical relationship, e.g. grouping and hierarchies · CPC title

  • in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems · CPC title

  • for graphical visualisation of monitoring data · CPC title

  • Root cause analysis, i.e. error or fault diagnosis (in a hardware test environment G06F11/22; in a software test environment G06F11/36) · CPC title

  • comprising specially adapted graphical user interfaces [GUI] · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9497072B2 cover?
Methods for monitoring a networked computing environment and for consolidating multiple alarms under a single root cause are described. In some embodiments, in response to detecting an alert corresponding with a performance issue in a networked computing environment, a root cause identification tool may aggregate a plurality of alarms from a plurality of performance management tools monitoring …
Who is the assignee on this patent?
Ca Inc
What technology area does this patent fall under?
Primary CPC classification H04L41/065. Mapped technology areas include Electricity.
When was this patent published?
Publication date Tue Nov 15 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).