Root cause detection and corrective action diagnosis system

US11269718B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-11269718-B1
Application numberUS-202016915993-A
CountryUS
Kind codeB1
Filing dateJun 29, 2020
Priority dateJun 29, 2020
Publication dateMar 8, 2022
Grant dateMar 8, 2022

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods, systems, and computer-readable media for automatically detecting root causes of anomalies occurring in information technology (IT) systems are disclosed. In some embodiments, data of a service graph depicting dependencies between nodes or services of the IT infrastructure is traversed to determine propagation patterns of anomaly symptoms/alarms through the IT infrastructure. Also, a causal inference model is used to determine probabilities that an observed propagation pattern corresponds to a stored propagation pattern, wherein a close correspondence indicates that the current anomaly is likely caused by a similar root cause as a past anomaly that caused the stored propagation pattern.

First claim

Opening claim text (preview).

What is claimed is: 1. An information technology (IT) operations monitoring system, comprising: one or more computing devices configured to: generate, or receive access to, a service graph for the IT operations, wherein the service graph comprises: a graphical representation of the IT operations indicating logical locations of network nodes of the IT operations relative to other network nodes of the IT operations; and service dependencies between services implemented at respective ones of the nodes of the IT operations; determine whether a quantity of active alarms for the IT operations exceeds one or more alarm thresholds; and provide an indication of one or more root cause conditions causing the active alarms in response to determining the number of active alarms exceeds the one or more alarm thresholds, wherein to provide the indication of the one or more root cause conditions, the one or more computing devices are further configured to: traverse service graph data corresponding to a plurality of moments in time during which respective ones of the active alarms were triggered; identify one or more alarm groupings for the active alarms based on logical locations of nodes associated with the active alarms as indicated in the service graph and based on service dependencies of services associated with the active alarms as indicated in the service graph; generate one or more probable root cause inferences based on applying the identified one or more alarm groupings to a causal analysis model, wherein the causal analysis model comprises alarm grouping patterns and associated probable root causes; and select, from the generated one or more probable root cause inferences, respective ones of the one or more probable root cause inferences that have associated probabilities greater than a probability threshold, wherein the selected one or more probable root cause inferences are assigned as the one or more root cause conditions causing the active alarms. 2. The IT operations monitoring system of claim 1 , further comprising: a correction of error database, wherein the causal analysis model is generated using data stored in the correction of error database, and wherein the one or more computing devices are configured to: update the correction of error database with data collected for events causing a given quantity of active alarms for the IT operations to exceed the one or more alarm thresholds. 3. The IT operations monitoring system of claim 2 , wherein to update the correction of error database, the one or more computing devices are configured to: automatically collect data for the events with active alarms exceeding the one or more alarm thresholds; automatically associate one or more assigned root cause conditions, assigned for the events, with the collected data; and in response to receiving a validation indication for the assigned one or more root cause conditions, add the assigned one or more root cause conditions and associated collected data to the correction of error database. 4. The IT operations monitoring system of claim 2 , wherein to populate the correction of error database, the one or more computing devices are configured to: inject one or more known error conditions into the IT operations; automatically collect data related to one or more events caused by the injection of the one or more known error conditions; and add the collected data for the one or more events and the known error conditions to the correction of error database, wherein the known error conditions are associated with the collected data. 5. The IT operations monitoring system of claim 1 , wherein the one or more computing devices are further configured to: provide an event ticket for an event causing the active alarms, wherein the event ticket comprises: the indication of the one or more root cause conditions causing the active alarms; and one or more recommended corrective actions to be taken to resolve the one or more root cause conditions causing the active alarms. 6. A method, comprising: determining a quantity of active alarms for a monitored system exceeds one or more alarm thresholds; traversing service graph data for the monitored system corresponding to a plurality of moments in time to identify one or more alarm patterns, wherein the service graph comprises: service dependencies between services implemented at respective nodes of the monitored system; and determining one or more root cause conditions causing the active alarms based on applying the identified one or more alarm patterns to a causal analysis model, wherein the causal analysis model comprises alarm patterns and associated root causes. 7. The method of claim 6 , wherein the service graph further comprises: logical locations of the nodes of the monitored system relative to other ones of the nodes of the monitored system, and wherein the method further comprises: identifying one or more alarm groupings based on the logical locations of nodes associated with the active alarms and based on service dependencies of services associated with the active alarms, wherein the one or more alarm patterns comprise the identified alarm groupings. 8. The method of claim 6 , further comprising: generating the service graph, wherein generating the service graph comprises: determining a graphical representation for logical locations of the nodes of the monitored system; and determining dependencies for services implemented via the respective ones of the nodes of the monitored system that are dependent on other services implemented at other respective ones of the nodes of the monitored system. 9. The method of claim 6 , further comprising: providing a user interface to a customer of a monitoring service that performs said determining the quantity of active alarms exceeds the one or more alarm thresholds, said traversing the service graph data, and said determining the one or more root causes; receiving, via the user interface, authorization to access operational data for the nodes of the monitored system, wherein the monitored system is a system of the customer of the monitoring service; and providing the determined one or more root cause conditions to the customer. 10. The method of claim 6 , further comprising: providing an event ticket for an event causing the active alarms, wherein the event ticket comprises: the determined one or more root cause conditions; and one or more recommended corrective actions to be taken to resolve the one or more root cause conditions causing the active alarms. 11. The method of claim 10 , wherein the determined one or more root cause conditions and the one or more corrective actions are determined using information stored in a correction of error database, wherein the method further comprises: automatically collecting data for other events with a quantity of active alarms exceeding the one or more alarm thresholds; automatically associating one or more assigned root cause conditions for the other events with the collected data; and in response to receiving a validation indication for the assigned one or more root cause conditions, adding the assigned one or more root cause conditions and associated collected data for the other events to the correction of error database. 12. The method of claim 11 , further comprising: determining, for the other events, one or more recommended corrective actions to be taken to resolve the one or more assigned root cause conditions; assigning tags to the one or more corrective actions determined for the other events, wherein the tags associate the one or more corrective actions with the one or more root cau

Assignees

Inventors

Classifications

  • G06F11/079Primary

    Root cause analysis, i.e. error or fault diagnosis (in a hardware test environment G06F11/22; in a software test environment G06F11/36) · CPC title

  • in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems · CPC title

  • Threshold · CPC title

  • Remedial or corrective actions (recovery from an exception in an instruction pipeline G06F9/3861; by retry G06F11/1402; for recovering from a failure of a protocol instance or entity H04L69/40) · CPC title

  • where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems (multiprogramming arrangements G06F9/46; allocation of resources G06F9/50) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11269718B1 cover?
Methods, systems, and computer-readable media for automatically detecting root causes of anomalies occurring in information technology (IT) systems are disclosed. In some embodiments, data of a service graph depicting dependencies between nodes or services of the IT infrastructure is traversed to determine propagation patterns of anomaly symptoms/alarms through the IT infrastructure. Also, a ca…
Who is the assignee on this patent?
Amazon Tech Inc
What technology area does this patent fall under?
Primary CPC classification G06F11/079. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 08 2022 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).