Fault tolerant root cause analysis system

US10417079B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10417079-B2
Application numberUS-201715485848-A
CountryUS
Kind codeB2
Filing dateApr 12, 2017
Priority dateMar 30, 2017
Publication dateSep 17, 2019
Grant dateSep 17, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Embodiments of the present disclosure relate to a fault tolerant root cause analysis (RCA) system that is able to handle calculation failures during runtime. Calculations (e.g., evaluation of a diagnostic model for a specific component or device) that are performed during the RCA are integrated using different resources of the system under analysis. In order to make a final diagnosis, the resources exchange messages containing calculation inputs and outputs. Calculation problems due to calculation failures in a particular resource can be resolved efficiently which reduces resource utilization and minimizes failure propagation to other parts of the system. Accordingly, the system is able to recover and output a diagnosis even if some of the resources fail or generate problems.

First claim

Opening claim text (preview).

What is claimed is: 1. A method comprising: providing, by a strategy manager, a supervising strategy for root cause analysis of a distributed system to a plurality of child devices, each child device comprising a diagnostic model, an actor instance, and the supervising strategy, wherein the supervising strategy is encoded at each child device; enabling communications, via an actor messaging protocol of the actor instance that is isolated from the diagnostic model, for the plurality of child devices, the communications comprising calculations made by the diagnostic model; based on the communications, identifying a fault in the root cause analysis for one of the plurality of child devices, the fault preventing a final diagnosis for the root cause analysis; and applying, by the actor instance of the one of the plurality of child devices, the supervising strategy and making a final diagnosis for the root cause analysis of the distributed system. 2. The method of claim 1 , wherein the fault indicates missing calculations or a failed or disconnected child device. 3. The method of claim 1 , wherein the diagnostic model is received from a model repository and encoded in a mathematical structure. 4. The method of claim 1 , wherein the calculations are provided asynchronously to an inference engine for evaluation and management. 5. The method of claim 1 , wherein the system further comprises a system definition repository that defines an order and details of an evaluation of diagnostic processes and device resources used to perform the calculations. 6. The method of claim 1 , wherein the distributed system is a hierarchical actor based system. 7. The method of claim 1 , wherein the strategy manager comprises a repository of supervising strategies. 8. The method of claim 1 , wherein input to the diagnostic model comprises data coming from a stream of metrics of the child device. 9. The method of claim 1 , wherein input to the diagnostic model comprises data coming from an output of another child device. 10. The method of claim 1 , wherein the supervising strategy defines circumstances of the fault and the child devices included in the application of the supervising strategy. 11. The method of claim 1 , wherein the supervising strategy is dynamically updated by the strategy manager to respond to a current state of the distributed system. 12. The method of claim 1 , wherein the supervising strategy is dynamically updated by the strategy manager at runtime. 13. The method of claim 1 , wherein the supervising strategies comprise restarting calculations. 14. The method of claim 13 , wherein calculations are restarted only on failed or disconnected child devices. 15. The method of claim 13 , wherein calculations are restarted for all devices. 16. The method of claim 1 , wherein the supervising strategies comprise managing device shutdown. 17. The method of claim 1 , wherein the supervising strategies comprise moving calculation execution from a failed child device to another child device. 18. The method of claim 1 , wherein the actor messaging protocol provides asynchronous and parallel communication between the manager and the plurality of child devices in a hierarchical structure. 19. A method comprising: receiving, at a child device, a supervising strategy for root cause analysis of a distributed system comprising a strategy manager and a plurality of child devices, the child device comprising a diagnostic model, an actor instance, and the supervising strategy, wherein the supervising strategy is encoded on the child device; communicating, via an actor messaging protocol of the actor instance that is isolated from the diagnostic model, calculations made by the diagnostic model to the strategy manager; based on the strategy manager identifying a fault in the root cause analysis for one of the plurality of child devices, the fault preventing a final diagnosis for the root cause analysis and applying the supervising strategy to mitigate the fault and provide fault tolerance in the root cause analysis. 20. A computerized system comprising: a processor; and a non-transitory computer storage medium storing computer-useable instructions that, when used by the processor, cause the processor to: provide, by a strategy manager, a supervising strategy for root cause analysis of a distributed system to a plurality of child devices, each child device comprising a diagnostic model, an actor instance, and the supervising strategy; enable communications, via an actor messaging protocol of the actor instance that is isolated from the diagnostic model, for the plurality of child devices, the communications comprising calculations made by the diagnostic model; based on the communications received via the actor messaging protocol, identify a fault in the root cause analysis for a child device of the plurality of child devices, the fault indicating missing calculations or a failed or disconnected child device; and apply the supervising strategy to restart calculations, manage device shutdown of the child device, or move calculation execution from the child device to another child device of the plurality of child devices device.

Assignees

Inventors

Classifications

  • Probabilistic graphical models, e.g. probabilistic networks · CPC title

  • Explanation of inference; Explainable artificial intelligence [XAI]; Interpretable artificial intelligence · CPC title

  • Error or fault reporting or storing · CPC title

  • in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems · CPC title

  • G06F11/079Primary

    Root cause analysis, i.e. error or fault diagnosis (in a hardware test environment G06F11/22; in a software test environment G06F11/36) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10417079B2 cover?
Embodiments of the present disclosure relate to a fault tolerant root cause analysis (RCA) system that is able to handle calculation failures during runtime. Calculations (e.g., evaluation of a diagnostic model for a specific component or device) that are performed during the RCA are integrated using different resources of the system under analysis. In order to make a final diagnosis, the resou…
Who is the assignee on this patent?
Ca Inc
What technology area does this patent fall under?
Primary CPC classification G06F11/079. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 17 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).