Root cause analysis
US-2018101426-A1 · Apr 12, 2018 · US
US10417079B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10417079-B2 |
| Application number | US-201715485848-A |
| Country | US |
| Kind code | B2 |
| Filing date | Apr 12, 2017 |
| Priority date | Mar 30, 2017 |
| Publication date | Sep 17, 2019 |
| Grant date | Sep 17, 2019 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Embodiments of the present disclosure relate to a fault tolerant root cause analysis (RCA) system that is able to handle calculation failures during runtime. Calculations (e.g., evaluation of a diagnostic model for a specific component or device) that are performed during the RCA are integrated using different resources of the system under analysis. In order to make a final diagnosis, the resources exchange messages containing calculation inputs and outputs. Calculation problems due to calculation failures in a particular resource can be resolved efficiently which reduces resource utilization and minimizes failure propagation to other parts of the system. Accordingly, the system is able to recover and output a diagnosis even if some of the resources fail or generate problems.
Opening claim text (preview).
What is claimed is: 1. A method comprising: providing, by a strategy manager, a supervising strategy for root cause analysis of a distributed system to a plurality of child devices, each child device comprising a diagnostic model, an actor instance, and the supervising strategy, wherein the supervising strategy is encoded at each child device; enabling communications, via an actor messaging protocol of the actor instance that is isolated from the diagnostic model, for the plurality of child devices, the communications comprising calculations made by the diagnostic model; based on the communications, identifying a fault in the root cause analysis for one of the plurality of child devices, the fault preventing a final diagnosis for the root cause analysis; and applying, by the actor instance of the one of the plurality of child devices, the supervising strategy and making a final diagnosis for the root cause analysis of the distributed system. 2. The method of claim 1 , wherein the fault indicates missing calculations or a failed or disconnected child device. 3. The method of claim 1 , wherein the diagnostic model is received from a model repository and encoded in a mathematical structure. 4. The method of claim 1 , wherein the calculations are provided asynchronously to an inference engine for evaluation and management. 5. The method of claim 1 , wherein the system further comprises a system definition repository that defines an order and details of an evaluation of diagnostic processes and device resources used to perform the calculations. 6. The method of claim 1 , wherein the distributed system is a hierarchical actor based system. 7. The method of claim 1 , wherein the strategy manager comprises a repository of supervising strategies. 8. The method of claim 1 , wherein input to the diagnostic model comprises data coming from a stream of metrics of the child device. 9. The method of claim 1 , wherein input to the diagnostic model comprises data coming from an output of another child device. 10. The method of claim 1 , wherein the supervising strategy defines circumstances of the fault and the child devices included in the application of the supervising strategy. 11. The method of claim 1 , wherein the supervising strategy is dynamically updated by the strategy manager to respond to a current state of the distributed system. 12. The method of claim 1 , wherein the supervising strategy is dynamically updated by the strategy manager at runtime. 13. The method of claim 1 , wherein the supervising strategies comprise restarting calculations. 14. The method of claim 13 , wherein calculations are restarted only on failed or disconnected child devices. 15. The method of claim 13 , wherein calculations are restarted for all devices. 16. The method of claim 1 , wherein the supervising strategies comprise managing device shutdown. 17. The method of claim 1 , wherein the supervising strategies comprise moving calculation execution from a failed child device to another child device. 18. The method of claim 1 , wherein the actor messaging protocol provides asynchronous and parallel communication between the manager and the plurality of child devices in a hierarchical structure. 19. A method comprising: receiving, at a child device, a supervising strategy for root cause analysis of a distributed system comprising a strategy manager and a plurality of child devices, the child device comprising a diagnostic model, an actor instance, and the supervising strategy, wherein the supervising strategy is encoded on the child device; communicating, via an actor messaging protocol of the actor instance that is isolated from the diagnostic model, calculations made by the diagnostic model to the strategy manager; based on the strategy manager identifying a fault in the root cause analysis for one of the plurality of child devices, the fault preventing a final diagnosis for the root cause analysis and applying the supervising strategy to mitigate the fault and provide fault tolerance in the root cause analysis. 20. A computerized system comprising: a processor; and a non-transitory computer storage medium storing computer-useable instructions that, when used by the processor, cause the processor to: provide, by a strategy manager, a supervising strategy for root cause analysis of a distributed system to a plurality of child devices, each child device comprising a diagnostic model, an actor instance, and the supervising strategy; enable communications, via an actor messaging protocol of the actor instance that is isolated from the diagnostic model, for the plurality of child devices, the communications comprising calculations made by the diagnostic model; based on the communications received via the actor messaging protocol, identify a fault in the root cause analysis for a child device of the plurality of child devices, the fault indicating missing calculations or a failed or disconnected child device; and apply the supervising strategy to restart calculations, manage device shutdown of the child device, or move calculation execution from the child device to another child device of the plurality of child devices device.
Probabilistic graphical models, e.g. probabilistic networks · CPC title
Explanation of inference; Explainable artificial intelligence [XAI]; Interpretable artificial intelligence · CPC title
Error or fault reporting or storing · CPC title
in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems · CPC title
Root cause analysis, i.e. error or fault diagnosis (in a hardware test environment G06F11/22; in a software test environment G06F11/36) · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.