Identifying root causes of failures in a deployed distributed application using historical fine grained machine state data

US9772898B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9772898-B2
Application numberUS-201514852006-A
CountryUS
Kind codeB2
Filing dateSep 11, 2015
Priority dateSep 11, 2015
Publication dateSep 26, 2017
Grant dateSep 26, 2017

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Methods and arrangements for identifying root causes of system failures in a distributed system said method including: utilizing at least one processor to execute computer code that performs the steps of: recording, in a storage device, collected machine state data, wherein the collected machine state data are added to historical machine state data; creating, based on the historical machine state data, a healthy map model; detecting at least one failed machine state in the distributed system; comparing the failed machine state against the healthy map model; identifying, based on the comparison, at least one root cause of the failed machine state; and displaying, on a display device, a ranked list comprising the at least one root cause. Other variants and embodiments are broadly contemplated herein.

First claim

Opening claim text (preview).

What is claimed is: 1. A method of identifying root causes of system failures in a distributed system, said method comprising: utilizing at least one processor to execute computer code that performs the steps of: recording, in a storage device, collected machine state data, wherein the collected machine state data are added to historical machine state data; creating, based on the historical machine state data, a healthy map model; detecting at least one failed machine state in the distributed system; comparing the failed machine state against the healthy map model; identifying, based on the comparison, at least one root cause of the failed machine state; and displaying, on a display device, a ranked list comprising the at least one root cause. 2. The method according to claim 1 , wherein the machine state data comprise at least one of: a process, a connection, a configuration setting, an application metric, a resource attribute, a disk attribute, a processor attribute, and a memory attribute. 3. The method according to claim 1 , wherein the machine state data is collected at predetermined intervals; and the historical machine state data are updated when the machine state data is collected. 4. The method according to claim 1 , further comprising: identifying dependencies between interconnected entities within the distributed system; and creating a property graph representation based on the identified dependencies. 5. The method according to claim 4 , further comprising creating the healthy map model by aggregating a plurality of property graph representations, wherein each property graph relates to a particular snapshot of machine state data. 6. The method according to claim 4 , further comprising: determining a failure time, wherein the failure time is associated with a machine state failure; and determining a healthy time, wherein the healthy time is associated with a healthy state of the machine state and its dependencies prior to the failure time. 7. The method according to claim 6 , further comprising: categorizing the machine state data collected between the healthy time and the failure time; wherein the categorizing comprises determining if a machine state is at least one of new, missing, changed, and unchanged. 8. The method according to claim 7 , further comprising generating at least one seed-anomaly score, using an inference algorithm for machine states within the categorized machine state data. 9. The method according to claim 8 , further comprising modifying the at least one seed-anomaly score, based on an iterative graph convergence algorithm; wherein the ranked list is based on the modified at least one seed-anomaly score. 10. An apparatus for identifying root causes of system failures in a distributed system apparatus comprising: at least one processor; and a computer readable storage medium having computer readable program code embodied therewith and executable by the at least one processor, the computer readable program code comprising: computer readable program code that records, in a storage device, collected machine state data, wherein the collected machine state data are added to historical machine state data; computer readable program code that creates, based on the historical machine state data, a healthy map model; computer readable program code that detects at least one failed machine state in the distributed system; computer readable program code that compares the failed machine state against the healthy map model; computer readable program code that identifies, based on the comparison, at least one root cause of the failed machine state; and computer readable program code that displays, on a display device, a ranked list comprising the at least one root cause. 11. A computer program product for identifying root causes of system failures in a distributed system, said computer program product comprising: a non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code comprising: computer readable program code that records, in a storage device, collected machine state data, wherein the collected machine state data are added to historical machine state data; computer readable program code that creates, based on the historical machine state data, a healthy map model; computer readable program code that detects at least one failed machine state in the distributed system; computer readable program code that compares the failed machine state against the healthy map model; computer readable program code that identifies, based on the comparison, at least one root cause of the failed machine state; and computer readable program code that displays, on a display device, a ranked list comprising the at least one root cause. 12. The computer program product according to claim 11 , wherein the machine state data comprise at least one of: a process, a connection, a configuration setting, an application metric, a resource attribute, a disk attribute, a processor attribute, and a memory attribute. 13. The computer program product according to claim 11 , wherein the machine state data is collected at predetermined intervals; and the historical machine state data are updated when the machine state data is collected. 14. The computer program product according to claim 11 , wherein the computer readable program code comprises: computer readable program code that identifies dependencies between interconnected entities within the distributed system; and creates a property graph representation based on the identified dependencies. 15. The computer program product according to claim 14 , wherein the computer readable program code comprises: computer readable program code that creates the healthy map model by aggregating a plurality of property graph representations, wherein each property graph relates to a particular snapshot of machine state data. 16. The computer program product according to claim 15 , wherein the computer readable program code comprises: computer readable program code that determines a failure time, wherein the failure time is associated with a machine state failure; and determines a healthy time, wherein the healthy time is associated with a healthy state of the machine state and its dependencies prior to the failure time. 17. The computer program product according to claim 16 , wherein the computer readable program code comprises: computer readable program code that categorizes the machine state data collected between the healthy time and the failure time; wherein the categorizing comprises determining if a machine state is at least one of new, missing, changed, and unchanged. 18. The computer program product according to claim 17 , wherein the computer readable program code comprises: computer readable program code that generates at least one seed-anomaly score, using an inference algorithm for machine states within the categorized machine state data. 19. The computer program product according to claim 18 , wherein the computer readable program code comprises: computer readable program code that modifies the at least one seed-anomaly score, based on an iterative graph convergence algorithm; wherein the ranked list is based on the modified at least one seed-anomaly score. 20. A method comprising: recording, in a storage device, collected machine state data, wherein the collected machine state data are added to historical machine state data; creating, based on

Assignees

Inventors

Classifications

  • where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems (multiprogramming arrangements G06F9/46; allocation of resources G06F9/50) · CPC title

  • Display for diagnostics, e.g. diagnostic result display, self-test user interface · CPC title

  • in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems · CPC title

  • using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis · CPC title

  • involving time analysis · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9772898B2 cover?
Methods and arrangements for identifying root causes of system failures in a distributed system said method including: utilizing at least one processor to execute computer code that performs the steps of: recording, in a storage device, collected machine state data, wherein the collected machine state data are added to historical machine state data; creating, based on the historical machine sta…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F11/079. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Sep 26 2017 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).