Log analytics for problem diagnosis
US-2016124823-A1 · May 5, 2016 · US
US9772898B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-9772898-B2 |
| Application number | US-201514852006-A |
| Country | US |
| Kind code | B2 |
| Filing date | Sep 11, 2015 |
| Priority date | Sep 11, 2015 |
| Publication date | Sep 26, 2017 |
| Grant date | Sep 26, 2017 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Methods and arrangements for identifying root causes of system failures in a distributed system said method including: utilizing at least one processor to execute computer code that performs the steps of: recording, in a storage device, collected machine state data, wherein the collected machine state data are added to historical machine state data; creating, based on the historical machine state data, a healthy map model; detecting at least one failed machine state in the distributed system; comparing the failed machine state against the healthy map model; identifying, based on the comparison, at least one root cause of the failed machine state; and displaying, on a display device, a ranked list comprising the at least one root cause. Other variants and embodiments are broadly contemplated herein.
Opening claim text (preview).
What is claimed is: 1. A method of identifying root causes of system failures in a distributed system, said method comprising: utilizing at least one processor to execute computer code that performs the steps of: recording, in a storage device, collected machine state data, wherein the collected machine state data are added to historical machine state data; creating, based on the historical machine state data, a healthy map model; detecting at least one failed machine state in the distributed system; comparing the failed machine state against the healthy map model; identifying, based on the comparison, at least one root cause of the failed machine state; and displaying, on a display device, a ranked list comprising the at least one root cause. 2. The method according to claim 1 , wherein the machine state data comprise at least one of: a process, a connection, a configuration setting, an application metric, a resource attribute, a disk attribute, a processor attribute, and a memory attribute. 3. The method according to claim 1 , wherein the machine state data is collected at predetermined intervals; and the historical machine state data are updated when the machine state data is collected. 4. The method according to claim 1 , further comprising: identifying dependencies between interconnected entities within the distributed system; and creating a property graph representation based on the identified dependencies. 5. The method according to claim 4 , further comprising creating the healthy map model by aggregating a plurality of property graph representations, wherein each property graph relates to a particular snapshot of machine state data. 6. The method according to claim 4 , further comprising: determining a failure time, wherein the failure time is associated with a machine state failure; and determining a healthy time, wherein the healthy time is associated with a healthy state of the machine state and its dependencies prior to the failure time. 7. The method according to claim 6 , further comprising: categorizing the machine state data collected between the healthy time and the failure time; wherein the categorizing comprises determining if a machine state is at least one of new, missing, changed, and unchanged. 8. The method according to claim 7 , further comprising generating at least one seed-anomaly score, using an inference algorithm for machine states within the categorized machine state data. 9. The method according to claim 8 , further comprising modifying the at least one seed-anomaly score, based on an iterative graph convergence algorithm; wherein the ranked list is based on the modified at least one seed-anomaly score. 10. An apparatus for identifying root causes of system failures in a distributed system apparatus comprising: at least one processor; and a computer readable storage medium having computer readable program code embodied therewith and executable by the at least one processor, the computer readable program code comprising: computer readable program code that records, in a storage device, collected machine state data, wherein the collected machine state data are added to historical machine state data; computer readable program code that creates, based on the historical machine state data, a healthy map model; computer readable program code that detects at least one failed machine state in the distributed system; computer readable program code that compares the failed machine state against the healthy map model; computer readable program code that identifies, based on the comparison, at least one root cause of the failed machine state; and computer readable program code that displays, on a display device, a ranked list comprising the at least one root cause. 11. A computer program product for identifying root causes of system failures in a distributed system, said computer program product comprising: a non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code comprising: computer readable program code that records, in a storage device, collected machine state data, wherein the collected machine state data are added to historical machine state data; computer readable program code that creates, based on the historical machine state data, a healthy map model; computer readable program code that detects at least one failed machine state in the distributed system; computer readable program code that compares the failed machine state against the healthy map model; computer readable program code that identifies, based on the comparison, at least one root cause of the failed machine state; and computer readable program code that displays, on a display device, a ranked list comprising the at least one root cause. 12. The computer program product according to claim 11 , wherein the machine state data comprise at least one of: a process, a connection, a configuration setting, an application metric, a resource attribute, a disk attribute, a processor attribute, and a memory attribute. 13. The computer program product according to claim 11 , wherein the machine state data is collected at predetermined intervals; and the historical machine state data are updated when the machine state data is collected. 14. The computer program product according to claim 11 , wherein the computer readable program code comprises: computer readable program code that identifies dependencies between interconnected entities within the distributed system; and creates a property graph representation based on the identified dependencies. 15. The computer program product according to claim 14 , wherein the computer readable program code comprises: computer readable program code that creates the healthy map model by aggregating a plurality of property graph representations, wherein each property graph relates to a particular snapshot of machine state data. 16. The computer program product according to claim 15 , wherein the computer readable program code comprises: computer readable program code that determines a failure time, wherein the failure time is associated with a machine state failure; and determines a healthy time, wherein the healthy time is associated with a healthy state of the machine state and its dependencies prior to the failure time. 17. The computer program product according to claim 16 , wherein the computer readable program code comprises: computer readable program code that categorizes the machine state data collected between the healthy time and the failure time; wherein the categorizing comprises determining if a machine state is at least one of new, missing, changed, and unchanged. 18. The computer program product according to claim 17 , wherein the computer readable program code comprises: computer readable program code that generates at least one seed-anomaly score, using an inference algorithm for machine states within the categorized machine state data. 19. The computer program product according to claim 18 , wherein the computer readable program code comprises: computer readable program code that modifies the at least one seed-anomaly score, based on an iterative graph convergence algorithm; wherein the ranked list is based on the modified at least one seed-anomaly score. 20. A method comprising: recording, in a storage device, collected machine state data, wherein the collected machine state data are added to historical machine state data; creating, based on
where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems (multiprogramming arrangements G06F9/46; allocation of resources G06F9/50) · CPC title
Display for diagnostics, e.g. diagnostic result display, self-test user interface · CPC title
in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems · CPC title
using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis · CPC title
involving time analysis · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.