Automated root-cause analysis for distributed systems using tracing-data
US-11645141-B2 · May 9, 2023 · US
US11789804B1 · US · B1
| Field | Value |
|---|---|
| Publication number | US-11789804-B1 |
| Application number | US-202217589556-A |
| Country | US |
| Kind code | B1 |
| Filing date | Jan 31, 2022 |
| Priority date | Oct 18, 2021 |
| Publication date | Oct 17, 2023 |
| Grant date | Oct 17, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method of identifying a root cause of a failure for a trace within a microservices-based application includes determining if a root span of the trace is an error span resulting in an error experienced by a user at a front end of the microservices-based application. If the root span of the trace is an error span, the method analyzes a plurality of spans comprising the trace to determine if the trace comprises at least one leaf error span. If the trace comprises a single leaf error span, the method attributes the root cause of the failure in the trace to a service associated with the single leaf error span. If the trace comprises multiple leaf error spans the method attributes the root cause of the failure in the trace to a service associated with a leaf error span of the multiple leaf error spans comprising a latest starting timestamp.
Opening claim text (preview).
What is claimed is: 1. A method of identifying a root cause of a failure for a trace within a microservices-based application, the method comprising: determining if a root span of the trace is an error span resulting in an error experienced by a user at a front end of the microservices-based application; responsive to a determination that the root span of the trace is the error span, analyzing a plurality of spans comprising the trace to determine if the trace comprises at least one leaf error span that is a last error span of a chain of unbroken error spans starting at the root span; responsive to a determination that the trace comprises the at least one leaf error span, attributing the root cause of the failure in the trace to a service associated with the at least one leaf error span; and responsive to a determination that the trace comprises multiple leaf error spans, attributing the root cause of the failure in the trace to a service associated with a leaf error span of the multiple leaf error spans that comprises a latest starting timestamp. 2. The method of claim 1 , wherein the trace is associated with a workflow, wherein the workflow is operable to group together a plurality of spans in the trace generated in response to a client process implemented by a group of services comprised within the microservices-based application. 3. The method of claim 1 , wherein the trace is tagged with a global tag comprising a name of the service associated with the root cause of the failure. 4. The method of claim 1 , wherein the trace is tagged with a global tag comprising a name of the service associated with the root cause of the failure, and further comprising: computing metrics for the service associated with the root cause of the failure using the global tag. 5. The method of claim 1 , further comprising: displaying the trace as a graphical element in a graphical user interface, wherein the graphical element visually indicates which service in the trace is associated with the root cause of the failure. 6. The method of claim 1 , wherein the trace is tagged with a global tag comprising a name of the service associated with the root cause of the failure, and further comprising: computing metrics for the service associated with the root cause of the failure using the global tag and a data set associated with a metric time series modality. 7. The method of claim 1 , wherein the trace is tagged with a global tag comprising a name of the service associated with the root cause of the failure, and further comprising: computing metrics for the service associated with the root cause of the failure using the global tag and a data set associated with a metric events modality. 8. The method of claim 1 , wherein the trace is tagged with a global tag comprising a name of the service associated with the root cause of the failure, and further comprising: computing metrics for the service associated with the root cause of the failure using the global tag, wherein the metrics comprise: request; error; and latency related metrics. 9. The method of claim 1 , further comprising: displaying the trace as a graphical element in a graphical user interface, wherein the graphical element visually indicates which service in the trace is associated with the root cause of the failure; and providing a client with information regarding a service team connected with the service associated with the root cause of the failure through the graphical user interface. 10. A non-transitory computer-readable medium having computer-readable program code embodied therein for causing a computer system to perform a method of identifying a root cause of a failure for a trace within a microservices-based application, the method comprising: determining if a root span of the trace is an error span resulting in an error experienced by a user at a front end of the microservices-based application; responsive to a determination that the root span of the trace is an error span, analyzing a plurality of spans comprising the trace to determine if the trace comprises at least one leaf error span that is a last error span of a chain of unbroken error spans starting at the root span; responsive to a determination that the trace comprises at least one leaf error span, attributing the root cause of the failure in the trace to a service associated with the at least one leaf error span; and responsive to a determination that the trace comprises multiple leaf error spans, attributing the root cause of the failure in the trace to a service associated with a leaf error span of the multiple leaf error spans that comprises a latest starting timestamp. 11. The non-transitory computer-readable medium of claim 10 , wherein the trace is associated with a workflow, wherein the workflow groups together a plurality of spans in the trace generated in response to a client process implemented by a group of services comprised within the microservices-based application. 12. The non-transitory computer-readable medium of claim 10 , wherein the trace is tagged with a global tag comprising a name of the service associated with the root cause of the failure. 13. The non-transitory computer-readable medium of claim 10 , wherein the trace is tagged with a global tag comprising a name of the service associated with the root cause of the failure, and wherein the method further comprises: computing metrics for the service associated with the root cause of the failure using the global tag. 14. The non-transitory computer-readable medium of claim 10 , wherein the method further comprises: displaying the trace as a graphical element in a graphical user interface, wherein the graphical element visually indicates which service in the trace is associated with the root cause of the failure. 15. The non-transitory computer-readable medium of claim 10 , wherein the trace is tagged with a global tag comprising a name of the service associated with the root cause of the failure, and wherein the method further comprises: computing metrics for the service associated with the root cause of the failure using the global tag and a data set associated with a metric time series modality. 16. The non-transitory computer-readable medium of claim 10 , wherein the trace is tagged with a global tag comprising a name of the service associated with the root cause of the failure, and wherein the method further comprises: computing metrics for the service associated with the root cause of the failure using the global tag and a data set associated with a metric events modality. 17. The non-transitory computer-readable medium of claim 10 , wherein the trace is tagged with a global tag comprising a name of the service associated with the root cause of the failure, and wherein the method further comprises: computing metrics for the service associated with the root cause of the failure using the global tag, wherein the metrics comprise: request; error; and latency related metrics. 18. The non-transitory computer-readable medium of claim 10 , wherein the method further comprises: displaying the trace as a service graph in a graphical user interface, wherein the service graph visually indicates which service in the trace is associated with the root cause of the failure; and providing a client information regarding a service team connected with the service associated with the root cause of the failure through the graphical user interface. 19. A system for performing a method of identifying a root cause of a failure for a trace within a
Environments for analysis, debugging or testing of software · CPC title
by tracing the execution of the program · CPC title
by runtime analysis (performance monitoring G06F11/3466) · CPC title
Root cause analysis, i.e. error or fault diagnosis (in a hardware test environment G06F11/22; in a software test environment G06F11/36) · CPC title
Error or fault detection not based on redundancy (power supply failures G06F1/30; network fault management H04L41/06) · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.