Failure analysis system for a distributed storage system

US11599435B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11599435-B2
Application numberUS-201916540080-A
CountryUS
Kind codeB2
Filing dateAug 14, 2019
Priority dateJun 26, 2019
Publication dateMar 7, 2023
Grant dateMar 7, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A failure analysis system identifies a root cause of a failure (or other health issue) in a virtualized computing environment and provides a recommendation for remediation. The failure analysis system uses a model-based reasoning (MBR) approach that involves building a model describing the relationships/dependencies of elements in the various layers of the virtualized computing environment, and the model is used by an inference engine to generate facts and rules for reasoning to identify an element in the virtualized computing environment that is causing the failure. Then, then the failure analysis system uses a decision tree analysis (DTA) approach to perform a deep diagnosis of the element, by traversing a decision tree that was generated by combining the rules for reasoning provided by the MBR approach, in conjunction with examining data collected by health monitors. The result of the DTA approach is then used to generate the recommendation for remediation.

First claim

Opening claim text (preview).

We claim: 1. A method to address health issues indicative of operating conditions in a virtualized computing environment that includes at least one host, the method comprising: monitoring, by a health check agent installed at the at least one host, health of the virtualized computing environment; detecting, based on health check information provided by the health check agent from the monitoring, a health issue that has manifested in the virtualized computing environment; generating, by an automated tool, a model that represents elements at multiple layers of the virtualized computing environment, and connections and relationships between the elements; using model-based reasoning to identify an element, amongst the elements, in the virtualized computing environment that is a source of the health issue, wherein the model-based reasoning uses the model, representing the connections and the relationships between the elements in the virtualized computing environment, to determine facts and rules for identification of the element that is the source of the health issue; using a decision tree analysis to identify a root cause of the health issue at the identified element, wherein a decision tree for use in the decision tree analysis is generated by injecting a fault into the virtualized computing environment to determine types, locations, and number of failures that are generated in the virtualized computing environment due to the injected fault; based on a result of the decision tree analysis that identifies the root cause of the health issue, generating a recommendation for remediation of the health issue; and performing the recommended remediation of the health issue. 2. The method of claim 1 , wherein the automated tool comprises a diagnostics tool, and wherein generating the model includes applying the diagnostics tool to the virtualized computing environment to discover and collect information about the elements in the virtualized computing environment. 3. The method of claim 1 , wherein the decision tree, for use in the decision tree analysis, is further generated by one or more of: using results of the injected fault as starting points for a machine-learning technique to evolve the decision tree; analyzing internal program logic of the elements in the virtualized computing environment; or analyzing processes that were historically used to troubleshoot health issues that were reported in the virtualized computing environment. 4. The method of claim 1 , wherein the elements in the virtualized computing environment include elements of a distributed storage system that are arranged in storage clusters, and wherein at least one of the health issues includes a cluster partition issue or other storage-operation-related issue in the distributed storage system. 5. The method of claim 1 , wherein using the decision tree analysis to identify the root cause of the health issue includes evaluating the health check information and configuration information while traversing a branch of the decision tree. 6. The method of claim 1 , wherein: the facts determined from the model are used to generate the rules, and the rules are combined to form the decision tree for the decision tree analysis. 7. The method of claim 1 , further comprising updating either or both the model and the decision tree for the decision tree analysis, in response to identifying a new root cause associated with a particular health issue, so that the updated model or the updated decision tree are usable to analyze other health issues that are similar to the particular health issue. 8. A non-transitory computer-readable medium having instructions stored thereon, which in response to execution by one or more processors, cause the one or more processors to perform or control performance of operations to address health issues indicative of operating conditions in a virtualized computing environment that includes at least one host, the operations comprising: monitoring, by a health check agent installed at the at least one host, health of the virtualized computing environment; detecting, based on health check information provided by the health check agent from the monitoring, a health issue that has manifested in the virtualized computing environment; generating, by an automated tool, a model that represents elements at multiple layers of the virtualized computing environment, and connections and relationships between the elements; using model-based reasoning to identify an element, amongst the elements, in the virtualized computing environment that is a source of the health issue, wherein the model-based reasoning uses the model, representing the connections and the relationships between the elements in the virtualized computing environment, to determine facts and rules for identification of the element that is the source of the health issue; using a decision tree analysis to identify a root cause of the health issue at the identified element, wherein a decision tree for use in the decision tree analysis is generated by injecting a fault into the virtualized computing environment to determine types, locations, and number of failures that are generated in the virtualized computing environment due to the injected fault; based on a result of the decision tree analysis that identifies the root cause of the health issue, generating a recommendation for remediation of the health issue; and performing the recommended remediation of the health issue. 9. The non-transitory computer-readable medium of claim 8 , wherein the automated tool comprises a diagnostics tool, and wherein generating the model includes applying the diagnostics tool to the virtualized computing environment to discover and collect information about the elements in the virtualized computing environment. 10. The non-transitory computer-readable medium of claim 8 , wherein the decision tree, for use in the decision tree analysis, is further generated by one or more of: using results of the injected fault as starting points for a machine-learning technique to evolve the decision tree; analyzing internal program logic of the elements in the virtualized computing environment; or analyzing processes that were historically used to troubleshoot health issues that were reported in the virtualized computing environment. 11. The non-transitory computer-readable medium of claim 8 , wherein the elements in the virtualized computing environment include elements of a distributed storage system that are arranged in storage clusters, and wherein at least one of the health issues includes a cluster partition issue or other storage-operation-related issue in the distributed storage system. 12. The non-transitory computer-readable medium of claim 8 , wherein using the decision tree analysis to identify the root cause of the health issue includes evaluating the health check information and configuration information while traversing a branch of the decision tree. 13. The non-transitory computer-readable medium of claim 8 , wherein: the facts determined from the model are used to generate the rules, and the rules are combined to form the decision tree for the decision tree analysis. 14. The non-transitory computer-readable medium of claim 8 , wherein the operations further comprise: updating either or both the model and the decision tree for the decision tree analysis, in response to identifying a new root cause associated with a particular health issue, so that the updated model or the updated decision tree are usable to analyze other health issues that are similar to the particular health issue. 15.

Assignees

Inventors

Classifications

  • Trees, e.g. B+trees · CPC title

  • Knowledge engineering; Knowledge acquisition · CPC title

  • Remedial or corrective actions (recovery from an exception in an instruction pipeline G06F9/3861; by retry G06F11/1402; for recovering from a failure of a protocol instance or entity H04L69/40) · CPC title

  • using expert systems · CPC title

  • Hypervisor-specific management and integration aspects · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11599435B2 cover?
A failure analysis system identifies a root cause of a failure (or other health issue) in a virtualized computing environment and provides a recommendation for remediation. The failure analysis system uses a model-based reasoning (MBR) approach that involves building a model describing the relationships/dependencies of elements in the various layers of the virtualized computing environment, and…
Who is the assignee on this patent?
Vmware Inc
What technology area does this patent fall under?
Primary CPC classification G06F11/2257. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 07 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 6 related publications on this page (citations in our corpus or others sharing the same primary CPC).