Troubleshooting for a distributed storage system by cluster wide correlation analysis

US11714701B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11714701-B2
Application numberUS-202117539219-A
CountryUS
Kind codeB2
Filing dateDec 1, 2021
Priority dateDec 1, 2021
Publication dateAug 1, 2023
Grant dateAug 1, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A troubleshooting technique provides faster and more efficient troubleshooting of issues in a distributed system, such as a distributed storage system provided by a virtualized computing environment. The distributed system includes a plurality of hosts arranged in a cluster. The troubleshooting technique uses cluster-wide correlation analysis to identify potential causes of a particular issue in the distributed system, and executes workflows to remedy the particular issue.

First claim

Opening claim text (preview).

We claim: 1. A method to perform troubleshooting in a distributed system that includes a plurality of hosts arranged in a cluster, the method comprising: obtaining host-level metric information pertaining to health of a host in the cluster; based at least in part on the host-level metric information, obtaining cluster-level metric information pertaining to health of the cluster; using the cluster-level metric information to identify at least one troubleshooting workflow to execute to remedy a particular issue in the distributed system; and executing the at least one troubleshooting workflow. 2. The method of claim 1 , wherein: the at least one troubleshooting workflow comprises a stored historical troubleshooting workflow that was previously used to solve a previous issue that correlates to the particular issue, and the at least one troubleshooting workflow further comprises a new troubleshooting workflow that is generated based at least in part on the host-level metric information or the cluster-level metric information. 3. The method of claim 1 , wherein executing the at least one workflow includes: executing a first troubleshooting workflow; if the particular issue still exists after execution of the first troubleshooting workflow, then executing a second troubleshooting workflow; and if the particular issue is remedied after execution of the first troubleshooting workflow, then ending troubleshooting of the particular issue. 4. The method of claim 1 , further comprising: receiving the host-level metric information, which is collected by a health monitoring agent, with first correlation information attached thereto; receiving the cluster-level metric information, which is generated by aggregating the host-level metric information, with second correlation information attached thereto; and storing the host-level metric information and the cluster-level metric information, having the first and second correlation information attached thereto, as historical information for use in identifying troubleshooting workflows for future troubleshooting. 5. The method of claim 4 , wherein the first correlation information and the second correlation information each specify at least one of: a category of the particular issue, at least one component of the host that is impacted by the particular issue, and at least one operation of the host that is impacted by the particular issue. 6. The method of claim 1 , wherein using the cluster-level metric information to identify the at least one troubleshooting workflow includes: using correlation information attached to the cluster-level metric information to search for matching correlation information; based on the matching correlation information, identifying at least one potential cause of the particular issue; and based on the identified at least one potential cause of the particular issue, identifying the at least one troubleshooting workflow to execute to remedy the particular issue. 7. The method of claim 1 , wherein the distributed system comprises a distributed storage system provided by a virtualized computing environment. 8. A non-transitory computer-readable medium having instructions stored thereon, which in response to execution by one or more processors, cause the one or more processors to perform or control performance of a method to perform troubleshooting in a distributed system that includes a plurality of hosts arranged in a cluster, wherein the method comprises: determining first correlation information for host-level metric information pertaining to health of a host in the cluster, and storing the host-level metric information along with the first correlation information; determining second correlation information for cluster-level metric information pertaining to health of the cluster, wherein the cluster-level metric information is based at least in part on the host-level metric information; comparing the second correlation information with the stored first correlation information to determine at least one potential cause of a particular issue in the distributed system; and calling a troubleshooting workflow engine to troubleshoot the particular issue by identifying at least one troubleshooting workflow to execute to remedy the particular issue, wherein the at least one troubleshooting workflow corresponds to the at least one potential cause. 9. The non-transitory computer-readable medium of claim 8 , wherein the method further comprises: storing the cluster-level metric information along with the second correlation information, wherein the host-level metric information and the cluster-level metric information stored along with their respective first and second correlation information are stored as historical data; and using the stored historical data to perform additional troubleshooting in the distributed system, by: using the historical data to identify at least another potential cause of another particular issue in the distributed system; and calling the troubleshooting workflow engine to troubleshoot the additional particular issue by identifying at least one additional troubleshooting workflow to execute to remedy the additional particular issue. 10. The non-transitory computer-readable medium of claim 8 , wherein the first correlation information and the second correlation information each specify at least one of: a category of the particular issue, at least one component of the host that is impacted by the particular issue, and at least one operation of the host that is impacted by the particular issue. 11. The non-transitory computer-readable medium of claim 8 , wherein comparing the second correlation information with the stored first correlation information to determine the at least one potential cause of the particular issue includes: using the second correlation information for the cluster-level metric information to search for matching correlation information; based on the matching correlation information, identifying the at least one potential cause of the particular issue; and providing the at least one potential cause to the troubleshooting workflow engine to enable the troubleshooting workflow engine to determine, based on the identified at least one potential cause of the particular issue, the at least one troubleshooting workflow to execute to remedy the particular issue. 12. The non-transitory computer-readable medium of claim 8 , wherein the particular issue corresponds to a cluster-level performance issue that is visible to a user of the distributed system, wherein the host-level metric information corresponds to a host-level performance issue that is invisible to the user due to fault-tolerant capability of the distributed system, and wherein the cluster-level performance issue becomes visible to the user when the host-level performance issue begins to affect performance of the cluster. 13. The non-transitory computer-readable medium of claim 8 , further comprising generating an alarm that provides a notification of the particular issue in the distributed system. 14. The non-transitory computer-readable medium of claim 8 , wherein the method further includes: instructing execution of a first troubleshooting workflow provided by the troubleshooting workflow engine; if the particular issue still exists after execution of the first troubleshooting workflow, then instructing execution of a second troubleshooting workflow provided by the troubleshooting workflow engine; and if the particular issue is remedied after execution of the first troubleshooting workflow, then instructing an end to troubleshooting of the particular issue.

Assignees

Inventors

Classifications

  • Remedial or corrective actions (recovery from an exception in an instruction pipeline G06F9/3861; by retry G06F11/1402; for recovering from a failure of a protocol instance or entity H04L69/40) · CPC title

  • Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS] · CPC title

  • in relation to data integrity, e.g. data losses, bit errors · CPC title

  • Command handling arrangements, e.g. command buffers, queues, command scheduling · CPC title

  • in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11714701B2 cover?
A troubleshooting technique provides faster and more efficient troubleshooting of issues in a distributed system, such as a distributed storage system provided by a virtualized computing environment. The distributed system includes a plurality of hosts arranged in a cluster. The troubleshooting technique uses cluster-wide correlation analysis to identify potential causes of a particular issue i…
Who is the assignee on this patent?
Vmware Inc
What technology area does this patent fall under?
Primary CPC classification G06F11/0793. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Aug 01 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).