What technology area does this patent fall under?

Primary CPC classification G06F11/0793. Mapped technology areas include Physics.

When was this patent published?

Publication date Tue Aug 01 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.

What related patents are in patentsdb?

We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).

Troubleshooting for a distributed storage system by cluster wide correlation analysis

US11714701B2 · US · B2

Patent metadata
Field	Value
Publication number	US-11714701-B2
Application number	US-202117539219-A
Country	US
Kind code	B2
Filing date	Dec 1, 2021
Priority date	Dec 1, 2021
Publication date	Aug 1, 2023
Grant date	Aug 1, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

Title
What the patent document calls the invention.
Abstract
A short plain-language summary of the technical disclosure.
Assignees and inventors
Who owns or filed the patent and who is credited as inventor.
Key dates
Filing, priority, publication, and grant dates set the timeline.
First independent claim
The legal scope of protection — read this for what is actually claimed.
CPC / IPC classifications
Technology tags used to group this patent with similar filings.
Citations and related patents
Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A troubleshooting technique provides faster and more efficient troubleshooting of issues in a distributed system, such as a distributed storage system provided by a virtualized computing environment. The distributed system includes a plurality of hosts arranged in a cluster. The troubleshooting technique uses cluster-wide correlation analysis to identify potential causes of a particular issue in the distributed system, and executes workflows to remedy the particular issue.

First claim

Opening claim text (preview).

We claim: 1. A method to perform troubleshooting in a distributed system that includes a plurality of hosts arranged in a cluster, the method comprising: obtaining host-level metric information pertaining to health of a host in the cluster; based at least in part on the host-level metric information, obtaining cluster-level metric information pertaining to health of the cluster; using the cluster-level metric information to identify at least one troubleshooting workflow to execute to remedy a particular issue in the distributed system; and executing the at least one troubleshooting workflow. 2. The method of claim 1 , wherein: the at least one troubleshooting workflow comprises a stored historical troubleshooting workflow that was previously used to solve a previous issue that correlates to the particular issue, and the at least one troubleshooting workflow further comprises a new troubleshooting workflow that is generated based at least in part on the host-level metric information or the cluster-level metric information. 3. The method of claim 1 , wherein executing the at least one workflow includes: executing a first troubleshooting workflow; if the particular issue still exists after execution of the first troubleshooting workflow, then executing a second troubleshooting workflow; and if the particular issue is remedied after execution of the first troubleshooting workflow, then ending troubleshooting of the particular issue. 4. The method of claim 1 , further comprising: receiving the host-level metric information, which is collected by a health monitoring agent, with first correlation information attached thereto; receiving the cluster-level metric information, which is generated by aggregating the host-level metric information, with second correlation information attached thereto; and storing the host-level metric information and the cluster-level metric information, having the first and second correlation information attached thereto, as historical information for use in identifying troubleshooting workflows for future troubleshooting. 5. The method of claim 4 , wherein the first correlation information and the second correlation information each specify at least one of: a category of the particular issue, at least one component of the host that is impacted by the particular issue, and at least one operation of the host that is impacted by the particular issue. 6. The method of claim 1 , wherein using the cluster-level metric information to identify the at least one troubleshooting workflow includes: using correlation information attached to the cluster-level metric information to search for matching correlation information; based on the matching correlation information, identifying at least one potential cause of the particular issue; and based on the identified at least one potential cause of the particular issue, identifying the at least one troubleshooting workflow to execute to remedy the particular issue. 7. The method of claim 1 , wherein the distributed system comprises a distributed storage system provided by a virtualized computing environment. 8. A non-transitory computer-readable medium having instructions stored thereon, which in response to execution by one or more processors, cause the one or more processors to perform or control performance of a method to perform troubleshooting in a distributed system that includes a plurality of hosts arranged in a cluster, wherein the method comprises: determining first correlation information for host-level metric information pertaining to health of a host in the cluster, and storing the host-level metric information along with the first correlation information; determining second correlation information for cluster-level metric information pertaining to health of the cluster, wherein the cluster-level metric information is based at least in part on the host-level metric information; comparing the second correlation information with the stored first correlation information to determine at least one potential cause of a particular issue in the distributed system; and calling a troubleshooting workflow engine to troubleshoot the particular issue by identifying at least one troubleshooting workflow to execute to remedy the particular issue, wherein the at least one troubleshooting workflow corresponds to the at least one potential cause. 9. The non-transitory computer-readable medium of claim 8 , wherein the method further comprises: storing the cluster-level metric information along with the second correlation information, wherein the host-level metric information and the cluster-level metric information stored along with their respective first and second correlation information are stored as historical data; and using the stored historical data to perform additional troubleshooting in the distributed system, by: using the historical data to identify at least another potential cause of another particular issue in the distributed system; and calling the troubleshooting workflow engine to troubleshoot the additional particular issue by identifying at least one additional troubleshooting workflow to execute to remedy the additional particular issue. 10. The non-transitory computer-readable medium of claim 8 , wherein the first correlation information and the second correlation information each specify at least one of: a category of the particular issue, at least one component of the host that is impacted by the particular issue, and at least one operation of the host that is impacted by the particular issue. 11. The non-transitory computer-readable medium of claim 8 , wherein comparing the second correlation information with the stored first correlation information to determine the at least one potential cause of the particular issue includes: using the second correlation information for the cluster-level metric information to search for matching correlation information; based on the matching correlation information, identifying the at least one potential cause of the particular issue; and providing the at least one potential cause to the troubleshooting workflow engine to enable the troubleshooting workflow engine to determine, based on the identified at least one potential cause of the particular issue, the at least one troubleshooting workflow to execute to remedy the particular issue. 12. The non-transitory computer-readable medium of claim 8 , wherein the particular issue corresponds to a cluster-level performance issue that is visible to a user of the distributed system, wherein the host-level metric information corresponds to a host-level performance issue that is invisible to the user due to fault-tolerant capability of the distributed system, and wherein the cluster-level performance issue becomes visible to the user when the host-level performance issue begins to affect performance of the cluster. 13. The non-transitory computer-readable medium of claim 8 , further comprising generating an alarm that provides a notification of the particular issue in the distributed system. 14. The non-transitory computer-readable medium of claim 8 , wherein the method further includes: instructing execution of a first troubleshooting workflow provided by the troubleshooting workflow engine; if the particular issue still exists after execution of the first troubleshooting workflow, then instructing execution of a second troubleshooting workflow provided by the troubleshooting workflow engine; and if the particular issue is remedied after execution of the first troubleshooting workflow, then instructing an end to troubleshooting of the particular issue.

Assignees

Vmware Inc

Inventors

Classifications

G06F11/0793Primary
Remedial or corrective actions (recovery from an exception in an instruction pipeline G06F9/3861; by retry G06F11/1402; for recovering from a failure of a protocol instance or entity H04L69/40) · CPC title
G06F3/067
Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS] · CPC title
G06F3/0619
in relation to data integrity, e.g. data losses, bit errors · CPC title
G06F3/0659
Command handling arrangements, e.g. command buffers, queues, command scheduling · CPC title
G06F11/0709
in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems · CPC title

Patent family

Related publications grouped by family.

View patent family 86500175

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11714701B2 cover?: A troubleshooting technique provides faster and more efficient troubleshooting of issues in a distributed system, such as a distributed storage system provided by a virtualized computing environment. The distributed system includes a plurality of hosts arranged in a cluster. The troubleshooting technique uses cluster-wide correlation analysis to identify potential causes of a particular issue i…
Who is the assignee on this patent?: Vmware Inc
What technology area does this patent fall under?: Primary CPC classification G06F11/0793. Mapped technology areas include Physics.
When was this patent published?: Publication date Tue Aug 01 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?: We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).

How to read this patent

Abstract

First claim

Assignees

Inventors

Classifications

Patent family

External sources

Related patents

Method and system for performing intelligent orchestration within a hybrid cloud

Server failure predictive model

Event correlation in cloud computing

Frequently asked questions