Hypervisor remedial action for a virtual machine in response to an error message from the virtual machine
US-2016216992-A1 · Jul 28, 2016 · US
US10152382B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-10152382-B2 |
| Application number | US-201615239612-A |
| Country | US |
| Kind code | B2 |
| Filing date | Aug 17, 2016 |
| Priority date | Oct 26, 2015 |
| Publication date | Dec 11, 2018 |
| Grant date | Dec 11, 2018 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A method and system for monitoring a virtual machine cluster comprising sending, by physical machine, state parameter query instruction to a virtual machine in the virtual machine cluster at a first preset time interval; sending response information to the physical machine in response to receiving the query; the physical machine determining that the virtual machine is faulty, in response to the response information beyond a second preset time, judging whether the faulty machine satisfies a restart condition, and sending a restart instruction to a second machine on which the faulty machine runs, if the faulty machine satisfies the restart condition, by the virtual machine; and restarting, the second physical machine, the faulty virtual machine according to the restart instruction. The disclosure can be used to monitor virtual machines and recover a faulty virtual machine, thereby improving the availability of the virtual machine cluster and shortening service intervals.
Opening claim text (preview).
What is claimed is: 1. A method for monitoring a virtual machine cluster, comprising: sending, by a first physical machine, a virtual machine state parameter query instruction to a virtual machine in the virtual machine cluster at a first preset time interval; sending, by the virtual machine, response information to the first physical machine in response to receiving the query instruction; determining, by the first physical machine, that the virtual machine is faulty, in response to the response information beyond a second preset time, judging, by the first physical machine, whether the faulty virtual machine satisfies a preset restart condition, and sending, by the first physical machine, a virtual machine restart instruction to a second physical machine on which the faulty virtual machine runs, if the faulty virtual machine satisfies the preset restart condition; restarting, by the second physical machine, the faulty virtual machine according to the virtual machine restart instruction; sending, by the second physical machine, a restart response signal to the first physical machine, when restarting the faulty virtual machine; obtaining, by the first physical machine, an address of the faulty virtual machine from pre-recorded meta-information of virtual machines in response to receiving the restart response signal, connecting, by the first physical machine, to the restarted virtual machine according to the address, and sending, by the first physical machine, a first service process restart signal to the restarted virtual machine; and starting, by the restarted virtual machine, a service process of the restarted virtual machine according to the first service process restart signal. 2. The method according to claim 1 , wherein the sending of a virtual machine restart instruction to a second physical machine on which the faulty virtual machine runs, if the faulty virtual machine satisfies the preset restart condition, comprises: sending the virtual machine restart instruction to the second physical machine, if a ratio of the faulty virtual machines is smaller than a preset ratio; or sending the virtual machine restart instruction to the second physical machine, if an interval from a preceding virtual machine restart or reconstruction of the faulty virtual machine exceeds a third preset time. 3. The method according to claim 1 , further comprising: determining, by the first physical machine, a restart failure of the faulty virtual machine in response to the restart response signal being not received within a preset time after sending the virtual machine restart instruction, and sending, by the first physical machine, a virtual machine reconstruction instruction to a third physical machine in response to times of the restart failure reaching preset times, wherein the third physical machine is a physical machine, except for the second physical machine, in a host physical machine cluster of the virtual machine cluster; and reconstructing, by the third physical machine, the faulty virtual machine according to the virtual machine reconstruction instruction. 4. The method according to claim 3 , further comprising: sending, by the third physical machine, a reconstruction response signal to the first physical machine; obtaining, by the first physical machine, meta-information of the faulty virtual machine from the meta-information of the virtual machines in response to receiving the reconstruction response signal, and sending, by the first physical machine, a node recovery instruction to the reconstructed virtual machine according to the obtained meta-information; and downloading, by the reconstructed virtual machine, previously backed-up incremental data associated with a previous management node from a remote storage according to the node recovery instruction, if it is determined that the reconstructed virtual machine is a management node according to the node recovery instruction; recovering, by the reconstructed virtual machine, metadata of the reconstructed management node based on the incremental data; accepting, by the reconstructed virtual machine, a registration of a computing node in the virtual machine cluster; and registering, by the reconstructed virtual machine, to the management node in the virtual machine cluster according to the node recovery instruction, if it is determined that the reconstructed virtual machine is a computing node according to the node recovery instruction. 5. The method according to claim 4 , further comprising: determining reconstruction success and sending a reconstruction success indication signal, by the reconstructed management node, to the first physical machine in response to a ratio of computing nodes in the virtual machine cluster registered within a preset time being larger than or equal to a preset ratio, and sending, by the reconstructed management node, a reconstruction failure indication alarm signal to the first physical machine in response to the ratio of computing nodes in the virtual machine cluster registered within the preset time being smaller than the preset ratio; and submitting, by the first physical machine, a received user job to the reconstructed management node according to the reconstruction success indication signal, and displaying, by the first physical machine, an alarm prompt according to the reconstruction failure indication alarm signal. 6. The method according to claim 5 , further comprising: executing the following operations by using the first physical machine: determining whether the faulty virtual machines comprise a management node and whether a ratio of faulty computing nodes exceeds a threshold according to the meta-information of the virtual machines; determining that the virtual machine cluster is faulty, in response to determining that the faulty virtual machines comprises a management node or the ratio the faulty computing nodes exceeds the threshold; continuing to receive user jobs, and stopping submitting user jobs to the management node in the virtual machine cluster, in response to the virtual machine cluster being faulty; judging whether the restarted or reconstructed virtual machines comprise a management node and whether the ratio of the faulty computing nodes exceeds the threshold, in response to the response information being from the restarted or reconstructed virtual machines; determining that the virtual machine cluster is recovered from a fault, in response to determining the restarted or reconstructed virtual machines comprising a management node and the ratio of the faulty computing nodes not exceeding the threshold; continuing to submit jobs to the management node in the virtual machine cluster, in response to the virtual machine cluster being recovered from the fault, determining whether a job running before the fault of the virtual machine cluster is incomplete according to job state information queried from the management node, if yes, submitting a next job, and if not, submitting the incomplete job, wherein the job state information is obtained by the management node according to a job log of the computing node; and continuing to receive user jobs and submitting the user jobs to the management node in the virtual machine cluster, in response to determining that the faulty virtual machines comprise no management node and the ratio of the faulty computing nodes does not exceed the threshold. 7. The method according to claim 6 , further comprising: periodically backing up, by the management node in the virtual machine cluster, incremental operation logs into the remote storage; and periodically merging, by the remote storage, the backed-up operation logs and deleting, by the remote storage, the operation logs prior to a merging time. 8. The method
by exceeding a time limit, i.e. time-out, e.g. watchdogs · CPC title
Restarting or rejuvenating · CPC title
without idle spare hardware · CPC title
involving virtual machines · CPC title
Virtual · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.