Systems and methods for preventing split-brain scenarios in high-availability clusters

US10114713B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-10114713-B1
Application numberUS-201615244092-A
CountryUS
Kind codeB1
Filing dateAug 23, 2016
Priority dateJan 3, 2014
Publication dateOct 30, 2018
Grant dateOct 30, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A computer-implemented method for preventing split-brain scenarios in high-availability clusters may include (1) detecting, at a first node of a high-availability cluster, a partitioning event that isolates the first node from a second node of the high-availability cluster, (2) broadcasting, from a health-status server and after the partitioning event has occurred, a cluster-health message to the first node that includes at least a health status of the second node that is based on whether the health-status server received a node-health message from the second node, and (3) reacting, at the first node and based at least in part on whether the first node received the cluster-health message, to the partitioning event such that the partitioning event does not result in a split-brain scenario within the high-availability cluster. Various other methods, systems, and computer-readable media are also disclosed.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method comprising: detecting, at a standby node of a high-availability cluster, a partitioning event that isolates the standby node from an active node of the high-availability cluster; after the partitioning event has occurred: broadcasting, from a health-status server, a cluster-health message to at least the standby node, wherein: the health-status server is separate and distinct from the standby node and the active node; the cluster-health message comprises at least a health status of the active node; the health status of the active node is based at least in part on whether the health-status server received a node-health message from the active node after the partitioning event occurred; reacting, at the standby node, to the partitioning event such that the partitioning event does not result in a split-brain scenario within the high-availability cluster by performing, based at least in part on whether the standby node received the cluster-health message from the health-status server, at least one of: leaving the high-availability cluster; assuming at least one computing task assigned to the active node. 2. The computer-implemented method of claim 1 , wherein reacting to the partitioning event comprises: determining, at the standby node, that the standby node did not receive the cluster-health message from the health-status server; leaving, at the standby node and in response to not receiving the cluster-health message from the health-status server, the high-availability cluster. 3. The computer-implemented method of claim 1 , wherein reacting to the partitioning event is further based at least in part on the health status of the active node indicated by the cluster-health message. 4. The computer-implemented method of claim 3 , wherein reacting to the partitioning event comprises: determining, at the standby node, that the health status of the active node indicated by the cluster-health message indicates that the active node is not healthy; assuming, at the standby node and in response to the active node being not healthy, the at least one computing task assigned to the active node. 5. The computer-implemented method of claim 4 , wherein determining that the health status of the active node indicated by the cluster-health message indicates that the active node is not healthy comprises determining that the health status of the active node indicated by the cluster-health message indicates that the health-status server did not receive a node-health message from the active node during a predetermined grace period after the partitioning event occurs. 6. The computer-implemented method of claim 1 , wherein broadcasting the cluster-health message to the standby node comprises: receiving, at the health-status server, the node-health message from the active node, wherein the node-health message from the active node comprises health-status information about the active node; creating, at the health-status server, the cluster-health message such that it includes at least the health-status information about the active node; sending, from the health-status server, the cluster-health message to the standby node. 7. The computer-implemented method of claim 1 , further comprising ensuring that the standby node reacts to the partitioning event by: receiving, via a user-space thread running on the standby node, any cluster-health message from the health-status server; updating, via the user-space thread and in response to receiving any cluster-health message from the health-status server, a hardware module of the standby node that reboots the standby node after a predetermined time period has passed since the hardware module is last updated; rebooting, via the hardware module and in response to the predetermined time period having passed since the hardware module was last updated, the standby node. 8. The computer-implemented method of claim 1 , further comprising ensuring that the standby node reacts to the partitioning event by: receiving, via a user-space thread running on the standby node, any cluster-health message from the health-status server; updating, via the user-space thread and in response to receiving any cluster-health message from the health-status server, a kernel-space thread running on the standby node that reboots the standby node after a predetermined time period has passed since the kernel-space thread is last updated; rebooting, via the kernel-space thread and in response to the predetermined time period having passed since the kernel-space thread was last updated, the standby node. 9. The computer-implemented method of claim 1 , further comprising ensuring that the standby node reacts to the partitioning event by: receiving, via a user-space thread running on the standby node, any cluster-health message from the health-status server; updating, via the user-space thread and in response to receiving any cluster-health message from the health-status server, a kernel-space thread running on the standby node, wherein: the kernel-space thread updates a hardware module of the standby node in response to being updated by the user-space thread; the hardware module reboots the standby node after a predetermined time period has passed since the hardware module is last updated; updating, via the kernel-space thread and in response to being updated by the user-space thread, the hardware module; rebooting, via the hardware module and in response to the predetermined time period having passed since the hardware module was last updated, the standby node. 10. The computer-implemented method of claim 1 , further comprising periodically sending, from each node of the high-availability cluster to the health-status server, an additional node-health message that indicates the health status of the node. 11. The computer-implemented method of claim 1 , further comprising periodically broadcasting, from the health-status server, an additional cluster-health message to each node of the high-availability cluster, wherein the cluster-health message: is based on node-health messages received at the health-status server from nodes of the high-availability cluster; indicates a health status for each node of the high-availability cluster. 12. A system comprising: a detecting module that detects, at a standby node of a high-availability cluster, a partitioning event that isolates the standby node from an active node of the high-availability cluster; a broadcasting module that broadcasts, from a health-status server and after the partitioning event has occurred, a cluster-health message to at least the standby node, wherein: the health-status server is separate and distinct from the standby node and the active node; the cluster-health message comprises at least a health status of the active node; the health status of the active node is based at least in part on whether the health-status server received a node-health message from the active node after the partitioning event occurred; a reacting module that reacts, at the standby node and after the partitioning event has occurred, to the partitioning event such that the partitioning event does not result in a split-brain scenario within the high-availability cluster by performing, based at least in part on whether the standby node received the cluster-health message from the health-status server, at least one of: causing the standby node to leave the high-availability cluster; causing the standby node to assume at least one computing task assigned to the active node; at least one physical processor that executes the detecting module, the broadcasting m

Assignees

Inventors

Classifications

  • in which an application is distributed across nodes in the network (software deployment G06F8/60; multiprogramming arrangements G06F9/46) · CPC title

  • H04L43/10Primary

    Active monitoring, e.g. heartbeat, ping or trace-route · CPC title

  • switching over of hardware resources · CPC title

  • by checking functioning · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10114713B1 cover?
A computer-implemented method for preventing split-brain scenarios in high-availability clusters may include (1) detecting, at a first node of a high-availability cluster, a partitioning event that isolates the first node from a second node of the high-availability cluster, (2) broadcasting, from a health-status server and after the partitioning event has occurred, a cluster-health message to t…
Who is the assignee on this patent?
Juniper Networks Inc
What technology area does this patent fall under?
Primary CPC classification H04L43/10. Mapped technology areas include Electricity.
When was this patent published?
Publication date Tue Oct 30 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 3 related publications on this page (citations in our corpus or others sharing the same primary CPC).