Systems and methods for preventing split-brain scenarios in high-availability clusters

US9450852B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-9450852-B1
Application numberUS-201414146804-A
CountryUS
Kind codeB1
Filing dateJan 3, 2014
Priority dateJan 3, 2014
Publication dateSep 20, 2016
Grant dateSep 20, 2016

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A computer-implemented method for preventing split-brain scenarios in high-availability clusters may include (1) detecting, at a first node of a high-availability cluster, a partitioning event that isolates the first node from a second node of the high-availability cluster, (2) broadcasting, from a health-status server and after the partitioning event has occurred, a cluster-health message to the first node that includes at least a health status of the second node that is based on whether the health-status server received a node-health message from the second node, and (3) reacting, at the first node and based at least in part on whether the first node received the cluster-health message, to the partitioning event such that the partitioning event does not result in a split-brain scenario within the high-availability cluster. Various other methods, systems, and computer-readable media are also disclosed.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method comprising: detecting, at an active node of a high-availability cluster, a partitioning event that isolates the active node from a standby node of the high-availability cluster; after the partitioning event has occurred: broadcasting, from a health-status server, a cluster-health message to at least the active node, wherein: the health-status server is separate and distinct from the active node and the standby node; the cluster-health message comprises at least a health status of the standby node; the health status of the standby node is based at least in part on whether the health-status server received a node-health message from the standby node after the partitioning event occurred; reacting, at the active node, to the partitioning event such that the partitioning event does not result in a split-brain scenario within the high-availability cluster by performing, based at least in part on whether the active node received the cluster-health message from the health-status server, at least one of: yielding, at the active node and in response to not receiving the cluster-health message from the health-status server, at least one computing task assigned to the active node to the standby node; continuing to perform, at the active node and in response to receiving the cluster-health message from the health-status server, the at least one computing task assigned to the active node. 2. The computer-implemented method of claim 1 , wherein reacting to the partitioning event comprises: determining, at the active node, that the active node did not receive the cluster-health message from the health-status server; yielding, at the active node and in response to not receiving the cluster-health message from the health-status server, the at least one computing task assigned to the active node to the standby node. 3. The computer-implemented method of claim 1 , wherein reacting to the partitioning event comprises: determining, at the active node, that the active node received the cluster-health message from the health-status server; continuing to perform, at the active node and in response to receiving the cluster-health message from the health-status server, the at least one computing task assigned to the active node. 4. The computer-implemented method of claim 1 , wherein reacting to the partitioning event is further based at least in part on the health status of the standby node indicated by the cluster-health message. 5. The computer-implemented method of claim 1 , wherein broadcasting the cluster-health message to the active node comprises: receiving, at the health-status server, the node-health message from the standby node, wherein the node-health message from the standby node comprises health-status information about the standby node; creating, at the health-status server, the cluster-health message such that it includes at least the health-status information about the standby node; sending, from the health-status server, the cluster-health message to the active node. 6. The computer-implemented method of claim 1 , further comprising ensuring that the active node reacts to the partitioning event by: receiving, via a user-space thread running on the active node, any cluster-health message from the health-status server; updating, via the user-space thread and in response to receiving any cluster-health message from the health-status server, a hardware module of the active node that reboots the active node after a predetermined time period has passed since the hardware module is last updated; rebooting, via the hardware module and in response to the predetermined time period having passed since the hardware module was last updated, the active node. 7. The computer-implemented method of claim 1 , further comprising ensuring that the active node reacts to the partitioning event by: receiving, via a user-space thread running on the active node, any cluster-health message from the health-status server; updating, via the user-space thread and in response to receiving any cluster-health message from the health-status server, a kernel-space thread running on the active node that reboots the active node after a predetermined time period has passed since the kernel-space thread is last updated; rebooting, via the kernel-space thread and in response to the predetermined time period having passed since the kernel-space thread was last updated, the active node. 8. The computer-implemented method of claim 1 , further comprising ensuring that the active node reacts to the partitioning event by: receiving, via a user-space thread running on the active node, any cluster-health message from the health-status server; updating, via the user-space thread and in response to receiving any cluster-health message from the health-status server, a kernel-space thread running on the active node, wherein: the kernel-space thread updates a hardware module of the active node in response to being updated by the user-space thread; the hardware module reboots the active node after a predetermined time period has passed since the hardware module is last updated; updating, via the kernel-space thread and in response to being updated by the user-space thread, the hardware module; rebooting, via the hardware module and in response to the predetermined time period having passed since the hardware module was last updated, the active node. 9. The computer-implemented method of claim 1 , further comprising periodically sending, from each node of the high-availability cluster to the health-status server, an additional node-health message that indicates the health status of the node. 10. The computer-implemented method of claim 1 , further comprising periodically broadcasting, from the health-status server, an additional cluster-health message to each node of the high-availability cluster, wherein the cluster-health message: is based on node-health messages received at the health-status server from nodes of the high-availability cluster; indicates a health status for each node of the high-availability cluster. 11. A system comprising: a detecting module that detects, at an active node of a high-availability cluster, a partitioning event that isolates the active node from a standby node of the high-availability cluster; a broadcasting module that broadcasts, from a health-status server and after the partitioning event has occurred, a cluster-health message to at least the active node, wherein: the health-status server is separate and distinct from the active node and the standby node; the cluster-health message comprises at least a health status of the standby node; the health status of the standby node is based at least in part on whether the health-status server received a node-health message from the standby node after the partitioning event occurred; a reacting module that reacts, at the active node and after the partitioning event has occurred, to the partitioning event such that the partitioning event does not result in a split-brain scenario within the high-availability cluster by performing, based at least in part on whether the active node received the cluster-health message from the health-status server, at least one of: causing, in response to not receiving the cluster-health message from the health-status server, the active node to yield at least one computing task assigned to the active node to the standby node; causing, in response to receiving the cluster-health message from the health-status server, the active node to continue to perform the at least one computing task assigned to the active node; at least one physical process

Assignees

Inventors

Classifications

  • H04L43/10Primary

    Active monitoring, e.g. heartbeat, ping or trace-route · CPC title

  • by checking functioning · CPC title

  • switching over of hardware resources · CPC title

  • in which an application is distributed across nodes in the network (software deployment G06F8/60; multiprogramming arrangements G06F9/46) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9450852B1 cover?
A computer-implemented method for preventing split-brain scenarios in high-availability clusters may include (1) detecting, at a first node of a high-availability cluster, a partitioning event that isolates the first node from a second node of the high-availability cluster, (2) broadcasting, from a health-status server and after the partitioning event has occurred, a cluster-health message to t…
Who is the assignee on this patent?
Juniper Networks Inc
What technology area does this patent fall under?
Primary CPC classification H04L43/10. Mapped technology areas include Electricity.
When was this patent published?
Publication date Tue Sep 20 2016 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 4 related publications on this page (citations in our corpus or others sharing the same primary CPC).