System and method for self-healing a database server in a cluster

US10169138B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-10169138-B2
Application numberUS-201615010502-A
CountryUS
Kind codeB2
Filing dateJan 29, 2016
Priority dateSep 22, 2015
Publication dateJan 1, 2019
Grant dateJan 1, 2019

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A system and method for implementing a database system is presented. A database cluster can comprise multiple database servers. Each database server is configured to regularly compile various statistics upon the occurrence of a triggering event. These statistics can be stored along with the statistics of each database server in the cluster of database servers. Upon the occurrence of various conditions, corrective actions can be implemented. The conditions can include the inability to achieve performance thresholds. The conditions also can include not meeting the performance of other database servers in the cluster. The corrective action can include removing a server temporarily from the cluster or rebooting the server. In addition, a database server can cause the corrective action on other database servers in the cluster. Other embodiments also are disclosed.

First claim

Opening claim text (preview).

What is claimed is: 1. A system comprising: a plurality of database servers, each database server in the plurality of database servers hosting shards of a database, each shard of the shards of the database having been split from a partition of the database and each partition of the database having been split from the database, each database server in the plurality of database servers having a unique identifier such that a status of each database server in the plurality of database servers can be accessed by other servers in the plurality of database servers, wherein each database server in the plurality of database servers is configured to: receive a triggering action comprising: receiving an indication that a minimum timer has expired; and receiving a pre-determined number of queries; detect a suspicious observation; discover that a particular server is underperforming; compile a plurality of statistics regarding itself, wherein the plurality of statistics is chosen from one of the following: memory usage, disk activity levels, CPU load, and error rates; and store the plurality of statistics in a data store accessible by: (1) each database server in the plurality of database servers; and (2) a load balancer; and the load balancer configured to: allocate queries among the plurality of database servers using load balancing techniques; determine when a condition has occurred by: accessing the plurality of statistics in the data store; and determining that a malfunctioning database server of the plurality of database servers is malfunctioning, comprising determining when one or more of the plurality of statistics stored in the data store by the malfunctioning database server does not meet performance thresholds; initiate an automatic self-corrective action in a database server in the plurality of database servers, the automatic self-corrective action comprising the database server taking itself out of a rotation for a predetermined amount of time configured to allow the database server to catch up; and perform a corrective action on the malfunctioning database server comprising: determining that the malfunctioning database server cannot correct itself; writing an entry in the data store indicating that the malfunctioning database server is not available; causing the malfunctioning database server to no longer receive instructions; and forwarding shard-level queries originally directed to the malfunctioning database server to one or more other database servers of the plurality of database servers. 2. The system of claim 1 , wherein: determining that the malfunctioning database server of the plurality of database servers is malfunctioning comprises: comparing one or more of the plurality of statistics stored in the data store by the malfunctioning database server to an average of all of the plurality of statistics stored in the data store. 3. The system of claim 1 , wherein: performing the corrective action comprises restarting the malfunctioning database server. 4. A method being implemented via execution of computing instructions configured to run at one or more processors and configured to be stored at non-transitory computer-readable media, the method comprising: in a plurality of database servers, each database server in the plurality of database servers hosting shards of a database, each shard of the shards of the database having been split from a partition of the database and each partition of the database having been split from the database, each database server in the plurality of database servers having a unique identifier such that a status of each database server in the plurality of database servers can be accessed by other servers in the plurality of database servers, performing acts of: receiving a triggering action comprising: receiving an indication that a minimum timer has expired; and receiving a pre-determined number of queries; detecting a suspicious observation; discovering that a particular server is underperforming; compiling a plurality of statistics regarding itself, wherein the plurality of statistics is chosen from one of the following: memory usage, disk activity levels, CPU load, and error rates; and storing the plurality of statistics in a data store accessible by: (1) each database server in the plurality of database servers; and (2) a load balancer; and in the load balancer configured to allocate queries among the plurality of database servers using load balancing techniques, performing acts of: determining when a condition has occurred by: accessing the plurality of statistics in the data store; and determining that a malfunctioning database server of the plurality of database servers is malfunctioning comprising determining when one or more of the plurality of statistics stored in the data store by the malfunctioning database server does not meet performance thresholds; initiating an automatic self-corrective action in a database server in the plurality of database servers, the automatic self-corrective action comprising the database server taking itself out of a rotation for a predetermined amount of time configured to allow the database server to catch up; and performing a corrective action on the database server comprising: determining that the malfunctioning database server cannot correct itself; writing an entry in the data store indicating that the malfunctioning database server is not available; causing the malfunctioning database server to no longer receive instructions; and forwarding shard-level queries originally directed to the malfunctioning database server to one or more other database servers of the plurality of database servers. 5. The method of claim 4 , wherein: determining that the malfunctioning database server of the plurality of database servers is malfunctioning comprises: comparing one or more of the plurality of statistics stored in the data store by the malfunctioning database server to an average of all of the plurality of statistics stored in the data store. 6. The method of claim 4 , wherein: performing the corrective action comprises restarting the malfunctioning database server. 7. A method comprising: sending a first incoming instruction to a database server selected from a first plurality of database servers or a second plurality of database servers, using load balancing techniques; retrieving server information for each database server in a cluster of database servers; processing the first incoming instruction to extract a first query in a database server belonging to a first server set and selected from the first plurality of database servers; sending the first query from the database server belonging to the first server set and selected from the first plurality of database servers to a database server belonging to the first server set and selected from the second plurality of database servers; sending the first query from the database server belonging to the first server set and selected from the first plurality of database servers to a database server belonging to the first server set and selected from a third plurality of database servers; executing the first query in the database server belonging to the first server set and selected from the first plurality of database servers, the database server belonging to the first server set and selected from the second plurality of database servers, and the database server belonging to the first server set and selected from the third plurality of database servers; sending a second incoming instruction to a database server selected from the first plurality of database servers or the second plurality of database servers, using the load balancing techniques; processing the second in

Assignees

Inventors

Classifications

  • in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems · CPC title

  • Storage of error reports, e.g. persistent data storage, storage using memory protection · CPC title

  • Backup restoration techniques · CPC title

  • Error or fault detection not based on redundancy (power supply failures G06F1/30; network fault management H04L41/06) · CPC title

  • Real-time · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10169138B2 cover?
A system and method for implementing a database system is presented. A database cluster can comprise multiple database servers. Each database server is configured to regularly compile various statistics upon the occurrence of a triggering event. These statistics can be stored along with the statistics of each database server in the cluster of database servers. Upon the occurrence of various con…
Who is the assignee on this patent?
Wal Mart Stores Inc, Walmart Apollo Llc
What technology area does this patent fall under?
Primary CPC classification G06F11/0793. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Jan 01 2019 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).