Self-healing system for distributed services and applications

US10528427B1 · US · B1

Patent metadata
FieldValue
Publication numberUS-10528427-B1
Application numberUS-201615177841-A
CountryUS
Kind codeB1
Filing dateJun 9, 2016
Priority dateJun 9, 2016
Publication dateJan 7, 2020
Grant dateJan 7, 2020

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A self-healing system configured to automatically restore non-responsive or failed applications to a normal operating state. A self-healing system may restart an application after confirming that the application itself has failed—and not an underlying dependency failure. The self-healing system may also evaluate a server hosting an application reported as being non-responsive to determine whether that server has itself failed. If an application is non-responsive or has failed on an otherwise healthy host, and the dependent service use by the application are available, the self-healing system automatically restores the application to a responsive state. To do so, the self-healing system may generate a run list specifying a sequence of scripts invoked to restore the application to the responsive state.

First claim

Opening claim text (preview).

What is claimed is: 1. A computer-implemented method for restoring a non-responsive application, the computer-implemented method comprising: monitoring a plurality of servers hosting a distributed application for unresponsiveness; performing a health check on an instance of the distributed application hosted on a first server of the plurality of servers, by attempting a test transaction with the instance of the distributed application; determining, based on an analysis of the test transaction, that the instance of the distributed application is in a malfunctioning state, wherein the malfunctioning state indicates the instance of the distributed application is experiencing latency above a threshold; determining that the first server is responsive by: establishing a shell connection with the first server; and receiving a response from the first server via the shell connection; and automatically initiating a restoration process, the restoration process including: identifying a cause of the malfunctioning state of the instance of the distributed application; generating, based on the cause of the malfunctioning state, an upstream profile identifying servers hosting upstream dependent computing services required by the instance of the distributed application; generating a downstream profile identifying servers hosting downstream dependent computing services which rely on the instance of the distributed application; removing artifacts associated with the instance of the distributed application, wherein the artifacts include at least an open transaction of the instance of the distributed application; killing processes associated with the instance of the distributed application; restarting the servers identified in the upstream profile thereby restoring availability of the upstream dependent computing services; restarting the instance of the distributed application; and restarting the servers identified in the downstream profile thereby restoring availability of downstream dependent computing services. 2. The computer-implemented method of claim 1 , further comprising starting, stopping, or restarting one of: the instance of the distributed application, the downstream dependent computing services, and the upstream dependent computing services. 3. The computer-implemented method of claim 1 , further comprising: confirming at least one upstream dependent computing service of the upstream dependent computing services is available on a second server of the plurality of servers; and restarting the instance of the distributed application on the first server. 4. The computer-implemented method of claim 1 , further comprising: stopping, on a third server of the plurality of servers, at least one downstream dependent computing service of the downstream dependent computing services; and restarting the at least one downstream dependent computing service on the third server after restoring the instance of the distributed application to a responsive state on the first server. 5. The computer-implemented method of claim 1 , further comprising determining whether the upstream dependent computing services are available. 6. The computer-implemented method of claim 5 , wherein, upon determining that the upstream dependent computing services are not available, scripts are invoked to: stop the instance of the distributed application on the first server; restart the upstream dependent computing services; and start the instance of the distributed application on the first server. 7. The computer-implemented method of claim 1 , further comprising: determining that the first server is not available; and generating a message indicating the first server has become non-responsive. 8. The computer-implemented method of claim 1 , wherein a first one of the downstream dependent computing services comprises one of a web server and a database. 9. The computer-implemented method of claim 1 , wherein the first server comprises an instance of a virtual machine (VM) hosted on a cloud computing platform. 10. The computer-implemented method of claim 1 , further comprising, confirming neither the distributed application nor the first server has been placed in a maintenance mode state. 11. A computer-implemented method for restoring a non-responsive application, the method comprising: determining a health status of an instance of a distributed application hosted on a first server, wherein determining the health status of the instance of the distributed application includes attempting a test transaction with the instance of the distributed application; determining, based on an analysis of the test transaction, that the instance of the distributed application is in a malfunctioning state, wherein the malfunctioning state indicates the instance of the distributed application is experiencing latency above a thresholds; determining a health status for at least a first upstream dependent computing service required by the instance of the distributed application, wherein the first upstream dependent computing service is hosted on a second server; upon determining the health status of the first upstream dependent computing service indicates a non-responsive status, determining a health status for the first server and the second server by: establishing a shell connection with each of the first server and the second server; and receiving a response from the first server and the second server via the shell connection; identifying a cause of the non-responsive status; and upon determining the health status of the first server and the second server indicates the first server and the second server are available, initiating a restoration process to restore the instance of the distributed application to a responsive state wherein the restoration process comprises: generating, based on the cause of the non-responsive state, an upstream profile identifying servers hosting upstream dependent computing services required by the instance of the distributed application; removing artifacts associated with the instance of the distributed application, wherein the artifacts include at least an open transaction of the instance of the distributed application; killing processes associated with the instance of the distributed application; restarting the servers identified in the upstream profile thereby restoring availability of the upstream dependent computing services; and restarting the first server thereby restoring availability of the distributed application. 12. A non-transitory computer-readable storage medium storing instructions, which when executed on a processor, perform an operation for restoring a non-responsive application, the operation comprising: monitoring a plurality of servers hosting a distributed application for unresponsiveness; performing a health check on an instance of the distributed application hosted on a first server of the plurality of servers, by attempting a test transaction with the instance of the distributed application; determining, based on an analysis of the test transaction, that the instance of the distributed application is in a malfunctioning state, wherein the malfunctioning state indicates the instance of the distributed application is experiencing latency above a threshold; determining that the first server is responsive by: establishing a shell connection with the first server; and receiving a response from the first server via the shell connection; and automatically initiating a restoration process, the restoration process including: identifying a cause of the malfunctioning state of the instance of the distributed application; generating,

Assignees

Inventors

Classifications

  • for recovering from a failure of a protocol instance or entity, e.g. service redundancy protocols, protocol state redundancy or protocol service redirection (management of faults, events, alarms or notifications in data switching networks H04L41/06) · CPC title

  • Profiles · CPC title

  • Fully automatic configuration · CPC title

  • by dynamic selection of recovery network elements, e.g. replacement by the most appropriate element after failure · CPC title

  • Real-time · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US10528427B1 cover?
A self-healing system configured to automatically restore non-responsive or failed applications to a normal operating state. A self-healing system may restart an application after confirming that the application itself has failed—and not an underlying dependency failure. The self-healing system may also evaluate a server hosting an application reported as being non-responsive to determine wheth…
Who is the assignee on this patent?
Intuit Inc
What technology area does this patent fall under?
Primary CPC classification H04L41/0886. Mapped technology areas include Electricity.
When was this patent published?
Publication date Tue Jan 07 2020 00:00:00 GMT+0000 (Coordinated Universal Time) (B1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).