Systems and Methods for Enhancing the Availability of Multi-Tier Applications on Cloud Computing Platforms
US-2015378743-A1 · Dec 31, 2015 · US
US10528427B1 · US · B1
| Field | Value |
|---|---|
| Publication number | US-10528427-B1 |
| Application number | US-201615177841-A |
| Country | US |
| Kind code | B1 |
| Filing date | Jun 9, 2016 |
| Priority date | Jun 9, 2016 |
| Publication date | Jan 7, 2020 |
| Grant date | Jan 7, 2020 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
A self-healing system configured to automatically restore non-responsive or failed applications to a normal operating state. A self-healing system may restart an application after confirming that the application itself has failed—and not an underlying dependency failure. The self-healing system may also evaluate a server hosting an application reported as being non-responsive to determine whether that server has itself failed. If an application is non-responsive or has failed on an otherwise healthy host, and the dependent service use by the application are available, the self-healing system automatically restores the application to a responsive state. To do so, the self-healing system may generate a run list specifying a sequence of scripts invoked to restore the application to the responsive state.
Opening claim text (preview).
What is claimed is: 1. A computer-implemented method for restoring a non-responsive application, the computer-implemented method comprising: monitoring a plurality of servers hosting a distributed application for unresponsiveness; performing a health check on an instance of the distributed application hosted on a first server of the plurality of servers, by attempting a test transaction with the instance of the distributed application; determining, based on an analysis of the test transaction, that the instance of the distributed application is in a malfunctioning state, wherein the malfunctioning state indicates the instance of the distributed application is experiencing latency above a threshold; determining that the first server is responsive by: establishing a shell connection with the first server; and receiving a response from the first server via the shell connection; and automatically initiating a restoration process, the restoration process including: identifying a cause of the malfunctioning state of the instance of the distributed application; generating, based on the cause of the malfunctioning state, an upstream profile identifying servers hosting upstream dependent computing services required by the instance of the distributed application; generating a downstream profile identifying servers hosting downstream dependent computing services which rely on the instance of the distributed application; removing artifacts associated with the instance of the distributed application, wherein the artifacts include at least an open transaction of the instance of the distributed application; killing processes associated with the instance of the distributed application; restarting the servers identified in the upstream profile thereby restoring availability of the upstream dependent computing services; restarting the instance of the distributed application; and restarting the servers identified in the downstream profile thereby restoring availability of downstream dependent computing services. 2. The computer-implemented method of claim 1 , further comprising starting, stopping, or restarting one of: the instance of the distributed application, the downstream dependent computing services, and the upstream dependent computing services. 3. The computer-implemented method of claim 1 , further comprising: confirming at least one upstream dependent computing service of the upstream dependent computing services is available on a second server of the plurality of servers; and restarting the instance of the distributed application on the first server. 4. The computer-implemented method of claim 1 , further comprising: stopping, on a third server of the plurality of servers, at least one downstream dependent computing service of the downstream dependent computing services; and restarting the at least one downstream dependent computing service on the third server after restoring the instance of the distributed application to a responsive state on the first server. 5. The computer-implemented method of claim 1 , further comprising determining whether the upstream dependent computing services are available. 6. The computer-implemented method of claim 5 , wherein, upon determining that the upstream dependent computing services are not available, scripts are invoked to: stop the instance of the distributed application on the first server; restart the upstream dependent computing services; and start the instance of the distributed application on the first server. 7. The computer-implemented method of claim 1 , further comprising: determining that the first server is not available; and generating a message indicating the first server has become non-responsive. 8. The computer-implemented method of claim 1 , wherein a first one of the downstream dependent computing services comprises one of a web server and a database. 9. The computer-implemented method of claim 1 , wherein the first server comprises an instance of a virtual machine (VM) hosted on a cloud computing platform. 10. The computer-implemented method of claim 1 , further comprising, confirming neither the distributed application nor the first server has been placed in a maintenance mode state. 11. A computer-implemented method for restoring a non-responsive application, the method comprising: determining a health status of an instance of a distributed application hosted on a first server, wherein determining the health status of the instance of the distributed application includes attempting a test transaction with the instance of the distributed application; determining, based on an analysis of the test transaction, that the instance of the distributed application is in a malfunctioning state, wherein the malfunctioning state indicates the instance of the distributed application is experiencing latency above a thresholds; determining a health status for at least a first upstream dependent computing service required by the instance of the distributed application, wherein the first upstream dependent computing service is hosted on a second server; upon determining the health status of the first upstream dependent computing service indicates a non-responsive status, determining a health status for the first server and the second server by: establishing a shell connection with each of the first server and the second server; and receiving a response from the first server and the second server via the shell connection; identifying a cause of the non-responsive status; and upon determining the health status of the first server and the second server indicates the first server and the second server are available, initiating a restoration process to restore the instance of the distributed application to a responsive state wherein the restoration process comprises: generating, based on the cause of the non-responsive state, an upstream profile identifying servers hosting upstream dependent computing services required by the instance of the distributed application; removing artifacts associated with the instance of the distributed application, wherein the artifacts include at least an open transaction of the instance of the distributed application; killing processes associated with the instance of the distributed application; restarting the servers identified in the upstream profile thereby restoring availability of the upstream dependent computing services; and restarting the first server thereby restoring availability of the distributed application. 12. A non-transitory computer-readable storage medium storing instructions, which when executed on a processor, perform an operation for restoring a non-responsive application, the operation comprising: monitoring a plurality of servers hosting a distributed application for unresponsiveness; performing a health check on an instance of the distributed application hosted on a first server of the plurality of servers, by attempting a test transaction with the instance of the distributed application; determining, based on an analysis of the test transaction, that the instance of the distributed application is in a malfunctioning state, wherein the malfunctioning state indicates the instance of the distributed application is experiencing latency above a threshold; determining that the first server is responsive by: establishing a shell connection with the first server; and receiving a response from the first server via the shell connection; and automatically initiating a restoration process, the restoration process including: identifying a cause of the malfunctioning state of the instance of the distributed application; generating,
for recovering from a failure of a protocol instance or entity, e.g. service redundancy protocols, protocol state redundancy or protocol service redirection (management of faults, events, alarms or notifications in data switching networks H04L41/06) · CPC title
Profiles · CPC title
Fully automatic configuration · CPC title
by dynamic selection of recovery network elements, e.g. replacement by the most appropriate element after failure · CPC title
Real-time · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.