Automating the production of runbook workflows
US-9891971-B1 · Feb 13, 2018 · US
US10282248B1 · US · B1
| Field | Value |
|---|---|
| Publication number | US-10282248-B1 |
| Application number | US-201816201471-A |
| Country | US |
| Kind code | B1 |
| Filing date | Nov 27, 2018 |
| Priority date | Nov 27, 2018 |
| Publication date | May 7, 2019 |
| Grant date | May 7, 2019 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
Disclosed are hardware and techniques for correcting computer process faults by identifying risk associated with correcting a computer process fault and computer processes that may depend on the corrected computer process. The interdependent computer processes in a network may be determined by evaluating a stream of process break flags from a monitoring component coupled to the network. Each computer process break flag in the stream of computer process break flags indicates a process fault detected by the monitoring component and is correlated to a corrective response. The break flag and the corrective response are assigned a risk. A risk matrix accounts for interdependencies between computer processes and identified corrective actions. A final response strategy that corrects the computer process faults is determined using the assigned risk and computer system interdependence. A runbook stores the final response strategy, which may be updated based on changing computer process interdependencies and assigned risk.
Opening claim text (preview).
What is claimed is: 1. An apparatus, comprising: a memory storing programming code; and a triage processing component, coupled to the memory and, via a communication interface, to a monitoring component that monitors operation of computer implemented processes of a network, operable to execute the stored programming code, that when executed causes the triage processing component to perform functions, including functions to: receive, from the monitoring component, a first process break event indicating a symptom of a potential operational breakdown of a computer implemented process; evaluate the received first process break event for a correlation to a possible cause of the potential operational breakdown of the computer process; based on the correlation to the possible cause of the potential operational breakdown of the computer process, identify possible corrective actions that can be implemented to fix the computer implemented process to prevent the potential operational breakdown; assign a break risk assessment value indicating a likelihood of occurrence of the potential operational breakdown of the computer implemented process; assign a respective fix risk assessment value to each of the identified possible corrective actions; populate a risk assessment matrix with the assigned break risk assessment value and the fix risk assessment value assigned to each of the identified possible corrective actions, wherein the risk assessment matrix has elements representing the computer implemented process, a plurality of other computer implemented processes, and an interdependency rating that quantifies a level of interdependence of each of the plurality of the other computer implemented processes on the computer implemented process; access a runbook including a plurality of corrective actions that correct potential operational breakdowns of computer implemented processes of the network; obtain a list of corrective actions correlated to the first process break event from the runbook; and modify the list of corrective actions based on a rule set applied to the risk assessment matrix, wherein the modified list of corrective actions includes at least one of the identified possible corrective actions as an optimal corrective action. 2. The apparatus of claim 1 , wherein: the assigned break event risk assessment value has a range from a value indicating the potential operational breakdown has a high likelihood of occurring to a value indicating the potential operation breakdown has a low likelihood of occurring; and the respective fix risk assessment value assigned to each of the identified possible corrective action has a range from a value indicating the potential operational breakdown has a high likelihood of being fixed to a value indicating the potential operation breakdown has a low likelihood of being fixed by the respective identified possible corrective action. 3. The apparatus of claim 1 , wherein the memory further comprises: programming code that causes the triage processing component to perform further functions when modifying the list of corrective actions in the runbook, including functions to: assign an interdependency rating to each of the possible corrective actions in the list of corrective actions, wherein the interdependency rating quantifies a level of interdependence of each of the computer implemented processes that may be affected by application of each of the possible corrective actions in the list of corrective actions; populate the risk assessment matrix with the interdependency rating of each of the possible corrective actions in the list of corrective actions; evaluate the risk assessment matrix, based on the assigned interdependency rating of each of the possible corrective action in the list of corrective actions to one another; and in response to the evaluation of the risk assessment matrix, flag a respective corrective action from the list of corrective actions as the optimal corrective action. 4. The apparatus of claim 1 , wherein the memory further comprises: programming code that causes the triage processing component to perform further functions prior to the runbook being modified, including functions to: identify interdependency risk patterns in the risk assessment matrix populated with the assigned break risk assessment value and the fix risk assessment value assigned for each of the identified corrective actions, wherein the identified interdependency risk patterns indicate risks related to procedures in the runbook and effects of implementing procedures on the computer implemented processes in the network; and generate, based on the identified interdependency risk patterns, a response strategy incorporating at least one of the procedures from the list of corrective actions. 5. The apparatus of claim 1 , wherein the memory further comprises: programming code that causes the triage processing component to perform further functions, including functions to: receive an additional process break event indicating an additional symptom of another or the same potential operational breakdown of the computer implemented process. 6. The apparatus of claim 5 , wherein the memory further comprises: programming code that causes the triage processing component to perform further functions, including functions to: update correlations to possible causes of the potential operational breakdown of the computer implemented processes by analyzing the received additional process break event in conjunction with the first process break event; based on the updated correlations, update the list of corrective actions; and generate updated break risk assessment values for the potential operational breakdown of the computer implemented process and updated fix risk assessment values for each corrective action in the updated list of corrective actions. 7. The apparatus of claim 6 , wherein the triage processing component is coupled to receive one or more process break events from multiple monitoring circuits that monitor computer implemented processes in the network; and the memory further comprises programming code that causes the triage processing component to perform further functions, including functions to: receive subsequent process break events from one or more of the multiple monitoring circuits coupled to the triage processing component; generate, based on the received subsequent break events, break risk assessment values and fix risk assessment values; populate the risk assessment matrix using the generated break risk assessment values and fix risk assessment values; identify one procedure in a revised list of procedures for implementing one corrective action to fix the potential operational breakdowns indicated by the subsequent break events; and modify the runbook to include the identified one procedure as the procedure to implement when the potential operational breakdown requires fixing. 8. The apparatus of claim 7 , wherein the memory further comprises programming code that causes the triage processing component to perform further functions, including functions to: produce a copy of the populated risk assessment matrix; receive successive process break events that follow the subsequent process break events from the one or more of the multiple monitoring circuits coupled to the triage processing component; generate, based on the received successive process break events, break risk assessment values and fix risk assessment values of the successive process break events; populate the copy of the risk assessment matrix using the generated break risk assessment values and fix risk assessment values to produce a revised risk assessment matrix; analyze the break risk assessment values and the fix risk
Remedial or corrective actions (recovery from an exception in an instruction pipeline G06F9/3861; by retry G06F11/1402; for recovering from a failure of a protocol instance or entity H04L69/40) · CPC title
Storage of error reports, e.g. persistent data storage, storage using memory protection · CPC title
Root cause analysis, i.e. error or fault diagnosis (in a hardware test environment G06F11/22; in a software test environment G06F11/36) · CPC title
Error or fault detection not based on redundancy (power supply failures G06F1/30; network fault management H04L41/06) · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.