Predictive batch job failure detection and remediation

US2023018199A1 · US · A1

Patent metadata
FieldValue
Publication numberUS-2023018199-A1
Application numberUS-202117379583-A
CountryUS
Kind codeA1
Filing dateJul 19, 2021
Priority dateJul 19, 2021
Publication dateJan 19, 2023
Grant date

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

Systems, methods, and computer programming products for predicting, preventing and remediating failures of batch jobs being executed and/or queued for processing at future scheduled time. Batch job parameters, messages and system logs are stored in knowledge bases and/or inputted into AI models for analysis. Using predictive analytics and/or machine learning, batch job failures are predicted before the failures occur. Mappings of processes used by each batch job, historical data from previous batch jobs and data identifying the success or failure thereof, builds an archive that can be refined over time through active learning feedback and AI modeling to predictively recommend actions that have historically prevented or remediated failures from occurring. Recommended actions are reported to the system administrator or automatically applied. As job failures occur over time, mappings of the current system log to logs for the unsuccessful batch jobs help the root cause analysis becomes simpler and more automated.

First claim

Opening claim text (preview).

What is claimed is: 1 . A computer-implemented method for alleviating a batch job failure comprising: creating, by a processor, a knowledge base including an archive of failed batch job histories comprising time series data of workflow logs, messages and invoked processes associated with failed batch jobs; generating, by the processor, a table of processes mapping processes invoked by batch jobs to the workflow logs and the messages associated with the batch jobs; monitoring, by the processor, the messages, the workflow logs and process-level information of the batch jobs being executed; matching, by the processor, error messages or process failures of a current system log to error messages and the invoked processes of the batch job histories contained in the knowledge base; displaying, by the processor, a root cause analysis of the error messages or process failures of a current system log and a recommended remediation action for alleviating the batch job failure; and updating, by the processor, the knowledge base with feedback comprising results of applying the recommended remediation action. 2 . The computer-implemented method of claim 1 , further comprising: scanning, by the processor, the table of processes for a potential batch job failure within a batch job queue, wherein the potential job failure is recognized as a queued batch job scheduled to invoke a same failed process as the process failures of the current system log; proactively flagging, by the processor, the potential batch job failure of the batch job queue; and transmitting, by the processor, a notification advising one or more actions to prevent the potential batch job failure from occurring. 3 . The computer-implemented method of claim 2 , wherein the one or more actions for preventing the potential batch job failure of selected from the group consisting of terminating batches anticipated to fail, restarting the batches anticipated to fail, placing the execution of the batches anticipated to fail on hold, and fixing the failed process. 4 . The computer-implemented method of claim 1 , further comprising: creating, by the processor, a second knowledge base including an archive of successful batch job histories including the time series data of the workflow logs, the messages and the invoked processes associated with successful batch jobs; detecting, by the processor, an anomaly as a function of comparing the workflow logs, the messages and the process level information of currently processing batch jobs to the archive of successful batch job histories and the archive of failed batch job histories on a per-process basis; flagging, by the processor, the anomaly at a process level; and transmitting, by the processor, a notification describing the anomaly at the process level. 5 . The computer-implemented method of claim 4 , further comprising: scanning, by the processor, the table of processes for queued batch jobs scheduled to invoke processes affected by the anomaly; flagging, by the processor, the queued batch jobs scheduled to invoke the processes affected by the anomaly as predicted batch job failures; and transmitting, by the processor, a notification describing the predicted batch job failures. 6 . The computer-implemented method of claim 1 , further comprising: generating, by the processor, time series data from tasks of the batch jobs being executed, capturing code paths invoked by the tasks; identifying, by the processor, invoked processes of the batch jobs being executed based on the code paths invoked by the tasks; training, by the processor, an RNN/LSTM model to predict using the time series data, expected process invocations for each of the batch jobs being executed; scanning, by the processor, the table of processes for the expected process invocations identified by the RNN/LSTM model for the batch jobs being executed and batch jobs present in a batch job queue; and flagging, by the processor, predicted batch job failures from the batch jobs being executed and the batch jobs in batch job queue, as a function of a combination of the expected process invocations and the archive of failed batch job histories. 7 . The computer-implemented method of claim 6 , further comprising: classifying, by the processor, workloads based on time series data from tasks of the batch jobs being executed into workloads that exhibit a linear invocation of processes per task and workloads that exhibit a varying invocation of processes, wherein predicting using the time series data, the expected process invocations for each of the batch jobs being executed workloads is only applied to the workloads that exhibit a varying invocation of processes. 8 . A computer program product comprising: one or more computer readable storage media having computer-readable program instructions stored on the one or more computer readable storage media, said program instructions executes a computer-implemented method comprising: creating a knowledge base including an archive of failed batch job histories comprising time series data of workflow logs, messages and invoked processes associated with failed batch jobs; generating a table of processes mapping processes invoked by batch jobs to the workflow logs and messages associated with the batch jobs; monitoring the messages, the workflow logs and process-level information of the batch jobs being executed; matching error messages or process failures of a current system log to error messages and the invoked processes of the batch job histories contained in the knowledge base; displaying a root cause analysis of the error messages or process failures of the current system log and a recommended remediation action for alleviating the batch job failure; and updating the knowledge base with feedback comprising results of applying the recommended remediation action. 9 . The computer program product of claim 8 , further comprising: scanning the table of processes for a potential batch job failure within a batch job queue, wherein the potential job failure is recognized as a queued batch job scheduled to invoke a same failed process as the process failures of the current system log; proactively flagging the potential batch job failure of the batch job queue; and transmitting a notification advising one or more actions to prevent the potential batch job failure from occurring. 10 . The computer program product of claim 9 , wherein the one or more actions for preventing the potential batch job failure of selected from the group consisting of terminating batches anticipated to fail, restarting the batches anticipated to fail, placing the execution of the batches anticipated to fail on hold, and fixing the failed process. 11 . The computer program product of claim 8 further comprising: creating a second knowledge base including an archive of successful batch job histories including time series data of the workflow logs, the messages and the invoked processes associated with successful batch jobs; detecting an anomaly as a function of comparing the workflow logs, the messages and the process level information of currently processing batch jobs to the archive of successful batch job histories and the archive of failed batch job histories on a per-process basis; flagging the anomaly at a process level; and transmitting a notification describing the anomaly at the process level. 12 . The computer program product of claim 11 further comprising: scanning the table of processes for queued batch jobs scheduled to invoke processes affected by the anomaly; flagging the queued batch jobs scheduled to invoke the processes affected by the ano

Assignees

Inventors

Classifications

  • Storage of error reports, e.g. persistent data storage, storage using memory protection · CPC title

  • Dumping, i.e. gathering error/state information after a fault for later diagnosis · CPC title

  • where the computing system component is a software system · CPC title

  • Root cause analysis, i.e. error or fault diagnosis (in a hardware test environment G06F11/22; in a software test environment G06F11/36) · CPC title

  • Recurrent networks, e.g. Hopfield networks · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US2023018199A1 cover?
Systems, methods, and computer programming products for predicting, preventing and remediating failures of batch jobs being executed and/or queued for processing at future scheduled time. Batch job parameters, messages and system logs are stored in knowledge bases and/or inputted into AI models for analysis. Using predictive analytics and/or machine learning, batch job failures are predicted be…
Who is the assignee on this patent?
Kyndryl Inc
What technology area does this patent fall under?
Primary CPC classification G06F11/0793. Mapped technology areas include Physics.
When was this patent published?
Publication date Thu Jan 19 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (A1). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 1 related publication on this page (citations in our corpus or others sharing the same primary CPC).