Second failure data capture in co-operating multi-image systems

US9921950B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-9921950-B2
Application numberUS-201615068910-A
CountryUS
Kind codeB2
Filing dateMar 14, 2016
Priority dateAug 8, 2012
Publication dateMar 20, 2018
Grant dateMar 20, 2018

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

A method captures diagnostic trace information in a computer system having a plurality of software images. Information is received that is associated with a first failure in a first one of the plurality of software images. The received information is distributed to others of the plurality of software images. Further information is captured that is associated with a second failure in another one of the plurality of software images. The information associated with a first failure in a first one of said plurality of software images is combined with the information associated with a second failure in another of said plurality of software images, and the combined information is analyzed in order to determine a cause of the first failure.

First claim

Opening claim text (preview).

The invention claimed is: 1. A method for capturing diagnostic trace information in a computer system having a plurality of software images, the method comprising: receiving, from a first one of the plurality of images via a communications mechanism, information associated with a first failure in the first one of the plurality of software images, wherein the communications mechanism interconnects the plurality of software images; distributing, via the communications mechanism, the information to others of the plurality of software images; capturing information associated with a second failure in another one of the plurality of software images; determining whether a same software component has failed in the first one and the another one of the plurality of software images; and in response to determining the same software component has failed in the first one and the another one of the plurality of software images, capturing a detailed trace diagnostic information for the software component in the another one of the plurality of software images; combining the information associated with the first failure in the first one of the plurality of software images and the information associated with the second failure in another one of the plurality of software images; analyzing the combined information to determine a cause of the first failure; and based on the cause of the first failure, identifying one or more actions to prevent further failures. 2. The method of claim 1 , wherein the communications mechanism is one of a load balancer, a hypervisor, an operating system, monitoring software, and a peer-to-peer communication mechanism. 3. The method of claim 1 , wherein distributing the information to others of the plurality of software images further comprises distributing a first portion of the information to first ones of the plurality of software images and distributing a second portion of the information to second ones of the plurality of software images. 4. The method of claim 1 , wherein a period for completing the capturing of information expires after a predetermined time period. 5. The method of claim 1 , wherein a period for completing the capturing of information expires following a detection of a second failure. 6. The method of claim 1 , wherein: each of the software images further comprises one or more processes or threads; the information received is associated with a first failure in a first one of the processes or threads; the information distributed is distributed to others of the processes or threads; and the information captured is associated with a second failure in another one of the processes or threads. 7. The method of claim 1 , wherein the information received identifies a factor external to the software images as the cause of the first failure. 8. The method of claim 1 , further comprising checking, prior to the receiving step, whether one or more of other ones of the plurality of software images is executing the same software as the first one of the plurality of software images. 9. The method of claim 1 , wherein a period in which the capturing of information is performed continues until completion of the analyzing of the combined information to determine the cause of the first failure. 10. The method of claim 1 , further comprising: in response to starting at least one of the plurality of software images after a failure, increasing a level of information that is captured for the at least one of the plurality of software images responsive to a subsequent failure. 11. The method of claim 1 , wherein the information associated with the first failure is used to configure others of the plurality of software images to capture an increased level of trace diagnostic information responsive to a failure. 12. The method of claim 11 , wherein the increased level of trace diagnostic information is captured by others of the plurality of software images responsive to a failure within a predetermined time period, the method further comprising: in response to the predetermined time period expiring, reverting a level of trace diagnostic information that is captured by others of the plurality of software images to a second predetermined level. 13. The method of claim 11 , wherein the increased level of trace diagnostic information is captured by others of the plurality of software images responsive to a failure within a predetermined time period, the method further comprising: in response to the predetermined time period expiring, reverting a level of trace diagnostic information that is captured by others of the plurality of software images to a level prior to the first failure. 14. The method of claim 11 , further comprising: in response to a sufficient amount of trace diagnostic information being captured, reverting the level of trace diagnostic information that is captured by others of the plurality of software images responsive to a failure to a level that was established prior to the first failure. 15. The method of claim 11 , further comprising: load balancing the capturing of the trace diagnostic information across the plurality of software images, wherein each one of the plurality of software images captures trace diagnostic information for a particular one or more parts of a software stack. 16. The method of claim 11 , further comprising: load balancing the capturing of the trace diagnostic information across the plurality of software images, wherein each one of the plurality of software images captures a particular one or more parts of a particular subset of the detailed trace diagnostic information. 17. The method of claim 1 , wherein each of the plurality of software images has at least one of a non-identical software stack and a non-identical workload. 18. The method of claim 1 , wherein the cause of the failure is at least one of: a particular software component, a particular process signal, an input/output (I/O) error, and a memory shortage. 19. The method of claim 1 , further comprising: in response to analyzing the first failure in the first one of the plurality of software images, analyzing the information associated with the second failure based on findings associated with the analysis of the first failure to determine a cause of the second failure; and based on the cause of the second failure, identifying at least one action to prevent further failures.

Assignees

Inventors

Classifications

  • in a system implementing multitasking (multitasking per se G06F9/46) · CPC title

  • in a virtual computing platform, e.g. logically partitioned systems · CPC title

  • Performance evaluation by tracing or monitoring · CPC title

  • Error or fault detection not based on redundancy (power supply failures G06F1/30; network fault management H04L41/06) · CPC title

  • Remedial or corrective actions (recovery from an exception in an instruction pipeline G06F9/3861; by retry G06F11/1402; for recovering from a failure of a protocol instance or entity H04L69/40) · CPC title

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US9921950B2 cover?
A method captures diagnostic trace information in a computer system having a plurality of software images. Information is received that is associated with a first failure in a first one of the plurality of software images. The received information is distributed to others of the plurality of software images. Further information is captured that is associated with a second failure in another one…
Who is the assignee on this patent?
IBM
What technology area does this patent fall under?
Primary CPC classification G06F11/0715. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 20 2018 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 8 related publications on this page (citations in our corpus or others sharing the same primary CPC).