What is claimed is:
1 . A method comprising:
determining, by a computing system, a health status for each of a plurality of software components for each of a plurality of time periods, wherein the health status comprises a healthy status and an unhealthy status; determining, by the computing system, a problem software component in the plurality of software components with the unhealthy status at a certain point in time; determining, by the computing system, a set of software components in the plurality of software components that are linked by dependency relationships to the problem software component, wherein multiple software components in the set of software components have the unhealthy status at the certain point in time; tracking, by the computing system, a plurality of events at which software components in the set of software components went from the healthy status to the unhealthy status, wherein the plurality of events correspond to points in time prior to the certain point in time; and rolling back in time through the plurality of events, by the computing system, to locate a software component in the set of software components that was first in time to have its health status go from the healthy status to the unhealthy status.
2 . The method of claim 1 , further comprising:
identifying, by the computing system, the software component that was the first in time to have its health status go from the healthy status to the unhealthy status as a root cause of the problem software component having the unhealthy status at the certain point in time.
3 . The method of claim 1 , wherein the rolling back in time through the plurality of events to locate the software component in the set that was the first in time to have its health status go from the healthy status to the unhealthy status comprises the computing system:
based on the plurality of events, determining a previous state of the set of software components at which point the health status for a software component in the set of software components from which the problem software component depends went from the healthy status to the unhealthy status; and determining that the software component from which the problem software component depends was a cause of the unhealthy status of the problem software component.
4 . The method of claim 1 , further comprising the computing system:
displaying a dependency map for a selected state of the set of software components on a user interface display coupled to the computing system, wherein the dependency map shows dependencies between the set of software components, wherein the dependency map further shows whether each software component in the set of software components has the healthy status or the unhealthy status at the selected state.
5 . The method of claim 1 , further comprising:
displaying, on a user interface display coupled to the computing system, an event-line that comprises an indicator for each of the plurality of events ( 904 ); receiving a selection of the indicator for one of the plurality of events; and displaying, on the user interface display, a dependency map that comprises the set of software components in response to receiving a selection of the indicator, wherein the dependency map shows dependencies between the set of software components, wherein the dependency map further shows whether each software component in the set of software components has the healthy status or the unhealthy status at the event that corresponds to the selected indicator.
6 . The method of claim 1 , wherein the health status of a software component for a point in time is based on a deviation between a value of a metric for the software component at that point in time and a baseline value for the metric for the software component.
7 . The method of claim 1 , wherein the determining, by a computing system, a health status for each of a plurality of software components for each of a plurality of time periods comprises:
determining, by the computing system, a health score for each of the plurality of software components for each of the plurality of time periods, wherein the health score for a given software component for each respective time period is based on a metric for the given software component for the respective time period, wherein a health score at or above a threshold score indicates the healthy status and a health score below the threshold score indicates the unhealthy status.
8 . The method of claim 7 , wherein the determining the health score for each of the plurality of software components comprises the computing system:
accessing a baseline for a metric of the first software component, wherein the baseline is associated with a point in time; and determining that a value for the performance metric of the first software component at the point in time was outside of a predicted value for the metric given the baseline.
9 . The method of claim 1 , wherein the determining, by a computing system, a health status for each of a plurality of software components for each of a plurality of time periods comprises:
detecting an event that indicates an anomaly with respect to a particular software component; and determining that the health status of the particular software component is the unhealthy status in response to detecting the event.
10 . The method of claim 1 , further comprising:
recording, by the computer system, the plurality of events in computer readable storage in response to determining that the health status of ones of the plurality of software components change from either the healthy status to the unhealthy status or from the unhealthy status to the healthy status.
11 . The method of claim 1 , further comprising:
for each respective software component of the plurality of software components, performing the following: collecting, by the computing system, performance metrics for the respective software component for a certain time interval; determining a health score for the respective software component for the certain time interval based on values for the performance metrics for the respective software component for the certain time interval; accessing a health score for the respective software component for a previous time interval to the certain time interval; determining whether a change from the health score for the previous time interval to the health score for the certain time interval indicates a change between the healthy status and the unhealthy status for the respective software component; recording a first event that indicates that the health score for the respective software component went from the healthy status to the unhealthy status during the certain time interval in response a determination that the change in health score so indicated; and recording a second event that indicates that the health score for the respective software component went from the unhealthy status to the healthy status during the certain time interval in response a determination that the change in health score so indicated.
12 . The method of claim 1 , further comprising:
determining dependencies between the plurality of software components as the plurality of software components process transactions.
13 . An apparatus, comprising:
a storage device; and a processor in communication with the storage device, wherein the processor: determines a health status for each of a plurality of software components for a plurality of time periods, wherein the health status comprises a healthy status and an unhealthy status; determines a problem software component of the plurality of software components with the