In recent years, substantial research has been devoted to monitoring and predicting performance degradations in real-world complex systems within large entities such as nuclear power plants, electrical grids, and distributed computing systems. Special challenges are posed by the fact that such systems operate in uncertain environments, are highly dynamic, and exhibit emergent behaviors that can lead to catastrophic failure. Discrete Time Markov chains (DTMCs) provide important tools for analysis of such systems, because they represent dynamic behavior succinctly, provide a means to measure uncertainty, and can be used to make quantitative measurements of the potential for change to system performance. Moreover, DTMCs can be extended to be time-inhomogeneous, i.e. to represent behavior that varies over long durations. To date, DTMCs have been proposed for tasks such as fault detection and long-term condition equipment monitoring in real-world complex systems. However, the scope of these models has generally been restricted to describing states and state transitions that directly concern fault conditions or states of degradation. Less work has been done on using DTMCs to represent a more complete range of states a system may enter into during normal operation. Of special interest are sequences of states that involve failure scenarios, in which a system evolves from a normal operating state into undesirable state that leads to widespread performance degradation. Unfortunately, use of large DTMCs often involves large search spaces, a problem which in part motivates our work. This paper describes progress made on developing an approach for using larger, more detailed DTMC models of operational complex systems to uncover potential failure scenarios. The approach uses a combination of methods to perturb a DTMC, simulate alternative system evolutions, and identify scenarios in which a system proceeds from normal operation to failure. Key to the approach is the use of graph theory techniques to reduce the size of the search space involved in exploring alternative behaviors. We show how graph theory techniques can be used to identify critical state transitions which can be perturbed to simulate performance degradation. Using critical transitions, it is also possible to estimate the rate of performance degradation and to understand how this rate is likely to change in response to increased failure incidence. Examples are provided of the use of this approach on a DTMC of significant size to identify failure scenarios in a distributed resource allocation system.

This content is only available via PDF.
You do not currently have access to this content.