Abstract
Traditional risk-based design processes seek to mitigate operational hazards by manually identifying possible faults and devising corresponding mitigation strategies—a tedious process which critically relies on the designer’s limited knowledge. In contrast, resilience-based design seeks to embody generic hazard-mitigating properties in the system to mitigate unknown hazards, often by modelling the system’s response to potential randomly generated hazardous events. This work creates a framework to adapt these scenario generation approaches to the traditional risk-based design process to synthetically generate fault modes by representing them as a unique combination of internal component fault states, which can then be injected and simulated in a model of system failure dynamics. Based on these simulations, the designer may then better understand the underlying failure mechanisms and mitigate them by design. The performance of this approach is evaluated in a model of an autonomous rover, where cluster analysis shows that elaborating the faulty state-space in the drive system uncovers a wider range of possible hazardous trajectories and failure consequences within each trajectory than would be uncovered from manual mode identification. However, this increase in hazard information gained from exhaustive mode sampling comes at a high computational expense, highlighting the need for advanced, efficient methods to search and sample the faulty state-space.
1 Introduction
One key motivator for the incorporation of resilience in complex engineered systems is that one cannot always foresee the hazards that the system may encounter in operations [1–3]. Since individual hazard scenarios that are likely to unfold are not fully known, it is desirable for the system to have generic hazard-mitigating properties and behaviors to achieve what is referred to as “graceful extensibility” [4]—the ability of the system to adapt to surprise events. That is, rather than solely designing the system to specifically mitigate a certain set of identified hazards (as is traditionally done in the failure modes and effects analysis (FMEA) process; see Chapter 10 in Ref. [5]), one additionally wishes to design the system to be able to mitigate hazards that are unknown and have not yet been identified.
While many approaches for resilience incorporation have been presented, most [6–8] do not consider unknown hazards directly, instead focusing on improving the system’s dynamic response to known hazards. We place methods that do account for unknown hazards in three categories:
- —
Discursive design approaches involve enabling the identification of potential fault modes by providing information (e.g., historical data [9–11] or failure archetypes [12]) to help the designer better expand their understanding of potential failure causes and behaviors. While these approaches enable the designer to identify more events that would have otherwise been unknown, they are limited by the historical data that exists.
- —
Structural design approaches involve changing the structural, behavioral, and parametric properties of the system to achieve a desired level of inherent resilience. This includes network structure-based design—where the network structure of the system is modified to prevent failures from propagating and provide redundant functionality in the case of a failure [13–15]—and capacity-based design–where a buffer is placed on the system’s known failure limits to account for failure causes outside the designer’s knowledge [16]. While these approaches enable the designer to incorporate the generic property of resilience into the system, they are limited in their ability to determine how the system should mitigate any specific hazardous scenario (e.g., to determine appropriate contingency management actions).
- —
Finally, scenario-based design approaches involve procedurally generating scenarios for the system to respond to, which can come from the representation of the system itself (e.g., failures in individual or joint components) [17,18] or some combination of randomly or selectively generated parameters (internal or external) [19–21]. While these methods enable designers to uncover further scenarios that they may not otherwise consider, they are limited in terms of the sampled distributions and the modeled failure dynamics (which may differ in the real system).
Of these, the scenario-based approaches align best with the framework of using dynamic simulations to determine the system’s post-fault recovery performance (e.g., in the resilience triangle [6]). One open research question for the scenario-based methods is how to best generate the set of scenarios to analyze the system over. While the prior work in this area has demonstrated the incorporation of resilience with respect to events that originate outside the system (e.g., changing planetary conditions [19], adversarial attacks [20,22,21]), when designing the contingency management of a system, there has been little to translate this perspective into the traditional engineering process, where the hazards of interest arise from internal failure mechanisms that lead to system failure. Conversely, while traditional risk-based design approaches (e.g., FMEA) enable one to consider the system’s ability to handle identified fault modes, they are limited in their ability to discover new modes. As a result, these methods are most effective during the redesign of an existing system with operational hazard data [23,24]. When designing a new system, however, there are much less data that can be used to inform design. Hence, to aid resilient design, there is an opportunity to use scenario-based approaches to help the designer discover new modes that they might not otherwise identify.
The main contribution of this work is the development of a synthetic mode generation approach which leverages fault simulation to enable the designer to not just consider the consequences of discrete identified fault modes, but a faulty state-space resulting from the underlying failure mechanisms of the system. Because this assessment happens in simulation, a large number of potential fault modes can be evaluated rapidly, making it possible to exhaustively evaluate system resilience over the space of hazard-producing parameters. We further demonstrate how to evaluate the performance of these approaches in a model of an autonomous rover whose task is to follow line markings by using cluster analysis to compare the space of consequences revealed by this approach and manual identification. Section 2 presents the background about risk-based design and resilience simulation, Sec. 3 presents the implementation and use of synthetic mode generation in design, Sec. 4 presents the evaluation of different mode generation approaches in the rover model, and Secs. 5 and 6 present the discussion of and conclusions from this evaluation, respectively.
2 Background
To contextualize how this approach fits within the risk-based design process and previous approaches for the simulation-based design of resilience, this section covers how fault risk is considered in engineering design and how existing fault simulation approaches have been leveraged in this process.
2.1 Risk-Based Design.
Removing risk is an inherent part of the engineering process. Traditional mechanical engineering analysis is often oriented around preventing defined modes of failure such as stress and fatigue [25]. General consideration of risk is traditionally taken into account in the early engineering process using the FMEA where the failure modes are identified, effects are determined, and the overall risk is evaluated by estimating the likelihood and severity (see Chapter 10 in Ref. [5]). The FMEA is a tabular analysis across the entire system and can be developed for the functions, components, and manufacturing processes for the system as the design process advances through the conceptual, embodiment, and construction design stages (Chapter 10 in Ref. [5]). While the FMEA is a helpful risk identification tool, it can be time consuming, difficult to iterate on, and prone to designer biases—especially the designer’s lack of knowledge in the fault mode brainstorming process [26].
Simulation-based design approaches can help resolve these limitations by automating the evaluation of failure consequences with a model that determines how a fault causes hazardous states given the component/functional dependencies and/or system behavior. With a fault simulation, one can thus automatically generate the FMEA iteratively over time as the system changes without having to go through a detailed expert-driven analysis each time [27]. Fault simulation approaches also enable one to evaluate a larger set of fault scenarios (such as joint-fault modes [28] or faults injected over a range of model parameters, e.g., injection times [29]) to reveal more information about how and in what circumstances hazardous scenarios unfold. In this article, we further build on these simulation-based hazard identification approaches to generate new modes and further reveal the faulty state-space inherent to a given system’s behavior and parameters.
2.2 Fault Simulation for Resilience.
Simulation is used in the resilience-based design process to evaluate the system response to hazardous scenarios. A number of prior formalisms have been developed for fault simulation [30], including network topology models [31], failure logic/graph-based error propagation [32], dynamic behavioral simulation [33,34], and stochastic simulation [35]. In early design, it is often necessary to consider the high-level hazards inherent to the conceptual design of the system. To accomplish this, frameworks such as functional-failure identification and propagation [36], inherent behavior of functional models [37], and function-based failure propagation [38] have been developed, which propagate faults through the high-level function structure of the system to understand the risks [39]. This work uses the fmdtools package and framework to evaluate faults, which builds on these early design approaches and is capable of embodying a number of simulation formalisms by using an object-oriented dynamic representation of the system structure and behavior [40].
One of the more common applications of fault injection in practice is software design, where it is used to ensure that the software will continue to perform well to hazards that may inevitably occur in operations [41,42]. Since software can run on a number of different machines and faults can additionally come from external attacks, software flaws, and operator errors [42], one may not be able to accurately predict the distributions of the underlying faults [43]. Thus, while a number of software fault injection tools use fault distributions based on the underlying physics of failure for the underlying hardware, random sampling possible faults [43–46] and orchestrated attack approaches [47] are also used. In some situations (where the relative probability of faults is within the same order of magnitude), the results from fault mode sampling are quite similar to the results one would get if the underlying distribution is known [43]. These same processes are used in the field of chaos engineering, the main difference being that chaos engineering approaches test the software in production, rather than in a controlled test setup [48]. However, while these processes are used often in software design, they have seen less use in a general engineered (i.e., cyber-physical) systems context. The goal of this work is thus to adapt these types of methods to this domain in the context of simulation-based design, where model parametrics may be used to evaluate a system in a wide variety of circumstances.
2.3 Research Gap.
The simulation approaches discussed in Sec. 2.1 require fault modes to perform the respective studies. While there are some automated failure analysis methods in the literature that can assist in the design of complex systems, they are either application specific or do not assist with the identification of fault modes. While automated FMEA methods use computation to assist the evaluation of fault modes, they still require the designer to specify these modes up-front [49,50]. Automated fault tree generation methods [51] and event tree generation methods [52–55] can additionally create failure scenarios from system structure, but typically generate very high-level faults (e.g., subsystem “failure”), rather than specific changes in parameters that lead to failure. Similarly, early design risk analysis methods [37,56–58] can generate large sets of fault scenarios from a failure model, but the scenario generation used in these techniques suffers from the same limitations—the designer must still define the underlying fault modes used to create scenarios. More sophisticated scenario generation techniques are often used in late-stage requirement verification and validation methods [59–61] to verify performance over hazardous scenarios, but unfortunately these methods have not been brought into early design, where there is opportunity to use them to identify scenarios unforeseen by the designer. The goal of this work is to adapt this concept of scenario generation to the exploration and discovery of potential fault modes in early design.
In addition, the difference between using fault injection to incorporate resilience in software systems (as described in Sec. 2.2) and a general engineered system is the physical constraints that shape the underlying design trade space. In software system fault injection, the faulty state-space is often defined by potential hardware modes (e.g., bit-flips), which creates a large faulty state-space (because of the large numbers of bits), the distribution of which is relatively uniform. In the early design context, however, the modes are not fully identified and the underlying distribution is not necessarily known very well. The physics of failure of the system will make it such that some modes are very prevalent, while others are not. In addition, the space of potential design solutions available in early design is quite large and often does not come “for free” as it would in software design—fail-safes and redundancies could come at the cost of significant loss of efficiency. Thus, to best leverage the adaptation presented in this study, it is important to consider the costs associated with possible failure mitigations—if a mitigation comes with significant costs, it may be necessary to balance this cost against the cost of unknown failures (for which more data would be preferred to justify the decision). If the mitigations come at no cost, precise valuation does not have to be considered. For example, designing the contingency management in the control system in Fig. 2 requires no consideration of trade-offs because each individual fault state is unique and new fault information only helps the rover decide what to do in that particular instance.
3 Method
Synthetic mode generation augments the traditional manual failure mode identification process used in scenario-based design for resilience by leveraging the ability of the underlying simulation to be parameterized in terms of performance-affecting states. Since modes can be represented as a change in parameters affecting the performance and/or functionality of the system, the set of possible modes can in turn be represented as points (or, more precisely, regions) within the space of these parameters. Considering this, synthetic mode generation seeks to help the designer explore this space to reveal novel types of failures that would otherwise not be identified. On the basis of this information, they can then mitigate these revealed types of failures and their underlying failure mechanisms, as shown in Fig. 1 and explained in more detail in the next subsections.
3.1 Design/Analysis Process.
In the process shown in Fig. 1, one first develops a model of the dynamics of (nominal and faulty) system behaviors (represented by the component behaviors and control system modes in Fig. 1). Then, hazard-affecting faulty states and their respective domains are identified by the designer based on the set constraints inherent to the underlying failure dynamics (see Sec. 3.2). The identified states define a faulty state-space, which is then sampled to create a set of potential hazardous modes (see Sec. 3.3). These modes are then simulated in the model, generating a set of responses that are then be analyzed to identify unique failure trajectories and the underlying mechanisms that cause these trajectories (see Sec. 3.4). Based on this analysis, these mechanisms can be mitigated in the design of the system, such as the structure (e.g., component architecture), parametrics (e.g., safety factor), or control policy (e.g., contingency management).
The main difference between this approach and the conventional manual identification approach used in the FMEA process is that it is less limited by designer experience. Instead of needing specific, known failure modes to be identified from previous experience (which may not be present for a new system), the process only needs the designer to identify parameters that affect system behavior and define feasible ranges for these parameters. This has both advantages and pitfalls. The goal of this approach is to gather a much larger set of potential modes that could otherwise be manually identified, as illustrated in Fig. 3. As shown, in the future (designed) system, there will be a set of fault modes that will occur in future operations. The goal of risk-based design is to mitigate these modes, however, in a manual identification process, the designer may not be able to think of and thus mitigate all of them in advance. In contrast, synthetic mode generation spans the entire faulty state-space, thus casting a wider net to represent future occurrences. However, it also has the potential to introduce more modes into the analysis that will not occur in the future system or are unlikely to occur. This limitation is especially salient when potential mitigations have known downsides (e.g., to performance or cost). Thus, this process still relies on designer judgement to judge/model the probability of these sets of scenarios (and the resulting failure trajectories) to determine their overall risk impact and determine what design changes may be justified to mitigate this risk. The main goal is to extend designer’s understanding of potential failures, rather than to supplant existing risk and resilience design processes.
3.2 Formalism.
Synthetic mode generation happens in the context of an overall scenario-based design approach involving a dynamic model that is simulated over a number of different scenarios. Thus, it relies on some assumptions about model form, as illustrated in Fig. 2. As shown, in this formalism, a function in the model may be composed of a control system and a component architecture. A component in the component architecture has physical behaviors (f1, f2, and f3) that translate the inputs of the function X into outputs Y at each time in the simulation. The high-level control system controls the component architecture behaviors by switching between modes depending on the current variable states that may reconfigure the component architecture or modify component behavior. In Fig. 2, for example, the system alternates between mode m1 and m2 depending on the value of input variable X3, which in turns controls the action variable a and thus the behaviors in equations f1 and f2. These attributes effectively define the nominal operations of the system.
In this formalism, potential hazards to system function are then represented using fault states, which are parameters (h1, h2, and h3) that modify the behaviors of the component by taking on different values than in the nominal state. These fault states are incorporated in the behaviors by modifying the underlying equations with mathematical operations (e.g., addition/subtraction for constant drift or multiplication/division for amplification/reduction) incorporating the modifying parameters. Based on the properties of these modified equations, the designer then defines continuous (e.g., (0, 2)) or discrete (e.g., {−1, 0, 10}) domains for these states. To adequately define these domains, set constraints must be identified for the fault parameters based on the physical limitations of the equations. These constraints can be identified by determining the physical limitations of the underlying equation (i.e., the values that would cause the equations to give results without physical meaning). For example, zero values would define a constraint for fault states that are divisors to avoid infinite output values, while fault states within periodic functions would only be defined over the period of the function to avoid repeated values. However, it should be noted that set constraints may be difficult to identify with certainty. In this case, the designer may instead choose to start with a wide range of potential parameters at low resolution and successively decrease the range while increasing the resolution to focus on the region of interest that produces distinct and physically meaningful failure modalities.
With this faulty state-space defined, a hazardous mode (e.g., H1, H2 · · · Hn) is then a tuple of fault states (h1,h2, and h3) where at least one state is off-nominal. Within this formalism, hazards may then be controlled via model parametrics, component architecture, or the control system. For example, in Fig. 2, the control system, which switches the modes of the system from m1 to m2, can make this switch depending on sensed hazardous states S1 or S2.
To illustrate this formalism, consider the case of a household lamp. In this system, the function of providing light is embodied by a component or set of components (light bulbs), which translates input electrical potential into visible light, and is controlled by a control system (light switch), which determines whether electricity flows through the bulb based on states input by the user. Fault modes then arise through physical modification of the component behavior (e.g., burning out, damaged contacts, excess heat), which would correspond to changes in the underlying equations (e.g., setting circuit inverse resistance, and thus energy flow, to zero). However, the underlying fault states may take on more values than just the identified burnout fault mode, also, such as dimming (below-nominal inverse resistance) or increased power draw (above-nominal inverse resistance). Thus, the limiting set constraints defining the inverse resistance fault state would be defined as ranging between zero (since negative power consumption is not possible) and the maximum possible current/power draw, which would lead to imminent burnout. To complete the illustration of the formalism, hazard management would be accomplished by the user detecting that the bulb has failed and replacing it.
3.3 Mode Elaboration.
Given the underlying assumptions about the domain of the faulty parameters, the resulting faulty state-space can be defined as continuous, discrete, or mixed continuous-discrete. When parameters are continuous, the space cannot be represented completely and thus must be sampled with a set resolution, resulting in a range elaboration approach where each fault state is represented as discrete values Dj = {hj,min, hj,min + rj, …hj,max}, where rj is the resolution for the fault state. Alternatively, when the parameters are considered to be discrete, the space can be represented directly with the given discrete values, resulting in a set elaboration approach. In both approaches, as the number of fault states m (and possible values |D|) increases, the number of potential fault modes increases by .
As a result, as the faulty state-space increases in size, it may become necessary to reduce model computational costs and lower the number of states needed to query by (1) constraining the set (e.g., specifying exclusivity between perturbations of identified fault states), (2) only considering a subset of n joint perturbations at a time (e.g., using the single-state elaboration approach shown in Fig. 3), or (3) sampling from the set in a way that admissibly represents the set of hazards for the purpose of the analysis (e.g., by adjusting the resolution or using monte carlo or adaptive sampling techniques that perform well in higher dimensionality spaces). In particular, there is opportunity here to use existing knowledge about the system to determine where fault states are likely to interact (requiring joint sampling) or be independent (enabling separated sampling) based on factors such as proximity and the underlying physics of failure. It may also be helpful to limit the set of possible fault state values to discrete quantities of interest (e.g., low, nominal, high) rather than treating the domain as continuous—a demonstration of this use of the set elaboration approach is thus included in the demonstration inSec. 4.
3.4 Simulation and Analysis.
To inform analysis, the scenarios are then simulated in the system behavioral model to determine their effects. If there are multiple time-steps in the simulation, the distribution of effects over the interval may be represented by sampling different times, per Ref. [29]. It may also be possible to sample fault modes successively, as is done in stress testing approaches [21]. Regardless, the simulations produce histories of states over time, as well as end-states for the simulations S(H). In previous frameworks for resilience quantification [40], these simulation results have been classified in terms of severity and/or cost, however, to give the designer insight into the underlying failure mechanisms, we propose further analyzing the responses to identify the unique failure trajectories that occur as a result of failure modes.
In this work, clustering methods may be used to group similar types of scenarios that produce similar results. Clustering methods can group parameters that are similar in defined dimensions (e.g., distance in space), essentially grouping types of failure results that represent distinct modalities of failure. Based on this information, the designer can then identify the distinct failure trajectories and evaluate their relative severity and occurrence within the model responses, as well as the distributions (min/max values) within clusters. Finally, together with the underlying fault states used to generate these responses, these clusters can be used to identify the mechanisms and parameter ranges that cause the types of failures. For example, in an aircraft, particular ranges for control surface parameters could be identified, which lead to spiral, stall, or pitch-down conditions identified in the simulation results.
4 Demonstration
To demonstrate the mode generation approach, we use a model of an autonomous rover to show how this approach may be implemented in a model, as well as what results may be expected of it (Secs. 4.1 and 4.2). It is then used to compare (range and set-type) synthetic mode elaboration approaches with a more traditional mode identification-based approach (Sec. 4.3).
4.1 Rover Model.
To demonstrate this framework, this article presents the design of an autonomous rover. This rover was modeled at a high level to perform a basic autonomous driving task, with the functional model of the model shown in Fig. 4. As shown, this model encompasses the rover power and control systems, as well as its drive system, avionics, and interactions with its environment (i.e., movement and position with respect to a map). The task is to follow a given line from a given starting location to a given end location. While many different input lines can be used for different routes, the route used in this article for demonstration purposes follows a simple L-Curve as shown in Fig. 6. If the rover deviates from the centerline, it may go off course and crash into its surroundings. When the distance from the centerline is greater than 1 m, the rover can no longer see the centerline and stops moving because the rover has crashed. The major fault effect considered here is thus how far the rover deviates from the centerline in fault scenarios with the priority being fault scenarios that directly lead to a crash.
4.2 Approach Setup.
To address these faults and show the value of synthetic mode sampling in this design scenario, this work compares the traditional manual identification with the set elaboration and range-elaboration approaches defined in Sec. 3.3. For simplicity of demonstration, this analysis focused solely on faults originating from the drive function, the goal of which is to move and turn the rover given external power and control inputs for velocity and heading. It also focused on a single fault time where the rover is in the middle of turning (rather than sampling several times). In the drive system, the following fault states were identified to define the faulty state-space:
— friction, which is resistance that makes the rover require more power to move a given distance;
— transfer, which is the ability of the rover to move forward in a given time-step at a level of input power; and
— drift, which is the misalignment of the rover trajectory from its intended heading.
— stuck, where the rover is blocked from moving forward,
— stuck_left and stuck_right, where the left and right sides of the rover is blocked from moving, respectively, and
— elec_open, where the power is disconnected from the rover motors.
4.3 Analysis Comparison.
Simulating each of these fault modes in the model at a given time results in the fault trajectories shown in Fig. 6. As shown, the manually identified modes mainly result in the rover stopping at the fault injection location, with one (stuck) resulting in the rover advancing further but not reaching the goal. In the set elaboration approach, several more trajectories are uncovered in which the rover makes it closer to the goal. In the range elaboration approach, a much larger space of trajectories are uncovered, including ones where the rover turns around and travels along the line backwards. The corresponding distribution of line distances (the metric for identifying when crashes occur) for each approach is shown in Fig. 7. As shown, the raw number of scenarios in the range elaboration approach is orders of magnitude higher than the set elaboration and manual identification approaches, resulting in many more scenarios which result in a crash. In addition, the coverage of the space of the severity of failures (i.e., the distance from the line) was much larger for the elaboration approaches, with a range of 0.0 m to 1.5 m (for the range approach) instead of 0.5 m to 1.0 m for manual identification. As a result, relying on manual identification or the set elaboration approach in this instance would have resulted in missing the possibility that the rover goes beyond line distance of 1.0, the threshold for going off-course. This ability to identify a broader range of scenarios is important because (1) it helps one to better identify the true worst-case scenario possible and (2) low-severity failures can often play an important role throughout the product lifecycle if they occur more often than expected by driving up maintenance cost and making more severe joint-fault modes more likely.
While the range elaboration approach uncovers a wider range of fault severities, it should be noted that the majority (note the log scale in Fig. 6) are still within the range expected from the set elaboration approach and manual identification—only a few additional severities are caught in the tails of the distribution. However, given the range elaboration approach has such a high resolution, one may wonder if this has actually uncovered more meaningful hazard information. That is, simulating a large number of fault state combinations at high resolution may result in a large number of scenarios that essentially play out in the same way and do not add anything more to the analysis. To evaluate the extent to which the range approach is affected by this, Sec. 4.3.1 uses clustering to identify and group similar simulation results and thus discover how much information was truly gained.
4.3.1 Cluster Analysis.
Clustering was used to identify similar sets of results where the fault state combinations caused a similar set of results. To perform this evaluation, the DBSCAN algorithm [62] in scikit-learn [63], which grouped the final locations of the simulations in terms of x-y coordinates, was used on the responses from the three approaches. The DBSCAN algorithm was used to identify duplicates because it (as a density-based algorithm) identifies densely packed clusters of points and neglects outliers (as opposed to other clustering methods, which group all points spatially into a set number of categories). As shown in Fig. 8, the algorithm identified four clusters (and the –1 cluster, which is made of response points that do not fit with any of the others). Scenarios in the 0 cluster essentially immediately immobilize the rover, either directly or by making it crash into its surroundings. The other clusters include stopping at the end location (cluster 1, which represents successes) and stopping after veering left (cluster 2) or right (cluster 3). However, the most interesting scenarios are uncluttered (cluster –1), which are distributed throughout and represent scenarios where the rover may have too much resistance or may veer in an adverse direction while still remaining on course.
Analysis of clusters in each approach
Cluster | Description | Coverage metrics | Identified | Set | Range |
---|---|---|---|---|---|
−1 | Unclustered, distributed throughout | # Scenarios | 0 | 3 | 59 |
Coverage loss | 1.00 | 0.78 | 0.00 | ||
Worst case | 0.00 | 1.45 | 1.41 | ||
0 | Near fault location | # Scenarios | 4 | 18 | 914 |
Coverage loss | 0.13 | 0.05 | 0.00 | ||
Worst case | 0.76 | 0.98 | 1.05 | ||
1 | At finish | # Scenarios | 1 | 2 | 4 |
Coverage loss | 1.00 | 0.78 | 0.15 | ||
Worst case | 0.63 | 0.97 | 1.01 | ||
2 | Veered left | # Scenarios | 0 | 0 | 9 |
Coverage loss | 1.00 | 1.00 | 0.00 | ||
Worst case | 0.00 | 0.00 | 0.72 | ||
3 | Veered right | # Scenarios | 0 | 1 | 4 |
Coverage loss | 1.00 | 1.00 | 0.00 | ||
Worst case | 0.00 | 0.61 | 0.63 |
Cluster | Description | Coverage metrics | Identified | Set | Range |
---|---|---|---|---|---|
−1 | Unclustered, distributed throughout | # Scenarios | 0 | 3 | 59 |
Coverage loss | 1.00 | 0.78 | 0.00 | ||
Worst case | 0.00 | 1.45 | 1.41 | ||
0 | Near fault location | # Scenarios | 4 | 18 | 914 |
Coverage loss | 0.13 | 0.05 | 0.00 | ||
Worst case | 0.76 | 0.98 | 1.05 | ||
1 | At finish | # Scenarios | 1 | 2 | 4 |
Coverage loss | 1.00 | 0.78 | 0.15 | ||
Worst case | 0.63 | 0.97 | 1.01 | ||
2 | Veered left | # Scenarios | 0 | 0 | 9 |
Coverage loss | 1.00 | 1.00 | 0.00 | ||
Worst case | 0.00 | 0.00 | 0.72 | ||
3 | Veered right | # Scenarios | 0 | 1 | 4 |
Coverage loss | 1.00 | 1.00 | 0.00 | ||
Worst case | 0.00 | 0.61 | 0.63 |
4.3.2 Summary.
A summary of the results provided by each approach is presented in Table 2. As shown, the set elaboration approach simulated a much larger number of scenarios than manual mode identification, while the range elaboration approach simulated over an order of magnitude more scenarios than the set elaboration approach. This results in a much larger computational cost. While this is not a significant burden for this model, it becomes a burden if more parameters were included or if the simulation itself were more computationally expensive. However, the use of these approaches uncovered substantially more hazardous trajectories, which was reflected in both the number of clusters represented in these approaches and the number of scenarios that did not fit in any cluster. In general, while the set elaboration approach represented an improvement on manual fault identification, many more unclustered scenarios were identified by the elaborated-range approach. Of the failure clusters that were elaborated in the range elaboration approach, only a small proportion of them were identified beforehand, and even the set elaboration approach was not able to represent all clusters. Thus, even though the range elaboration approach requires a much larger number of scenarios–many of which are essentially duplicates—it uncovers a much larger portion of the faulty state-space than would be possible to view otherwise. Even though there are many approximately duplicate scenarios, more information is also uncovered. This is likely an artifact of the fault states explored—while we would expect some fault states to produce relatively consistent results at different levels (e.g., friction should just slow the rover down in different ways), a specific level of drift can cause a great variety of possible results, since it additionally modifies the direction of the rover. As a result, some levels of drift can, for example, make the rover turn around and travel backward, as shown in Fig. 6. However, this type of result is very sensitive to the precise value of the fault state parameter and is thus only detectable by elaborating (or exploring) the full space. This shows how full faulty state-space elaboration can identify more modes, which would not be considered otherwise—by uncovering specific values of fault state parameters that change the faulty behavior of the system.
5 Discussion
The synthetic mode generation approach presented here has a number of implications to resilience-informed design. The design of resilience is often taken from the perspective that the designer does not necessarily know the form or distribution of hazards that the system may be subjected to. Thus, to minimize risk, a very broad and generic tolerance and adaptiveness needs to be built into the system to enable it to respond well to all hazards it could encounter. In this sense, from the perspective of resilience, it is very important to uncover new knowledge of the faulty state-space, even if those faults seem unrealistic or improbable. As shown in Sec. 4.3.1, synthetic mode generation can encourage this by elaborating an entire range of fault state combinations for hazard simulation and assessment, rather than the small list that can be directly identified by a designer. This approach can essentially be seen as the translation of resilience simulation and scenario generation approaches—where a large number of externally driven hazardous scenarios are presented to a system—to the traditional engineering design scenario where the hazards to prevent are internal to the system (i.e., component failures). While this method would be infeasible to perform manually due to the large space of modes, performing this sort of analysis in a computational environment is entirely practicable because the evaluation of fault mode consequences is performed by a computationally-inexpensive simulation.
However, from a risk-based design perspective, there are assumptions embedded in this approach which are important to consider. Given a probability distribution has not been given or assumed for fault state combinations, it may be difficult to understand how to weigh individual worst-case failure trajectories that only arise at particular fault state parameter value combinations: they could be highly improbable because of the narrow range under which they are realized (if the multivariate distribution is uniform), or they could be highly likely (if the range of the probability distribution where they are realized is very dense). Thus, to consider the (probability-weighted) risk of these events, an appropriate probability model should be implemented over the faulty state-space and sampled, instead of sampling the space uniformly. However, it should be noted that using any such model will be subject to considerable uncertainties when there is limited information (e.g., the early design process).
Section 4 shows that this approach can uncover meaningful information about the faulty state-space, which can help the designer understand how to most comprehensively mitigate potential failure events. For example, when viewing all of the different possible trajectories from the manually identified modes, a designer may set a requirement that the system shut down if it deviates too far from the line to prevent it from crashing. However, this would still leave open the hazardous faulty behavior uncovered by the mode sampling approach in which the rover follows the line backward. A more comprehensive and safer requirement might instead be for the rover to not deviate too far from its intended trajectory (position over time) while staying acceptably close to the line.
A major challenge for applying this method is how best to specify and sample the faulty state-space. First, it may not be clear how to determine the limits of the domain for each fault state, or, more broadly, determine what constraints limit the domain of possible joint fault state values. This can lead to difficulties—if the domain for fault states are defined too narrowly, the approach loses its ability to capture faults that are outliers. For example, in the design of a car, the designer may have an imagined range of hazardous operating temperatures that could prove incorrect if the car ended up being operated during an extreme whether event. As a result, they would leave out potential resulting fault modes in this process. However, if the domain is specified too broadly, the analysis may be dominated by fault modes that are impossible to realize (e.g., operating temperatures below 100 K or above 4000 K). As a result, it is important for the designer to appropriately exercise judgement over ranges (as they would manually identifying modes). This could be approached in practice by choosing broad ranges of parameters and then narrowing them based on their effects in the model (i.e., if values are realizable). Second, it may not be clear how best to sample the set of faults modes from the faulty state-space. From Sec. 4, we know that using the range elaboration approach with high resolution increases the amount of risk-related information that can be extracted, but these increases require large increases in the computational cost. When the fault sampling approach elaborates the entire set of fault modes from ranges, designers need to balance computational cost with knowledge increase. While this can be done on the basis of cost–benefit assessment (i.e., quantifying the value of hazard information [64]), it also highlights the need for more efficient sampling methods. For example, in the analysis process, one may choose the resolution of fault states first by performing a sensitivity analysis to identify the influence of each fault state on the outcome (putting lower resolution on lower-impact states) and then successively increasing the resolution until there is minimal gain in hazard-related information.
6 Conclusions
This work presented a method to synthetically generate fault modes for resilience analysis and resilience-based design. This approach works by elaborating a faulty state-space—a set of fault modes made up of parameters that affect the behavior of the system. The resulting space is much larger than a list of fault modes that the designer might uncover, but it also has a greater potential to reveal discrete faulty behaviors that can arise in the system. This is shown in the rover example in Sec. 4, where synthetic mode elaboration approaches find a larger quantity of potential hazardous outcomes, revealing previously unknown hazardous behaviors and better identifying worst-case scenarios for each behavior. However, both of these increases in hazard information come at the expense of simulation time, which increases substantially with every increase in resolution—an important consideration when this approach is used throughout a complex or computationally expensive model, or as a part of an iterative design process.
6.1 Limitations and Future Work.
This approach opens a number of potential directions for the future work. Because of the computational expense of synthetic mode generation, and the potential for essentially identical duplicate modes to be generated and evaluated, future work needs to extract the high-level modes uncovered in an approach like this so that they can (1) be tractably understood and accounted for by the designer and (2) be more readily simulated at a lower computational cost. In this work, clustering was used to identify groups of modes that were essentially duplicative of each other. However, given there is a distribution of modes within each cluster, it may not be clear how to represent the cluster fault states and parameter values as a whole. Future work should thus determine how to best represent these groups for tractable analysis and design.
Reducing computational cost is also an important consideration for optimization. Since sampling modes generate more hazard information, it could lead to more optimal designs if used in an optimization loop. However, if not implemented appropriately, it could also slow the optimization process while not meaningfully affecting the optimal variables. Future work should show how much (and when) value can be gained by elaborating the faulty state-space so that designers can understand when (and at what resolution) to use it in resilience optimization. Finally, to mitigate the trade-off between faulty state-space information and computational cost, future work should develop strategies that efficiently map out the faulty state-space by searching for parameters that lead to different outcomes. Scenario generation approaches such as monte carlo sampling, Latin hypercube sampling, Bayesian optimization, and adaptive stress testing [21] should all be explored to find how to most efficiently represent the faulty state-space without wasting computation on essentially duplicate fault modes.
Acknowledgment
This research was partially conducted at NASA Ames Research Center. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not constitute or imply its endorsement by the United States Government.
Conflict of Interest
There are no conflicts of interest.
Data Availability Statement
The data and information that support the findings of this article are freely available.2