Significant efforts have been recently devoted to the qualitative and quantitative evaluation of resilience in engineering systems. Current resilience evaluation methods, however, have mainly focused on business supply chains and civil infrastructure, and need to be extended for application in engineering design. A new resilience metric is proposed in this paper for the design of mechanical systems to bridge this gap, by investigating the effects of recovery activity and system failure paths on system resilience. The defined resilience metric is connected to design through time-dependent system reliability analysis. This connection enables us to design a system for a specific resilience target in the design stage. Since computationally expensive computer simulations are usually used in design, a surrogate modeling method is developed to efficiently perform time-dependent system reliability analysis. Based on the time-dependent system reliability analysis, dominant system failure paths are enumerated and then the system resilience is estimated. The connection between the proposed resilience assessment method and design is explored through sensitivity analysis and component importance measure (CIM). Two numerical examples are used to illustrate the effectiveness of the proposed resilience assessment method.
Resilience refers to the ability of a system to recover to its normal operating condition after occurrence of disruptive events . Since the first definition in 1970s, modeling and definitions of resilience have been widely studied in ecology , social science , and economics .
Even though resilience has been intensively studied in the above areas, its development in engineering field is still in the early stages. Resilience assessment of engineering systems has gained increasing interest in recent years. From the perspective of definition, in 2009, the American Society of Mechanical Engineers (ASME) defined resilience as a system's ability to rapidly recover to the full function after disruption ; Ouyang and Wang  evaluated the annual resilience of a system under multihazard events; Ayyub  proposed a resilience metric by considering the aging effects and different types of vulnerability and recoverability scenarios. Reed et al. developed a method to evaluate the resilience of networked infrastructure . Other definitions of resilience metrics have also been proposed , and a detailed review can be found in Ref. . From the perspective of application, Yodo and Wang [10,11] assessed the resilience of an electric motor supply chain using Bayesian networks (BNs); Panteli and Mancarella assessed the resilience of electrical power infrastructure ; Baroud et al.  evaluated the resilience of an inland waterway network based on the CIM method proposed by Barker et al. ; and Spiegler et al.  estimated supply chain resilience using a control engineering approach.
The above literature reviews show that current studies of resilience in engineering system have focused on problems related to supply chains [10,11,15], waterway networks , power infrastructure , and civil infrastructure systems . The developed resilience metrics are difficult to apply in engineering design. Motivated by filling the gap between resilience assessment and engineering design, the first quantitative attempt was made by Youn et al.  in 2011 to develop a resilience-driven design framework. After that, Mehrpouyan et al.  investigated the resilience of complex engineered system design by employing a graph spectral approach in the design of system architecture. The resilience design framework proposed by Youn et al.  was basically designed for prognostics and health management (PHM), which is associated with the detectability of failure events. In the method proposed by Mehrpouyan et al. , resilience is affected by the physical connections between components. In addition, Wang and Li [18,19] studied the redundancy allocation of an engineering system by considering the failure interactions; this is also related to resilience since redundancy is able to increase the reliability and decrease the vulnerability of a system.
Considering that self-healing is usually difficult for traditional mechanical systems, the recovery of mechanical systems is often achieved through repair or replacement. For different components, the recovery probability, ability, and required time are different. In this situation, according to the definition of resilience, a resilient mechanical system should be a system that has low quality loss after recovery and requires a short time to recover. Besides, there are numerous failure paths for a system with multiple components. For different failure paths, the recovery properties are different. Based on these observations, a new resilience metric is proposed in this paper. Since resilience is usually time-dependent and uncertainty is inherent in design, the proposed resilience metric is connected with design through time-dependent system reliability analysis. Time-dependent system reliability computation requires a large number of runs for realistic systems [20,21]. In this paper, a surrogate model-based method is developed to reduce the computational burden. The connection between the proposed resilience assessment and design optimization is investigated through resilience sensitivity analysis and CIM.
The contributions of this paper are thus summarized as: (1) the definition of a new resilience metric, which connects design with resilience assessment; (2) a new time-dependent system reliability analysis method for resilience assessment; (3) a strategy for the efficient evaluation of resilience based on time-dependent system reliability analysis; and (4) investigation of resilient design through sensitivity analysis and CIM.
The remainder of the paper is organized as follows: Section 2 provides background concepts on resilience and time-dependent reliability analysis. Section 3 presents the proposed resilience assessment method. Two numerical examples are used to illustrate the proposed method in Sec. 4. Concluding remarks are provided in Sec. 5.
In this section, we first briefly review two resilience metrics that have a qualitative connection to design. After that, we summarize the concept of time-dependent reliability analysis.
Resilience of an Engineering System.
Figure 1 illustrates a generalized representation of system resilience, which consists of three key elements, namely, reliability, vulnerability, and recoverability.
The reliability element is associated with the probability that the system performs satisfactorily in the presence of disruptive events. It can be time-independent or time-dependent. High reliability implies low probability of performing unsatisfactorily. However, high reliability requires large initial investment. The vulnerability element describes the degraded performance of the system after disruptive events. If a disruption occurs, a system with higher vulnerability will have a more severe failure consequence than a system with lower vulnerability. The recoverability element quantifies how quickly and how well a system can recover to its normal state after disruption. Inspired by the three elements of system resilience, various models and definitions of resilience have been proposed in recent years. Two representative definitions are the resilience metrics proposed by Youn et al.  and Ayyub . (It should also be noted that robustness is an element overlapping between reliability and vulnerability).
where is system failure event, is correct diagnosis, is correct prognosis, is mitigation/recovery event, is the probability of correct diagnosis , is the probability of correct prognosis , and is the probability of correct recovery.
The resilience metric proposed in Eqs. (1) and (2) focuses on the restoration of the system using PHM methods. In order to increase the resilience of a system, the resilience design problem finally becomes a sensor network design problem, which is associated with the probability of correct diagnosis and prognosis. However, the resilience metric given in Eq. (2) does not include the vulnerability element in Fig. 1. For example, for two systems with identical reliability and , it is apparent that the system with lower vulnerability has a higher resilience. But Eq. (2) fails to represent this situation.
where is the time instant of failure initialization, is the time to failure, is the time to recovery, is the failure profile, is the recovery profile, is the duration of failure, and is the duration of recovery. measures the robustness and redundancy, and measures the resourcefulness and rapidity. As shown in Fig. 2, three failure events and six recovery events have been considered in Ayyub's resilience metric .
All the three elements of the original resilience definition in Fig. 1 have been included in the metric defined in Eq. (3). A review of other alternative definitions and metrics of system resilience is available in Ref. . From the literature review, it is found that most of current definitions of resilience metrics have not been explicitly connected to engineering design. The purpose of this paper is to develop a resilience metric that can be quantitatively connected to time-dependent reliability analysis and design optimization. In Secs. 2.2 and 3, we first briefly introduce the concept of time-dependent reliability analysis and then propose a new resilience metric to connect engineering design with resilience.
Time-Dependent Reliability Analysis.
where is probability, “” means “for all”, and is the time duration of interest. The corresponding time-dependent failure probability is given by .
Time-dependent reliability analysis has been intensively studied during the past years . The efforts in time-dependent reliability analysis have led to a group of time-dependent reliability analysis methods, such as the upcrossing rate methods , surrogate model-based methods [27,28], sampling-based approaches , and composite limit-state function methods . Next, we develop the proposed approach to perform resilience assessment based on time-dependent reliability analysis.
Resilience Assessment Based on Time-Dependent System Reliability Analysis
In this section, we first propose a new resilience metric for an engineering system. After that, we discuss in detail how to evaluate the resilience based on this metric.
New Definition of Resilience Metric.
Considering the fact that Youn's resilience metric  can effectively represent resilience in terms of probability, we propose a new resilience metric by extending Youn's resilience metric  to incorporate vulnerability and the effect of uncertainty in recoverability.
We start to explain the proposed new resilience metric by investigating the resilience of a specific system without considering uncertainty. For a specific system, as shown in Fig. 3, consider a certain quantity of interest (QoI). The QoI can be system performance, economic value of the system, or other quantities. Suppose the QoI decreases over time from its original state , and at a certain time instant , the QoI suddenly decreases from to (QoI after failure) due to disturbance or failure of the system. The quality loss due to the disturbance is . After the disturbance, the recovery starts to be active. Recovery has three elements: (1) can the system function be recovered or not, (2) how much can it be recovered, and (3) how long does it take to recover. If the system can be recovered, the recovery activity is performed immediately, and the system recovers to without taking any time (immediate recovery), the recovered QoI is then . If it takes some time for the system to recover (normal recovery) and the system is recovered at time instant , the recovered QoI is then , where is the average quality loss during recovery used to account for the required effort for recovery.
where means the system function cannot be recovered, indicates the system function can be recovered, is the QoI at time instant t,,, , and are the remaining performance ratio at before disturbance, recovery ratio, the remaining performance ratio after disturbance, and average performance loss ratio per unit time during recovery process, respectively. corresponds to the situation of immediate recovery. The three elements of recovery are represented as , , and in the above equation.
where is the probability of recovery given that the component is failed, which is a probabilistic form of defined in Eq. (5), is the resilience given that the component is failed and can be recovered, is the expected resilience by considering the uncertainty in and , and is the joint probability density function (PDF) of and . The distribution of can be obtained using the time-dependent system reliability analysis method presented in Sec. 3.2.
in which is the time-dependent reliability of the component.
in which is the expected recovery time.
Since and , we have and . , and may be affected by the redundancy of a system since redundancy will reduce the QoI losses due to failure and during recovery. In this paper, the resilience metric is proposed without considering the effect of redundancy. Redundancy can be considered in the proposed resilience metric by studying its effect on , , and in future. Analysis of Eq. (9) shows that: (i) when the component is completely reliable (), the resilience is also unity; (ii) when the reliability is zero (), is governed by the recovery probability (), recovery time (), recovery ratio (), and vulnerability which is represented as the remaining performance ratio after failure (); (iii) when the recovery ratio is unity () and reliability is zero, is mainly affected by the recovery time ().
in which , , and are the recovery ratio to the system initial performance , remaining ratio to , and the expected required recovery time of the ith failure path.
in which is the recovery probability of the kth component, is the vector of failed component indices of the ith system failure path, is the recovery ratio to of the kth component, is the expected required recovery time of the kth component, and is the quality remaining ratio to of the kth component. Note that , , and are used as constants for a component k in this paper for the sake of illustration. They can also be treated as random. and the quality recovery ratios and times can be obtained from the failure modes and effects analysis (FMEA) for the system. Besides, can be expressed as to connect the proposed resilience metric with the metric given in Eq. (2).
Equation (17) indicates that four main elements are required to evaluate the resilience of a system. The four elements are (i) reliability of the system, (ii) probability of having different system failure paths, (iii) probability of recovery of different system failure paths from failure, and (iv) QoI loss due to different system failure paths.
It can be seen from Eq. (10) that the proposed resilience metric has a form similar to Youn's resilience metric  as presented in Eq. (2). However, there are mainly three differences between Eqs. (2) and (17): (i) the resilience is expressed as a time-dependent function in Eq. (17) while Eq. (2) is time-independent; (ii) The term given in Eq. (2) is combined into one term in Eq. (17) and is expanded into by investigating the effects of different mutually exclusive system failure paths; and (iii) Eq. (17) has an extra term to include the vulnerability element within resilience assessment. Besides, the investigation of effects of different failure paths on the probability of recovery also incorporates vulnerability into resilience evaluation.
The resilience defined in Eq. (17) is bounded in the interval [0,1], with 1 indicating high resilience of the system and 0 indicating low resilience. Before applying the proposed new resilience metric to engineering design, there are three main challenges that need to be solved.
Computationally expensive simulation models are usually used to predict the system response. Since time-dependent system reliability analysis is required in Eq. (17), how to efficiently estimate and , over is the first challenge.
A system with multiple components may have many mutually exclusive failure paths, which are required in the proposed resilience assessment. How to efficiently enumerate these mutually exclusive system failure paths is the second challenge.
Given the resilience metric defined in Eq. (17), how to efficiently perform resilience assessment and how to connect the resilience analysis with design is the third challenge.
In this paper, a new time-dependent system reliability analysis method is developed to address the first challenge. Based on the system reliability analysis method, the second and third challenges are solved as well.
Time-Dependent System Reliability Analysis.
Time-dependent system reliability analysis provides and , , required in Eq. (17). During the past decades, only a few methods have been reported for time-dependent system reliability analysis [20,31,32]. Most of the reported system reliability analysis methods rely on the first-order reliability method (FORM). In this paper, to remove the limitation of FORM and yet be computationally efficient, a recently developed single-loop Kriging (SILK) surrogate modeling method is employed and extended for time-dependent system reliability analysis . Note that failure sequences and brittle failure events  are important issues for time-dependent system reliability analysis. In the case of ductile failures, the overall system limit state is not affected by the sequence of component failures . However, in the case of brittle failures, the failure of a component changes the limit state functions of the other components; as a result, the overall system limit state is dependent on the failure sequence . Consider a two-bar system with brittle failures as shown in Fig. 4. Two failure sequences are possible (as in Fig. 4(c)), and the corresponding reliability block diagram (RBD) is shown in Fig. 4(d) . The time-dependent system reliability method discussed in this paper is applicable when the sequences are identified and the RBD is available. In large systems with multiple components, dominant failure sequences may need to be identified using a branch-and-bound technique  or adaptive sampling .
where “” is “union,” “” is “intersection,” and is the limit-state function of the ith component.
In the context of surrogate model-based reliability analysis, methods have been proposed to construct a single extreme value surrogate model for system reliability analysis [37,38]. The extreme value surrogate model may be highly nonlinear. In this situation, building surrogate models for individual limit state functions is a promising way. In this paper, we therefore build a surrogate model for each individual limit state function and the SILK method is employed for the surrogate modeling. The original SILK method only focused on the estimation of , which is point estimation. For different time intervals, surrogate models need to be constructed repeatedly to obtain the failure probability up to . In this section, we first briefly review the SILK method. Based on that, we modify the original SILK method to efficiently estimate and , . By doing so, we can evaluate the resilience up to with just one surrogate model.
A Brief Review of SILK.
where stands for normal distribution, and are mean and variance of the prediction, which are obtained from Kriging surrogate model , and is the ith trajectory of at time instant .
A detailed description of SILK is available in Ref. .
Time-Dependent System Reliability Analysis Based on SILK.
As discussed above, the original SILK method only focuses on estimating instead of . In order to accurately estimate , the first-passage boundary needs to be accurately modeled in the surrogate model . It also implies that for every trajectory of the response function, the sign of points close to the first-passage point as shown in Fig. 5 needs to be accurately classified.
where is the system failure indicator for the kth realization of the system with indicating failure and indicating success, is the failure indicator of the ith component over time interval , and indicates failure and indicates success.
For a combined series and parallel system, the system Boolean function is defined according to the system topology based on Eqs. (33) and (34). For instance, for the kth realization of a combined system as shown in Fig. 6, the Boolean function is defined as
where , if and .
By implementing a similar procedure, we can also estimate , . However, there is a challenge that the number of failure scenarios (i.e., ) will increase exponentially with the number of components. This makes it almost impossible to get all , . In Sec. 3.3, we will discuss how to perform resilience assessment by overcoming this challenge.
According to the resilience metric defined in Eq. (17), the first step of resilience assessment is to identify all the mutually exclusive system failure paths. A possible way of achieving this purpose is to use the binary decision diagram (BDD)-based method as presented in Ref. . From the BDD, the mutually exclusive failure paths can be identified efficiently. However, for some failure paths, there are still a lot of possible failure paths. In the proposed resilience metric, all the failure paths need to be identified. This is not practical for a system with a large number of components even if the BDD-based method  is employed.
where is the number of failed system realizations through the ith system failure path, and is the failure indicator function of the ith failure path at the jth random realization.
in which is the vector of failed component indices and is the number of failed components in the jth failed random realization. Note that failure indicator and failed indices are discussed at the component level in this paper, the failure indicators of component-level failure modes need to be converted into failure indicator of components if a component has multiple failure modes.
Resilience Sensitivity Analysis and CIM.
In this section, the relationship between design variables and the proposed resilience metric is investigated through resilience sensitivity analysis and CIM.
Resilience Sensitivity Analysis.
where is a realization of random variables , is the system failure indicator over , is the joint PDF of under given , is the failure indicator of the ith system failure path over , and and are failure domains of the system and the ith system failure path, respectively.
in which and are the mean and standard deviation of , , are independent random variables, and are the eigenvalues and eigenvectors of the covariance function of , and is the number of eigenvectors used to represent the stochastic process.
in which is the standard deviation of the normal random variable.
where is given in Eq. (40).
where is the resilience difference given that component i is safe and is the resilience difference given that the recoverability of component i is one.
Based on the resilience CIM, the importance of each component to the resilience of the system can be analyzed. In design for resilience, we could allocate different resilience levels to different components based on the CIM .
In this section, a roller clutch without brittle failure events and a cantilever beam-bar system with brittle failure events are used to demonstrate the proposed resilience assessment method.
A Roller Clutch.
An automotive roller clutch as shown in Fig. 7 is adopted from Ref.  as our first example. For proper operation of the clutch, three performance functions, namely, contact angle, torque capacity, and hoop stress, need to be verified during the clutch design. A proper contact angle ensures that the clutch will not be scraped. A requirement of torque avoids the situation that the clutch is locked. The hoop stress requirement guarantees the fatigue life of the cage . The clutch will fail if any of the three requirements cannot be satisfied.
in which , , , and .
In the above time-dependent failure probability expressions, and are related to the contact angle, is related to the torque capacity, and is related to the cage stress. Table 2 gives the random variables of the roller clutch example. The QoI of the clutch is torque. There are three types of components: roller ( and ), hub (), and cage (). The average quality loss rate () during revoery is assumed to be . Table 3 gives the assumed data of the three types of components for the resilience assessment of the roller clutch.
Following the procedure given in Table 1, we first construct surrogate models for the limit-state functions given in Eqs. (56)–(59) using the modified SILK method. Table 4 gives the number of function evaluations (NOF) required for each limit-state function.
Figure 8 plots the comparison of time-dependent system failure probability obtained from the modified SILK with Kriging surrogate model and MCS. It shows that the modified SILK method can accurately estimate the time-dependent system failure probability. We then perform resilience assessment for the roller clutch. Figure 9 gives the resilience of the roller clutch over 20 years. Along with the resilience curve, we also plot two realizations of the system performance curves with failure events. In each individual realization, the system performance is recovered to a particular value after failure due to the recovery activity. Comparing Figs. 8 and 9, it can be found that considering the recovery activity has increased the resilience of the system.
We also perform resilience sensitivity analysis for the mean values of , , , and and resilience CIM using the method presented in Sec. 3.4. Figures 10 and 11 plot the results of resilience sensitivity analysis and CIM over different time intervals. The results show that the resilience is the most sensivity to the mean of . With the increase of time duration, sensivities of , , and are getting close to each other. The results of CIM analysis indicate that component 1 (roller) is the most important for the clutch resilience.
A Cantilever Beam-Bar System.
A cantilever beam-bar system as shown in Fig. 12 is modified from Refs. [36,40] as our second example. There are three components in the system including (1) bar, (2) beam, and (3) joint at the fixed point. The RBD which defines the failure of the system is also given in Fig. 12. There are brittle failure events in this example. The failure of component 3 (i.e., joint at the fixed point) will change the limit state function of components 1 and 2. Meanwhile, the failure of the bar will trigger the change in limit state function of component 3.
Table 5 gives the random variables and stochastic load process of the cantilever beam-bar system. The QoI of this example is the cost of the system. Table 6 gives the assumed recovery data of the three components for the resilience assessment of the cantilever beam-bar system. The average quality loss rate () during revoery is . In this example, the load is modeled as a stationary Gaussian stochastic process and the correlation of the stochastic process is given by
Equations (61)–(65) show that each component has two-stage failure paths. Based on the relationship between the trigger events and the resulted failure modes, the RBD as shown in Fig. 12 is modified as Fig. 13, which is the same as that presented in Refs. [36,40].
Based on the modified RBD, we perform time-dependent system reliability analysis and resilience assessment for the system. Figure 14 gives the resilience of the system over twenty years. Figure 15 presents the CIM analysis results.
The result illustrates that the resilience decreases with time and component 2 (Beam) and 3 (Joint) are more important than component 1 (Bar) for the system resilience.
A new resilience metric is proposed in this paper in order to connect resilience assessment to engineering design, by investigating the effects of failure, recovery, and the system failure paths on system resilience. The proposed resilience metric is expressed as a function of time-dependent system failure paths, reliability, and recovery probability. This builds a bridge between design and the resilience metric. A new time-dependent system reliability analysis method is presented to efficiently evaluate system resilience based on the proposed resilience metric. Resilience sensitivity analysis and CIM are also discussed based on the proposed metric to study the connection between resilience and design. Two numerical examples illustrate the effectiveness of the proposed method.
In the proposed resilience metric, the recovery probability of a component is assumed to be constant. In reality, the recovery probability may be random as well. How to integrate the health monitoring system into the proposed resilience metric needs to be investigated in the future. Other future needs include considering redundancy  among components in the system resilience assessment, accounting for the interdependency between different components and multiple failure sequences, considering different types of recovery scenarios, and learning the interdependence between components using BNs.
The research reported in this paper was supported by the Air Force Office of Scientific Research (Grant No. FA9550-15-1-0018, Technical Monitor: Dr. David Stargel). The support is gratefully acknowledged.