## Abstract

Complex systems—such as gas turbines, industrial plants, and infrastructure networks—are of paramount importance to modern societies. However, these systems are subject to various threats. Novel research does not only focus on monitoring and improving the robustness and reliability of systems but also focus on their recovery from adverse events. The concept of resilience encompasses these developments. Appropriate quantitative measures of resilience can support decision-makers seeking to improve or to design complex systems. In this paper, we develop comprehensive and widely adaptable instruments for resilience-based decision-making. Integrating an appropriate resilience metric together with a suitable systemic risk measure, we design numerically efficient tools aiding decision-makers in balancing different resilience-enhancing investments. The approach allows for a direct comparison between failure prevention arrangements and recovery improvement procedures, leading to optimal tradeoffs with respect to the resilience of a system. In addition, the method is capable of dealing with the monetary aspects involved in the decision-making process. Finally, a grid search algorithm for systemic risk measures significantly reduces the computational effort. In order to demonstrate its wide applicability, the suggested decision-making procedure is applied to a functional model of a multistage axial compressor, and to the U-Bahn and S-Bahn system of Germany's capital Berlin.

## 1 Introduction

Modern societies rely on the operations of various complex systems, such as gas turbines, industrial plants, or infrastructure networks. These form complex capital goods whose construction, improvement, and regeneration are of paramount importance. However, these systems are subject to various threats. Evidence shows that a wide range of natural, technical, and anthropogenic impacts at all scales can severely affect the functionality of these systems. Due to their high and increasing complexity, it is infeasible to identify all potential adverse impacts and to prevent them accordingly. Novel developments are therefore important that do not only focus on monitoring and improving the robustness and reliability of systems but also focus on their recovery from adverse events [1]. The concept of resilience encompasses these developments: analyzing and optimizing robustness, reliability, and recovery of systems—both from a technical and from an economic perspective [2–4]. Resilience applied to the artificial systems of our modern society leads to a paradigm shift. Secure systems can not only be based on strategies that prevent failures but must include strategies for the efficient recovery in cases of failure.

The concept of resilience in the context of engineering applications has gained growing popularity in recent years [5,6]. The term “resilience” appears in several different domains like ecology, economy, psychology as well as in the context of mechanical and infrastructure systems and is derived from the Latin word “resilire” which means “to bounce back”. The concept of resilience first appeared in the domain of ecological systems by Holling [7]. He defined resilience as “[…] a measure of the persistence of systems and their ability to absorb change and disturbance and still maintain the same relationships between populations or state variables.” Although many different definitions of resilience were introduced in the context of engineering and complex systems (e.g., see Refs. [8–12]), the early definition from Holling [7] captures key aspects of all of them.

Ayyub [13] provides a review of the literature and develops a comprehensive definition of resilience in the context of complex systems which is based on the content of the Presidential Policy Directive (PPD) on critical infrastructure security and resilience [14]: “Resilience notionally means the ability to prepare for and adapt to changing conditions and withstand and recover rapidly from disruptions. Resilience includes the ability to withstand and recover from disturbances of the deliberate attack types, accidents, or naturally occurring threats or incidents. The resilience of a system's function can be measured based on the persistence of a corresponding functional performance under uncertainty in the face of disturbances.” This novel definition embraces the former definitions and provides a solid basis for the quantification of resilience.

Our paper suggests a novel quantitative approach to resilience enabling decision-makers to efficiently design and improve complex systems present all over our modern communities [1,15]. Resources are not unlimited, and resiliences cannot arbitrarily be improved in reality; realistic models must reflect constraints and methods must be developed that support decision-makers in choosing between different resilience-enhancing investments [4,16].

This paper provides an efficient method for identifying the cost-effective allocations of different resilience-enhancing investments by combining the resilience metric of Ouyang et al. [17] and the systemic risk measure of Feinstein et al. [18]. A grid search algorithm for systemic risk measures significantly reduces the computational effort. In order to demonstrate its wide applicability, the suggested decision-making procedure is applied to a functional model of a multistage axial compressor, and to the *U-Bahn* and *S-Bahn* system of Germany's capital Berlin.

The paper is structured as follows: Section 2 describes the theoretical foundations: the quantification of resilience, the systemic risk measure and its adaptation to technical systems, and the grid search algorithm. Section 3 develops on this basis a novel resilience-based decision-making process. In Secs. 4 and 5, the methods are, first, applied to a functional model of an axial compressor and, second, to Berlin's suburban train (S-Bahn) and subway (U-Bahn) network. Section 6 summarizes the results and discusses questions for future research.

## 2 Theoretical Fundamentals

### 2.1 Resilience Quantification.

Applications of resilience to engineering problems rely on the availability of quantitative measures of resilience. Within the last two decades, various methodologies have been developed. Comprehensive discussions of different resilience metrics are provided by Bergström et al. [5], Hosseini et al. [15], and Linkov and Palma-Oliveira [19]. In addition, Hosseini et al. [15] propose a specific classification system for these metrics. Most resilience metrics are performance based, and the majority of performance-based measures of resilience are assigned to the category of “generic resilience metrics.” These determine resilience by comparing the performance of a system before and after a disruptive event. Further subcategories are constructed by distinguishing between time-dependent or time-independent and deterministic or probabilistic metrics, respectively.

Performance-based approaches may be ratio based, integral based, or both. When a system is exposed to a disruptive event and recovers its functionality afterward, it passes through three essential phases: (i) the original stable state whose duration can be interpreted as the reliability of the system forms the first phase. (ii) The second phase is the vulnerability of the system, represented by a loss of performance after the occurrence of a disruptive event; the robustness of the system mitigates the loss of performance. (iii) The disrupted state of the system and its recovery to a new stable state represent the recoverability and the last phase. The three phases are illustrated in Fig. 1, with *Q*(*t*) denoting the system performance at time *t*. The new stable state may differ from the original state, e.g., in terms of its performance which may be higher or lower.

The majority of resilience metrics in the literature is based on system performance, i.e., on these three states and their transitions. A quantitative measure of resilience thus depends on the specific choice and definition of system performance [3].

*t*

_{0}is the time of occurrence of a disruptive event,

*t*

_{1}the time of complete recovery, and

*Q*(

*t*) the quality of the community infrastructure at time

*t*, a specific type of system performance, their metrics can be expressed in the following form:

For systems with random performance, this metric defines a pathwise measure of resilience. Bruneau et al. [21] also introduce the well-known principle of a “resilience triangle” as illustrated in Fig. 2. Their approach was applied in various contexts and forms a strong basis for several, later proposed metrics [22–24]. Further resilience metrics in the context of deterministic models were, e.g., suggested by Refs. [25–29].

Pathwise metrics do not rely on probabilities and do not capture quantities that depend on probabilities—such as the rates of occurrences of disruptive events and the distributions or moments of the random size of disruptions or the random times of their recovery. Such quantities require the existence and knowledge of a probability measure on the scenario space together with probabilistic resilience metrics, e.g., see Refs. [12] and [30–33]. Very informative resilience metrics were introduced by Ouyang et al. [17] and Ayyub [13]; both metrics are probabilistic, time dependent, and universally applicable.

*Res*, it is defined as the expectation of the ratio of the integral of the system performance

*Q*(

*t*) over a time interval [0,

*T*] and the integral of the target system performance $TQ(t)$ during the same time interval:

System performance *Q*(*t*) is a stochastic process. The parameter $TQ(t)$ is generally considered as a stochastic process as well, but for simplicity it is assumed to be a nonrandom constant $TQ$ in this work.

*T*is $N(T)=sup{n:tn\u2264T}$. The impact area AIA

*is the expected area between the reduced system performance curve and the target system performance curve caused by the*

_{n}*n*th failure within the considered time interval. Under the assumption that the system fully recovers before its next failure, one obtains that $AIAn=E(\u222btntn+1[TQ\u2212Q(t)]dt)$. In this case, Eq. (2) can be written as

The resilience metric takes values between 0 and 1. The value *Res* = 1 indicates a system performance corresponding to the target performance, while *Res* = 0 captures that the system is not working during the considered time period.

### 2.2 Systemic Risk Measure.

Feinstein et al. [18] propose a novel approach to measure risk inherent in complex systems. Their methodology is based on two key components: first, a suitable descriptive input–output model; and, second, an acceptance criterion representing the normative safety standards of a regulatory authority. These systemic risk measures were, e.g., considered in finance, see Weber and Weske [34], and applied to power transmission, see Cassidy et al. [35].

Let $(\Omega ,F,P)$ be a probability space, $l\u2208\mathbb{N}$ the number of entities in the considered system, and $k\u2208\mathbb{R}l$ a vector of controls. For each scenario $\omega \u2208\Omega $ and a control vector *k*, we denote by $Yk(\omega )$ the relevant stochastic outcome of the system; for each $k\u2208\mathbb{R}l$, *Y _{k}* is a random variable.

In the context of financial systems, the vector *k* is the “endowment” and describes the capital allocation to the entities of the system. The underlying input–output model of the system is given by $Y=(Yk)k\u2208\mathbb{R}l$, a nondecreasing random field taking values in some vector space $X$ of random variables. The monotonicity property encodes that a larger capital allocation, $ki\u2264mi\u2009\u2200\u2009\u2009i=1,\u2026,l$, increases the random outcome, i.e., $Yk\u2264Ym$. The acceptance criterion is described by the set $A\u2286X$ of random variables meeting the requirements of a decision-maker; for a survey on acceptance sets and monetary risk measures, we refer to Föllmer and Weber [36].

### 2.3 Adapted Systemic Risk Measure.

The systemic risk measure introduced in Sec. 2.2 can be applied to engineering systems with components of multiple types with several endowment properties. We consider technical systems for which a meaningful system performance *Q*(*t*) can be determined. It is assumed that the system consists of *l* system components each characterized by their type and *n* properties that influence the system performance. For convenience, we replace the vector notation of Sec. 2.2 by matrix notation.

*n*relevant properties and $ji\u2208{1,2,\u2026,b}\u2286\mathbb{N}$ is its type. Once all components are specified, the system is described by a pair consisting of a matrix $A\u2208\mathbb{R}(l\xd7n)$ and a column vector $z\u2208\mathbb{N}l$ that captures the types of the components:

The input–output model $Y=(Y(A;z))$ is enumerated by these pairs. In our case studies, we will typically assume that vector *z* of types is fixed and investigate the impact of a varying matrix *A*.

which is the set of all allocations of modifications of the system properties *A* such that the altered system characterized by (*K *+* A*; *z*) possesses a resilience greater than or equal to *α*. In order to keep the notation simple and without loss of generality, we set *K *=* *0, and *R*(*Y*;0) is written as *R*(*Y*).

For practical applications, it is often necessary to impose restrictions on the structure of the matrix in Eq. (6). For example, it might be required that any component of a specific type is configured in the same way, meaning that the corresponding row vectors *a _{i}* must be equal. As described in Ref. [18], such constraints can be captured by monotonously increasing functions $gz:\mathbb{R}p\u2192\mathbb{R}(l\xd7n),\u2009a\u2032\u21a6(A;z)$ where $z\u2208\mathbb{R}l$ denotes the types of the components; these functions map a lower-dimensional set of parameters $a\u2032\u2208\mathbb{R}p$ to the description of the system.

*l*=

*5 components of*

*b*=

*2 types. Each component is characterized by its two endowment properties and its type, i.e., $(\eta i1,\eta i2;ji)$, and we assume that*

*η*

_{i}_{2}is a function of the type

*j*of the component

_{i}*i*. More specifically, we suppose that

*η*

_{i}_{2}= 3 for type 1 and

*η*

_{i}_{2}= 5 for type 2. We choose

*p*=

*5 and consider as an example the types $z=(1,1,1,2,2)\u22a4$. This leads to the following characterization of the system:*

In this example, the constraint reduces the dimension from 10 = 5 × 2 to 5.

The dimension of the space of parameters can be further reduced, if more constraints are introduced. Consider, e.g., the additional condition that the first endowment property *η _{i}*

_{1}is a function of the type of the components, but that it can otherwise freely be chosen. This implies for the given types $z=(1,1,1,2,2)\u22a4$ that $q1=q2=q3$ and

*q*

_{4}=

*q*

_{5}. In this case,

*p*=

*2 becomes the appropriate dimension of the parameter space.*

### 2.4 Grid Search Algorithm.

Set-valued systemic risk measures can be computed via a combination of a grid search algorithm and stochastic simulation, see Ref. [18]. To employ this algorithm, a box-shaped subset of endowments which are of interest is subdivided by a grid of equally spaced points.

The grid search algorithm proceeds as follows. In a first step, the search is started at the origin of the considered box; we assume that the origin is outside of *R*(*Y*); from here, the acceptance criterion is successively evaluated for each adjacent grid point, lying on the diagonal of the grid identified by the direction $(1,1,\u2026,1)\u22a4$. Each evaluation typically requires stochastic simulation. The search along the diagonal direction is interrupted as soon as a point satisfying the acceptance criterion is identified. Due to the monotonicity of the input–output model and the properties of the acceptance criterion (cf., Ref. [18]), all grid points representing superior endowments are acceptable as well and belong to *R*(*Y*). Analogously, all endowments that are worse than the first identified point are rejected, thus belonging to *R*(*Y*)* ^{c}*, the complement of the systemic risk measurement. It is precisely this monotonicity property that makes the algorithm efficient.

Each pair of diagonally adjacent points, one meeting the requirements and the other not, defines a sub-box. The algorithm checks the remaining corners of this sub-box and can quickly assign an acceptance status to dominating, respectively, dominated endowments. Subsequently, new pairs of points can be determined, one in *R*(*Y*) and one outside. The successively resulting sub-boxes are checked in the same way as before. The algorithm terminates when all points on the grid are assigned to an acceptance status. It finally determines a discrete grid-approximation of *R*(*Y*).

For a more detailed description of the grid search algorithm, we refer to Ref. [18].

## 3 Resilience Decision-Making

The decision-making process for resilience-enhancing endowments in complex systems, developed in this work, integrates resilience metrics and systemic risk measures. As discussed in Zuev et al. [37], complex systems are often described as networks: nodes and edges represent systems as well as the connections between their components. System components may be represented as network edges or nodes — whatever representation is more appropriate.

In order to illustrate our method, we consider a specific flow network as shown in Fig. 3. This network consists of seven nodes and eight edges, as well as a source node denoted by *s* characterized by an initial flow *w* and a target node denoted by *t* with a destination flow *v*, respectively. The network edges represent the essential components of the network. Each component is assigned to one of the two types, i.e., *b *=* *2. We set *n *=* *2, i.e., two endowment properties are associated with each component: a capacity *c* and a recovery improvement *r*^{*}. Each component $i\u2208{1,\u2026,8}$ is characterized by $(ai;ji)=(ci,ri*;ji)\u2208\mathbb{R}(1\xd72)\xd7{1,2}$.

*T*]. The interval is partitioned into

*u*parts by the time points $0=t0<\u2026<tu\u22121<tu=T$. System performance

*Q*(

*t*) is defined as a piecewise constant stochastic process that evaluates the ratio of the destination flow and the initial flow at each time point, i.e.,

We assume that partition is equidistant, i.e., $\Delta t=th+1\u2212th=(T/u)$$\u2200h$. The specification of a notion of system performance is, of course, not uniquely determined by the system; instead, alternative choices may be analyzed simultaneously and should thereby be carefully selected to enable suitable resilience analysis for the intended decision-making process.

The flow for a given endowment (*A*; *z*) is simulated as follows: at each time point *t _{h}*, the flow of the entire network is computed as follows. The flow starts at the source node and runs iteratively by means of a node-by-node breadth-first search through the entire network up to the destination node. Each node receives the partial flows from all edges leading into it and returns them to all subsequent edges, obeying the following allocation rules: (i) the incoming flow is allocated to all subsequent edges such that 30% runs into edges of type 1 and 70% runs into edges of type 2. Among subsequent edges of the same type, the relevant flow is uniformly allocated; (ii) if the capacity of a subsequent edge is exceeded, this edge is destroyed immediately and the flow is instead reallocated to the remaining edges according to (i); (iii) if a node has no subsequent edge, the flow emanating from this node is lost, i.e., the node becomes a sink.

After the flow has been computed at the time point *t _{h}*, the simulation proceeds to time-step $th+1=th+\Delta t$: edges that have been destroyed at time

*t*are removed from the network in consecutive time-steps unless they are recovered; the process of recovery will be described below. In addition, each edge can fail at random after the flow has been computed at time

_{h}*t*and before time

_{h}*t*

_{h}_{+1}. At time

*t*

_{h}_{+1}, the algorithm (i)–(iii) described earlier is then applied to the remaining network.

*i*, $ci$ its capacity, and $\beta >0$ a mitigation factor, we set

As discussed by Ayyub [13], multiple causes and processes can lead to failures. In this illustrative example and in applications in later sections, we consider only immediate failures due to overload or random impacts; failure might also occur due to a loss of performance in time, e.g., by aging.

where *r*_{max} is an upper bound for number of time-steps for recovery and *r*^{*} is a reduction specific to the component. Since each time-step has a length of $\Delta t=(T/u)$, the duration of the recovery process is $r\xb7(T/u)$. This recovery model corresponds to a one-step recovery profile; as discussed before in the context of failure profiles, various characteristic profiles of recovery in time are possible as well, cf., Refs. [3] and [13].

In the context of our model, the simulation procedure is executed consecutively for all time points, resulting in a single path of the performance process $t\u21a6Q(t)$ over the time interval [0, *T*]. A sample average of the system performance as a function of time *T* obtained from a Monte Carlo simulation is exemplarily illustrated in Fig. 4. The probabilistic resilience metric given in Eq. (2) can, of course, also be computed as a suitable average of Monte Carlo samples.

When analyzing the resilience of the system, an important task consists in determining the set of all endowment configurations (*A*; *z*) that lead to a prescribed acceptable level of system resilience. The numerical procedure is computationally expensive, but tractable due to the grid search algorithm by Feinstein et al. [18]. In addition, the problem is also simplified if restrictions are imposed on the matrix *A* via a suitable function *g _{z}*, where

*z*denotes the vector of types; this was discussed in Sec. 2.3.

To illustrate this procedure in the context of a flow network model, we fix the vector of types $z\u2208{1,2}8$ for the eight edges. Figure 3 provides, for example, $z=(2,1,2,1,2,2,1,2)\u22a4$. We assume that the constraint function *g _{z}* captures the following restrictions: (a) recovery improvements $ri*$ are fixed and equal for all components

*i*. (b) Capacities

*c*are a function of the type

_{i}*j*of the components

_{i}*i*, i.e., if two components are of the same type, they possess the same capacity. We explore a range of capacities in order to separate acceptable and inacceptable pairs. Figure 5 provides an example how the results of the grid search algorithm could look like. The blue dots signify the acceptable pairs of capacities of the two types of components, whereas red dots are inacceptable pairs. Acceptable pairs satisfy the desired resilience criterion, while inacceptable pairs do not. Obviously, the computation of the systemic risk measure significantly facilitates decision-making.

Additionally, the procedure allows the integration of monetary aspects into any decision process that is focusing on the resilience of the system. An important question concerns the identification of a least expensive configuration that is acceptable with respect to the chosen resilience criterion. If increasing the endowment values is costly, a least expensive solution will always be at the boundary between the red and the blue area. If the price of the endowments is linearly increasing, prices define a normal vector to this boundary that characterizes the least expensive acceptable configurations on the boundary, as illustrated by the green points in Fig. 5. Finding the least expensive configurations corresponds to efficient allocation rules as introduced by Feinstein et al. [18].

## 4 Multistage High-Speed Axial Compressor

Gas turbines are a highly important technology employed in industrial application, e.g., for electricity production, as well as in the military and transportation sector, e.g., as component of aircraft propulsion systems. In particular, axial compressors are one of the key components of gas turbines. For economic and safety reasons, it is of the utmost importance that they are as resilient as possible. In order to illustrate how the decision-making method developed in this paper allows for an analysis of the financial burden of increasing resilience and for an optimal choice between different instruments to enhance resilience, our method is applied to a functional model of an axial compressor.

### 4.1 Model.

In a previous work by one of the authors of this paper, developed within the Collaborative Research Centre 871, funded by the German Research Foundation [38], a functional model of an axial compressor was created as the foundation for a reliability analysis. This model has been developed to represent the reliability characteristic and functionality of the four-stage high-speed axial compressor of the Institute for Turbomachinery and Fluid Dynamics at Leibniz Universität Hannover. Detailed information about this axial compressor is provided in Refs. [39–41].

The model captures the influence of the roughness of the blades in the individual stator and rotor rows, alternately connected in series, on the performance of the axial compressor, namely, on the total-to-total pressure ratio and on the total-to-total isentropic efficiency. This functional model of the axial compressor has been assembled by applying a sensitivity analysis and identifying the relative important indices from an aerodynamic model of the compressor. The network representing the functional model is shown in Fig. 6. Each component of the reliability-based model represents one of the rotor blade rows (R1–R4) or stator blade rows (S1–S4). The arrangement of the components was chosen according to the effect of blade roughness on the two performance parameters of the axial compressor. More specifically, an interruption between start and end means a performance variation of at least 25%, corresponding to a nonfunctional compressor. This defines the system performance *Q*(*t*) of the functional model for the subsequent application of the resilience decision-making method. The system performance is determined at each time point *t _{h}* and is 1 if there is a connection from start to end and 0 if this connection is interrupted. More detailed information on the functional model and its formulation can be obtained from Ref. [38].

For the analysis, as components, we do not distinguish between the stator blade rows and the rotor blade rows and enumerate them by $i\u2208{1,\u2026,8}$. Further, each of them is assigned to the same component type, i.e., it is *j _{i}* = 1 $\u2200i\u2208{1,\u2026,8}$, and we therefore simplify the notation by $(ai;ji)=(ai;1)=ai$$\u2200i\u2208{1,\u2026,8}$. Each row, i.e., each component of the functional model, is assumed to be characterized by two endowment properties, a roughness resistance

*re*and a recovery improvement

*r**, so that a component is fully described by $ai=(rei,ri*)$. Both the roughness resistance

*re*

*and the recovery improvement $ri*$ of each row*

_{i}*i*are assumed to be functions of the type

*j*, i.e., $rei=rei\u2032,\u2009ri*=ri\u2032*$ if $ji=ji\u2032$ and are therefore in this case study equal for all components. This restriction can be captured by a suitable constraint function

_{i}*g*, cf., Sec. 2.3.

_{z}It should be noted that, in order to improve the roughness resistance of a blade, techniques that counteract the roughening of the surface are required. However, such techniques are not clearly identifiable and readily available at the moment. Within the scope of this example, the application of methodologies leading to an improvement of the resistance, e.g., by applying coating techniques, can nevertheless still be envisioned for the scope of the analysis. As an example, in areas not inherent with the mechanical resistance, the principle of blade coatings is already extensively employed, e.g., in the reduction of heat transfer from the gas flow into the blades by means of thermal coatings [42].

*t*. A failed component is treated as no longer present in the model and does not contribute to the overall system performance at time

_{h}*t*

_{h}_{+1}and all subsequent time points anymore until it is fully recovered. The failure probability of a component

*i*in the time interval $(th,th+1)$ is assumed to be constant in time, cf., Ref. [38], and is given by

where *λ _{i}* is the time-independent failure rate. An increase of the roughness resistance of a row of blades will reduce the degradation of the surface, and thus, the corresponding failure rate

*λ*. In contrast to the flow model in Sec. 3, in the functional model of the axial compressor, a component can fail exclusively at random.

_{i}### 4.2 Costs of Endowment Properties.

The optimal endowment properties are related to the quality of the components, and an increase in their production quality is associated with large costs. This should be taken in to account in the decision-making process. As discussed in Ref. [43], an increase of the reliability of components in complex networks might be associated with an exponential increase in their costs, and in our analysis, we will make such an assumption.

*re*is the “roughness resistance” value of component

_{i}*i*and

*price*a common basic price that does not depend on

^{re}*i*in this case study. In a similar way, an exponential relationship is assumed for the cost associated with recovery improvement:

*cost*” of an endowment is the sum of these costs:

### 4.3 Scenario.

In order to apply the decision-making method for resilience-enhancing endowments to the multistage high-speed axial compressor, the model parameter and simulation parameter values, shown in Table 1, are considered. A resilience acceptance threshold of $\alpha =0.8$, an arbitrarily selected number of *u *=* *200 time-steps as well as an arbitrarily selected time-step length of $\Delta t=0.05$ are assumed. We first determine the set of all acceptable endowments corresponding to a resilience value of at least $Res=0.8$ over the considered time period. Second, in practice, any improvement of the axial compressor blades is associated with costs; thus, the least expensive acceptable endowment is characterized as well, denoted by $A\u0302$. The roughness resistance *re* and the recovery improvement *r*^{*} are explored over $rei\u2208{1,\u2026,20}\u2009and\u2009\u2009ri*\u2208{1,\u2026,20}\u2009\u2200i\u2208{1,\u2026,l}$. These values can be interpreted as increasing quality levels. In terms of recovery, this leads to a recovery time for the components of maximum 20 time-steps to a minimum of one time-step, depending on the recovery improvement value $ri*$ of each component.

The scenario was simulated on the basis of the functional model, following the procedure described in Secs. 2.3 and 2.4. Figure 7 shows the results of the grid search algorithm. The blue, filled dots are the acceptable pairs of roughness resistance and recovery improvement. In terms of system resilience, the impact of the quality of recovery improvement and the quality of the blade coatings can be compared. For example, for recovery improvement values of $ri*\u226515$ time-steps, only the minimum roughness resistance value of *re** _{i}* = 1 is necessary in order to achieve the desired level of system resilience.

By applying the grid search algorithm [18], only about 10% of the possible pairs of roughness resistance and recovery improvement values had to be tested to determine *R*(*Y*). As described in Sec. 3, the least expensive endowment is an element of the boundary of *R*(*Y*). Taking into account the base prices in Table 1, the least expensive endowment is characterized by a roughness resistance of *re** _{i}* = 8 and a recovery improvement of $ri*=13$ for each component. In Fig. 7, the corresponding pair is highlighted in green. According to Eq. (17), its cost is 136,930€.

## 5 Berlin's U-Bahn and S-Bahn System

Berlin's subway U-Bahn, suburban train S-Bahn, trams, and buses carry more than 1.5 billion passengers each year. Approximately, two-thirds of these passengers are transported via the S-Bahn and U-Bahn rails [44,45]. These are the most used public transport systems in Berlin and of utmost importance for Germany's capital. Obviously, key infrastructures of high social and economic relevance require a large degree of resilience. The method developed in this paper will be applied to a model of the Berlin subway and suburban train system, with the aim of examining suitable resilience-enhancing modifications. Our methodology could also be applied to assess the resilience of new systems that are still in their design phase. This provides an opportunity to characterize ex ante adequate system requirements in terms of reliability, robustness, and regenerative capacity.

### 5.1 Model.

The U-Bahn and S-Bahn public transportation systems in Berlin are interconnected via multiple train stations. As described in Ref. [37], they may thus be considered as one single public transport network, in the following called “metro network”. Zhang et al. [46] explain how a metro network can be mapped into a topological graph: train stations correspond to the nodes and the connecting railway lines to the edges of the graph. For simplicity, we map parallel railway lines between two stations to one single undirected edge. In this way, the complexity of the metro network can be significantly reduced. In the case of the Berlin metro network, this procedure leads to a topological graph with 306 nodes and 350 edges. This representation of the U-Bahn and S-Bahn system is shown in Fig. 8.

We begin our analysis with the definition of a suitable metric of the system performance of the network, as explained in Sec. 2.3. Zhang et al. [47] present resilience assessments for large-scale metro networks and apply their approach to the Shanghai metro network. They suggest that the connectivity between the individual stations is an essential criterion for assessing metro operations. Their approach employs the characteristics of topographic networks in order to capture resilience, e.g., the characteristic path length, the network-clustering coefficient, the average node degree, and the network efficiency.

with *N* being the number of nodes in the network and *d _{ij}* being the path length between node

*i*and node

*j*, i.e., the shortest distance between these nodes. We use the network efficiency

*E*as system performance of the Berlin metro network in each time point

_{f}*t*, previously denoted by

_{h}*Q*(

*t*). Zhan and Noon [49] and Dreyfus [50] provide a good overview of tools for efficiently determining the path length

_{h}*d*between nodes, e.g., the algorithms of Floyd, Dijkstra's, or Bellman-Ford.

_{ij}The node degree represents the number of nodes in the graph that have a direct connection to the *i*th node. In many useful random graphs, the distribution of node degrees follows a power distribution, see, e.g., Barabási and Albert [51]. Figure 9 shows the relative frequencies of the node degrees in case of the Berlin metro network which could be approximated by a power distribution.

Metro stations are modeled as nodes of the network. We assume that each metro station *i* is characterized by two endowment properties, (a) robustness *ro** _{i}* and (b) recovery improvement $ri*$; a component

*i*of type

*j*is described by a tuple $(ai;ji)=(roi,ri*;ji)$. Again, both endowment properties are assumed to be functions of their component type

_{i}*j*only, such that $roi=roi\u2032$ and $ri*=ri\u2032*$ if $ji=ji\u2032$. These restrictions can again formally be captured by the constraint function

_{i}*g*that explicitly describes the reduction of the dimension of the problem, cf., Sec. 2.3.

_{z}*i*is

where $\lambda i(th)$ is the failure rate at time *t _{h}*, $ki(th)$ is the time-dependent number of direct neighbors of the

*i*th metro station that are in a failed state at time

*t*,

_{h}*ro*

*is the robustness of the*

_{i}*i*th metro station, and

*ro*is the maximum value of the robustness. As the robustness of a metro station

_{max}*i*increases, its failure rate $\lambda i(th)$ decreases. In the event of failure of a directly adjacent metro station, its probability of failure increases; a rational for this assumption is that the load on the considered station becomes larger which increases the likelihood of failure. This phenomenon might potentially lead to cascading failures, cf., Refs. [52–54].

The second endowment property, the recovery improvement $ri*$, determines the time to recovery after failure according to Eq. (13). Failed metro stations are not removed, but remain in the set of metro stations; however, their node degree becomes 0 in the evolving network structure. This assumption is important, since the computation and interpretation of network efficiency *E _{f}* depends on the number of nodes; our case study relies on the fact that the number of nodes is preserved. After recovery, all previous connections to other metro stations are restored, unless these are in a state of failure. Our analysis focuses on determining the optimal endowments in terms of resilience. This is in contrast to Zhang et al. [47] whose focus is — among other things — the optimal order in which connections should be recovered.

### 5.2 Endowment Property Costs.

*j*is the type of station

_{i}*i*, its robustness is

*ro*

*, and its basic price of the endowment property “robustness” is $price(roi;ji)ro$. The total cost of “recovery improvement” is given by*

_{i}*j*is the type of station

_{i}*i*, its recovery improvement is $ri*$, and its basic price of the endowment property “recovery improvement” is $price(ri*;ji)*$. The total cost $cost(A;z)$ of an endowment (

*A*;

*z*) is obtained as

### 5.3 Scenarios.

We consider two different scenarios characterized by the simulation and model parameters in Table 2. In both scenarios, we assume a resilience threshold of $\alpha =0.8$, a number of *u *=* *100 time-steps, and a time-step length of $\Delta t=0.01$. The objective of the analysis is to characterize suitable endowment allocations such that the metro network's resilience is at least *Res* = 0.8. We will also identify the least expensive acceptable endowments, denoted by $(A\u0302;z)$ in both cases.

#### 5.3.1 Scenario 1.

In the first scenario, each metro station is assigned to one of the two station types, namely, “small” and “large.” The “small” metro stations *i* (type *j _{i}* = 1) have only one or two direct neighboring stations, i.e., a node degree of 1, 2, while the “large” metro stations

*i*(type

*j*= 2) have more than two direct neighboring stations, i.e., a node degree > 2, and are highlighted in red in Fig. 8. Out of all stations, a total of 245 are of type 1 and 61 of type 2.

_{i}The endowment properties of all components depend on their type only. We vary robustness with $roi\u2208{1,\u2026,20}$ for all *i*. The recovery improvements of both component types are fixed and equal $ri*=15$ for all components *i* independent of their types.

Figure 10 shows the results of the grid search algorithm for scenario 1. The blue, filled dots signify the pairs of robustness values that lead to acceptable endowments. In terms of system resilience, the robustness of the “small” metro stations is more important than the one of the “large” stations in this case study. For example, with large robustness values for the “small” stations, i.e., $roi\u226518$ for *j _{i}* = 1, acceptability may be achieved even with a minimal robustness value of

*ro*

*= 1 of “large” stations, i.e.,*

_{i}*j*= 2. The slope of the almost linear boundary indicates that in order to compensate for the reduction of one robustness unit of the endowments of the “small” stations, an increase of approximately 1.7 robustness units of the endowments of the “large” stations is necessary.

_{i}This observation can be explained as follows: first, the number of “small” stations (245) is significantly larger than the number of “large” stations (61), and thus, their overall influence is large. Second, “small” stations are often arranged in chains. If a station within such a chain fails, all stations further out of town are automatically cut off from the main network, which has a major impact on network efficiency.

Thanks to the grid search algorithm only about 10% of all pairs of robustness values had to be tested to determine the set of all accepted endowments. As described in Sec. 3, the least expensive endowment is an element of the boundary of the acceptable endowment allocations. For the parameters in Table 2, the least expensive endowment corresponds to a robustness of *ro** _{i}* = 10 for all “small” stations

*i*of type

*j*= 1 and a robustness of

_{i}*ro*

*= 13 for all “large” stations of type*

_{i}*j*= 2. The corresponding pair of parameters is highlighted in green in Fig. 10. The total endowment cost, computed according to Eq. (24), equals 6,673,579€.

_{i}#### 5.3.2 Scenario 2.

In the second scenario, all metro stations are assigned to the same station type, and $(ai;ji)$ can simply be written as *a _{i}*. We vary both recovery improvement and robustness, i.e., $ri*,roi\u2208{1,\u2026,20}$$\u2200i$.

Figure 11 shows the results of the grid search algorithm for scenario 2. The blue, filled dots signify the pairs of robustness and recovery improvement values that lead to acceptable endowments. The application of the grid search algorithm leads to a similar reduction of the computing effort as in scenario 1. For the parameters in Table 2, the least expensive endowment corresponds to robustness *ro** _{i}* = 14 and recovery improvement $ri*=11$ for all stations

*i*; this is highlighted in green in Fig. 11. Its cost is computed according to Eq. (24) and equals 6,995,127€.

## 6 Conclusion

This paper introduces a procedure for decision-making in complex systems that enables the optimal allocation of scarce resources to resilience-enhancing endowments. The methodology integrates systemic risk measures with time-dependent and probabilistic resilience metrics. Our approach is not limited to controls of the same type, but allows for a direct comparison of the impact of heterogeneous controls on the resilience of the system, e.g., failure prevention and recovery improvement arrangements, over any period of time. The system behavior itself may depend on a wide variety of stochastic variables that influence its performance. Our method characterizes, in a first step, all acceptable endowments of system components that lead to a desired level of resilience. In a second step, it is capable of incorporating monetary aspects into the decision-making process and admits the identification of the least expensive controls. In addition, we explain a grid search algorithm for systemic risk measures that significantly reduces the required computational effort.

The suggested methodology is not limited to a certain type of networks. This paper illustrates that the approach is easily adaptable and universally applicable. Examples in this paper include technical systems such as axial compressors as well as infrastructure networks such as the Berlin metro system. Many other applications are possible, thereby supporting decision-makers in improving the complex systems of our modern society and increasing their resilience.

Future research may apply the suggested methodology to highly complex systems. In this paper, many simplifying assumptions were made in order to be able to focus on the basic concepts and to demonstrate the versatility of the approach in concrete examples. More challenging problems, e.g., higher dimensions of the parameter space, are left to future developments. From a conceptional point of view, real-world problems involve multiple objectives and are not limited to a finite time horizon. Future work should not only focus on system resilience and the costs of the controls but also on long-term effects, such as different expected profits under a modified system resilience. Comprehensive decisions require a deep understanding of the tradeoff between the costs and the gains of resilience. Further work will also have to address the balancing between monetary and safety-related criteria for decision-making. The proposed method provides powerful instruments for one important aspect: the characterization of acceptable resilience-enhancing endowments and the identification of the most cost-efficient allocation. This tool box will be a prerequisite for future answers to many challenging questions.

## Acknowledgment

The authors kindly thank the German Research Foundation (DFG) for the financial support to accomplish the research project D5 “Resilience Based Decision Criteria for Optimal Regeneration” within the Collaborative Research Center (CRC) 871—Regeneration of Complex Capital Goods.