Abstract
Including resilience in an overall systems optimization process is challenging because the space of hazard-mitigating features is complex, involving both inherent and active prevention and recovery measures. Many resilience optimization approaches have thus been put forward to optimize a system’s resilience while systematically managing these complexities. However, there has been little study about when to apply or how to adapt architectures (or their underlying decomposition strategies) to new problems, which may be formulated differently. To resolve this problem, this article first reviews the literature to understand how choice of optimization architecture flows out of problem type and, based on this review, creates a conceptual framework for understanding these architectures in terms of their underlying decomposition strategies. To then better understand the applicability of alternating and bilevel decomposition strategies for resilience optimization, their performance is compared over two demonstration problems. These comparisons show that while both strategies can solve resilience optimization problem effectively, the alternating strategy is prone to adverse coupling relationships between design and resilience models, while the bilevel strategy is prone to increased computational costs from the use of gradient-based methods in the upper level. Thus, when considering how to solve a novel resilience optimization problem, the choice of decomposition strategy should flow out of problem coupling and efficiency characteristics.
1 Introduction
Complex, large-scale, or safety-critical engineered systems will inevitably encounter hazardous scenarios. In these scenarios, it is important to minimize potential safety and/or performance losses and maintain or restore critical operations [1]. This is accomplished by incorporating resilience in the system’s dynamic hazard response, which can include resistance, absorption, restoration, and recovery [2], as well as active prevention [3,4] attributes or features. Starting this process in the early design stage provides the best opportunity to shape the overall system design to be resilient to hazards [5–7] (e.g., by incorporating flexibility, redundancy, sensing and reconfiguration technology in the design).
Many frameworks for incorporating resilience in early system design have thus been put forward [1,8–13]. A key challenge in incorporating resilience in the early design process is trading the benefits of hazard-mitigating features to system resilience with their corresponding design and operational costs and inefficiencies. Value modeling [14,15], multiobjective decision analysis [12,16], and expected cost modelling [17–21] frameworks have thus been put forward to resolve these trade-offs and enable design decision-making. However, even when the trade-offs among design, operational, and resilience objectives have been resolved, it can be difficult to incorporate resilience because the space of potential features is large and complex [12], since it comprises many different variables which can interact in unintuitive ways.
To enable the systematic exploration of these design spaces, resilience optimization frameworks have been developed, which leverage mathematical optimization techniques to find the optimally resilient design [12]. The most commonly used approach is the two-stage approach [22–24], which uses a bilevel optimization strategy to (in the upper-level) design the system before the event and then (in the lower level) optimize the system response after each hazardous scenario occurs [25–29]. Other general resilience optimization formulations have presented it as a sequential problem: first allocating resilience to subsystems and then optimizing the reliability and health management of those systems to achieve the required resilience [30,31]. In addition, there have been a number of other dedicated resilience optimization architectures and applications that organize the problem differently—as a sequential problem, as a multidisciplinary design optimization problem [32,33], a multiagent problem [34,35], a bilevel problem [36,37], and as a monolithic problem [21,38–40].
Given the variety of resilience optimization formulations and frameworks, it can be difficult to understand how best to use these frameworks to approach a new problem. Thus, there is a need to understand and differentiate the types of resilience optimization problems one might encounter and to understand how best to select and tailor an optimization framework to these problem types, similar to what has been done in the related fields of multidisciplinary design optimization [41,42] and codesign [43]. To approach this problem, the authors previously developed a general framework for combined design, operational, and resilience optimization and compared the use of all-at-once, sequential, and bilevel architectures within this framework [44]. While this comparison constituted a first step into understanding the comparative advantages of resilience optimization approaches, it was limited to a very simple algorithm (exhaustive search) on a single problem. Other authors have compared nested (bilevel) with simultaneous solution architectures in similarly-formulated reliability-based codesign problems [45], finding that a nested approach could converge to a similar approach to a simultaneous (all-at-once) optimization architecture with fewer function evaluations. However, this work was limited toward reliability-based codesign problems and thus may not apply to the broader set of resilience optimization problem formulations. Furthermore, neither of these studies included alternating architectures in the comparison.
1.1 Contributions.
Given these gaps in the research, it may be difficult to understand when to select one of the myriad of existing approaches or how to formulate a new approach to a new resilience optimization problem of interest. The aim of this article is thus to develop an overall theory for understanding resilience optimization problem formulations and solution architectures. It advances this aim by pursuing two major contributions: First, it provides a comprehensive review of existing resilience optimization approaches and categorizes them in an overall framework showing how optimization architectures are used in different problem formulations (Sec. 2). Based on this review, it then identifies overall multilevel decomposition strategies used in resilience optimization approaches, as well as an alternating multilevel decomposition approach, which has not been studied in the field. To understand the comparative performance and applicability of these strategies, it then compares them over a simple notional resilience optimization problem and a cooling tank problem in Sec. 3 to evaluate decomposition strategy performance given coupling and alignment problem characteristics. These comparisons are then used to understand how best to apply the identified multilevel decomposition strategies to new resilience optimization problems (Secs. 4 and 5).
2 Resilience Optimization
Resilience optimization is the use of mathematical optimization techniques to increase the resilience of the system to hazardous scenarios. In this framework, resilience is the valued quality of the system’s dynamic response to hazards, which minimizes their undesirable consequences (e.g., downtime, safety consequences, repairs) [17]. This definition is common to a number of commonly used resilience frameworks, including the resilience triangle [46] and similar variants presented in the literature, (e.g., Refs. [31,47–49]) although there remains ongoing debate about resilience definitions [50–52]. The variety of different approaches to model, consider, and change the system’s hazard response has resulted in a number of different formulations of the resilience optimization problem. The remainder of this section presents a general framework for understanding resilience optimization problem formulations, which it then uses to classify previous examples of resilience optimization in the literature. It then describes decomposition strategies used in resilience optimization architectures, as well as the alternating architecture, which will be used in the comparison in Sec. 3.
2.1 Resilience Optimization Problem Formulations.
2.1.1 Integrated Resilience Optimization.
A large number of integrated resilience optimization formulations and approaches have been presented for integrated resilience optimization problems, as presented in Table 1. While some of these approaches use a monolithic formulation, the use of decomposition architectures is much more common because of the inherent structure of the problem, where some variables (design/operational) will apply to all resilience scenarios and others will only apply to some or individual scenarios. As a result, two-stage (and n-stage, a variant used where there are multiple sequential decisions) approaches are used most commonly. This approach leverages both a bi-level and scenario-based decomposition to find the optimal response of a system in each scenario given a set of design/operational variables that apply to all scenarios. However, it is not the only approach used, because the resilience variables can be coupled between scenarios (leading to the use of bilevel architectures) and because (in robust optimization formulations) the resilience objective is based on the worst-case scenario (rather than the entire set). This leads to the use of trilevel optimization architectures where the optimization of the resilience model is the optimization of the resilience policy nested within the optimization of the worst-case scenario. While there has been some exploration of sequential multilevel decomposition approaches, these are used much less commonly.
Architecture | Ref. | Problem description/variables |
---|---|---|
Monolithic | [55] | Mitigation (D/O) and restoration (R) strategies for a natural gas distribution network and power grid. |
[56] | Preparedness (D/O) and recovery (R) actions made links of a transportation network. | |
[33] | Power plant condenser design parameters (D/O) and PHM maintenance/repair policy (R). | |
[44] | Multirotor architecture, flight-plan, (D/O), and in-flight contingency management (R). | |
[57] | Aircraft control actuator reliability, detectability, and reconfigurability (D/O) and detection/recovery rapidity (R). | |
[58] | Transmission line protection level (D/O) and power generation and supply response for an electricity distribution system. | |
[59,60] | Freight transportation shipment allocation (D/O) and recovery activity (R). | |
Bilevel | [36] | Size and capacity of logistics service centers (D/O) and customer demand allocation (R). |
[44] | Multirotor architecture, flight-plan, (D/O), and in-flight contingency management (R). Compared monolithic and scenario-set resilience model decomposition approaches. | |
Trilevel | [61] | Network configuration and capacity (D/O) and processes operating level (R) in the worst-case scenario. |
[29,62] | Investment planning (D/O) and hazard response (D/O) in worst-case scenario. | |
[63] | Electricity and natural gas system line reinforcement (R) and flow response in worst-case attack scenario (R). | |
[64] | Protected nodes (D/O) and restored nodes (R) in worst-case interdiction scenario in water, gas, and power network. | |
[65] | Natural Gas Unit commitments (D/O) and contingency actions (R) in worst-case scenario. | |
[66] | Road network defenses (D/O) and flow of traffic (R) in worst-case attack scenario. | |
Two-stage | [26] | Airport preparedness (D/O) and recovery (R) actions. |
[27] | Freight network preparedness (D/O) and recovery (R) actions. | |
[28] | Power commitment and reserves of distributed generators and microgrids (D/O) and reserve deployment (R). | |
[67] | Distribution center capacity, and nominal customer allocation and order size (D/O) and customer allocation and order size in disruptive scenarios (R). | |
N-stage | [68] | Infrastructure mitigation and preparedness actions (D/O) and repair, recovery, and transfer plans (R) for healthcare systems. |
Sequential | [69] | Train braking profile (D/O) and maintenance policy (R). |
[30] | System-level resource allocation and redundancy (D/O). Component reliability and PHM efficiency are co-optimized in the lower level (R). Used a custom lower-level strategy. |
Architecture | Ref. | Problem description/variables |
---|---|---|
Monolithic | [55] | Mitigation (D/O) and restoration (R) strategies for a natural gas distribution network and power grid. |
[56] | Preparedness (D/O) and recovery (R) actions made links of a transportation network. | |
[33] | Power plant condenser design parameters (D/O) and PHM maintenance/repair policy (R). | |
[44] | Multirotor architecture, flight-plan, (D/O), and in-flight contingency management (R). | |
[57] | Aircraft control actuator reliability, detectability, and reconfigurability (D/O) and detection/recovery rapidity (R). | |
[58] | Transmission line protection level (D/O) and power generation and supply response for an electricity distribution system. | |
[59,60] | Freight transportation shipment allocation (D/O) and recovery activity (R). | |
Bilevel | [36] | Size and capacity of logistics service centers (D/O) and customer demand allocation (R). |
[44] | Multirotor architecture, flight-plan, (D/O), and in-flight contingency management (R). Compared monolithic and scenario-set resilience model decomposition approaches. | |
Trilevel | [61] | Network configuration and capacity (D/O) and processes operating level (R) in the worst-case scenario. |
[29,62] | Investment planning (D/O) and hazard response (D/O) in worst-case scenario. | |
[63] | Electricity and natural gas system line reinforcement (R) and flow response in worst-case attack scenario (R). | |
[64] | Protected nodes (D/O) and restored nodes (R) in worst-case interdiction scenario in water, gas, and power network. | |
[65] | Natural Gas Unit commitments (D/O) and contingency actions (R) in worst-case scenario. | |
[66] | Road network defenses (D/O) and flow of traffic (R) in worst-case attack scenario. | |
Two-stage | [26] | Airport preparedness (D/O) and recovery (R) actions. |
[27] | Freight network preparedness (D/O) and recovery (R) actions. | |
[28] | Power commitment and reserves of distributed generators and microgrids (D/O) and reserve deployment (R). | |
[67] | Distribution center capacity, and nominal customer allocation and order size (D/O) and customer allocation and order size in disruptive scenarios (R). | |
N-stage | [68] | Infrastructure mitigation and preparedness actions (D/O) and repair, recovery, and transfer plans (R) for healthcare systems. |
Sequential | [69] | Train braking profile (D/O) and maintenance policy (R). |
[30] | System-level resource allocation and redundancy (D/O). Component reliability and PHM efficiency are co-optimized in the lower level (R). Used a custom lower-level strategy. |
2.1.2 Resilience-Based Design Optimization.
In RDO, the design and operations of the system xD/O are optimized as decision variables, resulting in a design and/or mission profile, which is inherently resilient to faults. Following the notation in Eq. (1), this problem may be generically stated as follows:
Existing formulations and solution approaches for RDO problems are presented in Table 2. As shown, the majority of the existing RDO formulations use a monolithic architecture, and a very common type of problem is a sensor allocation, where sensors are distributed in a system to enable detection and reconfiguration of component faults. However, there have been a few additional problem formulation and solution architecture variants—two approaches [35,83] incorporate a coupled control problem that finds a corresponding control policy given the design policy by solving a resilience constraint satisfaction problem. In addition, the reliability-based codesign formulation in Ref. [45] uses a bilevel structure (in a comparison with all-at-once and other approaches) to simultaneously explore system designs and corresponding control policies at the same time. In addition, while these problems are nearly all solved using monolithic solution strategies, there is one example using the scenario-based decomposition described in Sec. 2.2.1, where each design variable is mapped to a set of scenarios originating from a function in the system.
Architecture | Ref. | Problem description/variables |
---|---|---|
Monolithic | [38,39,70–75] | PHM sensor allocation and network design. |
[76–79] | Optimization of (generic, power, rail, etc.) network topology. | |
[40] | Sensor locations, inspection interval, detection probability of motor controller PHM system. | |
[16] | Optimization of supply chain connectivity and production. | |
[18,31,80] | Component resilience attributes (e.g., redundancy, robustness, rapidity, reliability, restoration). | |
[81] | Retrofit improvements (materials, thicknesses) in a bridge. | |
[82] | Transportation path through an earthquake-prone area. | |
[83] | Power production planning (D/O) and load not served during extreme weather events (coupled R). | |
[35] | Agent (operator) placement and number (D/O) in a power grid and power equilibrium (coupled R). | |
Scenario-set | [84] | EPS system redundancy architecture. |
Bilevel | [45] | Wind turbine (and notional problem) design (upper level) and operations (lower level). |
Architecture | Ref. | Problem description/variables |
---|---|---|
Monolithic | [38,39,70–75] | PHM sensor allocation and network design. |
[76–79] | Optimization of (generic, power, rail, etc.) network topology. | |
[40] | Sensor locations, inspection interval, detection probability of motor controller PHM system. | |
[16] | Optimization of supply chain connectivity and production. | |
[18,31,80] | Component resilience attributes (e.g., redundancy, robustness, rapidity, reliability, restoration). | |
[81] | Retrofit improvements (materials, thicknesses) in a bridge. | |
[82] | Transportation path through an earthquake-prone area. | |
[83] | Power production planning (D/O) and load not served during extreme weather events (coupled R). | |
[35] | Agent (operator) placement and number (D/O) in a power grid and power equilibrium (coupled R). | |
Scenario-set | [84] | EPS system redundancy architecture. |
Bilevel | [45] | Wind turbine (and notional problem) design (upper level) and operations (lower level). |
2.1.3 Resilience Policy Optimization.
Architecture | Ref. | Problem description/variables |
---|---|---|
Monolithic | [89] | Urban rail repair sequence and duration. |
[85–88] | Aircraft off-nominal control policies. | |
[90] | Electric distribution system recovery policy. | |
[91] | Optimal bridge recovery delay and rate. | |
[92] | Optimal power and water system postearthquake recovery actions. | |
[19] | Monopropellant system recovery policy (R) and enabling flexibility (coupled D/O). | |
Two stage | [25] | Modes to repair over set of scenarios and dispatch of energy resources and repair crews in each scenario. |
Bilevel | [93] | Optimal policy for reconfiguring traffic intersections given optimal user behavior to intersection reconfiguration in each scenario. |
[37] | Reservoir flowrate (upper level) and flexibility allocation (lower level) over scenarios. | |
[94] | Road network recovery plans (upper level) given equilibrium traffic assignment (lower level). | |
[95] | Wildfire response operations (lower level) in the worst-case scenario (upper level). |
Architecture | Ref. | Problem description/variables |
---|---|---|
Monolithic | [89] | Urban rail repair sequence and duration. |
[85–88] | Aircraft off-nominal control policies. | |
[90] | Electric distribution system recovery policy. | |
[91] | Optimal bridge recovery delay and rate. | |
[92] | Optimal power and water system postearthquake recovery actions. | |
[19] | Monopropellant system recovery policy (R) and enabling flexibility (coupled D/O). | |
Two stage | [25] | Modes to repair over set of scenarios and dispatch of energy resources and repair crews in each scenario. |
Bilevel | [93] | Optimal policy for reconfiguring traffic intersections given optimal user behavior to intersection reconfiguration in each scenario. |
[37] | Reservoir flowrate (upper level) and flexibility allocation (lower level) over scenarios. | |
[94] | Road network recovery plans (upper level) given equilibrium traffic assignment (lower level). | |
[95] | Wildfire response operations (lower level) in the worst-case scenario (upper level). |
2.2 Decomposition Strategies.
The use of a specialized optimization architecture is motivated by the potential to reduce the solution time and complexity of a problem by decomposing it into reduced-dimensionality subproblems that map better to known algorithms. What differentiates the resilience optimization problem from a traditional multidisciplinary design optimization (MDO) problem is that it often has inherent structural characteristics that may be leveraged for more efficient solution due to there being reduced coupling between analyses. MDO problems, on the other hand, are often tightly coupled, having a large number of shared and coupling variables that must be kept consistent in the final solution. As a result, existing MDO architectures may not be suitable for the task of leveraging the weak coupling relationships between resilience optimization subproblems to achieve a more efficient and effective solution process.
2.2.1 Scenario Decomposition.
Scenario decomposition approaches decompose the resilience model into independent subproblems for each scenario or groups of scenarios. This is advantageous because the size of resilience problem increases with the number of scenarios considered, and each scenario may itself be computationally costly to simulate. However, its usage depends on the coupling of the scenarios in the resilience problem, as shown in Fig. 4. If the scenarios are fully uncoupled, meaning that the resilience variables for each scenario do not interact, each problem can essentially be solved independently. This is the case in the two-stage approaches [25], n-stage, and scenario-set approaches in Tables 1–3, where the variables optimized are how the system responds to each situation individually (rather than in aggregate).
However, there may be resilience problems of interest where the variables being optimized are coupled between the scenarios—for example, when a control component must take actions in hazardous scenarios given sensor readings. In cases like this, it may still be possible to map these variables to subsets of scenarios, which is the lower-level decomposition approach in Ref. [44]. However, in cases where the scenarios are entirely coupled, the problem must remain monolithic. Note that these decomposition approaches can be performed not only in the context of an IRO formulation (resulting in the approaches shown in Fig. 4) but also in a RPO and a RDO formulation where design variables can be decomposed to fault scenarios [84]. While these approaches are not compared here, we include them in this discussion because they can be a key reason to choose a multilevel decomposition strategy instead of using an all-at-once approach: multilevel decomposition strategies enable one to further decompose the resilience model into separate optimization problems, which can greatly reduce the computational cost of the overall problem.
2.2.2 Multilevel Decomposition: Bilevel Architecture.
2.2.3 Multilevel Decomposition: Alternating Architecture.
Alternating architectures can be adapted in a number of ways depending on the type of problem. As shown in Fig. 6, the sequential architecture explored in the previous work [44] is a variant of the alternating architecture, which only solves the optimization at each level once. One could additionally use several different exit conditions in the implementation of an alternating architecture (e.g., number of iterations, tolerances on variables), which could be adapted to a particular problem. Finally, alternating architectures may include the resilience model (or a surrogate) in the design/operations optimization loop, resulting in the “with CR” variant shown (see overlay/highlighting) in Fig. 6. This enables the design/operational optimization to take into account some of the costs of resilience without running a full optimization at every iteration.
3 Demonstration: Comparing Multilevel Decomposition Strategies
As stated in Sec. 2.2, while the bilevel multilevel decomposition strategy is used widely in resilience optimization, the use of the alternating strategy for these problems has not yet been studied. As a result, less is known about this strategy, and it may not be clear when it might be most applicable to a problem of interest. We posit that the effectiveness of this strategy compared with other strategies (such as a bilevel or all-at-once strategy) hinges on two inherent problem properties: alignment and coupling. Alignment refers to whether the upper-level and lower-level objectives oppose, support, or are invariant to each other (i.e., if dCD/dxD· dC/dxD ≈ ‖dCD/dxD‖*‖dC/dxD‖, where dC/dxD = dCD/dxD + dCR/dxD). Coupling, on the other hand, refers to the degree to which upper-level variables are constrained with lower-level variables. For the purpose of this demonstration, we define three levels of coupling:
– In an uncoupled problem, the lower-level optimization is merely a refinement of the upper-level optimization, meaning that there is a direct path from .
- – In a loosely coupled problem, the optimal choice of design variables xD* may depend on the choice of resilience variables; however, the choices do not directly depend on each other and the following relationship holds:
– In a fully coupled problem, this relationship does not hold, and as a result, the design and resilience variables must be jointly explored.
3.1 Notional System.
3.1.1 Optimization.
This is a nonlinear programming problem. In this work, this problem is solved using python’s trust-region algorithm in the scipy package (see Ref. [100]) using all-at-once, bilevel, alternating, and sequential strategies. While a full description of the implementation is out of the scope of this section, it is important to know that the bilevel method was run for 50 upper-level iterations with 20 corresponding lower-level iterations (since other convergence criteria were not met during optimization at either level) and the alternating approaches were set to terminate when the improvement between upper and lower-level optimizations was below a tolerance of ftol = 10−4. The starting point used was x = [1, 0.5, 10−4, 0.5, 1, 1]. Figure 8 shows the progress of the alternating, all-at-once, and alternating strategies over the computational time used for the optimization. As shown, while both the all-at-once and alternating strategies complete the optimization in reasonable computational time, the bilevel strategy takes an order of magnitude longer to approach the same solution while the alternating and sequential strategies (without CR) converge to a poor design, since they have no ability to account for resilience in the design optimization problem. These results are also reflected in the final results comparison in Table 4, which additionally shows the performance of sequential strategies with and without the resilience cost CR in the upper level. As shown, the sequential strategy with the resilience cost in the upper level performs nearly as well as the all-at-once strategy at reduced computational cost because of the reduced space of the problems (and lack of subsequent iterations present in alternating architectures).
Strategy | xp | xa | xr | xs | xb | xc | f* | time |
---|---|---|---|---|---|---|---|---|
All-at-once | 1.5 | 1.1 | 0.0022 | 0.41 | 0.62 | 10 | −1.3 × 10+06 | 0.51 |
Bilevel | 1.5 | 0.8 | 0.0024 | 0.7 | 0.71 | 10 | −1.2 × 10+06 | 14 |
Alternating (with CR) | 1.5 | 1.1 | 0.0022 | 0.41 | 0.62 | 10 | −1.3 × 10+06 | 0.96 |
Alternating (no CR) | 1 × 10+02 | 80 | 1 × 10+02 | 20 | 0.071 | 10 | 4.1 × 10+11 | 0.53 |
Seq. (with CR) | 1.3 | 0.78 | 0.00083 | 0.51 | 0.72 | 10 | −1.2 × 10+06 | 0.37 |
Seq. (no CR) | 1 × 10+02 | 80 | 1 × 10+02 | 20 | 0.071 | 10 | 4.1 × 10+11 | 0.26 |
Strategy | xp | xa | xr | xs | xb | xc | f* | time |
---|---|---|---|---|---|---|---|---|
All-at-once | 1.5 | 1.1 | 0.0022 | 0.41 | 0.62 | 10 | −1.3 × 10+06 | 0.51 |
Bilevel | 1.5 | 0.8 | 0.0024 | 0.7 | 0.71 | 10 | −1.2 × 10+06 | 14 |
Alternating (with CR) | 1.5 | 1.1 | 0.0022 | 0.41 | 0.62 | 10 | −1.3 × 10+06 | 0.96 |
Alternating (no CR) | 1 × 10+02 | 80 | 1 × 10+02 | 20 | 0.071 | 10 | 4.1 × 10+11 | 0.53 |
Seq. (with CR) | 1.3 | 0.78 | 0.00083 | 0.51 | 0.72 | 10 | −1.2 × 10+06 | 0.37 |
Seq. (no CR) | 1 × 10+02 | 80 | 1 × 10+02 | 20 | 0.071 | 10 | 4.1 × 10+11 | 0.26 |
The comparative performance of these strategies flows out of the characteristics of the problem and optimization methods. Because the design and resilience optimization problems are only loosely coupled (resilience variables do not significantly impact the upper-level cost), the sequential and alternating approaches perform nearly as well as a monolithic approach in terms of solution found and computational time. However, this is only the case when the resilience cost is included in the upper-level model, which is consistent with Ref. [44]. The bilevel strategy performs poorly for two reasons: first, establishing a gradient in the upper level is unnecessarily costly because each point evaluated to find the gradient using the finite difference method in the algorithm corresponds to a full optimization of the lower-level; second, many of the iterations used solving the lower-level optimization are wasted because they are unrelated to establishing feasibility in the upper level. To summarize, the alternating strategy can perform well comparably to the all-at-once strategy because of the loose coupling in the problem, while the bilevel strategy performs poorly because of the inherent computational expense of re-optimizing the lower level to calculate the gradient in the upper level.
3.2 Cooling Tank Problem.
Scenario | Rate | Cost | Expected cost |
---|---|---|---|
Import coolant leak | 1.7 × 10−06 | 2.1 × 10+06 | 3.5 × 10+05 |
Import coolant blockage | 1.7 × 10−06 | 2.1 × 10+06 | 3.5 × 10+05 |
Store coolant leak | 1.7 × 10−06 | 1 × 10+06 | 1.7 × 10+05 |
Export coolant leak | 1.7 × 10−06 | 1 × 10+06 | 1.7 × 10+05 |
Export coolant blockage | 1.7 × 10−06 | 1 × 10+05 | 1.7 × 10+04 |
Scenario | Rate | Cost | Expected cost |
---|---|---|---|
Import coolant leak | 1.7 × 10−06 | 2.1 × 10+06 | 3.5 × 10+05 |
Import coolant blockage | 1.7 × 10−06 | 2.1 × 10+06 | 3.5 × 10+05 |
Store coolant leak | 1.7 × 10−06 | 1 × 10+06 | 1.7 × 10+05 |
Export coolant leak | 1.7 × 10−06 | 1 × 10+06 | 1.7 × 10+05 |
Export coolant blockage | 1.7 × 10−06 | 1 × 10+05 | 1.7 × 10+04 |
3.2.1 Optimization.
This problem is difficult to solve in an all-at-once strategy because of the high resilience model dimensionality (54 variables) and the mix of variable types (continuous in the design model and discrete in the resilience model). In addition, because variables are state based (and not scenario based), some variables may be coupled (e.g., raising a level when it is too low may cause the level to become too high, resulting in a new set of actions). Thus, this work uses a custom evolutionary algorithm in a monolithic resilience model to generate and refine solutions. This makes it difficult to solve design and resilience models in tandem since the result of the lower-level optimization may not necessarily be continuous (or act like a continuous function to an upper-level solver). To solve this problem, the design model in this work is searched using the Nelder–Mead method in scipy [100], a gradient-free direct search method, which creates a mesh and iteratively adds and removes points based on the values of the current points.
The alternating strategy was implemented on this problem using a population size of 50 and a number of iterations of 100, while a population size of 20 and number of iterations of 20 was used in the bilevel strategy. This was done to mitigate computational costs in the bilevel strategy and take as much advantage of each successive optimization step in the alternating strategy as possible, since it was observed to converge quickly. Both strategies used a seeding method in the evolutionary algorithm, which enabled the best populations found at the end of each lower-level optimization to be saved and carried over to the next optimization. Since the lower-level optimization was performed using an evolutionary algorithm, these optimizations were run over 20 replicates to avoid potential issues with solution and performance variability. The progression of these strategies over the computational time of the optimization in Fig. 10. As shown, the sequential strategies and alternating strategy (without the cost of resilience) follow the trends shown for the notional problem in Sec. 3.1—either preceding to the minimum-design cost solution (10,0) (which does not incorporate any resilience) or terminating prematurely. However, unlike the previous comparison, the alternating strategy (with CR) very quickly reaches a plateau where each individual optimization does not improve the design significantly, while the bilevel strategy searches the space more effectively, ultimately converging to a lower-cost design.
This is further demonstrated in Table 6, which shows the optimization performance (over 20 runs), an optimal design output from a single run, and a summary of the chosen policy for that design. As shown, the alternating strategy (with CR) converges to a tank size of 20, the size which by design mitigates leak faults by making it impossible for the tank to drain completely, while the bilevel strategy converges to a tank size of 18.6. While the optimal designs found by all approaches include corrective actions in the lower level for when states go off-nominal (i.e., |xip, xop ≠ 0| > 0), the design found by the bilevel strategy increases the inlet flow more often (xip > 0) in the resilience policy, since the inlet has buffer, which it can leverage (i.e., xl > 0). As a result, it does not need as large of a tank, since it can increase the input flow in the corresponding faulty scenarios when it might otherwise rely solely rely on tank buffer.
Design | Policy (summary) | Optimization performance (20 runs) | ||||||
---|---|---|---|---|---|---|---|---|
Approach | xt* | xl* | |xip, xop ≠ 0| | |xip > 0| | std. | Time (s) | std. | |
Bilevel | 18.6 | 0.62 | 22 | 12 | 287,604 | 2,785 | 1,062 | 586 |
Alt. (with CR) | 20 | 0.0 | 21 | 8 | 453,567 | 831 | 373 | 83 |
Alt. (no CR) | 10 | 0.0 | 23 | 5 | 893,333 | 0 | 181 | 6 |
Seq. (with CR) | 22 | 0.0 | 20 | 7 | 467,133 | 980 | 65 | 2 |
Seq. (no CR) | 10 | 0.0 | 24 | 6 | 893,333 | 0 | 60 | 2 |
Design | Policy (summary) | Optimization performance (20 runs) | ||||||
---|---|---|---|---|---|---|---|---|
Approach | xt* | xl* | |xip, xop ≠ 0| | |xip > 0| | std. | Time (s) | std. | |
Bilevel | 18.6 | 0.62 | 22 | 12 | 287,604 | 2,785 | 1,062 | 586 |
Alt. (with CR) | 20 | 0.0 | 21 | 8 | 453,567 | 831 | 373 | 83 |
Alt. (no CR) | 10 | 0.0 | 23 | 5 | 893,333 | 0 | 181 | 6 |
Seq. (with CR) | 22 | 0.0 | 20 | 7 | 467,133 | 980 | 65 | 2 |
Seq. (no CR) | 10 | 0.0 | 24 | 6 | 893,333 | 0 | 60 | 2 |
The superior performance of the bilevel strategy in this instance is a result of the coupling relationship between the upper- and lower-level problems. Since the pipe buffer has no intrinsic value outside its ability to be leveraged by a lower-level policy, which has been correspondingly optimized, the alternating strategy (which optimizes each separately) reduces pipe margin to 0, while the bilevel strategy (which optimizes each in conjunction) finds an optimal pipe margin of 0.62, which it leverages in the resilience policy by increasing the input flow when appropriate to mitigate the fault scenario. This demonstrates how tightly coupled design and resilience problems can necessitate a bilevel strategy, since in these cases, the resilience of the design variables requires a leveraging optimal resilience policy, which itself may be sensitive to changes in the design.
4 Discussion and Theoretical Implications
Using multilevel decomposition strategies on integrated resilience optimization problems can improve the computational efficiency of the optimization process by reducing the complexity of the design and resilience optimization problems. It is particularly desirable to reduce the number of evaluations of the resilience model, which can be computationally expensive because evaluation time increases by , where E is the number of equations, T is the number of time-steps, and S is the number of scenarios. Multilevel strategies can reduce this complexity by enabling the use of scenario-based decomposition strategies in the resilience optimization problem (used in the commonly used two-stage architecture), which greatly decrease computational complexity of the resilience optimization problem by breaking up the lower-level optimization over scenarios or sets of scenarios. Sequential or alternating strategies additionally have the potential to reduce the space of the problems and proceed more efficiently through the design space (as was the case in the notional system problem). Finally, a monolithic formulation may not be able to readily solve a given problem when the design and resilience models have differing variable types (as was the case in the cooling tank problem). In this context, multilevel strategies can improve the ability to optimize the problem by enabling the design and resilience problems to be solved by methods applicable to each problem type.
To better understand how (and when) to apply different multilevel decomposition strategies, Sec. 3 compared alternating and bilevel strategies (and their variants) on two different problems with different levels of coupling. In this comparison (summarized in Fig. 11), it was shown that the alignment and coupling of the design/operational and resilience problems drives the relative effectiveness of multilevel decomposition strategies. When the design/operational and resilience problems are not aligned (which constitutes all problems where there is a trade-off between design cost and resilience), the resilience model (or a surrogate) must be included in alternating and sequential strategies for them to perform adequately. Coupling additionally drives the choice of multilevel decomposition strategies—while alternating strategies can be used when the design and resilience problems are uncoupled or loosely coupled, a fully coupled problem requires a bilevel or all-at-once strategy.
Finally, the relative efficiency and effectiveness of given decomposition strategies is heavily dependent on the algorithm used. In the exhaustive search used in the drone model in Ref. [44], for example, the bilevel architecture was able to reduce (some) computational cost by reducing the number of iterations by reducing the space of the search and enabling a lower-level decomposition strategy. However, as shown in the notional example presented here, the large number of lower-level optimizations necessary to approximate a gradient in the upper-level problem can increase the computational cost by orders of magnitude when using a gradient-based solver. This was not the case in the tank problem, in part because the design/operations model was optimized using the Nelder–Mead method, which does not have a gradient-finding step at each iteration. Thus, to perform efficiently, the choice of strategy must be connected to the solution algorithm—whether it be because a strategy is inherently quicker with a given algorithm or because upper- and lower-level problems require different methods to solve efficiently. In particular, bilevel formulations are significantly hindered by the use of gradient-based approaches in the upper level, since each point in upper-level design/operations optimization results in a re-optimization of the lower level. There is some potential to mitigate the efficiency issues of the bilevel architecture by adapting the number of scenarios and iterations of the resilience optimization steps depending on the stage of optimization (e.g., to speed gradient approximation steps). Thus, the future work should develop specialized architectures for resilience optimization, which help manage the computational costs specific to this type of problem.
5 Conclusions
Effective application of resilience optimization architectures requires knowledge of the underlying optimization problem formulation. Because resilience optimization problems can be formulated in a number of different ways, characterizing the applicability of given architectures requires considering their ability to optimize each type of formulation. As presented in Sec. 2, existing architectures use a variety of scenario-based and multilevel decomposition strategies to effectively solve different formulations of the resilience optimization problem. To understand the applicability of the multilevel decomposition strategies in the literature, as well the alternating strategy (which has not been used or studied rigorously in this application), this work then applied these architectures to two example problems. There were two main lessons learned from these two applications: First, the ability for alternating architectures to optimize effectively depends on the alignment and coupling of the design and resilience problems. Second, the performance of the bilevel strategy varies widely based on the algorithms used at each level, since it can enable different solution strategies at each level but also require a full optimization of the lower level at each upper-level design point. More broadly, knowing when to apply decomposition architectures requires an understanding of the formulation of the problem (in terms of variable types), alignment and coupling properties of the underlying models, and efficiency characteristics of the algorithms used.
There are a few limitations with insights presented here, which should be resolved in the future work. First, while the comparison here can help one understand when to use an alternating or bilevel architecture, it may be difficult to identify the coupling relationships prior to optimization. Future work should develop methods for identifying coupling relationships in a given resilience optimization problem before selecting an architecture. Second, while gradient-based optimization methods are shown to lead to high computational costs in bilevel approaches, this can be reduced by using gradient-free optimization algorithms, using a surrogate of the resilience model in the upper level, or limiting the amount of re-optimization of the resilience model during the gradient-finding steps. Future work should investigate these approaches to best understand how to solve integrated resilience optimization problems. Finally, a large number of possible formulations and architectures were presented in Sec. 2.2, which demonstrates the large space of possible resilience optimization approaches. In addition, methods in the larger field of multidisciplinary design optimization (e.g., analytic target cascading) have largely been neglected in the field of resilience optimization despite the mature and readily available tools that exist for this purpose (e.g., openmdao, agile). Future work should continue to explore and compare architectures, which have not seen substantial use for resilience optimization so that they can be understood and used appropriately in practice.
Acknowledgment
This research was partially conducted at NASA Ames Research Center. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not constitute or imply its endorsement by the United States Government.
Conflict of Interest
There are no conflicts of interest.
Data Availability Statement
The datasets generated and supporting the findings of this article are obtainable from the corresponding author upon reasonable request.