## Abstract

Multifidelity (MF) models abound in simulation-based engineering. Many MF strategies have been proposed to improve the efficiency in engineering processes, especially in design optimization. When it comes to assessing the performance of MF optimization techniques, existing practice usually relies on test cases involving contrived MF models of seemingly random math functions, due to limited access to real-world MF models. While it is acceptable to use contrived MF models, these models are often manually written up rather than created in a systematic manner. This gives rise to the potential pitfall that the test MF models may be not representative of general scenarios. We propose a framework to generate test MF models systematically and characterize MF optimization techniques' performances comprehensively. In our framework, the MF models are generated based on given high-fidelity (HF) models and come with two parameters to control their fidelity levels and allow model randomization. In our testing process, MF case problems are systematically formulated using our model creation method. Running the given MF optimization technique on these problems produces what we call “savings curves” that characterize the technique's performance similarly to how receiver operating characteristic (ROC) curves characterize machine learning classifiers. Our test results also allow plotting “optimality curves” that serve similar functionality to savings curves in certain types of problems. The flexibility of our MF model creation facilitates the development of testing processes for general MF problem scenarios, and our framework can be easily extended to other MF applications than optimization.

## 1 Introduction

The idea of leveraging multifidelity (MF) models has been exploited in most engineering fields where simulations of physical systems are heavily used in the engineering process. The simulations typically come with configuration settings such as the choice of underlying physical models and discretization resolutions [1–7]. By varying these settings, engineers can obtain simulation models of varying fidelities and running costs. The cost can come in multiple forms such as computation time, financial cost, and even carbon emissions. While not always the case, high-fidelity (HF) models have relatively high costs and low-fidelity (LF) models have low costs. In application scenarios such as engineering design optimization where simulation models need to be run repeatedly with different inputs, oftentimes models of varying fidelities can be used for each run. Potential savings can be achieved by running LF models instead of HF ones when the information gain through certain LF model runs outweighs the associated cost. Researchers in various fields have proposed MF techniques and algorithms to manage the LF and HF model runs strategically to achieve cost savings [8–17]. Some approaches are more problem-specific, whereas some others are more general-purpose. While researchers always provided some MF problems to test and benchmark their techniques' performance, we observe that most contrived problems are prone to inadequate or even unfair selection, and more systematic testbeds can be developed to better characterize the techniques and compare them in a more rigorous way.

In this work, we focus on the field of general-purpose MF model-based design optimization to facilitate concrete discussions of methodologies and case studies [18–20], although one may realize that most of our contributions can be easily extended to other fields using MF models for other types of applications. Examples of such applications include model calibration, model uncertainty quantification, and sensitivity analysis. The existing MF optimization techniques can be roughly classified into two categories, namely, global and local optimization techniques. Among global techniques, many follow the paradigm of Bayesian optimization (BO), where during the optimization process, all the model evaluation data are used to conduct posterior inference across the design space and construct a so-called acquisition function (AF) to determine the next model-evaluation decisions. The AF is meant to quantify the “merit” of evaluating any fidelity model at any input location, and the next sample point is almost always chosen as its maximum point. Most research interest focused on the formulation of effective and efficient AFs, exemplified by variable-fidelity expected improvement [21], multifidelity Gaussian process (GP) upper confidence bound [19], multifidelity max-value entropy search [22], and multifidelity mutual information greedy [23] techniques. Local optimization techniques utilize MF models only to find better design points close to the current best one in the design space. Thus, local techniques seek to achieve local optimality rather than the global optimality that global techniques target directly. While fewer local optimization techniques have been proposed in literature than global ones, some techniques in this category are worth attention. In Ref. [24], the authors modified the archetypal Broyden–Fletcher– Goldfarb–Shanno (BFGS) algorithm for gradient-based optimization [25] with MF model evaluations. The line-search step in the algorithm is performed with the LF model that is corrected locally against the HF model. A recent work tackles multifidelity multidisciplinary design optimization problems with a seemingly simple yet effective approach [26]. It drops the MF models that are not Pareto-efficient in terms of fidelity and evaluation cost, and sorts the MF models based on their fidelities. Then, it sequentially uses lower- to higher-fidelity models to improve the current design with tolerances matching the model's fidelity level. While these local techniques only guarantee local optimality, they can be plugged into multistart optimization architectures, resulting in effectively global techniques.

When researchers try to benchmark the performance of their proposed MF techniques using case studies, contrived case problems are frequently used to compensate the lack of real-world problems or speed up the testing process. These problems typically consist of analytical models with hypothetical evaluation costs. Due to the vast space of possible analytical models, it is widely accepted that researchers only need to test their techniques against two or more analytical models with seemingly random formulas. However, when it comes to constructing MF models, using seemingly random formulas becomes more questionable. After all, given a fixed HF model formula, snapping another random formula to it to create an associated LF model would not work, as the additional random formula may render the “LF” model completely different from the HF model. It follows that constructing the LF model requires adjusting the formula so that the LF model resembles the HF model to a certain degree. The question is then how to ensure the adjustment is still “random” enough. In most practice, authors simply manually craft formulas for MF models that can pass their own and reviewers' judgements.

We propose a new framework for creating MF testbeds and testing MF optimization techniques that is much more rigorous than the aforementioned practice. Our framework includes two components, i.e., a new method for generating MF models based on given HF model, and a process for generating test cases and characterizing the performance of tested MF techniques. Our method of generating MF models is based on characterizing the HF model through GP modeling [27] and comes with three desirable features. First, the MF model generation process is highly standardized given any base HF model, so that manual adjustment of the model formula is completely avoided. Second, the MF models have a tunable parameter that continuously controls the fidelity, i.e., resemblance to the HF model. Third, the MF models can be highly randomized to further avoid skewing of the generated case models. With this powerful MF model generation tool, we are able to systematically generate a suite of test cases for the given MF optimization technique to be tested. The results of the tests can yield what we call a “savings curve” that characterizes the performance of the technique in a similar way that receiver operating characteristic (ROC) curves characterize binary classifiers in the machine learning field [28]. Specifically, savings curves plot the costs for running the MF optimization technique across the generated suite of test cases representing different MF problem scenarios. With savings curves, one can better understand the costing characteristics of given technique in a wide range of problem scenarios and compare different techniques more comprehensively. In addition, the test results can produce an “optimality curve” in a similar fashion that complements the savings curve to give the tester a complete picture of the technique's performance. Instead of plotting the costs for running the optimization like savings curves do, optimality curves plot the values of the optima found by the optimization techniques across the test cases. Using our framework, one can convert any existing optimization benchmark problem such as collected in Ref. [29] into MF benchmark problems, which greatly saves researchers' time and effort in coming up with appropriate benchmark problems for testing their MF optimization techniques.

Our framework is motivated by the pursuant of rigor of validation; therefore, we want to bring to the readers' attention broader principles for validation such as validation square (VS) [30,31], which provides a rigorous approach for testing and validating general design methods. Our testing framework is largely a substantiation of the second step in VS, as our framework facilitates systematic testing of the structural validity of any proposed MF design techniques. If one adopts our framework when following the steps in VS to test certain MF techniques, one needs to be careful whether the use of our framework will be consistent with how s/he conducts the third step in VS. This will be better understood and addressed after we lay out more details of our framework in Sec. 2.

The paper is organized as follows. Section 2 elaborates our testing framework, detailing our methodology to generate MF models and suite of test cases. Section 3 uses concrete HF model and MF techniques to demonstrate examples of our testing process and results. Section 4 summarizes our contributions in this work.

## 2 Methodology

### 2.1 Multifidelity Model Generation Based on Given High-Fidelity Model.

where $\lambda \u2208[0,1]$ is the parameter controlling the level of fidelity of the LF model, $A$ and $B$ are fitted magnitude factor and mean value that ensure the first and second terms without the $\lambda $ factors have similar global variation ranges, $fi(0)$ is a problem-independent normalization constant to be explained shortly, and $\varphi i\u2208[0,2\pi ]$ is a phase parameter that randomizes the correlation of the first and second terms along $i$th input dimension.

More explanations of each factor/parameter introduced above are as follows. $\lambda $ is a control parameter functioning like a signal–noise ratio; $\lambda =0$ means the LF model is completely different than the HF model or contains no “signal” of the HF model, and $\lambda =1$ means the LF model is identical to the HF model. $A$ and $B$ are constants fitted by matching the standard deviations and means of globally sampled outputs of $yHF(x)$ and those of the summed sine functions. $fi(0)$ is a constant equal to the fitted roughness parameter for each dimension after fitting an accurate GP model to the prototype model $y0(x)=\u2211i=1dsin(xi)$. This ensures that the first and second terms in $yLF\u2212\lambda (x;\varphi )$ fluctuate in similar frequencies from the GP modeling perspective along each input dimension. Finally, $\varphi i$ is a control parameter whose value can be sampled uniformly from $[0,2\pi ]$ to generate randomized MF models for any fixed fidelity level $\lambda $. To sum up, once all the rest factors have been fitted, the MF models $yLF\u2212\lambda (x;\varphi )$ are controlled by only $\lambda $ and $\varphi $, with $\lambda $ linearly controlling the model fidelity and $\varphi $ allowing model randomization.

We use the standard Branin function as an example HF model to illustrate the effectiveness of our MF model generation method. The Branin function and MF models generated using our method are plotted in Fig. 1. Two instances of relatively high-fidelity models ($\lambda =0.8$) and two instances of relatively low-fidelity models ($\lambda =0.2$) are generated and plotted in Figs. 1(b)–1(e). It is clear that instances of respective fidelity levels have respective levels of resemblances to the original HF model. At the same time, different instances within the same fidelity level still manifest relatively random deviations from the HF model.

While it appears that the second term in our formulation in Eq. (1) is just one of many possible constructs, we provide the following justifications for our choice of Eq. (1) over others. First, following the popular framework of Gaussian-process-based Bayesian calibration [34], we believe that the most theoretically sound formulation of the second term is a realization of a Gaussian process, as this takes the most general and flexible form. However, this choice suffers from two drawbacks. First, the randomization space may be too large to be practical, especially for high-dimensional problems. The large randomization space may also affect the stability of the generated models. Second, the realization of a Gaussian process would lack the model transparency that is important in practical applications. Our choice of sinusoidal function strikes a good balance between generality and practicality; the wavy behavior of sinusoidal functions is quite similar to that of Gaussian processes with commonly used Gaussian kernels, and the randomization based on a single parameter $\varphi $ is quite manageable in practical uses. Still, we acknowledge that realization of Gaussian processes could be an alternative formulation if model transparency is not a concern, and a practical randomization scheme can be developed for this case.

### 2.2 Testing Process for Given Multifidelity Technique.

Note that such problems contain $\lambda ,\u2009\varphi ,\u2009cHF,\u2009cLF,\u2009and\u2009cb$ as setting parameters, which we specify as follows. The values for $\lambda $ and $\varphi $ come from the Cartesian product of the previously generated sets of values ${\lambda (i)}i=1m$ and ${\varphi (j)}j=1n$, resulting in $m\xd7n$ problems in total. For the ease of normalization, we set $cHF=1.0$ in all cases and let $cLF$ be a fraction of $cHF$. We propose two standardized ways to determine the value of $cLF$. One is to fix $cLF$ to a small value such as $0.01$ for all cases. The other is to let $cLF=\lambda cHF$, which means the LF model cost will be proportional to its fidelity level. In the case studies in Sec. 3, we will see that both kinds of $cLF$ values can produce interesting test results. The value of $cb$ will depend on the nature of the MF technique to be tested and the intended characteristic of the test problem. If the tested MF technique does not consider a total budget for running optimization, the value of $cb$ will have no influence on the test whatsoever. If the tested MF technique considers both total budget and traditional convergence criteria when solving optimization problems, we can imagine two extreme cases. For one, if the value of $cb$ is very large, then the technique is very likely to solve the problem till convergence, and the total cost incurred will be the key performance indicator. For the other, if the value of $cb$ is very small, then the technique is very likely to run out of budget before it finds the true optimum, and the quality of the best solution it finds will be the key performance indicator. In literature, we find that the total cost is the performance indictor of most interest. Even though this is because many proposed techniques have not considered budget at all, techniques that perform well in one of the two extreme cases mentioned above tend to perform well in the other case anyway. Therefore, we recommend setting $cb$ to a large value for standard testing, unless the tester particularly wants to use a small $cb$ value for some reason, e.g., to speed up the testing process.

Once the $m\xd7n$ test problems (with parameters $\lambda (i)$ and $\varphi (j)$ ($i=1,\u2026,m;j=1,\u2026,n$)) are fully defined, they are fed to the MF technique and the resultant total costs $c(i,\u2009j)$ and obtained optimum values $yopt(i,\u2009j)$ for each test case are collected. For normalization purposes, we also run a benchmark optimization with a default optimization technique using a single-fidelity model (HF model) and record the total cost $cbench$ and the optimum value $ybench$. The default technique can be the same as the test MF technique if it can handle single-fidelity problems; otherwise, another benchmark optimization technique needs to be provided by the tester. One last piece of data is the scale of variation of $y$ across the design space of $x$, and we measure this by the standard deviation $ystd$ of a large enough set of uniformly sampled data of $yHF$ in the design space. Once all the data mentioned above are collected, they are organized as shown in Tables 1 and 2. The summary statistics will then be plotted to characterize the performance of the tested technique. Specifically, we calculate the mean $\mu c(i)$ and standard deviation $\sigma c(i)$ of the costs incurred in all the test problems with parameter $\lambda (i)$ ($i=1,\u2026,m$), and likewise the mean $\mu y(i)$ and standard deviation $\sigma y(i)$ of the $y$ values of the optima obtained in the test problems.

$\lambda (1)$ | … | $\lambda (m)$ | |
---|---|---|---|

$\varphi (1)$ | $c(1,1)cbench$ | … | $c(m,1)cbench$ |

… | … | … | … |

$\varphi (n)$ | $c(1,n)cbench$ | … | $c(m,n)cbench$ |

Summary stats | $\mu c(1),\u2009\sigma c(1)$ | … | $\mu c(m),\u2009\sigma c(m)$ |

$\lambda (1)$ | … | $\lambda (m)$ | |
---|---|---|---|

$\varphi (1)$ | $c(1,1)cbench$ | … | $c(m,1)cbench$ |

… | … | … | … |

$\varphi (n)$ | $c(1,n)cbench$ | … | $c(m,n)cbench$ |

Summary stats | $\mu c(1),\u2009\sigma c(1)$ | … | $\mu c(m),\u2009\sigma c(m)$ |

The data are grouped by $\lambda (i)$ values and summary statistics, exemplified by mean $\mu $ and standard deviation $\sigma $, are calculated.

$\lambda (1)$ | … | $\lambda (m)$ | |
---|---|---|---|

$\varphi (1)$ | $yopt(1,1)\u2212ybenchystd$ | … | $yopt(m,1)\u2212ybenchystd$ |

… | … | … | … |

$\varphi (n)$ | $yopt(1,n)\u2212ybenchystd$ | … | $yopt(m,n)\u2212ybenchystd$ |

Summary stats | $\mu y(1),\u2009\sigma y(1)$ | … | $\mu y(m),\u2009\sigma y(m)$ |

$\lambda (1)$ | … | $\lambda (m)$ | |
---|---|---|---|

$\varphi (1)$ | $yopt(1,1)\u2212ybenchystd$ | … | $yopt(m,1)\u2212ybenchystd$ |

… | … | … | … |

$\varphi (n)$ | $yopt(1,n)\u2212ybenchystd$ | … | $yopt(m,n)\u2212ybenchystd$ |

Summary stats | $\mu y(1),\u2009\sigma y(1)$ | … | $\mu y(m),\u2009\sigma y(m)$ |

The data are grouped by $\lambda (i)$ values and summary statistics are calculated like in Table 1.

Plotting $(\mu c(i),\u2009\sigma c(i))$ against $\lambda (i)$ yields what we call the savings curve that characterizes the given technique's performance in terms of how much cost it can save in comparison with running a single-fidelity optimization with only the HF model. Since the plotted values are normalized costs, the values below $1.0$ means the MF optimization can save cost compared with single-fidelity optimization, and vice versa. Since the $x$-axis is the $\lambda $ value that indicates the actual fidelity of the LF model, the left part of the curve characterizes the technique's performance when the LF model has relatively low fidelity, and the right part characterizes the technique's performance when the LF model has relatively high fidelity. If one wants to use a single value to summarize the technique's performance, one way to calculate this value is to mimic how area under ROC is calculated from ROC curve in the machine learning field and calculate a similar area under the savings curve. This area represents the technique's average saving performance across different scenarios represented by different $\lambda $ values.

Plotting $(\mu y(i),\u2009\sigma y(i))$ against $\lambda (i)$ yields what we call the optimality curve that characterizes the given technique's performance in terms of how good the quality of the optimum point is that it can find during the optimization. This curve is constructed in a similar way to how savings curve is constructed, so its interpretation is also similar to that of the savings curve. As discussed before, in most cases, the optimization budget parameter $cb$ is set to a large value so that the tested MF technique is expected to converge at a global or local optimum before it runs out the budget during the optimization. In these cases, the optimum found by the MF technique should be close to the benchmark optimum, given that the MF technique has decent convergence properties. As a result, the optimality curves in these cases should be relatively flat with values close to zero, and the savings curve is more informative about the technique's performance. In other cases where the technique is tightly constrained by the budget $cb$ and runs out of the budget in most test runs, the optimality curve will be unlikely to stay flat and will better characterize the performance of the technique.

Note that the above testing process is based on the common prototypical situation with bilevel fidelity models and no prior knowledge about the fidelity level of the LF model. This process is good for general-purpose testing because it has relatively standard structure, but one can also easily tailor this process to other situations such as those where multiple LF models are available or LF model's actual fidelity is known to a certain degree. Furthermore, this testing framework can be extended to test MF-model-based techniques purposed for application types other than optimization. The savings curve can be still applied for performance characterization, and one only needs to modify the definition of the optimality curve according to the relevant performance metric in the application.

While our testing framework is mathematically sound, one needs to be careful when one uses our framework as the second step in VS as we mentioned before. This is because the third step of VS involves empirical testing of the given (MF) design technique on a practical problem that should be consistent with the test in the second step. While our framework has the parameter $\lambda $ that allows the tester to freely tune the fidelity of the LF model, not all $\lambda $ values correspond to a real-world MF design problem. Conversely, for any specific real-world bilevel MF problem, it may not be easy to precisely recognize the corresponding $\lambda $ value in our formulation. We therefore have two recommendations for using our framework in the context of VS. First, our framework is best suited to situations where the tester does not have detailed information about the specific real-world MF problems that the MF techniques will be applied to. In this case, perhaps the best the tester can do is to try to come up with some contrived test problem like in the current common practice. Our framework stands out by providing the tester with a more structured and principled way of formulating the test problems. Second, in situations where the tester does have some specific sample MF problems and wants to generate more similar test problems, one way to estimate the $\lambda $ values corresponding to those problems is as follows. Taking bilevel problems as an example, one first draws sizable samples from the actual LF model. Then, one draws parameterized samples from the model in Eq. (1) with the same $x$ values, keeping $\lambda $ as a parameter while randomizing the $\varphi $ value. That is, one picks a certain number of random $\varphi $ values, plugs those values and $x$ values into Eq. (1), and gets $y$ values as functions of $\lambda $. At this point, each sample from the actual LF model corresponds to a certain number (equal to the number of $\varphi $ values) of $\lambda $-parameterized samples from Eq. (1). Then, one can estimate $\lambda $ by minimizing the difference between corresponding samples under certain metric. This is one basic idea of how to estimate $\lambda $, and this topic deserves more thorough study in the future work.

## 3 Case Studies

In this section, we demonstrate our testing framework with two case studies, where we test some state-of-the-art MF optimization techniques on test problems generated based on given HF models.

### 3.1 Testing Two Gradient-Based Optimization Techniques

#### 3.1.1 Technique Description and Test Setup.

In the first case study, we test two gradient-based MF optimization techniques that solve problems without considering any given total budget. The first technique, which we name as MF-BFGS technique, modifies the BFGS algorithm by using locally corrected LF models as surrogates for the HF model for the line-search steps in the algorithm [24]. An overview of the MF-BFGS algorithm is shown in Fig. 2. In MF-BFGS technique, the highlighted line-search step in Fig. 2 is done using the LF model locally corrected around the current design point. There are multiple ways to do the correction, and we use a simple yet effective correction method that adds a linear corrective term to the LF model and fit the coefficients in the term by matching the LF model and HF model at the current design point up to first-order derivatives. Note that this technique in the current form can only solve problems with bilevel MF models.

The second technique is one that was developed for solving multifidelity multidisciplinary design optimization problems, but it can be readily used to solve our single-disciplinary problems [26]. It involves two stages when solving a given problem. In stage 1, the tester quantifies model errors for all the available MF models by comparing each model's output value against that of the HF model. Then, all the models are assessed based on their evaluation costs and errors, as illustrated in Fig. 3. The models that are not Pareto-efficient are dropped, and the rest models are sorted by their errors. In stage 2, the sorted models are sequentially used across iterations to optimize the current design point, starting with the one with the lowest fidelity or the largest error. When each model is used for optimization, a vanilla gradient-based algorithm is used, and the convergence tolerance is proportional to the model's error. In the following content, we call this technique the “sequential” technique for simplicity.

Both these techniques run optimization starting from a single design point and the final point can only satisfy local optimality. In our case study, we run these techniques with multiple randomly generated starting points so that both techniques can roughly achieve global optimality.

Regarding test problem setup, we use the Branin function as the HF model. We use $21\u2009\lambda $ values that linearly span $[0,\u20091]$ and $20\u2009\varphi $ values randomly sampled in $[0,2\pi ]2$. The value of budget parameter $cb$ is irrelevant as neither technique considers the budget. Regarding the specification for the LF model cost, we test both standard scenarios as introduced in Sec. 2.2: (1) the cost is fixed to $1%$ of the HF model cost, and (2) the cost is $\lambda $ times the HF model cost. We refer these scenarios as scenarios 1 and 2 in the following. The benchmark technique is a single-fidelity (HF model only) version of the sequential technique.

#### 3.1.2 Test Results and Discussion.

The results of scenario 1 tests are shown in Figs. 4–6. Figure 4 shows the savings curves plotted for both techniques as described in Sec. 2.2. The two curves are overall close to each other, which means both techniques have similar performances in terms of the cost for solving given optimization problems. Figure 5 shows the optimality curves for both techniques. Both curves are very flat and hover close to dashed benchmark line (note the $y$-axis range is very small), which means in most test cases, both techniques are able to find the optimum close to that found by the benchmark single-fidelity optimization. Figure 6 provides the additional information of the number of model evaluations incurred over the test cases. This information is generally not essential for characterizing the performance of tested techniques, and we plot it just to provide a bit more insight into the behaviors of the techniques. It can be seen that the MF-BFGS technique evaluates the LF model more when $\lambda $ value is smaller, that is, the fidelity of the LF model is lower. While this does not affect the savings curves too much due to the low cost of the LF model, we will see that this behavior can have significant effect on the savings curve when the LF model's cost is specified differently in the other test scenario. In contrast, with a bit care, one can observe that the sequential technique evaluates the LF model more when $\lambda $ value is larger and LF model has higher fidelity. This is somewhat more intuitive as the LF model is more useful when its fidelity is higher, given that its cost remains fixed.

It is worth noting that the savings curve results in this case can illustrate the pitfall of the traditional way of testing MF techniques. In Fig. 4, if we look at the cases where $\lambda $ is $0.7$, the cost values of the MF-BFGS techniques are roughly $10%$ smaller than those of the sequential technique, which means MF-BFGS is the better performing technique. On the other hand, if we look at the cases where $\lambda $ is close to $0.0$ or $1.0$, the cost values of MF-BFGS are significantly larger than those of the sequential technique, which means the sequential technique is better. Based on these observations, one could argue either technique outperforms the other by focusing on the test cases that yield results in favor of the argument and ignoring the other cases. In our test framework, by testing case problems covering the full range of $\lambda $ values and visualizing the results with savings curve, one can obtain a much more comprehensive picture of the technique's performance and be much less vulnerable to the pitfall observed above.

The results of scenario 2 tests are shown in Figs. 7–9, similar to Figs. 4–6. Although in scenario 2 the costs of the LF models are different from those in scenario 1, this has no influence on how both tested techniques solve test problems due to how the costs are used in the techniques' algorithms. Therefore, the results in Figs. 8 and 9 are almost identical to those in Figs. 5 and 6. Only Fig. 7 shows different results than Fig. 4. It is clear in Fig. 7 that MF-BFGS technique's savings curve is substantially above that of the sequential techniques, which means MF-BFGS consistently incurred more costs than the sequential technique and thus has worse performance. This is caused by the fact that in scenario 2, for most case problems the LF model's cost is much larger than that in the corresponding cases in scenario 1. As discussed in Fig. 6, MF-BFGS technique evaluates the LF model more than the sequential technique does. Therefore, if both techniques have similar cost performances in scenario 1, then MF-BFGS technique is well expected to have worse cost performance in scenario 2.

Another interesting observation from Fig. 7 is that the sequential technique's savings curve aligns closely with the dashed line, which indicates the cost of the benchmark single-fidelity optimization. This phenomenon means that under the cost assumption for the LF model in scenario 2, regardless of the fidelity level of the LF model, the sequential technique cannot save any cost by utilizing the LF model during the optimization, although it does not incur much extra cost, either. In comparison, in Fig. 4, the right tail of the savings curve of the sequential technique falls under the dashed line, which means when the fidelity level of the LF model is high enough, the sequential technique can indeed save cost compared with the benchmark single-fidelity technique. While we are not sure whether the sequential technique is the best technique possible for solving these test problems, we know that any LF model can become too costly and not worth using for saving cost once its cost exceeds a certain threshold. The fact that the savings curve of the sequential technique aligns so well with the benchmark cost line prompts us to conjecture that the cost assumption in scenario 2 may be the critical case where it is generally impossible for any technique to save cost in these case problems. It would be interesting to see future theoretical or experimental studies on this conjecture. If this conjecture is true, then this cost assumption deserves becoming a standard setting in our test framework.

### 3.2 Testing Two Bayesian Optimization Techniques.

In the second case study, we test two Bayesian optimization techniques that consider total budget when solving MF optimization problems.

#### 3.2.1 Technique Description and Test Setup.

Both techniques in this case study come from a recently proposed technique called multifidelity output space entropy search for multi-objective optimization (MF-OSEMO) [35]. The MF-OSEMO technique belongs to the MF BO technique family and is developed to solve multi-objective problems. Our test problems are single-objective but MF-OSEMO can be directly applied to solve them. In the following, we describe the architecture of MF BO techniques for solving single-objective problems to give readers an idea of how MF-OSEMO works.

Figure 10 illustrates the essential architecture of MF BO techniques in problems with two fidelity models. First, some initial sample data are drawn from each fidelity model, and GP models are fitted to each model. Then, optimization iterations are run until some convergence criteria have been met. Each iteration contains two steps. Step 1 is to construct AFs for each fidelity model that quantifies the merit of drawing a new sample at any input location of that model. The construction of these AFs is usually the focus of research interests, and MF-OSEMO constructs the AF based on the output space entropy theory. Step 2 is to find the global optimum of the AFs to decide which model to sample at what input location, if the samples are to be drawn one-at-a-time. The new data are used to update the GP models and check convergence.

While we do not expand on the details of the AF used in the MF-OSEMO technique, on a high level we note that this technique's AF requires some approximations to efficiently compute some quantities and the technique offers two approximation approaches; one is named “truncated Gaussian” (TG) and the other is named “numerical integration” (NI). Therefore, we treat MF-OSEMO with these two approximation approaches as two distinct optimization techniques and test them comparatively in this case study. We refer to them as the NI and TG techniques for brevity.

The test problem setup in this case study is similar to that in the previous one. The HF model, $\lambda $ values, and $\varphi $ values are exactly the same as before. The value of budget parameter $cb$ is now set as $15.0$, i.e., $15$ times the cost of the HF model. Regarding the LF model cost, we only test one standard scenario: the cost is fixed to $10%$ of the HF model cost. The benchmark technique is one that uses a standard gradient-based optimization algorithm with multiple starting points. The optimization runs at each starting point are allocated equal maximum numbers of model evaluations so that the total cost will be at most the total budget.

#### 3.2.2 Test Results and Discussion.

The results are shown in Figs. 11–13. Figure 11 plots the savings curves for the two techniques. It is clear that both curves are very close to the dashed benchmark line, which means in all cases the techniques incurred roughly the same amount of cost as the benchmark technique. In fact, our setting of the total budget is quite tight for the given problem, so actually all the techniques, including the benchmark one, ran out of budget before finding a convergent solution. For this reason, this case study is an example of when the savings curves cannot differentiate the performances of tested techniques and one needs to rely more on other plots for performance characterization. Figure 12 plots the optimality curves for the tested techniques, and we are able to differentiate the two techniques in this plot. Both optimality curves in Fig. 12 are relatively flat, but the TG technique's curve is overall lower than the other techniques, which means the TG technique is able to find better solutions in most case problems regardless of the fidelity of the LF model. Considering that TG's optimality curve is also close to the benchmark dashed line, one can also argue that TG performs as well as the benchmark technique. Synthesizing the results of both savings curves and optimality curves, one can easily see that the TG technique outperforms the other one in this case study.

Like before, we also plotted the numbers of model evaluations for both techniques in Fig. 13 to make deeper sense of the results. One may not be surprised to see that the numbers of model evaluations are relatively flat for the underperforming NI technique, as its inability to detect the actual fidelity of the LF model aligns with its poor optimization performance as indicated in optimality curves. However, it is somewhat intuitive to see that the well performing TG technique evaluated the LF model more when the fidelity of the LF model is lower ($\lambda $ is smaller). Given that the cost of the LF model is fixed throughout the case problems, if TG is able to somehow better detect the actual fidelity of the LF model, one may expect this technique to allocate few evaluations to the LF model when its fidelity is low and vice versa. The fact that the results are opposite to this expectation strongly implies one possibility—the technique overestimated the information gain of sampling the LF model when its fidelity is low. To verify this conjecture and fully understand why this happened if ever requires detailed analysis of the technique itself and is hence beyond the scope of this paper, but we do want to point out that when the LF model's fidelity is low, due to our formulation in Eq. (1), the LF model's profile is very close to a multivariate sinusoidal function. This may affect the efficacy of using a GP model to quantify the uncertainty of the LF model given limited data. Ideally, when the LF model's fidelity is low, it should have more random profile than a sinusoid with random phase like in our formulation. To our knowledge, however, pursuant of that case may be too complicated to be practical, and we leave it for future work.

## 4 Conclusion

In this work, we propose a new framework for testing MF design optimization techniques more comprehensively and rigorously than current practice. The framework consists of two novel components. The first is a method to generate MF models leveraging GP modeling. The formula for generating MF models has two control parameters that allow continuous control of the fidelity of the generated MF models and flexible model randomization. Compared with the existing practice of generating MF models, the new method enables testers to avoid the pitfalls of generating unrepresentative/biased models. The second component is a testing process involving systematic test case generation and result organization. The testing process produces savings curve and optimality curve that characterize tested MF optimization technique's performance much more comprehensively than the traditional testing practice does. The advantages of our new framework are demonstrated with two case studies, with each case study testing two state-of-the-art MF optimization techniques.

At least three avenues of future work can be pursued based on this new framework. First, in our MF model generation formula, a sinusoidal function is used as the random “noise” or “error” of the LF model. Other formulation choices for this term may be explored to enhance the efficacy or generality of the model generation method. Second, we proposed two relatively standard ways for structuring the LF model cost in the case problems generated during the testing process, but one can further study the justification for these cost structures versus other possible cost structures. Additionally, while the current framework is developed for testing design optimization techniques, most parts of our framework are highly independent of the specific techniques being tested. This is particularly true for our method of generating MF models. Hence, our framework has the potential to be extended to other MF application scenarios to fully exploit the MF model generation method that underlies our framework.

## Data Availability Statement

The datasets generated and supporting the findings of this article are obtainable from the corresponding author upon reasonable request.

## Nomenclature

- $A$ =
magnitude factor for balancing the terms in the model-generating formula

- $B$ =
mean value factor for balancing the terms in the model-generating formula

- $cb$ =
total budget for the evaluation cost in the formulated test problem

- $cbench$ =
benchmark total cost for solving a single-fidelity optimization problem using a default optimization technique

- $cHF$ =
evaluation cost of the high-fidelity model in the formulated test problem

- $cLF$ =
evaluation cost of the low-fidelity model in the formulated test problem

- $c(i,\u2009j)$ =
the total cost incurred by the tested technique for solving the test problem indexed by $(i,j)$

- $d$ =
number of input variables for general models

- $fi$ =
length-scale parameter for $i$th input dimension in the Gaussian process model fitted to the sample points drawn from the high-fidelity model

- $fi(0)$ =
roughness parameter along each dimension of the Gaussian process model fitted to the prototype model $y0(x)=\u2211i=1dsin(xi)$

- $m$ =
number of $\lambda $ parameter values used for formulating the test problem suite

- $n$ =
number of $\varphi $ parameter values used for formulating the test problem suite

- $x$ =
vector of input variables for general models

- $y$ =
response variable for general models

- $ybench$ =
benchmark optimum response value obtained by the default optimization technique when solving a single-fidelity optimization problem

- $yHF(\xb7)$ =
the scalar-output high-fidelity model in a given problem

- $yLF\u2212\lambda (\xb7;\varphi )$ =
generated low-fidelity model with parameters $\lambda $ and vector $\varphi $

- $ystd$ =
standard deviation of the high-fidelity $y$ values of a large set of uniformly sampled data in the design space

- $yopt(i,\u2009j)$ =
the optimum response value obtained by the tested technique in the test problem indexed by $(i,j)$

- $\lambda $ =
parameter controlling the fidelity of the generated low-fidelity model

- $\lambda (i)$ =
the $i$th $\lambda $ value used for formulating the test problem suite

- $\mu c(i)$ =
the mean of the cost data $c(i,\u2009j)$ aggregated by index $i$ (i.e., $i$ is fixed)

- $\mu y(i)$ =
the mean of the response data $yopt(i,\u2009j)$ aggregated by index $i$

- $\sigma c(i)$ =
the standard deviation of the cost data $c(i,\u2009j)$ aggregated by index $i$

- $\sigma y(i)$ =
the standard deviation of the response data $yopt(i,\u2009j)$ aggregated by index $i$

- $\varphi i$ =
phase parameter for $i$th input dimension for randomization purpose

- $\varphi (j)$ =
the $j$th $\varphi $ value used for formulating the test problem suite

## References

*Decision Making in Engineering Design*, ASME Press, New York, pp.