## Abstract

Bayesian optimization (BO) is an efiective surrogate-based method that has been widely used to optimize simulation-based applications. While the traditional Bayesian optimization approach only applies to single-fidelity models, many realistic applications provide multiple levels of fidelity with various computational complexity and predictive capability. In this work, we propose a multi-fidelity Bayesian optimization method for design applications with both known and unknown constraints. The proposed framework, called sMF-BO-2CoGP, is built on a multi-level CoKriging method to predict the objective function. An external binary classifier, which we approximate using a separate CoKriging model, is used to distinguish between feasible and infeasible regions. The sMF-BO-2CoGP method is demonstrated using a series of analytical examples, and a fiip-chip application for design optimization to minimize the deformation due to warping under thermal loading conditions.

## 1 Introduction

High-fidelity engineering models are frequently utilized to predict quantities of interests, such as properties or performances, with respect to a specific design. These predictions are then fed back into the design process to find a better design that outperforms the previous ones by changing the design parameters. This process, often called design optimization, is ubiquitous in industrial settings. Simulation-based optimization is challenging due to the tremendous computational cost associated with high-fidelity models. However, for many practical applications, multiple models of various fidelity can be developed and a multi-fidelity optimization framework, such as Bayesian optimization (BO), can be then applied to optimize the objective function at the highest level of fidelity, but at a reduced computational cost by leveraging lower-fidelity data.

Multi-fidelity methods provide an efiective framework to reduce the computational cost in optimization and uncertainty quantification by leveraging the correlations with lower-fidelity models to reduce the computational burden on high-fidelity models. Multi-fidelity frameworks are particularly practical for engineering simulation-based applications, such as computational fiuid dynamics and solid mechanics problems, because most of these involve discretizations (spatial and/or temporal), where a finer discretization typically corresponds to a higher level of fidelity and the coarser discretization corresponds to a lower level of fidelity.

Incorporating physical and practical constraints into the optimization formulation is also a critical task. Digabel and Wild [1] proposed the QRAK taxonomy to classify constrained optimization problems. In engineering settings, constraints arise from multiple sources and many problems require both known and unknown constraints to be incorporated into the formulation. Constraints are *known* if the feasibility of the input can be determined directly from the input sampling location, i.e., without actually evaluating the model. Such known constraints are often formulated as a set of inequalities. On the other hand, constraints are *unknown* if the feasibility of the input must be evaluated indirectly by evaluating the model. Common examples of unknown constraints are ill-conditioning induced by the parameters, singularity in the design, and mesh generation problems. These constraints are implicitly imposed and feasibility cannot be determined without evaluating the computational model.

Gaussian process (GP) methods provide an eficient framework to model a response surface that approximates the objective function for single-fidelity formulations. In the traditional BO approach, an acquisition function *a*(** x**) is constructed based on a utility function, which rewards the BO method if the new sampling location outperforms the rest. The acquisition function is typically constructed based on the posterior mean and posterior variance of the GP. Because of its fiexibility, many extensions based on the traditional BO framework have been proposed to solve other optimization problems, including both constrained and multi-fidelity problems. Incorporating constraints is a well-studied subject in the context of BO methods and typically involves adopting a penalty scheme to penalize the infeasible sampling locations that do not satisfy all of the constraints. Multi-fidelity BO problems are more complicated to deal with. To generalize to multiple levels of fidelity, one needs to consider the correlation between levels of fidelity from the objective function and fuse the predictions across all levels of fidelity. For example, Kennedy and O'Hagan [2] proposed an auto-regressive approach to form a link between lower-fidelity to the next higher-fidelity by a linear regression between two levels of fidelity. The terms CoGP and CoKriging are used interchangeably in this work to describe the recursive auto-regressive GP model. Because the constrained problems have been relatively well studied, we will focus the literature review in Sec. 2 on multi-fidelity GP.

In this work, we develop a sequential constrained multi-fidelity method sMF-BO-2CoGP, as an extension of sBF-BO-2CoGP [3], using a CoKriging approach to approximate the objective function at the highest level of fidelity. The known constraints are implemented by penalizing the acquisition function directly for infeasible input sampling locations. The unknown constraints are adaptively learned using another CoKriging model, which acts as a probabilistic binary classifier. The unknown-constrained acquisition function is conditioned on this predicted probability mass function, in addition to the penalty scheme for known constraints. The optimal location for the next sample is determined by maximizing the constrained acquisition function. Next, an uncertainty reduction scheme, where uncertainty is measured by the integrated mean-square error, is proposed to determine the appropriate level of fidelity to evaluate. Compared to the maximum mean square error criteria, the integrated mean-square error is demonstrated to be more robust and eficient.

The content of this paper is invited following the conference paper presented at the ASME IDTEC CIE 2020 (August 18{21, 2019) at Anaheim, CA [3]. The main difierence is that this paper generalizes our previous work [3] from bi-fidelity to multi-fidelity problems. The remainder of this paper is organized as follows. Section 2 provides a brief introduction to the BO method. Section 3 describes the multi-fidelity sMF-BO-2CoGP method proposed in this paper, including the constrained acquisition function, the fidelity selection criteria. Section 4 demonstrates the application of the proposed sMF-BO-2CoGP methodology using several analytical examples and an engineering application in designing a fiip-chip package based on the finite element model. Section 5 discusses and Sec. 6 concludes the paper.

## 2 Related Works

*f*denote a function of

**, where $x\u2208X$ is a**

*x**d*-dimensional input and

*y*is the observation. The optimization considered in this paper is formulated as

*J*is the number of inequality constraints.

### 2.1 Gaussian Process.

*n*is the number of observations and $x\u2208X$ is the

*d*-dimensional input. A GP regression approach assumes that

**=**

*f**f*

_{1:n}is jointly Gaussian, and the observation

*y*is normally distributed given

*f*,

*m*

_{i}≔

**(**

*μ*

*x*_{i}) and

*K*

_{i,j}≔

*k*(

*x*_{i},

*x*_{j}).

**depends on the covariance between inputs. At an unknown sampling location**

*K***, the predicted response is described by a posterior Gaussian distribution, where the posterior mean is**

*x***and**

*x*

*x*_{1:n}. The classical GP formulation assumes a stationary covariance matrix, which only depends on the distance $r=||x\u2212x\u2032||$, where $||\u22c5||$ is usually the classical L

^{2}-norm, but other choices, e.g., the L

^{1}-norm and weighted variants, have also been explored. Common kernels for GP include [5]

### 2.2 Multi-Fidelity CoKriging.

One of the advantages of CoKriging is that it can exploit the correlation between low- and high-fidelity and improve the prediction at high-fidelity level by adding more low-fidelity training data points. If the computational costs of evaluation between the high- and low-fidelity difier significantly, this advantage ofiers a reduction in the number of training data points, thus increasing the eficiency of the optimization problem. Kennedy and O'Hagan [2] proposed an auto-regressive model that couples all levels of fidelity together. Le Gratiet and Garnier [8] proposed a nested scheme $D1\u2282D2\u2286\cdots \u2286Ds$ to decouple *s* levels of fidelity into *s* standard levels of GP regression, where the GP is used to model the discrepancy between two consecutive levels of fidelity. Karniadakis et al. [9–11] employed the same method to approximate the highest level of fidelity and extended for noisy evaluations using the same method. Perdikaris et al. [12] proposed a generalized framework that can model nonlinear and space-dependent cross-correlations between models of variable fidelity. The multi-fidelity Bayesian optimization approach is sometimes known as multi-information source optimization [13] or multi-task Bayesian optimization [14]; they are all closely related to each other. For example, Ghoreishi and Allaire have proposed several approaches to solve the multi-information source optimization problem in the context of constraints [15], knowledge-gradient acquisition function [16], Monte Carlo-based approach [17], and applications to computational micromechanics [18]. In this paper, we follow the formulation of Xiao et al. [19] in developing a multi-fidelity CoKriging framework due to its simplicity and the relaxation of the nested requirement, compared to Le Gratiet and Garnier [8] and Perdikaris et al. [12].

*s*, can be written as an auto-regressive model [19],

*s*is the high-fidelity level, the remaining (

*s*− 1) levels correspond to the low-fidelity levels, and ρ

_{t}'s are the scaling factors. Two important assumptions are typically made. First, δ(

**) is assumed to be independent of**

*x**f*

_{t}(

**), i.e.,**

*x**s*− 1) low-fidelity levels are uncorrelated, i.e.,

*s*levels of fidelity is given by

_{t}is the intrinsic variance of noisy observations at the

*t*th level of fidelity. The hyper-parameters ${\theta t}t=1s$ are obtained by optimizing the maximum likelihood function,

_{δ}that corresponds to the discrepancy are obtained by maximizing the likelihood function

If the number of fidelity levels are two (*s* = 2), then the conventional bi-fidelity CoKriging framework is conveniently recovered from the multi-fidelity CoKriging. The dataset $D$ is divided into $Dc$ and $De$ corresponding to cheap and expensive datasets, respectively. The bi-fidelity formulation is closely related to Couckuyt et al. [20–22] and Forrester et al. [23]. Following the auto-regressive scheme described above, the first GP models the low-fidelity response {*x*_{c}, *y*_{c}}, whereas the second GP models the discrepancy between the high- and low-fidelity model δ(** x**).

*k*(

**) and the covariance matrix**

*x**K*(

**) are then updated [19,21] as**

*x*_{c}are obtained by maximizing the likelihood function at the lower-fidelity level,

_{e}are obtained along with ρ, again by maximizing the likelihood function,

### 2.3 Acquisition Function.

*a*(

**) denotes the acquisition function and**

*x**** is the next sampling location. The acquisition function is deeply connected to the utility function, which corresponds to the rewarding scheme for BO methods, if the next sampling point outperforms the other sampling locations.**

*x*Three acquisition functions are widely used: the probability of improvement (PI), the expected improvement (EI), and the upper-confident bounds (UCB), but other forms also exist, for example, entropy-based approaches, GP-PES [24–26], GP-ES [27], GP-EST [28], and GP-EPS [29].

## 3 Methodology

In this section, we describe the sMF-BO-2CoGP method solving the multi-fidelity optimization problem in Sec. 2.

### 3.1 Constraints.

*a*(

**) with another indicator function $I(x)$, where**

*x*To handle the unknown-constrained problem, an external binary probabilistic classifier is employed to predict the probability of feasibility. Practically speaking, the approach employed to approximate the binary classifier for feasibility is up to users. Some examples are *k*-NN [41], AdaBoost [42], RandomForest [43], support vector machine [44], least squares support vector machine (LSSVM) [45], GP [46], and convolutional neural network [47]. One notable choice for the binary classifier is the GP classifier, which performs relatively well when compared with other binary classifiers. In sMF-BO-2CoGP, another CoGP is adopted as a binary classifier to predict the probability of feasibility of the sampling location considered.

**, the coupled binary classifier predicts a probability of feasibility based on the trained dataset, where the probability of being feasible is**

*x**Pr*(clf(

**) = 1), whereas the probability of being infeasible is**

*x**Pr*(clf(

**) = 0) = 1 −**

*x**Pr*(clf(

**) = 1). Again, we condition the sampling point on this predicted probability mass function by assigning zero value to the probability of being infeasible. Taking the expectation of the acquisition function conditioned on this probability mass function results in a new acquisition function, which can be rewritten in a product form as**

*x**a**(

**) yields the next sampling location of sMF-BO-2CoGP:**

*x**a**(

**).**

*x*### 3.2 Fidelity Selection Criteria.

In this section, we propose the fidelity selection criteria for the multi-fidelity frameworks. The computational cost as well as the reduction of uncertainty are used as the two factors to determine the fidelity level at which the evaluation will be performed.

To determine the level of fidelity in evaluating the new sampling location, a fidelity selection criteria balancing the computational cost and integrated mean-squared error (IMSE) reduction is utilized based on a one-step hallucination. The process of hallucination is adopted from our previous work [38]. For the sake of completeness, the process is summarized here.

The CoKriging is hallucinated at a point ** x*** if the observation, i.e., training data, is assumed to be exactly the same with the GP posterior mean prediction

*temporarily*. The CoKriging posterior distribution is then updated accordingly based on the assumption. The posterior variance σ

^{2}(

***) at**

*x**** is σ**

*x*^{2}for the posterior prediction (in particular, σ

^{2}(

***) at**

*x**** is 0 for deterministic functional evaluation). Then, if the sampling point**

*x**** is feasible, the GP is updated with the true observation, instead of the posterior GP prediction. If the sampling point**

*x**** is infeasible with respect to unknown constraints, then the hallucination process will take place for every iteration at**

*x****.**

*x**t*= 1, …,

*s*are the

*s*levels of fidelity, then the optimal fidelity level

*t** to perform the functional evaluation is determined by

*t*is denoted as

*C*

_{t}. The term $(IMSEt,hallucinated\u22c5Ct)$ quantifies the performance of querying at level

*t*of fidelity, which is measured as a product between the estimated IMSE and the computational cost. The integrated mean-square error, IMSE, is calculated as

^{2}(

**) is hallucinated at the sampling location**

*x****, i.e., assuming that**

*x**y*(

***) =**

*x***(**

*μ****). The optimal level**

*x**t** corresponding with the optimal product $(IMSEt,hallucinated\u22c5Ct)$ as a measure of cost and efiectiveness is selected to query the model.

Additionally, to promote the functional evaluation at highest fidelity level, i.e., level *t* = *s*, we choose the highest fidelity data (instead of *t* = *t**) if the ratio of number of training data points is larger than the computational cost ratio since the goal is to optimize at the highest level of fidelity *t* = *s*. In particular, let *t** be the optimal level of fidelity to query for the next sampling point, and $|D(t)|$ be the cardinality of the training dataset at level *t* of fidelity (i.e., the number of training data points at level *t*) and *C*_{t} be the computational cost at level *t*. We compare two quantities, $(Ct*\u22c5|D(t*)|)$ and $(Cs\u22c5|D(s)|)$. If $Ct*\u22c5|D(t*)|\u2265Cs\u22c5|D(s)|$, which means some of the computational cost building $D(t*)$ could be traded for building $D(s)$ (which is consistent with the policy of promoting evaluation at highest fidelity level *s*), then level *s*, i.e., the highest level of fidelity is chosen instead of *t**.

*a*

_{fidelity}ratio of measure at the high-fidelity level to that of low-fidelity as

*C*

_{h}and

*C*

_{l}are the computational costs at the high- and low-fidelity levels, respectively. If

*a*

_{fidelity}≤ 1, then the function evaluator is called at the high-fidelity level, whereas if

*a*

_{fidelity}> 1, then the function is evaluated at the low-fidelity level. The proposed fidelity selection criteria defined in Eq. (29) determines the trade-ofi between running at low-fidelity and high-fidelity levels. If the high-fidelity return is higher than the low-fidelity, then the high-fidelity level is chosen, and vice versa.

In practice, to promote the high-fidelity evaluations, a hard condition is proposed to prevent the imbalance between low- and high-fidelity data sets based on the comparison between the number of data points available and the relative computational cost between high- and low-fidelity data. If the ratio of low-to-high-fidelity data points is higher, the relative computational cost, then the high-fidelity level will be chosen to evaluate the sampling locations. The IMSE is computed by Monte Carlo sampling in high-dimensional space. It is noted that if the relative computational cost between the high- and low-fidelity is 1, then fidelity criteria selection always promotes evaluating the sampling data point at the high-fidelity level.

## 4 Applications

In this section, we demonstrate the proposed sMF-BO-2CoGP through several analytical functions in Sec. 4.1, including 1d Forrester function [23] and a subset of benchmark functions from Kandasamy et al. [50] and Xiong et al. [51], including 2d Currin exponential function (two levels of fidelity), 8d borehole function (two and three levels of fidelity), welded beam design problem (four levels of fidelity), and an 11d real-world engineering application (two levels of fidelity) in designing fiip-chip package (Sec. 4.6). Some implementations are adopted from Surjanovic and Bingham [52]. In all the benchmark of difierent acquisition functions, the computational cost between the high- and low-fidelity model is fixed at 2.50.

### 4.1 Forrester Function (1d) With Two Fidelity Levels.

*x*∈ [0, 1], the low-fidelity function is

First, consider a baseline set of four low-fidelity and two high-fidelity data points. We compare the efiects of adding low- and high-fidelity observations on the prediction of CoKriging. Figure 2 shows the comparison between the posterior mean ** μ**(

**) and posterior variance σ**

*x*^{2}(

**) between adding four more low-fidelity and two more high-fidelity data points. The common low-fidelity data points are denoted as blue squares, the common high-fidelity data points are denoted as black diamonds, and the added data points are denoted as red circles (low-fidelity for Figs. 1(a) and 2(a); high-fidelity for Figs. 1(b) and 2(b)).**

*x*For the low-fidelity level, Figs. 1(a) and 2(a) show the updated posterior mean ** μ**(

**) and posterior variance σ**

*x*^{2}(

**) after four more low-fidelity data points are added, respectively. The posterior mean**

*x***(**

*μ***) prediction slightly improves near the end of the domain**

*x**x*= 1, but does not improve significantly near the other end of the domain

*x*= 0 (Fig. 1(a)). The posterior variance σ

^{2}(

**) slightly reduces at the location where the low-fidelity data points are added.**

*x*For the high-fidelity level, Figs. 1(b) and 2(b) show the updated posterior mean ** μ**(

**) and posterior variance σ**

*x*^{2}(

**) after two more high-fidelity data points are added, respectively. The posterior mean**

*x***(**

*μ***) improves as expected, as shown in Fig. 1(b). The posterior variance σ**

*x*^{2}(

**) reduces to zero for noiseless evaluations at the two added sampling locations.**

*x*Next, we test the numerical implementation of the sMF-BO-2CoGP method by considering the minimization problem $argminfH(x)$ with no constraint and various computational relative cost ratio between the high- and low-fidelity levels. Figures 3(a) and 3(b) show the convergence plot with respect to iterations and total computation cost, respectively. The case where the relative cost ratio is 1.0 serves as a benchmark for traditional sequential BO using only high-fidelity. We verified that when the relative cost ratio is 1.0, all the evaluations are evaluated only at high-fidelity level. When the relative cost ratio is higher than 1.0, the sMF-BO-2CoGP selects the fidelity criteria on-the-fiy, using the fidelity criteria selection described above. It is worth noting that Fig. 3(a) only shows the convergence plot at high-fidelity level. That means, the convergence plot only updates when a better high-fidelity result is available. The numerical performance at high-fidelity level of the multi-fidelity sMF-BO-2CoGP framework degrades when the computational cost ratio increases, because more low-fidelity points are selected at high computational cost ratio, according to Eq. (29).

As shown in Fig. 3(a), when the computational cost ratio is 1.0, the sMF-BO-2CoGP converges to a sequential BO with high-fidelity and is the fastest with respect to the number of iterations. Figure 3(b) shows comparable performances between the cases of ratio 1.0 and 2.5, where the performance degrades when the computational cost ratio increases. However, they all converge after approximately seven iterations.

In this example, we consider an initial sampling data set comprised of four low-fidelity and two high-fidelity data points. The numerical performances are expected to change with difierent initial samples, as well as the behavior of high- and low-fidelity models.

Figure 4 shows the convergence plot of 1d Forrester function. Five trial runs are performed with difierent initial samples. Bands are covered by the lower and upper bounds at a fixed iteration with respect to difierent trial runs. Solid lines denote the mean objective function at a fixed iteration. The EI acquisition function is denoted with blue circles and blue band. The UCB acquisition function is denoted with red crosses and red band. The PI acquisition function is denoted with green squares and green band. The UCB acquisition function converges slowly at the beginning, but outperforms the EI acquisition function later on. The PI acquisition function does not perform very well. In a case of UCB, five high-fidelity and three low-fidelity evaluations are performed to achieve convergence.

### 4.2 Currin Exponential Function (2d) With Two Fidelity Levels.

Figure 5 shows the convergence plot of 2d Currin function. The explanation of the plots is similar to the case of 1d Forrester function. In this example, the UCB acquisition function outperforms both the EI and PI acquisition functions. The PI acquisition function performs poorly. In a case of EI, 16 high-fidelity and 14 low-fidelity evaluations are performed to achieve convergence.

### 4.3 Welded Beam Design Problem (4d) With Four Fidelity Levels.

*f*is calculated as

*C*

_{1}(

*m*),

*C*

_{2}(

*m*), σ

_{d}(

*m*),

*E*(

*m*), and

*G*(

*m*) are parameters that depend on the bulk materials

*m*, as listed in Table 1. The lower and upper bounds of the problem are 0.0625 ≤

*h*≤ 2, 0.1 ≤

*l*≤ 10, 2.0 ≤

*t*≤ 20.0, and 0.0625 ≤

*b*≤ 2.0.

*m*∈ {1, 2, 3, 4} encodes the bulk materials as steel, cast iron, aluminum, and brass, respectively.

Constants | Description | Steel | Cast iron | Aluminum | Brass |
---|---|---|---|---|---|

C_{1} | Cost per volume of the welded material ($/in^{3}) | 0.1047 | 0.0489 | 0.5235 | 0.5584 |

C_{2} | Cost per volume of the bar stock ($/in^{3}) | 0.0481 | 0.0224 | 0.2405 | 0.2566 |

σ_{d} | Design normal stress of the bar material (psi) | 30 · 10^{3} | 8 · 10^{3} | 5 · 10^{3} | 8 · 10^{3} |

E | Young's modulus of bar stock (psi) | 30 · 10^{6} | 14 · 10^{6} | 10 · 10^{6} | 16 · 10^{6} |

G | Shear modulus of bar stock (psi) | 12 · 10^{6} | 6 · 10^{6} | 4 · 10^{6} | 6 · 10^{6} |

Constants | Description | Steel | Cast iron | Aluminum | Brass |
---|---|---|---|---|---|

C_{1} | Cost per volume of the welded material ($/in^{3}) | 0.1047 | 0.0489 | 0.5235 | 0.5584 |

C_{2} | Cost per volume of the bar stock ($/in^{3}) | 0.0481 | 0.0224 | 0.2405 | 0.2566 |

σ_{d} | Design normal stress of the bar material (psi) | 30 · 10^{3} | 8 · 10^{3} | 5 · 10^{3} | 8 · 10^{3} |

E | Young's modulus of bar stock (psi) | 30 · 10^{6} | 14 · 10^{6} | 10 · 10^{6} | 16 · 10^{6} |

G | Shear modulus of bar stock (psi) | 12 · 10^{6} | 6 · 10^{6} | 4 · 10^{6} | 6 · 10^{6} |

The goal is to minimize the objective function *f*(*w*, *m*, *h*, *l*, *t*, *b*), which is the estimated cost. The physical meaning of the input parameters is as follows. *h* is the thickness of the weld; *l* is the length of the welded join, *t* is the width of the beam; *b* is the thickness of the beam. *m* is a discrete variable enumerating the bulk material of the beam, which can be steel, cast iron, aluminum, or brass; and *w* is a binary variable to model the type of weld: *w* = 0 for two-sided welding and *w* = 1 for four-sided welding. Difierent bulk materials are associated with difierent materials property, as described in Table 1. The fixed design parameters of the beam are *L* = 14 inch, δ_{max} = 0.25 in., and *F* = 6,000 lb.

*w*= 0, and consider four levels of fidelity. The high-fidelity function and three low-fidelity functions are

Figure 7 shows the convergence plot of the welded beam design problem with four levels of fidelity. In this example, the UCB and EI acquisition functions perform on par with each other and outperform the PI acquisition functions. In a case of EI, six high-fidelity and 14 low-fidelity evaluations are performed to achieve convergence.

### 4.4 Borehole Function (8d) With Two Fidelity Levels.

*x*

_{1}∈ [0.05, 0.15],

*x*

_{2}∈ [100, 50,000],

*x*

_{3}∈ [63,070, 115, 600],

*x*

_{4}∈ [990, 1110],

*x*

_{5}∈ [63.1, 116],

*x*

_{6}∈ [700, 820],

*x*

_{7}∈ [1120, 1680], and

*x*

_{8}∈ [9855, 12,045].

Figure 8 shows the convergence plot of 8d borehole function. The explanation of the plots is similar to the case of 1d Forrester function. In this example, the EI acquisition function outperforms both the UCB and PI acquisition functions, where the PI and UCB acquisition functions perform on-par with each other. In the case of EI, 22 high-fidelity and 53 low-fidelity evaluations are performed to achieve convergence.

### 4.5 Borehole Function (8d) With Three Fidelity Levels.

It is obvious to see that $fL1(x)\u2264fH(x)\u2264fL2(x)$; thus, the high-fidelity function is bounded between two low-fidelity functions. Figure 9 shows the convergence plot of the 8d borehole function with three levels of fidelity. In this example, all the acquisition functions, including EI, UCB, and PI perform on par with each other, almost throughout the optimization process. In the case of PI, 28 high-fidelity and 82 low-fidelity evaluations are performed to achieve convergence.

### 4.6 Flip-Chip Package Design (11d) With Two Fidelity Levels.

In this section, we demonstrate the design application of a fiip-chip package using the proposed sMF-BO-2CoGP, where the details of development and implementation are fully described in our previous work [54]. A lidless fiip-chip package with a monolithic silicon die (FCBGA) mounted on a printed circuit board (PCB) with a stifiener ring is considered in this example. The computational model is constructed based on a 2.5D, half symmetry to reduce the computational time.

Figure 10 shows the geometric model of the thermomechanical finite element model (FEM), where the mesh density varies for difierent levels of fidelity. Two design variables are associated with the die, three are associated with the substrate, three more are associated with the stifiener ring, two are with the underfill, and the last one is with the PCB board. Only two levels of fidelity are considered in this example. Table 2 show the design variables, the physical meaning of the design variables, as well as their lower and upper bounds in this case study.

Variable | Design part | Lower bound | Upper bound | Optimal value |
---|---|---|---|---|

x_{1} | Die | 20,000 | 30,000 | 20,702 |

x_{2} | Die | 300 | 750 | 320 |

x_{3} | Substrate | 30,000 | 40,000 | 35,539 |

x_{4} | Substrate | 100 | 1800 | 1614 |

x_{5} | Substrate | 10 · 10^{−6} | 17 · 10^{−6} | 17 · 10^{−6} |

x_{6} | Stifiener ring | 2000 | 6000 | 4126 |

x_{7} | Stifiener ring | 100 | 2500 | 1646 |

x_{8} | Stifiener ring | 8 · 10^{−6} | 25 · 10^{−6} | 8.94 · 10^{−6} |

x_{9} | Underfill | 1.0 | 3.0 | 1.52 |

x_{10} | Underfill | 0.5 | 1.0 | 0.804 |

x_{11} | PCB board | 12.0 · 10^{−6} | 16.7 · 10^{−6} | 16.7 · 10^{−6} |

Variable | Design part | Lower bound | Upper bound | Optimal value |
---|---|---|---|---|

x_{1} | Die | 20,000 | 30,000 | 20,702 |

x_{2} | Die | 300 | 750 | 320 |

x_{3} | Substrate | 30,000 | 40,000 | 35,539 |

x_{4} | Substrate | 100 | 1800 | 1614 |

x_{5} | Substrate | 10 · 10^{−6} | 17 · 10^{−6} | 17 · 10^{−6} |

x_{6} | Stifiener ring | 2000 | 6000 | 4126 |

x_{7} | Stifiener ring | 100 | 2500 | 1646 |

x_{8} | Stifiener ring | 8 · 10^{−6} | 25 · 10^{−6} | 8.94 · 10^{−6} |

x_{9} | Underfill | 1.0 | 3.0 | 1.52 |

x_{10} | Underfill | 0.5 | 1.0 | 0.804 |

x_{11} | PCB board | 12.0 · 10^{−6} | 16.7 · 10^{−6} | 16.7 · 10^{−6} |

After the numerical solution is obtained, the component warpage at 20 °C, 200 °C, and the strain energy density of the furthest solder joint are calculated. The strain energy density is one of accurate predictors to estimate the fatigue life of the solder joints during thermal cycling [55].

A vectorized 11-dimensional input is used to parameterize the design. Nine low-fidelity and three high-fidelity data points are used as initial samples. It is noted that not all of the initial samples are feasible. There are some unknown constraints, but no known constraint is imposed in this example. We consider that the sampling locations where the FEM solutions diverge are infeasible. This condition can be regarded as an unknown constraint, because no prior knowledge regarding divergence is known beforehand but only after the simulation is finished. ansys parametric design language (APDL) software is used to evaluate the model in the batch mode with no graphical user interface. The sMF-BO-2CoGP is implemented in matlab, where an interface using python is devised to communicate with the APDL FEM model. The average computational time for one iteration is approximately 0.4 CPU h.

Figure 11 presents the convergence plot of the FCBGA design optimization, where the feasible sampling points are plotted as blue circles, whereas the infeasible sampling points are plotted as red squares. The EI acquisition function is used in this example to locate the next sampling point. It is observed that the predicted warpage is converging steadily. The numerical solver fails to converge on many cases. It has also demonstrated that the proposed sMF-BO-2CoGP is robust against diverging simulations, by its convergent objective despite numerous failed cases.

The optimization results are relatively close to designs commonly used in the microelectronics packing industry. It is observed that thin and small die, as well as thick substrate, are suggested in order to minimize the component warpage.

## 5 Discussion

*m*is an arbitrary level of fidelity and

*l*is the highest level of fidelity. In this scheme, after each point is nominated at a level of fidelity, a unique sampling point is chosen by looping over all the levels. The uncertainty reduction is measured in the second term of the above equation, $(1\u2212\sigma \epsilon ,l/sl2(x)+\sigma \epsilon ,l2)$. In our scheme, the uncertainty is measured by IMSE

_{h,hallucinated}/IMSE

_{l,hallucinated}in Eq. (29). One advantage of the proposed criteria is that it truly estimates the reduction of uncertainty at a particular level. While the uncertainty could be measured by the maximum σ

^{2}(

**) for $x\u2208X$ for the uncertainty reduction, the maximal location is often found on the border of the bounded domain. Another advantage of the proposed criteria is that it removes the restriction of using EI acquisition and generalizes to any arbitrary acquisition function. The choice of the acquisition function is left to users as a choice. Previous work by Gauthier et al. [57,58] and Silvestrini et al. [59] have shown that the performance of IMSE supersedes the performance of maximal MSE. The scheme proposed by Huang et al. [56] in Eq. (57) can be further generalized to some other commonly used acquisition functions, such as PI and UCB. Furthermore, multiple acquisition functions can be considered simultaneously based on their performance, as in GP-Hedge scheme [60].**

*x*In the implementation, the CMA-ES framework is adopted to maximize the acquisition function *a**(** x**). For computationally expensive high-fidelity simulations, the CMA-ES parameters must be tuned to search carefully with multiple restarts to avoid local minima. In practice, optimizing the acquisition function takes some amount of time, thus it also reduces the eficiency of the method. However, it has been rarely discussed in the literature, and there has not been much work dedicated to benchmarking and quantifying the computational cost of this process. For batch-sequential parallel BO approaches, the computational cost is much more severe, particularly with simulations that are associated with large infeasible space.

The use of the probabilistic binary classifier to learn and distinguish feasible and infeasible region also depends many factors of the problems. Essentially, the classifier needs to accurately predict the feasibility before the optimal point is obtained. This depends largely on the dimensionality of the problem considered. However, once the feasibility is accurately predicted through Eq. (25), the convergence to the global optimal point is guaranteed through the classical BO framework. The analytical convergence rate can be found in the seminal work of Rasmussen [46].

BO is indeed a fiexible framework that allows for numerous possible extensions in engineering domains [61]. One of those is multi-fidelity, which is studied in this paper. Other extensions include batch-sequential parallel sampling [38], asynchronously parallel sampling [40], mixed-integer optimization with a small number of discrete/categorical variables [39], latent variable model [62]. More sophisticated GP models, including local GP, [63,64], sparse GP [65], heteroscedastic [66], and even deep learning [67], have been developed to widen the range of multi-disciplinary applications. This area remains active for further research.

While the proposed sequential multi-fidelity sMF-BO-2CoGP aims at improving the eficiency compared to the sequential high-fidelity BO, the eficiency can be further improved by performing parallel optimization. That is to sample multiple locations concurrently (i.e., at the same time) and asynchronously (i.e., sampling points do not have to wait for others to complete). The proposed multi-fidelity framework serves as a foundation work to tackle the constrained multi-fidelity problem in an asynchronously parallel manner. The research question remains open and poses as a potential future work.

## 6 Conclusion

In this paper, a sequential multi-fidelity BO optimization, called sMF-BO-2CoGP, is proposed to solve the constrained simulation-based optimization problem. A fidelity selection criteria is proposed to determine the level of fidelity for evaluating the objective function value. Another CoKriging model is coupled into the method to classify the next sampling point and distinguish between feasible and infeasible regions.

The proposed sMF-BO-2CoGP method is demonstrated using a simple analytic 1D example, as well as an engineering thermomechanical FEM for fiip-chip package design optimization. The preliminary results provided in this study demonstrates the applicability of the proposed sMF-BO-2CoGP method.

## Acknowledgment

This research was supported by the U.S. Department of Energy, Ofice of Science, Early Career Research Program, under award 17020246 and the Sandia LDRD program. The views expressed in the article do not necessarily represent the views of the U.S. Department of Energy or the United States Government. Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy's National Nuclear Security Administration under contract DE-NA-0003525.