Abstract
Bayesian optimization (BO) has become a powerful tool for solving simulation-based engineering optimization problems thanks to its ability to integrate physical and mathematical understandings, consider uncertainty, and address the exploitation–exploration dilemma. Thompson sampling (TS) is a preferred solution for BO to handle the exploitation–exploration tradeoff. While it prioritizes exploration by generating and minimizing random sample paths from probabilistic models—a fundamental ingredient of BO–TS weakly manages exploitation by gathering information about the true objective function after it obtains new observations. In this work, we improve the exploitation of TS by incorporating the -greedy policy, a well-established selection strategy in reinforcement learning. We first delineate two extremes of TS, namely the generic TS and the sample-average TS. The former promotes exploration, while the latter favors exploitation. We then adopt the -greedy policy to randomly switch between these two extremes. Small and large values of govern exploitation and exploration, respectively. By minimizing two benchmark functions and solving an inverse problem of a steel cantilever beam, we empirically show that -greedy TS equipped with an appropriate is more robust than its two extremes, matching or outperforming the better of the generic TS and the sample-average TS.
1 Introduction
Bayesian optimization (BO) [1–6] is a global sequential optimization technique well-suited for solving small- and moderate-dimensional optimization problems with costly or black-box objective functions. It finds application in diverse domains of science and engineering, including hyperparameter tuning of machine learning algorithms [1], identification of material parameters [7,8], material design [9,10], airfoil design [11], adaptive experimental design [12], and accelerator physics [13]. For a recent comprehensive review of its applications in engineering design, see Ref. [5]. At its core, BO guides the optimization process using a probabilistic surrogate model of the objective function, usually a Gaussian process (GP), coupled with an optimization policy [4]. Given a dataset that has several observations of the input variables and the corresponding objective function values, a GP posterior built from this dataset often serves as the probabilistic model encapsulating our beliefs about the black-box objective function. The optimization policy specifying what we value in the dataset is defined through an acquisition function . This acquisition function can be deterministic or stochastic, and is cost-effective to evaluate given , making it convenient for processing optimization. Different considerations that should be taken into account when formulating from include, for example, the value of the objective function [14], the information about the minimum location [15–17], and the information about the minimum value [18].
Various algorithms have been developed to address the classic exploitation–exploration dilemma in BO, where exploitation involves selecting new solutions anticipated to improve the objective function value immediately and exploration focuses on sampling new solutions to reduce uncertainty in predictions of the objective function for a long-term improvement in the solution. An effective BO algorithm is deemed to balance these opposing concerns [4]. Notable deterministic acquisition functions, such as expected improvement (EI) [19], weighted EI [20], GP upper confidence bound [21], knowledge gradient [22], and likelihood-weighting [23], can manage the exploitation–exploration tradeoff. However, they are considered myopic policies, and therefore tend to favor exploitation [14].
Thompson sampling (TS) is a preferred algorithm to address the exploitation–exploration dilemma in solving multi-armed bandit problems [24–27]. More specifically, TS selects arms (or actions) from a finite set of possible arms over a limited number of iterations. Each arm corresponds to a stochastic reward drawn from an unknown distribution. The goal is to craft a sequence of arms that maximizes the cumulative reward assuming that the rewards are independent of time and conditionally independent given the selected arms. When applied to GP-based BO, TS (or GP-TS [4]) generates a sequence of new solution points using the mechanism that involves random sampling from unknown posterior distributions of the global minimum location [28], which is due to the imperfect knowledge of described by the probabilistic model . Rather than finding the posterior distributions directly, TS generates functions from and minimizes them for samples of . Thus, we can consider TS a BO method that minimizes stochastic acquisition functions generated from the model posteriors for selecting new solution points. It is also worth noting that several information-theoretic optimization policies of BO, such as predictive entropy [17] and max-value entropy [18], compute their acquisition functions based on a set of samples generated by TS.
While TS naturally manages the exploitation–exploration tradeoff, its randomness can reduce the role of exploitation in optimization. When the GP posterior exhibits high uncertainty at the beginning of the optimization process, TS prioritizes exploration as it can diversify the selection of new solutions when limited information has been gained about the optimal solution . As the number of observations increases and the GP posterior becomes concentrated, the algorithm transitions to exploiting knowledge about the true objective function. This exploitation strategy, however, is inferior due to the randomness of TS, motivating the quest for an intervention to improve its exploitation.
In this work, we incorporate the -greedy policy into TS to improve its exploitation. This policy is a selection strategy of reinforcement learning to address the tension between exploitation and exploration [29]. Given , the policy chooses an action by either maximizing an average reward function with probability (i.e., exploitation or greedy) or selecting it randomly with probability (i.e., exploration). The selection strategy is pure exploitation when , or pure exploration when . Similarly, our proposed approach implements the generic TS (with probability ) for exploration and a new fashion of TS called sample-average TS (with probability ) for exploitation.
Several works have explored the -greedy policy in BO and multi-armed bandit problems. De Ath et al. [30] proposed two schemes for applying the policy to BO. The first scheme performs exploration (with probability ) by randomly selecting a point on the Pareto frontier obtained by simultaneously minimizing the posterior mean and maximizing the posterior standard deviation. The second scheme performs exploration by randomly selecting a point in the input variable space. Both schemes implement exploitation (with probability ) by minimizing the posterior mean function. Jin et al. [31] introduced the so-called -exploring TS (-TS) to multi-armed bandit problems. Given the posterior distributions of arms, exploration (with probability ) sets the estimated reward function for each arm as a random sample drawn from the associated posterior distribution, while exploitation (with probability ) sets the estimated reward function as the sample mean function. Yet the performance of -greedy TS for BO remains unexplored.
Our contributions are as follows: (1) -greedy TS to BO, which is a simple, effective method to improve the exploitation of TS; and (2) empirical evaluations demonstrating that -greedy TS with an appropriate is more robust than its two extremes (i.e., the generic TS and the sample-average TS), matching or outperforming the better of the two across various problems.
The rest of this paper progresses as follows. Section 2 provides a background essential for the development of -greedy TS. Section 3 describes the method in detail. Section 4 presents the empirical performance of -greedy TS on minimizing two benchmark functions and finding constitutive parameters for a steel cantilever beam. Finally, Sec. 5 concludes this paper.
2 Background
2.1 Overview of Gaussian Processes.
Let denote a training dataset, where are observations of a -dimensional vector of input variables and the corresponding observations of the objective function. We wish to build from an observation model , where is additive zero-mean Gaussian noise with variance . This observation noise is assumed to be independent and identically distributed.
Training GP models involves finding an optimal set of hyperparameters that ensures robust predictive performance of trained GPs across unseen input variables. Exact training methods emulating all possible hyperparameter hypotheses are often impractical, leading to the prevalent use of approximate training methods. These approximate training methods fall into two main categories: deterministic approximations and Monte Carlo methods [34]. Deterministic approximations include the maximum likelihood method identifying a set of hyperparameters that maximizes the parameter likelihood , and Laplace’s method approximating the posterior as a multivariate Gaussian with a posterior mean equal to the maximum likelihood estimation and a posterior covariance estimated from Laplace approximation. Monte-Carlo methods, such as importance sampling, the Metropolis, Gibbs sampling, and slice sampling, generate approximate samples of from the posterior given the likelihood and a prior . Several of the aforementioned approximate training methods have been incorporated in reliable GP toolboxes such as DACE [35], GPML [36], GPstuff [37], pyGPs [38], and GPflow [39].
1: Input: input variable domain , number of initial observations , threshold for number of BO iterations , standard deviation of observation noise
2: Generate initial samples of
3: for {
4:
5: end for
6:
7:
8: fordo
9: Build a GP posterior
10: Generate a sample path from
11: s.t. ;
12:
13:
14:
15: end for
16: return
2.2 Bayesian Optimization Via Thompson Sampling.
As briefly described in Sec. 1, the generic TS generates a sequence of solution points by randomly sampling from an unknown posterior distribution of the global minimum , where represents a finite budget on the number of BO iterations. Leveraging the fact that is fully determined by the objective function, the generic TS follows two simple steps in each iteration. It first generates a sample path from the GP posterior for which the detailed implementation is described in Sec. 2.3. It then minimizes the generated sample path to find a minimum location . By doing so, , assigned as the new solution , is a sample from . Algorithm 1 summarizes the implementation of the generic TS.
Figure 1 shows sample paths generated from the GP posterior of univariate function for and an approximate distribution derived from their minimum locations. Specifically, we build from ten observations a GP model with zero mean and SE covariance function. From this GP model, we generate 50 sample paths, and five of them are shown in Fig. 1(a). Minimizing the generated sample paths and utilizing a kernel density estimation on the obtained solutions result in an approximate distribution , as shown in Fig. 1(b). Note that the generic TS in Algorithm 1 generates and minimizes only one sample path in each iteration and does not attempt to approximate the posterior distribution of the minimum location. The following section describes how we can randomly generate sample path in Line 10 of Algorithm 1.
Generation of sample paths using random features
1: Input: dataset , type of stationary covariance function , number of spectral points , standard deviation of observation noise
2: Build a GP model from
3: Formulate from the posterior covariance function , see Table 1
4: fordo
5:
6:
7: end for
8: Formulate , see Eq. (7)
9:
10: Compute and , see Eq. (9)
11:
12: return
Covariance function | Spectral density function | |
---|---|---|
SE | ||
ARD SE | ||
Matérn 3/2 | ||
Matérn 5/2 |
Covariance function | Spectral density function | |
---|---|---|
SE | ||
ARD SE | ||
Matérn 3/2 | ||
Matérn 5/2 |
and , where and are length scale and marginal standard deviation, respectively. , where is length scale for the th input variable and Gamma function.
2.3 Sampling From Gaussian Process Posteriors.
Here and stack spectral points generated from the normalized spectral density and points drawn from the uniform distribution , respectively. is well defined because is a stationary covariance function satisfying .
Figure 2 shows approximate SE and Matérn 5/2 covariance functions computed from Eq. (6) for different numbers of spectral point samples. We see that the approximate covariance functions converge to the true ones when the number of spectral points increases.
The generation of using random features is detailed in Algorithm 2. Note that Line 10 of Algorithm 2 involves the Cholesky decomposition of matrix to facilitate the calculation of its inverse. Alternative posterior sampling techniques such as the decoupled method [43], which generates GP posteriors by updating the corresponding GP prior samples, can be adopted to increase the sample path accuracy.
1: Input: input variable domain , number of initial observations , threshold for number of BO iterations , number of spectral points , number of sample paths , value of , standard deviation of observation noise
2: Generate samples of
3: fordo
4:
5: end for
6:
7:
8: fordo
9: Build a GP posterior
10: Generate
11: ifthen
12: Sample
13: else
14: fordo
15: Sample
16: end for
17:
18: end if
19: s.t. ;
20:
21:
22:
23: end for
24: return
3 -Greedy Thompson Sampling
3.1 Sample-Average Thompson Sampling.
Another approach to generate an average sample path is to use a decoupled representation per “Matheron’s rule” [45], which can be computed at the same cost of generating one sample path, regardless of the nominal number of sample paths being averaged.
We then minimize the average sample path to find a new solution in each iteration of BO. This approach differs from that of Balandat et al. [44] which generates sample paths using randomized quasi Monte-Carlo techniques.
The sample-average TS and the generic TS can be considered two extremes of TS. If , in Eq. (10) is indeed the GP posterior mean in Eq. (3) whose minimum promotes exploitation. If , the sample-average TS recovers the generic TS that favors exploration. Thus, there exists an optimal state between these two extremes that corresponds to an unknown value of where TS balances exploitation and exploration. However, we do not attempt to find such an optimal state in this work. We instead set at a sufficiently large value, say , to enforce the exploitation of TS.
3.2 -greedy Thompson Sampling.
The two distinct extremes of TS motivate the use of -greedy policy to randomly switch between them. Specifically, we implement the generic TS with probability to explore the input variable space. We invoke the sample-average TS with probability to guide the search toward exploitation. We do not use the posterior mean in Eq. (3) for exploitation because we observe that computing its derivatives is more expensive than computing the derivatives of the average sample path in Eq. (10), which adds computational cost if optimization is performed via a gradient-based algorithm. It is also straightforward to recover the generic TS from -greedy TS by simply setting or . Algorithm 3 details the proposed -greedy TS.
The proposed algorithm addresses the exploitation–exploration dilemma by the following factors: (1) –a small value of promotes exploitation and (2) –a sufficiently large value of encourages exploitation.
4 Numerical Examples
4.1 Benchmark Functions.
Consider minimization problems of the 2d Ackley and 6d Rosenbrock functions [47]. The analytical expressions for these functions and their global minimums are given as follows:
While -greedy TS can be applied to any stationary covariance function listed in Table 1, we restrict our experiments to the ARD SE covariance function, assuming no specific prior knowledge about the characteristics of the objective functions. We call the DACE toolbox [35] to find optimal GP hyperparameters for each minimization problem. We set for score standardized output observations and generate spectral points to approximate the sample paths of GP posteriors. To minimize the generated sample paths, we use the DIRECT algorithm [48] whose output solution is set as the initial point of an interior-point optimizer. Parameters for the DIRECT algorithm including the function tolerance, maximum number of function evaluations, maximum number of iterations, and maximum number of rectangle divisions are , , , and , respectively. Parameters for the interior-point optimizer including the function tolerance, optimality tolerance, maximum number of iterations, maximum number of function evaluations, and termination tolerance on are , , , , and , respectively. We carry out all experiments for the benchmark functions using the Carya Cluster of the University of Houston. We compare the optimization results obtained from -greedy TS with those from other BO methods, including EI, the lower confidence bound (LCB), the averaging TS, and the generic TS.
To determine an appropriate value of in Eq. (10) which tradeoffs between the exploitation and computational cost, our preliminary experiments investigate how changes in affect the optimization results for a fixed value of . We observe that when the number of sample paths reaches a sufficiently large value of , the solutions exhibit less sensitivity to changes in .
To ensure fair comparisons, we randomly generate 100 initial datasets for each problem using Latin hypercube sampling [49], and start each BO method from each of these datasets. In each BO iteration, we record the best-found value of observation error and the corresponding solution vector. Here and represent the best observation of the objective function found in each iteration and its true minimum value, respectively. We set the number of initial observations and the number of BO iterations at and for the 2d Ackley function, and and for the 6d Rosenbrock function.
Figures 3(a) and 3(b) show the medians and interquartile ranges obtained from 100 runs of each BO method for the 2d Ackley and 6d Rosenbrock functions, respectively. Figures 3(c) and 3(d) show the medians and interquartile ranges of final solutions. In both cases, the optimization results from -greedy TS for an appropriate are better than those from one of its two extremes and competitive with the results from the other. In addition, -greedy TS for can provide the best objective values among those from the considered BO methods.
While the optimization results in Fig. 3 suggest that it is safe to set to balance exploitation and exploration, an optimal may depend on the nature of the problem of interest. A problem that requires more exploitation for its optimal solution (e.g., the 2d Ackley function) may benefit from a smaller . A problem with more exploration (e.g., the 6d Rosenbrock function) might be better addressed with a larger . Yet determining whether a black-box objective function leans toward exploitation or exploration for its extrema is often elusive. This, ironically, underscores the rationale behind employing -greedy TS to prevent our search from overly favoring either exploitation or exploration. In -TS for multi-armed bandits, can be set as for achieving simultaneous minimax and asymptotic optimality [31], where is the number of arms to select. This, however, does not apply to BO because the number of new solution points, which are similar to arms in multi-armed bandits, is uncountable. It is also worth noting that any endeavors to find an optimal value of for a particular problem should correspond to a specific accuracy level of the initial GP model because lowering the model fidelity always encourages exploration, which is irrespective of the selection of .
As discussed in Sec. 1, increasing value results in more exploration of the -greedy strategy, and this holds true for -greedy TS. An increase in encourages the algorithm to explore more unseen regions of the input variable space, which is confirmed by observing how the algorithm, with varying values, selects new solution points for the Ackley function, as shown in Fig. 4.
We also investigate the effect of varying on the runtime for selecting a new solution point. Figure 5 shows approximate distributions of runtime for selecting a new solution point by the sample-average TS with , -greedy TS with different values and , and the generic TS for the 2d Ackley and 6d Rosenbrock functions. There are two distinct clusters of the runtime values for -greedy TS. One cluster aligns with the runtime of the generic TS and the other corresponds to the runtime of the sample-average TS. In addition, the average runtimes for each BO trial of the sample-average TS, -greedy TS with , and the generic TS are 124, 102, 64, 18, and 7 s for the 2d Ackley function, respectively. Those for the 6d Rosenbrock functions are 1377, 1188, 809, 229, and 111 s, respectively. We see that increasing reduces the runtime of each BO trial. This can be attributed to the fact that a larger requires more exploration of calling the generic TS, which is cheaper than the sample-average TS.
4.2 Cantilever Beam.
In this example, we use -greedy TS to identify the constitutive parameters for simulating the behavior of a steel cantilever beam under a static cyclic loading history RH1, as shown in Fig. 6. The web and flange of the beam have different material parameters. In the experiment conducted by Yamada and Jiao [50], forced vertical displacement was applied at the free end of the beam. The defection angle was defined as the ratio of the vertical tip displacement mm to the beam length , i.e., . Values of moment reaction about the -axis at the beam support were measured.
To evaluate , we model the beam using abaqus 2022. We reduce the sensitivity of simulation estimates to the finite element (FE) mesh density by generating a fine mesh of 4960 nodes and 3510 linear hexahedral elements of type C3D8, see Fig. 6. The maximum increment size for each loading history is set as 0.01 s.
4.2.1 Nonlinear Combined Isotropic/Kinematic Hardening.
The nonlinear combined isotropic/kinematic hardening model makes use of Eqs. (15)–(17) for checking the material status at each time instant. If we use one back-stress component for this model, its underlying parameters read , where Young’s modulus and Poisson’s ratio are incorporated in the isotropic elastic tensor.
4.2.2 Constitutive Parameters by Bayesian Optimization.
During the minimization of misfit function , we keep Young’s modulus and Poisson’s ratio for the web and flange of the beam as constants, set at GPa and [50]. As a result, we identify ten parameters for the beam of which five parameters are associated with the web and the remaining five with the flange. The permissible range for each parameter is taken from Ref. [46] and listed in Table 2. To initiate BO, we generate ten distinct initial datasets, half containing 50 data points each and the remainder 100 data points each. With a total of BO iterations planned, we execute the optimization process using the Carya Cluster housed at the University of Houston. This cluster hosts 9984 Intel CPU cores and 327680 Nvidia GPU cores integrated within 188 compute and 20 GPU nodes. This infrastructure allows us to complete a costly beam simulation in 3 min.
Component | Parameter | Lower bound | Upper bound |
---|---|---|---|
Web | (GPa) | — | |
— | |||
(MPa) | |||
(MPa) | |||
(MPa) | |||
Flange | 175.05 | — | |
— | |||
(MPa) | |||
(MPa) | |||
(MPa) | |||
Component | Parameter | Lower bound | Upper bound |
---|---|---|---|
Web | (GPa) | — | |
— | |||
(MPa) | |||
(MPa) | |||
(MPa) | |||
Flange | 175.05 | — | |
— | |||
(MPa) | |||
(MPa) | |||
(MPa) | |||
Figure 7 shows the optimization histories and variations in the final solution obtained from different BO methods for the cantilever beam, where represents the best observation of found in each BO iteration. The identification results obtained from greedy TS with and are comparable to or better than that from the sample-average TS, and are better than that from the generic TS. More notably, greedy TS for provides the best set of identified parameters.
We use the best set of parameters identified from -greedy TS to predict the – curves of the beam under RH1 and other two cyclic loading histories RH2 and RH3. The experimental data associated with RH2 and RH3 were not fed to the parameter identification process. The agreement observed between the measured and simulated – curves for each loading history, as shown in Fig. 8, confirms the reliable prediction performance achieved with the identified parameters.
5 Conclusions
We introduced -greedy TS to optimizing costly simulation-based objective functions. The method addresses the classic exploitation–exploration dilemma by randomly switching between its two extremes, namely the generic TS and the sample-average TS. Our empirical findings reveal that -greedy TS with an appropriate value robustly achieves the better performance of its two extremes across the benchmark problems, and can provide the best result in a practical engineering application.
While several -greedy algorithms and the generic TS are guaranteed to converge eventually [4,30], we look forward to a theoretical analysis to elucidate the convergence properties of -greedy TS. The regret bound analysis of -TS [31] for multi-armed bandits can serve as a starting point for this endeavor. We are also interested in exploring extensions of -greedy TS to high-dimensional settings where good approximations of the GP posteriors and efficient optimization of TS acquisition functions in high dimensions are of interest. For this, subspace-based approaches [56,57] or trust-region methods [58] present viable strategies. Additionally, the application of -greedy TS to non-Gaussian likelihood settings and the evolution of during the optimization process are still open problems. The former may be addressed using a Markov chain Monte Carlo algorithm (see, e.g., Refs. [59,60]) to sample a large set of function values or to sample the weight values in Eq. (8) from a non-Gaussian posterior.
Acknowledgment
We thank the University of Houston for providing startup fund to support this research. We thank Professor Satoshi Yamada at the University of Tokyo and Professor Makoto Ohsaki at Kyoto University for providing experimental results of the cantilever beam used in Sec. 4.2.
Conflict of Interest
There are no conflicts of interest.
Data Availability Statement
Data provided by a third party listed in Acknowledgment.