1 Introduction

In Ref. [1], the authors developed a sequential Bayesian optimal design framework to estimate the statistical expectation of a black-box function f(x):RdR. Let xp(x) with p(x) the probability distribution of the input x, the statistical expectation is then defined as follows:
q=f(x)p(x)dx
(1)
The function f(x) is not known a priori but can be evaluated at arbitrary x with Gaussian noise of variance σ2:
y=f(x)+ε,εN(0,σ2)
(2)
Based on the Gaussian process surrogate learned from the available samples Dn = {Xn, Yn}, i.e., f(x)|DnGP(mn(x),kn(x,x)), the next-best sample is chosen by maximizing the information-based acquisition G(x~):
xn+1=argmaxx~G(x~)
(3)
where G(x~) computes the information gain of adding a sample y~ at x~, i.e., the expected KL divergence between the current estimation p(q|Dn) and the hypothetical next-step estimation p(q|Dn,x~,y~):
G(x~)=Ey~[KL(p(q|Dn,x~,y~)|p(q|Dn))]=p(q|Dn,x~,y~)logp(q|Dn,x~,y~)p(q|Dn)dqp(y~|x~,Dn)dy~
(4)
In Eq. (4), y~ is chosen based on the surrogate f(x)|Dn following a distribution of N(mn(x~),kn(x~,x~)+σ2)); q is considered as a random variable with uncertainties coming from the (current and next-step) surrogates. It is noted that G(x~) also depends on the hyperparameter θ in the learned Gaussian process f(x)|Dn. We neglect this dependence for simplicity, which does not affect the main derivation.
As a major contribution of the discussed paper, the authors simplified the information-based acquisition as Eq. (30) in Ref. [1]:
G(x~)=log(σ1σ2(x~))+12σ22(x~)σ1212+12v(x~)2σ12(σn2(x~)+σ2)
(5)
where σ12 and σ22 are, respectively, the variances of current estimation and hypothetical next-step estimation of q; v(x~)=kn(x~,x)p(x)dx and σn2(x~)=kn(x~,x~). Furthermore, for numerical computation of Eq. (5), the authors developed analytical formulas for each involved quantity (important for high-dimensional computation) under uniform distribution of x.

The purpose of our discussion is to show the following two critical points:

  1. The last three terms of Eq. (5) always add up to zero, leaving a concise form with a much more intuitive interpretation of the acquisition.

  2. The analytical computation of Eq. (5) can be generalized to arbitrary input distribution of x, greatly broadening the application of the developed framework.

These two points are discussed, respectively, in Secs. 2 and 3.

2 Derivation of the Simplified Acquisition G(x~)

To simplify Eq. (4), we first notice that q|Dn follows a Gaussian distribution with mean μ1 and variance σ12:
p(q|Dn)=N(q;μ1,σ12)μ1=E[f(x)p(x)dx|Dn]
(6)
=mn(x)p(x)dx
(7)
σ12=E[(f(x)p(x)dx)2|Dn](E[(fn(x)p(x)dx)|Dn])2=kn(x,x)p(x)p(x)dxdx
(8)
After adding one hypothetical sample {x~,y~}, the function follows an updated surrogate f(x)|Dn,x~,y~GP(mn+1(x),kn+1(x,x)) with
mn+1(x)=mn(x)+kn(x~,x)(y~mn(x~))kn(x~,x~)+σ2
(9)
kn+1(x,x)=kn(x,x)kn(x~,x)kn(x,x~)kn(x~,x~)+σ2
(10)
The quantity q|Dn,x~,y~ can then be represented by another Gaussian with mean μ2 and variance σ22:
p(q|Dn,x~,y~)=N(q;μ2(x~,y~),σ22(x~)),
(11)
μ2(x~,y~)=E[f(x)p(x)dx|Dn,x~,y~]=mn+1(x)p(x)dx=μ1+kn(x~,x)p(x)dxkn(x~,x~)+σ2(y~mn(x~))
(12)
σ22(x~)=E[(f(x)p(x)dx)2|Dn,x~,y~](E[(fn(x)p(x)dx)|Dn,x~,y~])2=kn+1(x,x)p(x)p(x)dxdx=σ12(kn(x~,x)p(x)dx)2kn(x~,x~)+σ2
(13)
We note that Eqs. (7), (8), (12), and (13) are, respectively, intermediate steps of Eqs. (19), (21), (26), and (28) in the discussed paper. Substitute Eq. (6) and Eq. (11) into Eq. (4), one can obtain:
G(x~)=p(q|Dn,x~,y~)logp(q|Dn,x~,y~)p(q|Dn)dqp(y~|x~,Dn))dy~=(log(σ1σ2(x~))+σ22(x~)2σ12+(μ2(x~,y~)μ1)22σ1212)p(y~|x~,Dn)dy~=log(σ1σ2(x~))+12σ12((μ2(x~,y~)μ1)2p(y~|x~,Dn)dy~+σ22(x~)σ12)=log(σ1σ2(x~))+12σ12((kn(x~,x)p(x)dx)2kn(x~,x~)+σ2+σ22(x~)σ12)
(14)
=log(σ1σ2(x~))
(15)
where Eq. (14) is exactly Eq. (5) (or Eq. (30) in discussed paper). The fact that the last three terms of Eq. (14) sum up to zero is a direct result of Eq. (13).
The advantage of having a simplified form (15) is that the optimization (3) yields a much more intuitive physical interpretation. Since σ1 does not depend on x~, Eq. (3) can be reformulated as
xn+1=argminx~σ22(x~)
(16)
which selects the next-best sample minimizing the expected variance of q. Similar optimization criterion is also used in Refs. [2,3] for the purpose of computing the extreme-event probability.
Another alternative interpretation can be obtained by writing (3) as
xn+1argmaxx~σ12σ22(x~)argmaxx~(kn(x~,x)p(x)dx(kn(x~,x~)+σ2)1/2)2
(17)
argmaxx~(ρy|Dn(x~,x)(kn(x,x)+σ2)1/2p(x)dx)2,
(18)
where Eq. (17) is a result of Eq. (13), and ρy|Dn(x~,x)=kn(x~,x)/((kn(x~,x~)+σ2)(kn(x,x)+σ2))1/2 is the correlation of y for two inputs x~ and x. Equation (18) can be interpreted as to select the next sample which has overall most (weighted) correlation with all x.
We finally remark that the above derivation is for given hyperparamter values θ in f(x)|Dn. This is consistent with the Bayesian approach where the optimal values of θ are chosen from maximizing the likelihood function. However, the discussed paper used a different approach by sampling a distribution of θ and computed G(x~) as an average of the sampling. In the latter case, the above analysis should be likewise considered in a slightly different way, i.e., Eq. (16) should be considered as maximization of the multiplication of σ22 from all samples of {θ(i)}i=1s:
xn+1=argmaxx~1si=1sG(x~,θ(i))argmaxx~1si=1slog(σ1(θ(i))σ2(x~,θ(i)))argminx~1si=1slogσ2(x~,θ(i))argminx~i=1sσ22(x~,θ(i))
(19)

By using the Github code from the authors of Ref. [1] for their test cases, we have confirmed that the new results based on Eq. (15) are the same as the original results based on Eq. (5).

3 Analytical Computation of G(x) for Arbitrary Input Distribution p(x)

In the computation of G(x) in the form of Eq. (17), the most heavy computation involved is the integral kn(x~,x)p(x)dx (which is prohibitive in high-dimensional problem if direct integration is performed). Following the discussed paper, the integral can be reformulated as
kn(x~,x)p(x)dx=K(x~)k(x~,Xn)(K(Xn,Xn)+σ2In)1K(Xn)
(20)
where
K(x)=k(x,x)p(x)dx
(21)
k(x,x)=s2exp(12(xx)TΛ1(xx))
(22)
with s and Λ involving hyperparameters of the kernel function (with either optimized values from training or selected values as in Ref. [1]).
The main computation is then Eq.(21), for which the authors of the discussed paper addressed the situation of uniform p(x). To generalize the formulation to arbitrary p(x), we can approximate p(x) with the Gaussian mixture model (as an universal approximator of distributions [4]) :
p(x)i=1nGMMαiN(x;wi,Σi)
(23)
Equation (21) can then be formulated as:
K(x)i=1nGMMαik(x,x)N(x;wi,Σi)dx=i=1nGMMαi|ΣiΛ1+I|1/2k(x,wi;Σi+Λ)
(24)
which yields an analytical computation. In practice, the number of mixtures nGMM is determined by the complexity of the input distributions, but any distribution of p(x) can be approximated in such a way.
Finally, in computing G(x~) in the form of Eq. (19), computation of σ12 in Eq. (8) is necessary. This can also be generalized for arbitrary p(x) using the Gaussian mixture model as follows:
σ12=kn(x,x)p(x)p(x)dxdx=k(x,x)p(x)p(x)dxdxk(x,Xn)p(x)dx(K(Xn,Xn)+σ2In)1k(Xn,x)p(x)dx=i=1nGMMj=1nGMMαiαj|Λ|1/2|Λ+Σi+Σj|1/2k(wi,wj;Λ+Σi+Σj)K(Xn)T(K(Xn,Xn)+σ2In)1K(Xn)
(25)

Conflicts of Interest

There are no conflicts of interest.

Data Availability Statement

No data, models, or code were generated or used for this paper.

References

1.
Pandita
,
P.
,
Bilionis
,
I.
, and
Panchal
,
J.
,
2019
, “
Bayesian Optimal Design of Experiments for Inferring the Statistical Expectation of Expensive Black-Box Functions
,”
ASME J. Mech. Des.
,
141
(
10
), p.
101404
.
2.
Hu
,
Z.
, and
Mahadevan
,
S.
,
2016
, “
Global Sensitivity Analysis-Enhanced Surrogate (gsas) Modeling for Reliability Analysis
,”
Struct. Multidiscipl. Optim.
,
53
(
3
), pp.
501
521
.
3.
Blanchard
,
A.
, and
Sapsis
,
T.
,
2021
, “
Output-weighted Optimal Sampling for Bayesian Experimental Design and Uncertainty Quantification
,”
SIAM/ASA J. Uncertainty Quantif.
,
9
(
2
), pp.
564
592
.
4.
Goodfellow
,
I.
,
Bengio
,
Y.
, and
Courville
,
A.
,
2016
,
Deep Learning
,
MIT Press
,
Cambridge, MA
.