We describe a framework for the verification of Bayesian model calibration routines. The framework is based on linear regression and can be configured to verify calibration to data with a range of observation error characteristics. The framework is designed for efficient implementation and is suitable for verifying code intended for large-scale problems. We propose an approach for using the framework to verify Markov chain Monte Carlo (MCMC) software by combining it with a nonparametric test for distribution equality based on the energy statistic. Our matlab-based reference implementation of the framework is shown to correctly distinguish between output obtained from correctly and incorrectly implemented MCMC routines. Since correctness of output from an MCMC software depends on choosing settings appropriate for the problem-of-interest, our framework can potentially be used for verifying such settings.

## Introduction

The growth of interest in uncertainty quantification research is motivated by the potential to improve the practical relevance of computational mathematics to applications in engineering, science, and industry. The breadth of applications found as examples in a small sample of the literature supports this notion. Applications exist for optimal control theory [1,2], acoustic and electromagnetic scattering [3–5], smart material systems [6,7], fluid dynamics [8–12], aircraft design [13–15], physiology and disease modeling [16–19], geology [20–22], and nondestructive evaluation [23–25]. Broadly speaking, uncertainty quantification is concerned with characterizing variability in the outputs of mathematical models arising from uncertainties in the inputs, typically through statistical approaches which facilitate comparison to measurement. Improved understanding of model variability and its relationship to measurement uncertainty translates to improved predictability when applying a computational model to a real-world phenomenon.

Although the concern for understanding variability in mathematical models is a unifying theme, the field of uncertainty quantification encompasses a diverse range of topics, as the recent collection of papers in Ref. [26] demonstrates. For broad introductions to the field as a whole, we refer the reader to Refs. [27,28]. In this paper, our focus is on Bayesian formulations for model calibration. A general introduction to Bayesian methods in scientific computing can be found in Ref. [29]. A useful introduction to Bayesian methods in the context of model calibration is available in Ref. [30]. For a treatment of the mathematical foundations underlying these methods, see Ref. [31]. Accepting the premise that interest in uncertainty quantification is largely motivated by its benefits to practical applications of computational models, then the subfield of model calibration is particularly important. Model calibration quantifies the relationship between observations of a phenomenon and the parameters of the chosen mathematical model. Bayesian techniques provide a rigorous, data-driven approach to parameter inference. They are a natural and robust methodology for uncertainty quantification, as they express calibration results as a probability distribution instead of a point-estimate. Bayesian inference therefore forms a key component of the link between measurement and abstract mathematical model. The reliability of these computational techniques is critical to ensuring the methods of uncertainty quantification fulfill their potential to bolster the contributions of computational engineering in practice.

However useful an algorithm may be, it is only as reliable as its implementation. This naturally leads to the computational engineering research topic of verification. For a general introduction to verification, see Ref. [32]. A review of verification research for general computational simulation is available in Ref. [33] and Ref. [34] reviews verification in computational fluid dynamics context. Verification is a necessary first step for uncertainty quantification. Whereas uncertainty quantification generally deals with the identification of variabilities in models and how they propagate, verification involves approaches to testing the implementation of a computational algorithm to ensure the implementation correctly performs the intended computations. In particular, verification is designed to ensure that the outputs of a computational model accurately reflect the solution of the underlying mathematical model. Tests are designed to estimate the errors from the numerical discretization scheme, as well as to check that the convergence rate (as a function of mesh size or timestep) observed in the software output matches that of the implemented numerical scheme.

One of the simplest methods of code verification is to test the numerical output of a code against an exact, analytical solution. This approach is of limited utility, as typically analytical solutions do not exist for practical problems. Another method is to compare simulation output from the implementation being verified against simulation output from a reference solution. This method is also of limited use, as it is requires availability of an already verified implementation.

A particularly flexible approach to verification is the method of manufactured solutions (MMS) which applies to numerical algorithms for solving nearly all deterministic mathematical equations. MMS assumes a parameterized functional form for the solution that leaves an analytically computable residual when inserted into the equation to be solved. The residual is used to derive inputs for which the functional form does solve the equation of interest. This is a powerful approach because it provides the same level of convergence testing available from an analytical solution, while simultaneously not requiring an analytical solution to exist. MMS typically requires the support of computer algebra systems, due to the complexity of evaluating the residual terms by hand. For a thorough treatment of MMS, including a study of the types of errors that MMS can and cannot find, see Ref. [35]. The verification library manufactured analytical solution abstraction library is based on MMS [36] and acts as a repository of solutions generated by this approach.

One more verification approach is the method of nearby problems described in Ref. [37]. Here, analytical curves are fit to numerical solutions of an equation of interest. These curves are used with the original equation to derive analytical input terms, similar to the approach of MMS. These input terms are used to define a problem that is “nearby” the original equation but whose discretization error can be calculated precisely due to the analytical input terms. The advantage of this over MMS is that it provides a means of defining physically realistic verification routines.

Regretably, all of the approaches outlined are formulated for deterministic problems. Stochastic algorithms present several additional difficulties. Most significant is the fact that the error introduced from random inputs in the system no longer guarantees monotonic convergence under refinement. Furthermore, the nonreproducability of results across different system architectures and random number generators adds additional complexity to the verification process. For these reasons, there are very few existing frameworks for the robust verification of statistical software. Simultaneously, due to inherent randomness in the output of stochastic algorithms, verification is arguably even more important than for deterministic systems. For example, Bayesian model calibration involves computing the density weighting the parameterization of a large ensemble of solutions to an equation. Such complicated results are difficult to assess by inspection.

Currently, very few algorithms presently exist for the verification of stochastic algorithms, in general, and the authors are not aware of any designed for use in a Bayesian calibration framework. The objective of this paper is therefore to present a step toward the verification of Bayesian model calibration algorithms. We outline a verification framework based on linear regression, a version of which was first described in Ref. [38]. This framework provides analytically derived formulas for the parameter densities obtained via Bayes rule that can be compared to numerical results. Although we employ a linear model which we describe by a matrix with a specific class of components for concreteness, the results do not depend on this specific form so the regression matrix can be designed to approximate some types of nonlinear functions. The framework is based on widely known statistical concepts, relatively simple to implement, and computationally efficient but has flexibility in the choice of error statistics to ensure the calibration algorithm is exercised over a range of possible inputs.

A major motivation for the framework is the verification of Markov chain Monte Carlo (MCMC) routines for Bayesian inference. A brief introduction to MCMC methods, the mathematical theory underlying them, and their use in a variety of scientific fields is [39]. Rather than directly computing posterior distributions for the calibration parameters, MCMC works by iteratively computing random values in a way that guarantees the limiting distribution of the collective samples will converge to that of the posterior. Properties of the posterior can be inferred from the resulting sample. Several types of MCMC algorithms exist. See Ref. [40] for an overview of the classic Metropolis-Hastings algorithm. Variants of the Metropolis-Hastings algorithm designed to improve the sampling efficiency include the delayed-rejection adaptive metropolis algorithm [41], which automatically tunes the way the random values of the Markov chain are chosen, and the DiffeRential Evolution Adaptive Metropolis algorithm [42], which is designed for parallelization. Hybrid Monte Carlo or Hamiltonian Monte Carlo methods are another variety of algorithms that employ concepts inspired by Hamiltonian dynamics to improve sampling efficiency [43].

When the posterior is a known parametric distribution, there are standard hypothesis tests which can be used to verify the expected distribution from data up to a given confidence level. Although for many configurations of our framework the posteriors are well-known distributions, some are not standard and must be evaluated numerically. For this reason, we employ a nonparametric hypothesis test for equal distributions based on the energy statistic, developed in Ref. [44]. This method provides a way to verify any random sample (in our case generated by the MCMC algorithm), provided that sample satisfies certain reasonable assumptions and a separate sample drawn from the true distribution is available (in our case drawn from the known posterior distribution using either standard routines or with the inverse cumulative distribution map).

An alternative approach to verifying that the MCMC output matches a known distribution would be to employ hypothesis tests for comparing moments. One might consider using the classical Hotelling *T*^{2} test to compare the means of two sample populations, one drawn from the MCMC algorithm and the other from the known distribution. Comparison tests designed to be highly scalable like [45] for testing the equality of means for a large number of regression parameters or [46] for comparing covariance matrices in high-dimensions could also be considered. These would apply to a subset of the problems defined by our framework which are normal or are approximately normal due to the central limit theorem. These approaches might require or benefit from transformations designed to better approximate normality (e.g., taking the logarithm of the precision parameter *λ* defined in the model descriptions below). Another alternative would be to employ kernel density estimate methods to estimate the distribution of the MCMC output and compare this to the known distribution with some metric (e.g., the norm of the difference). Such an approach would not require normally distributed or approximately normally distributed variables, but would require a way of quantifying the errors in the density estimate and incorporating them into thresholds for the comparison metric used to assess the similarity in distribution.

The rest of the paper is organized as follows: In Sec. 2, we describe the linear model calibration verification framework, including a description of the model, its parameters, and its statistical properties. In Sec. 3, we discuss the numerical implementation of the framework, describing some computationally efficient approaches to calculating the posterior densities and generating samples from the error processes used. Section 4 provides formulas for the exact posterior densities which are the heart of the verification framework. Derivations for the simplest case are included to make the paper somewhat more self-contained. Section 5 describes the energy statistic, the hypothesis test for equal distributions based on it, and details of how to use it with the posterior densities provided by the framework to verify MCMC output. In Sec. 6, we apply our matlab-based reference implementation of the framework to the verification of MCMC output computed using the DRAM matlab package in three separate studies. Section 7 concludes the paper with some additional suggestions of how to apply the framework.

## Linear Statistical Model Calibration Verification Framework

is known and the $N\beta $ components of the vector $\beta $ are unknown parameters to be estimated. Each $(N\beta \u22121)$ -dimensional covariate vector $xj=[xj,1xj,2\cdots xj,N\beta \u22121]T$ is a sample from a zero-mean multivariate normal distribution with known $(N\beta \u22121)\xd7(N\beta \u22121)$ covariance matrix *C*. The observation error *ε* is an *N*-dimensional, mean-zero, normally distributed random variable with *N* × *N* covariance matrix $R(\varphi )/\lambda $. The notation $\epsilon =\epsilon (\lambda ,\varphi )$ signifies the dependence of the distribution of the observation error on the hyperparameters *λ* and $\varphi $. The error precision parameter *λ* is a positive scalar and the admissible values of the correlation structural parameter $\varphi $ depend on the observation error correlation type discussed in Sec. 2.4.

The choice of covariates in *G* is an example that admits a well-known formula for calculating a predictive distribution for new inputs (see, e.g., the review of regression in Ref. [50] for a derivation), but is not at all a constraint the framework relies on as any correctly sized matrix will do. It is possible to replace the rows of *G* with regression functions *g*(*x*) which map a *p*-dimensional vector of inputs $x\u2208\mathbb{R}p$ to a $N\beta $ -dimensional row vector. For example, *g*(*x*) could be a polynomial, a sinusoidal function, a spline function, or any other basis function commonly used in numerical methods. This would enable the approximation of more complicated functions, including nonlinear functions, with no change to the remainder of the framework.

### Calibration Cases.

To generate data to be used for calibration, $\beta ,\lambda $, and $\varphi $ are set to some true values and *N* samples of data are generated from the model defined by the equation in Eq. (1). The associated calibration problem is to treat some choice of parameters as unknown and determine the distribution of the parameter values from this generated data using Bayesian inference. Three cases of unknown variables are considered in this framework, as summarized in Table 1. Each of these cases is a very common statistical inference problem covered in standard texts on regression such as [51] and very generally applicable. They are ordered by the difficulty of the inference task, with subsequent cases inferring more properties of the model from the observation data. Case 1 estimates the linear parameters assuming the statistics of the observation error are perfectly characterized. This is the simplest case of linear regression, applicable to situations where the observation or measurement error is very well-characterized both in terms of its structure and magnitude. Case 2 removes explicit knowledge of the amount of variance in the observation error samples. This is a more complex model where the structure of the observation error is known ahead of time, but its magnitude is inferred from data. In the case of normally distributed observation error which we use, this assumption results in posterior distributions that are no longer Gaussian. Case 3 assumes in addition that a model of the correlation structure in the observation error is known but a parameter characterizing the amount of correlation is unknown.

### Likelihood Function.

which is a consequence of the assumption of normally distributed observation error. For cases where the true value of *λ* is known, $\lambda \u0302=\lambda $ is used for computing the likelihood and similarly for $\varphi $.

### Prior Cases.

Two cases of prior distribution for *β* are considered in the verification framework: the noninformative prior case and the Gaussian prior case. A noninformative prior case indicates a belief that all the unknown parameters are equally likely to lie anywhere in their respective parameter domains. This is a reasonable assumption to use for models of poorly understood problems where little is known about the expected character of the solution. A Gaussian prior case indicates a belief that the true parameters for *β* are likely to be near some expected value in the domain with the likelihood decaying exponentially as a function of the distance from this expected value. This is a reasonable prior to use on problems where some information is known about the solution ahead of time, for example, from historical data. Note for calibration cases where *λ* and $\varphi $ are unknown, their prior distributions are treated as noninformative, although as will be clear from the expressions for the exact posterior, more general choices are possible.

where $\sigma 1,\u2026,\sigma N\beta $ are positive parameters. Note that when *λ* is unknown and calibrated, the *λ* in Eq. (2) is replaced with the calibrated variable $\lambda \u0302$ rather than the true (and unknown) value *λ*. The noninformative prior for *β* is simply a uniform density $\pi (\beta )\u221d1$. The use of a diagonal $\Sigma 0$ is for convenience as it requires the selection of fewer parameters. Any symmetric positive definite matrix will work just as well.

The noninformative priors for *λ* and $\varphi $ are, respectively, the Jeffrey's noninformative prior $\pi (\lambda )\u221d1/\lambda $ and the uniform prior $\pi (\varphi )\u221d1$. When $\varphi \u0302$ is calibrated, it is restricted to a closed interval that is a proper subset of its domain. The priors for each calibration case are listed in Table 2.

### Correlation Functions.

The covariance matrices $R(\varphi )/\lambda $ for the observation error are determined from a choice of correlation functions $R(\varphi )$. The three correlation functions considered and the corresponding domains of $\varphi $ are as follows:

- (1)
no correlation, no $\varphi $ dependence;

- (2)
equicorrelation, $0<\varphi <1$; and

- (3)
AR(1), i.e., order 1 autoregressive correlation, $\u22121<\varphi <1$.

No correlation represents the basic assumption of the simplest form of ordinary least squares in which the phenomenon is assumed to follow a linear trend with independent, identically distributed errors. AR(1) models are utilized in time series analysis that is broadly applicable and common in, e.g., econometrics and climate models [52]. Equicorrelation is another common and general model with applications in finance and econometrics, e.g., see Refs. [53,54].

where *R*_{1} is uncorrelated error, *R*_{2} is equally correlated error, and *R*_{3} is AR(1) correlated error. The uncorrelated case has no correlation between output observations and no dependence on $\varphi $. Equicorrelation makes the correlation between all observations equal with $\varphi $ determining the amount of correlation. For AR(1) correlation, any two output observations *y _{i}*,

*y*become less correlated the further apart they are in index; i.e., as $|i\u2212j|$ increases. From a numerical perspective, the role of the equicorrelation and AR(1) correlation functions in this framework is to make the inference task more challenging as values of $\varphi $ nearer to 1 are more poorly conditioned.

_{j}## Numerical Implementation

While it is possible to use any correlation function for the observation errors, a major advantage of using the correlation functions described in Sec. 2.4 is that they allow the calibration and exact posteriors to be computed more efficiently. The discussion in Sec. 3.1–3.3 summarizes these benefits.

### Realizations of Observation Error.

To generate a realization of an *N*-dimensional mean-zero normally distributed random variable with covariance matrix *C*, the Cholesky factorization $C=LLT$ of *C* is computed, and the resulting Cholesky factor is applied to a vector *x* of *N* samples from a standard normal distribution. The vector *y* = *Lx* is then the desired realization. For a general covariance matrix *C*, the Cholesky factorization requires $O(N2)$ arithmetic operations. Storing *C* in memory also requires $O(N2)$ units of memory. These constraints are prohibitive for large sample sizes, *N*. For the covariance matrices corresponding to the correlation functions in Sec. 2.4, these constraints can both be reduced to *O*(*N*). To simplify notation, the algorithms below do not include the multiplicative factor $1/\lambda $ which must be applied to the output.

*L*

_{2}by an

*N*-vector $x=[x1\cdots xN]$ of samples from a standard normal distribution, giving

Similar recursions can be found for the inverse matrices of *L*_{2}, *L*_{3} which suggests an approach for computing the term $y\u0303TR\u22121(\varphi )y\u0303$ needed in the likelihood. Since this term is computed repeatedly in the calibration routines, there is potential to save a great deal of computational effort. Caution should be exercised in implementing these strategies, however, to avoid loss of numerical precision due to cancelation and similar effects.

### Matrix Inversion.

### Determinants.

## Exact Posterior Densities

*θ*is the vector of unknown parameters,

*y*is the observed data, $\pi (\theta |y)$ is the probability density of

*θ*given

*y*, $\pi (y|\theta )$ is the likelihood of

*y*given

*θ*, $\pi 0(\theta )$ is the prior probability density of the parameters, and

is an integral taken over $D(\theta )$, the domain of *θ*. Division by *Z*(*y*) ensures the posterior integrates to 1, as required for a probability density. The verification framework outlined in Sec. 2 has three cases, each with different choices of unknown parameters. Throughout the remainder of the paper, the model residual is denoted $y\u0303(\beta \u0302)=y\u2212G\beta \u0302$. Expressions for the posteriors and likelihoods of each case are

As more calibrated parameters are added, the form of the likelihood becomes more complicated due to the need to explicitly include the functional dependence on all unknown parameters. The verification framework described in Sec. 2 provides an exact expression for the posterior densities of the calibrated parameters. These are presented in Secs. 4.1–4.3. We provide derivations of these expressions for the first case and omit the similar (but more tedious) details for the others. For the sake of notational clarity, dependence of *R* on $\varphi $ and choice of correlation type is suppressed unless needed; i.e., $Rj(\varphi )$ is written *R*.

### Case 1.

*Z*(

*y*) is a normalizing constant which makes the posterior integrate to 1; i.e., dividing it into the Gaussian function results in a Gaussian probability density. Using $y\u0303(\beta \u0302)=y\u2212G\beta \u0302$, the numerator can be rewritten as

*Z*(

*y*) is taken with respect to $\beta \u0302$ only. The remaining terms making up the argument of the exponential can be simplified by completing the square to obtain

*β*. Notice that the term $\beta mleTGTR\u22121y$ has no dependence on $\beta \u0302$ so it can be factored out of the exponential like the $yTR\u22121y$ term was. Setting $\Sigma 1=(GTR\u22121G)\u22121$, this implies that

and mean vector *β*_{mle}; i.e., $\pi (\beta \u0302|y)=N(\beta mle,\Sigma 1/\lambda )$.

*Gaussian prior*

**.**The Gaussian prior case with prior mean

*β*

_{0}and prior covariance matrix $\Sigma 0/\lambda $ is derived similarly to the noninformative case. Recall that $y\u0303=y\u2212G\beta \u0302$ and $\beta \u0303=\beta \u0302\u2212\beta 0$. The only difference is that instead of being able to ignore the $\beta \u0302$-independent term for the prior density, the terms from the prior combine with the likelihood as in

### Case 2.

*Noninformative prior***.** The true parameter $\varphi $ is known and the calibrated variables are $\beta \u0302,\lambda \u0302$. A noninformative prior is specified for $\beta \u0302$ and $\lambda \u0302$, namely, the Jeffreys prior, $\pi 0(\beta \u0302,\lambda \u0302)\u221d1/\lambda \u0302$.

### Case 3.

Note that $\pi 0(\varphi \u0302)=A=1/(\varphi \u0302max\u2212\varphi \u0302min)$ is the prior of $\varphi \u0302$, where $(\varphi \u0302min,\varphi \u0302max)$ is the domain of $\varphi \u0302$. These integrals should be evaluated numerically and the analytical formulas for the determinant of $R(\varphi \u0302)$ should be used for computational efficiency. The denominator of $\pi (\varphi \u0302|y)$ goes to zero as a monomial term with exponent *N*, the number of observations used for calibration, so care must be taken in the evaluation.

## Energy Statistic Test for Equal Distributions

Section 4 detailed exact expressions which can be used to compute the posterior probability densities of the calibrated parameters for each case of the regression framework. We are interested in applying these expressions to the verification of output from MCMC algorithms for Bayesian inversion. These methods do not directly compute the posterior density and hence cannot be directly compared with the expressions described earlier. While density estimation methods can be used to approximate a posterior density from the output sample for comparison with the true posterior, it is difficult to use this approach for anything other than qualitative verification due to uncertainties introduced by the density estimation algorithms. Our proposed method for obtaining quantitative results which are better suited for rigorous verification is to apply a statistical test for comparing the distribution of two random samples. This test is based on the energy distance, a class of statistics described in Ref. [55]. A test for distribution equality is developed in Ref. [44], which we outline here and employ in our framework. The test provides a rigorous way of evaluating whether the differences between a sample drawn from the software being verified and a sample drawn from the true posterior are statistically significant, up to a given level of confidence in the test results. The confidence metrics involved in this approach are readily interpretable statistically, making them well-suited as a means of verifying software whose outputs are inherently random.

### Hypothesis Test for Distribution Equality.

where the norm $\Vert \xb7\Vert $ is the Euclidean norm on $\mathbb{R}d$, the indices *i*, *j* range between 1 and *n _{X}*, and the indices

*l*,

*m*range between 1 and

*n*Note that Σ's in the equation are double summations over each of the indices. The name of the statistic is motivated by its analogy to physical formulas for energy or distance [55].

_{Y}_{.}Let $M0=[X1,\u2026,XnX,Y1,\u2026,YnY]$ denote the $d\xd7(nX+nY)$ matrix of samples, where the first *n _{X}* columns designate the first sample and the last

*n*columns the second sample. Roughly speaking, $EnX,nY$ is large when ${Xi}$ and ${Yl$} are drawn from different distributions and small when they are not. For the equal distribution case, one would expect $EnX,nY$ to be approximately the same value if computed after the columns of

_{Y}*M*

_{0}are randomly permuted and the resulting first

*n*columns are taken as the first sample and the last

_{X}*n*columns as the second sample.

_{Y}If ${Xi}$ and ${Yl}$ are drawn from different distributions, then the same permutation would result in sets that are more similarly distributed so that the resulting $EnX,nY$ would be smaller than that obtained from the original two data sets. The test for distribution equality based on the energy statistic is a formalization of this intuitive idea. It is a permutation test with the following steps:

- (1)
Choose a significance level

*α*and initialize the sample counter*b*to 1. - (2)
Apply a random permutation to the columns of

*M*_{0}and take the first*n*and last_{X}*n*columns of the resulting matrix to be the first and second samples, respectively._{Y} - (3)
Evaluate the energy test statistic for this permutation and label it $Eb$.

- (4)
Increment

*b*and repeat the previous two steps until a set $E1,\u2026,EB$ of*B*energy test statistic values is obtained. - (5)
Compute $E$ on the original, unpermuted sets ${Xi}$ and ${Yi}$.

- (6)
Compute the empirical

*p*-value (*p*) as the fraction of $Eb$ values at least as large as $E$. If $p<\alpha $, the data ${Xi}$ and ${Yl}$ are statistically unlikely to have been drawn from the same distribution.

This test is shown to be statistically consistent in Ref. [44]. The paper also reports several empirical results which suggest this test may perform better for small sample sizes than some alternative methods for distribution comparison. These properties make it a good choice as an approach for the verification tasks we are interested in.

### Random Sampling From Posterior Density Approximation.

The hypothesis test outlined in Sec. 5.1 can be used for the verification of MCMC output by comparing a sample drawn from the MCMC algorithm with a sample drawn from the exact posterior. This requires a way of drawing samples from the posteriors. Conditioned on knowledge of $\varphi $, the parameters $\beta \u0302$ and $\lambda \u0302$ are normal and Gamma distributions, respectively. These are well-known parametric distributions which can be sampled from with standard routines. When $\varphi \u0302$ is calibrated, its posterior is not contained in a widely used class of parametric distributions. Instead, its density is expressed as an integral which is approximated numerically. To use the energy statistic test above for the third calibration case, it is necessary to draw a random sample from this approximation.

Our approach is to use the well-known property that if *u* is distributed as a uniform variable on $[0,1]$ and $\varphi $ is a random variable with cumulative distribution function $F(\varphi )$, then the transformed variable $F\u22121(u)$ has the same distribution as $\varphi $. For a small sample, one can simply draw a uniform sample $u1,\u2026,un$ and solve the equation $F(\varphi i)\u2212ui=0$ for each $\varphi i$ using a numerical routine for solving nonlinear equations. The resulting sample ${\varphi i}$ will have the desired distribution. If a large sample is desired, it may be more efficient to evaluate a function to directly approximate $F\u22121(u)$ using interpolation in order to avoid repeated solution of nonlinear equations. We use this approach in our reference implementation. The equation $F(\varphi i)\u2212ui=0$ is still solved for $\varphi i$, but the *u _{i}* are an evenly spaced grid on $[0,1]$ rather than draws from a uniform distribution. The resulting solutions are used as interpolation nodes to directly approximate $F\u22121(u)$ for arbitrary

*u*. The specific algorithm used in our code involves two steps:

- (1)
Divide the interval $[0,1]$ into

*n*uniformly spaced subintervals with grid-points $F0=0,F1=1/n,\u2026,Fn\u22121=1\u22121/n,Fn=1$, and obtain values $\varphi 1,\u2026,\varphi n\u22121$ by solving the equation $F(\varphi i)\u2212Fi=0$ for $i=1,\u2026,n\u22121$ on the interior points of the grid. - (2)
Interpolate a uniformly distributed random sample using

*F*as grid points and $\varphi i$ as function values._{i}

The final result uses linear interpolation over the resulting function-value pairs, where again, the interpolation grid values are *F _{i}* and the function values are $\varphi i$. Given a sample

*u*from the uniform distribution on $[0,1]$, determine the index

*i*for which $u\u2208[Fi\u22121,Fi)$. Approximate the associated sample $\varphi $ as follows: $\varphi \u2248\varphi i\u22121+n(u\u2212Fi\u22121)(\varphi i\u2212\varphi i\u22121)$. This results in better approximation to the desired random sample and more uniform decrease in its error as

*n*is increased.

### Differences From Verification of Deterministic Output.

There is a distinction in the meaning of “verification” using the proposed approach as compared with its use for deterministic problems. Algorithms for deterministic problems will produce the same results for the same inputs no matter how many times they are run. Approaches like the method of manufactured solutions provide a means of verifying a software routine produces correct output, up to some level of error tolerance. A single test result can determine whether the output satisfies that error tolerance constraint.

The situation is somewhat different for the proposed approach. The result of a single energy statistic test depends on the particular random realization of the output. Whether a test fails or not can change from run-to-run due to random variations. When a test does not fail, it does not imply the software has no errors or produces the correct output, up to some tolerance. Rather it implies there is no “statistically significant” difference between the computed output and the expected output, where statistically significant is defined by the user through the *α* parameter. Whereas the tolerance in deterministic verification quantifies an acceptable level of error in the output, tolerance with the proposed method for stochastic verification quantifies an acceptable level of risk that the test result produces a false positive by chance. This distinction should be kept in mind when interpreting results and incorporating such tests into procedures for ensuring software correctness.

## Numerical Studies

We developed a reference matlab implementation of the framework described in this paper called LinVer, which is available on GitHub [56]. This code was developed as a reference for those interested in implementing the framework in their own software and was used to produce the results in this section. It makes use of routines in the matlab Statistics Toolbox for convenience. The tests in this section require the DRAM package for matlab [41], but the core functionality of the framework implementation is independent of this package.

The remainder of this section reports the results of three numerical studies designed to test the framework and examine some of its capabilities and limitations. The first study introduces a single error into the likelihood function used by the MCMC algorithm and compares the approximate probability of the energy test reporting an error from the resulting output to that from an error-free implementation. The second study varies the size of an error in the likelihood function used by the MCMC algorithm as well as the sample size used by the energy test and reports the ratio of reported failures in a fixed number of repeated tests. While the first two studies involve errors in the likelihood function used by the MCMC algorithm, the final study introduces two separate errors directly into the MCMC implementation itself and applies the energy test a fixed number of times to samples drawn from the outputs of each of these erroneous implementations.

### Multiple Configurations.

Several examples were run using MCMC output generated for a range of parameters to exercise various configurations of the framework. The MCMC output was generated with code employing the DRAM matlab package [41]. The code was written so that all calibrated parameters were sampled from a Gaussian proposal distribution in DRAM. Output was generated from two sets of code. The first (“Good”) was implemented correctly while the second (“Bad”) used an incorrectly implemented Gaussian log-likelihood which left out the factor of 1/2; i.e., $\u2212\lambda y\u0303TR\u22121y\u0303$ was used rather than the correct $\u2212(1/2)\lambda y\u0303TR\u22121y\u0303$.

### Simulation Parameters.

For all tests, the MCMC output generated a chain of 100,000 iterates using both the delayed rejection and proposal adaptation features of DRAM. The sample to be used in the energy statistic tests was gathered from the chain by choosing the 20,001st iterate of the chain, then the 20,501st iterate, and subsequent iterates in increments of 500. This resulted in a sample of 160 values. This was partially done to reduce the computational effort required by the unoptimized implementation of the energy statistic test that our reference code uses, as this increases with the size of the sample. This also helped satisfy the assumption that samples used in the energy test are comprised of independent realizations from their respective distributions, as neighboring iterates in an MCMC chain are more correlated. We note that Sec. 7.4 of Ref. [55] mentions the independence assumption is stronger than necessary so this consideration may not be critical for some applications. Starting from the 20,001st iterate allows time for “burn-in” of the chain.

All energy statistic tests were performed with significance level $\alpha =0.01$. This means that for output from a correct implementation, we would expect no more than 1% of the tests to fail. Each sample drawn from the MCMC code was computed once and tested 500 times against a sample drawn from the exact posterior (using the method described in Sec. 5.2 whenever $\varphi \u0302$ was calibrated). For each of the 500 tests, the same MCMC sample was used and a new sample from the exact posterior was drawn. A ratio of failed tests was computed from these. Using the binomial distribution, a *p*-value was computed for these ratios, calculating the probability that 500 random tests with a 1% probability of failing would result in a ratio higher than what was observed. We expect the *p*-value to be very small when the experiment is run on the incorrectly implemented code and relatively large otherwise.

Sampling from the exact joint posterior distribution of ($\beta \u0302,\u2009\lambda \u0302,\u2009\varphi \u0302$) for case 3 is accomplished as follows:

- (1)
Sample $\varphi \u0302$ from the relevant marginal posterior distribution given in Sec. 4.3.

- (2)
Sample $\lambda \u0302$ from the relevant conditional posterior distribution given in Sec. 4.2, where $\varphi $ is set equal to the $\varphi \u0302$ sampled in the first step.

- (3)
Sample $\beta \u0302$ from the relevant conditional posterior distribution given in Sec. 4.1, where $\varphi $ is set equal to the $\varphi \u0302$ sampled in the first step, and

*λ*is set equal to the $\lambda \u0302$ sampled in the second step. - (4)
Repeat the first three steps to obtain a set of samples from the desired exact joint posterior distribution.

For case 2, only the second and third steps are executed with $\varphi $ set to its fixed value. Similarly, for case 1, the third step is executed with $\varphi $ and *λ* set to their fixed values.

One test was performed calibrating only $\beta \u0302$ using a noninformative prior and no correlation in the observation error. Tests calibrating $\beta \u0302$ alone and $\beta \u0302,\lambda \u0302$ for both noninformative and Gaussian priors were performed with equicorrelated observation error. Finally, calibration of $\beta \u0302,\lambda \u0302,$ and $\varphi \u0302$ was performed with AR(1) observation error for both noninformative and Gaussian priors.

The results of the energy statistic tests for each examined configuration are reported in Table 3. The results of the test are as expected, with the ratio of the number of failures for the correct code generally falling below the $\alpha =0.01$ significance level and the ratio of number of failures for the incorrect code lying well above it. Out of the correct code tests, only one resulted in a ratio greater than 0.01. Even though the ratio of failures for this case fell above the significance threshold, the *p*-value for the result is 0.2371, meaning the probability of observing a test with a higher ratio than this is about 23%. This is a plausible figure for correct code, in contrast to the *p*-values for all of the tests involving the incorrect code, which were 0 up to at least four figures. A second run of this particular test case resulted in a failure ratio of 0.004, which supports the conclusion that this result is simply a chance occurrence.

In addition to the table of energy test results, density estimates were computed from the parameter chains using the matlab command ksdensity. All points of the chain were used past the initial burn-in iterate (i.e., past the 20,001st chain value). These densities were compared to those computed using the expressions for the exact posteriors. The results of calibrating $\beta \u0302,\lambda \u0302,$ and $\varphi \u0302$ with the noninformative prior are shown in Figs. 1 and 2. The same calibration case with Gaussian prior is shown in Figs. 3 and 4. These plots demonstrate the close agreement between the correct MCMC code and the posterior contrasted with the results from the incorrect MCMC code.

### Error Sensitivity Study.

The proposed verification framework can be viewed as two fairly independent subsystems. The first subsystem is the family of linear regression problems that can be configured to provide a diverse collection of baseline challenges to the MCMC routine under test that all have analytical or semi-analytical exact solutions to compare the results against. The second subsystem is the energy statistic test that assesses the statistical significance of any difference in distribution between a sample drawn from the MCMC code under test and a sample drawn from the known analytical solution the MCMC code should draw from. The previous numerical experiment tested for proper behavior of the first subsystem by checking that deliberately introduced errors were likely to be distinguished from those of a correct implementation as the configuration was varied over a subset of the available problem space. The experiment in this section examines the second subsystem by fixing the problem configuration and varying the size of the error introduced into the implementation and the number of samples used in the energy statistic test.

Whether or not a given verification methodology reports an issue for a particular error is a function of how sensitive the output of the algorithm under test is to the error as well as how sensitive the error-detecting metric used is to the output changes resulting from the error. This experiment examines the latter for a family of output errors obtained by scaling the log-likelihood supplied to the MCMC algorithm. This is a generalization of the implementation error used in the previous experiment, as leaving off a factor of 1/2 is equivalent to scaling by 2. This is applied to the case 1 problem of inferring *β* with other parameters known, making the scales interpretable as errors in the known observation error variance used in the problem. A noninformative prior is assumed and the observation error is uncorrelated. The remaining calibration parameters used are the same as in the previous experiment.

A collection of samples of size 20, 40, 60, 80, and 100 is drawn from each of the erroneous implementations and tested against a sample of the same size drawn from the true distribution using the energy statistic test. The confidence level of the energy statistic test is set to $\alpha =0.05$ and 200 tests are performed for each combination of sample size and error effect size. Unlike the previous experiment, each replication of the energy statistic test resamples from the MCMC code as well as the true posterior. Table 4 demonstrates how the reliability of error detection generally increases with sample size and error effect size. The scales used were 1.2, 1.3225, 1.440, 1.5625, 1.69, 1.8225, 2.0, and 3.0. The table reports the KL-divergence of the posterior distribution due to applying these scales and the true posterior.

It is noted that this error model is a convenient way of continuously varying the error effect size and we do not claim that it directly corresponds to a realistic set of code errors. This experiment can be considered an examination of the statistical power of our proposed methodology for a set of error effects which are contrived but easily interpretable. This provides some intuition for the relationship between the severity of a hypothetical error and the quantity of sample data needed to reliably detect it.

### Direct Algorithm Errors.

Briefly, the subscripts *t* index the iteration of the MCMC chain, *X _{t}* are vectors representing the

*d*-dimensional chain value at iterate

*t*, $X\xaft$ represent the means of

*X*at iterate

_{t}*t*,

*C*is the empirical covariance of the chain at iterate

_{t}*t*taken over its history, and

*s*is a scaling parameter. The error changes

_{d}*t*− 1 and

*t*+ 1 to

*t*, which is plausible as a transcription error when translating the formula to code.

The results in Table 5 show the energy statistic test fails reliably for the first error but is fairly insensitive to the second. This observation is consistent with the fact that the first error leads to an incorrect MCMC algorithm, while the second error leads to a slightly different adapting proposal covariance matrix at each iteration but with ultimate convergence to the same matrix as obtained from the correct recursion. These errors are similar in the sense that each of them could arise from a subtle textual error in the algorithm source code. While this is certainly a limitation of the approach, it is not unique to it. Furthermore, if the resulting change in output due to an error is too small to be considered statistically significant, if the problem configuration used for verification is close enough to that of a target application, then it may not matter if that error is present or not.

We note that while this experiment indicates some potential of the proposed framework as a verification tool, it is not feasible to assess the types of errors that may be reliably detected with it from this limited study. One approach to gain an understanding of the capabilities of the framework in practice would require the design of a large-scale study of implementation errors, similar to what was done in Ref. [35] for the method of manufactured solutions. Unfortunately, as the size of this report shows, such a study is a large-scale effort that is beyond the scope of our current investigation but is an interesting possibility for future research.

## Conclusion

We have presented a linear-regression-based framework for the verification of Bayesian model calibration routines. The framework provides a flexible model problem with formulas for the exact densities which can be compared with the results of the calibration code applied to the model problem. As highlighted earlier, the choice of framework has several advantages from the point-of-view of computational efficiency. We demonstrated the use of the framework by comparing an MCMC-based Bayesian parameter estimation code we developed using the DRAM matlab package with the results computed from the exact formulas and found good agreement. The energy statistic test was shown to distinguish between a sample drawn from code with a deliberate mistake added to its computation of the posterior and a sample drawn from correctly implemented code. Although our test examples focused on DRAM, the framework is applicable to other MCMC methods such as DiffeRential Evolution Adaptive Metropolis or Hybrid Monte Carlo, etc. We also note that while the emphasis of the paper was verification of MCMC routines for Bayesian calibration, the framework itself is also applicable to other approaches.

The framework was useful even while developing the code for these tests. Early results produced a significant ratio of failed tests for the “correct” code whenever variables other than $\beta \u0302$ were calibrated (although the ratio was still noticeably lower than that of the deliberately incorrect code). This turned out to be due to a bug in the function for calculating the prior, where a factor of $1/\lambda \u0302$ was left out. Correcting this yielded the results reported here.

Another observation from testing is that the test results can depend on the configuration of the MCMC algorithm. For instance, earlier testing used a smaller MCMC chain, which mostly worked well but resulted in a statistically significant amount of failures for the more difficult inference cases involving $\varphi \u0302$ (although again, the ratio was much smaller than for the deliberately incorrect code). Using a longer chain to give the routine more time to reach convergence resulted in consistently reasonable test ratios. This suggests the framework may be useful not only for detecting errors in software, but for determining MCMC configurations which are suitable for a given inference problem. For example, if an MCMC algorithm were to be used with a nonlinear model, one might solve for the maximum likelihood estimate, linearize the problem, and then use the framework to verify the particular MCMC configuration used converges. This would provide at least some level of confidence in obtaining meaningful results.

While the framework is limited to verifying the correct calibration of a set of linear problems, this is an important and general class of models whose verification provides some confidence that the algorithms perform as expected. Its ease of implementation makes it a good choice when limited resources are available for verification development. The models employed are flexible enough that they can be adapted to many application scenarios typical in uncertainty quantification, e.g., approximating nonlinear input–output relationships via the linear model, even though this may be somewhat inefficient and require some application-specific tuning. Furthermore, many components of the framework we have outlined are independent of each other and can be easily substituted with alternatives. For example, sampling from the distribution of a known analytical solution framework could be substituted with sampling from an assumed-correct reference implementation. We reiterate that our results demonstrated code errors that have a reasonable chance of going unobserved by the framework. Practical application of our approach to provide high confidence in an implementation will require additional investigation into the scope of errors that the framework is likely to find. Additionally, we acknowledge that our approach will not scale well for the non-Gaussian cases of our framework or for other non-Gaussian cases in general, due to the need to sample from a true posterior which can be determined analytically or numerically. Hence verification of samplers for high-dimensional, strongly non-Gaussian samplers is not feasible in our current framework and remains a topic of future research.

## Acknowledgment

This research was supported by the Consortium for Advanced Simulation of Light Water Reactors in the website,^{1} an Energy Innovation Hub in the website^{2} for Modeling and Simulation of Nuclear Reactors under U.S. Department of Energy Contract No. DE-AC05-00OR22725.

## Funding Data

U.S. Department of Energy (Grant No. DE-AC05-00OR22725).