Hierarchical Bayesian models (HBMs) have been increasingly used for various engineering applications. We classify two types of HBM found in the literature as hierarchical prior model (HPM) and hierarchical stochastic model (HSM). Then, we focus on studying the theoretical implications of the HSM. Using examples of polynomial functions, we show that the HSM is capable of separating different types of uncertainties in a system and quantifying uncertainty of reduced order models under the Bayesian model class selection framework. To tackle the huge computational cost for analyzing HSM, we propose an efficient approximation scheme based on importance sampling (IS) and empirical interpolation method (EIM). We illustrate our method using two engineering examples—a molecular dynamics simulation for Krypton and a pharmacokinetic/pharmacodynamics (PKPD) model for cancer drug.

## Introduction

Hierarchical Bayesian model (HBM) is a powerful modeling tool extended from the classical Bayesian framework. It has been used to solve many difficult practical problems, such as modeling heterogeneous data, calibrating system with multiple objectives, and inducing sparsity in a model. Its applications span across various fields, such as social science, economics, physics, medical science, and civil engineering [1–4].

In a classical Bayesian setting, users define a stochastic model with parameter $\theta $ for a forward problem that predicts a quantity of interest and a prior distribution of $\theta $. When data are available, the Bayes' Theorem is used to solve the inverse problem by finding the posterior distribution of $\theta $. For a complex system in reality, we further parameterize the stochastic model and the prior with hyperparameters. This extra level of parameters provides extra flexibility to a model, but generally requires more data to reach a well-posed problem. HBM refers to model classes with such a hierarchy (multiple levels) of parameters. Here, we define two types of HBM commonly found in the literature: a hierarchical prior model (HPM) that further parameterizes the prior and a hierarchical stochastic model (HSM) that further parameterizes the stochastic model (or known as the likelihood function when evaluated at a given data).

Hierarchical prior model is a well-studied subject in the machine learning and compressive sensing community in the context of sparse Bayesian learning (SBL). Its applications include imaging [5,6], gene selection [7], and Bayesian compressive sensing [8,9]. It is an effective tool to solve ill-posed regression problems by adding extra constraints in the prior distribution under the Bayesian framework. The well-known Bayesian optimization method can also be interpreted using this model structure. On the other hand, the origin of HSM can be traced back to the multilevel model in the statistics community. One classic application is to analyze test results collected from multiple schools [10]. Tia and Tan [11] and Hill [12] are early publications on Bayesian analysis for multilevel regression, and Congdon [13] summarizes recent developments on this topic. Since most of the computational demand in Bayesian analysis comes from evaluating the stochastic model, HSM has a significantly larger computational cost than HPM. Although recently, there is increasing amount of HSM applications, the diversity of HSM uses is not comparable to HPM. However, recent developments in parallel computing open up new opportunities to apply HSM to more complicated problems. In this paper, we discuss important theoretical implications of HSM and demonstrate an efficient approximation scheme with practical applications.

Despite the frequent use of the terminology “HBM” in the literature, the distinction between what we called the HSM and the HPM is often omitted. The overlapped usage of HBM has even caused confusion. For example, Guha et al. [14] studied the HPM. Nevertheless, the authors used [15] as a reference, in which the type of HBM mentioned in Ref. [15] is actually the HSM. Hence, in this paper, we begin with a comparison between HPM and HSM in Sec. 2. Then, we turn our focus back to HSM and study its theoretical implications illustrated by a simple polynomial regression in Sec. 3. To tackle the high computational cost for analyzing HSM in practice, we propose an efficient approximation based on the idea of importance sampling (IS) and empirical interpolation method (EIM) [16] in Sec. 4. Section 5 includes two realistic examples using our approximation methods. Finally, we give some concluding remarks in Sec. 6.

## Classification of Hierarchical Bayesian Model: Hierarchical Prior Model Versus Hierarchical Stochastic Model

*y*, one may have a stochastic forward model $F(x,\theta ,\epsilon )$ with input

*x*, model parameters $\theta $, and the stochasticity part $\epsilon $. This function defines the distribution $p(y|x,\theta )$. One of the most common example of

*F*(⋅) is

*ε*is chosen to follow a zero mean and

_{y}*σ*standard deviation Gaussian distribution,

_{y}*N*(0,

*σ*), for computational convenience. Given a set of input and output data pairs $D={(x\u0302i,y\u0302i)|i,\u2026,ND}$, the posterior distribution of $\theta $ is inferred by the Bayes' theorem

_{y}*x*

_{0},

*y*

_{0}) can be obtained by

Hierarchical Bayesian model adds an extra level of hyperparameters $\psi $ to this classical Bayesian model. This new hierarchy can be added to either the prior (HPM) or the stochastic model/likelihood (HSM), which will result in completely different model classes suitable for different problems. The major distinction of the two models comes from different structures of information dependencies between all the stochastic variables. The graph representations shown in Fig. 1 are very useful to display and study such relations [19]. The arrows denote the directions of the forward models, which imply that Bayes' Theorem is needed for inference in the opposite direction. Any nodes without an incoming arrow require a prior distribution.

If we integrate out $\psi $ in the analysis, this is equivalent to recovering the classical Bayesian setting from HPM by choosing $p(\theta )=\u222bp(\theta |\psi )p(\psi )\u2009d\psi $, because HPM does not affect the stochastic model part. Hence, the hyperparameters in HPM behave as latent variables. In most applications of HPM, we perform optimization of $\psi $ by maximizing $p(D|\psi )$, instead of the robust treatment of $\psi $. For example, in the context of SBL, zero-mean Gaussian priors with different variances are chosen for the coefficients in a linear regression problem. This prior is called the automatic relevance determination prior that is proven to induce sparsity [20,21]. A commonly used algorithm in SBL, called the relevance vector machine, is based on this optimization setup [22,23]. One can interpret this procedure as a Bayesian model selection problem in the continuous space.

*y*

_{0}. The posterior distribution of $\psi $ is

Here, we denote the full data set as $D={Di|i=1,\u2026,ND}$, where *D _{i}* represents the observed data values for

*y*. The choice of $p(\theta i|\psi )$ is flexible. For example, a statistical model is chosen in Refs. [15] and [13], and $\psi $ is called the hyperparameter vector. The statistical model can be seen as a common prior for all $\theta i$. This explicit modeling for the prior separates different uncertainties by isolating out the uncertainty of $\theta i$ across multiple groups of predictions. It can give better predictions and also be related to causal inference [10]. Another example of such a hierarchical structure is found in a state-space model, where $\theta i$ is the state variable and $p(\theta i|\psi )$ represents the underlying stochastic process. Therefore, $p(\theta i|\psi )$ can also be a theoretically informed model, instead of a purely empirical or statistical model.

_{i}In the current literatures, HPM is mainly used as a tool for optimal prior selection, while HSM is for explicit modeling of the uncertainty of model parameters across multiple groups of predictions for the data. However, HSM has the potential for more sophisticated modeling depending on the choice of $p(\theta i|\psi )$, though it comes with a significantly larger computational cost due to the extra integral shown in Eq. (5). We note that since HPM and HSM are two exclusive types of HBM; in theory, they can co-exist in a complex HBM and may have more than one level of hyperparameters. It is the limit of computational power that constrains the usage of HBM to be only one level of hyperparameters. In this paper, we study the important theoretical implications of HSM and present an efficient and flexible approximation method for performing Bayesian analysis with HSM. We note that a similar comparison can be found in Ref. [24], but they present it purely from the perspective of a likelihood function of $\psi $, instead of the HBM.

## Theoretical Implications of Hierarchical Stochastic Model

Different assumptions on the uncertainties in a system can be represented using different hierarchical and nonhierarchical models. In this section, we verify that the effect of Occam's Razor in the Bayesian model selection framework is applicable to both types of models. Moreover, we study the theoretical implications of using a HSM in three important aspects:

For computational accuracy, we demonstrate our results using polynomial regression because of the availability of many analytical solutions during the Bayesian analyses.

### Separating Different Types of Uncertainties Using Hierarchical Stochastic Model.

In this study, we demonstrate how HSM can capture different types of uncertainties using models that are embedded with different sources of stochasticity. We generate synthetic data based on two uncertainty models: (1) zero-mean Gaussian error *ε _{y}* added to the function $f(x,\theta )$, which can represent measurement noise in practice and (2) zero-mean Gaussian error

*ε*added to the model parameters $\theta $, which can represent inherent model uncertainty due to, for example, insufficient knowledge of theoretically informed models or environmental variations across multiple experiments. We use a linear function

_{θ}*f*(

*x*,

*θ*) =

*θx*(

*θ*and

*x*are scalars) and all data points are generated with independent

*ε*and

_{y}*ε*. Three types of synthetic data are considered based on $\theta \u0302$, a fixed value of the model parameter

_{θ}*θ*:

- (1)
Additive error data, $D1:\u2009y=f(x,\theta \u0302)+\epsilon y,\epsilon y\u223cN(\epsilon y|0,\sigma \u0302y2)$.

- (2)
Embedded error data, $D2a:\u2009y=f(x,\theta \u0302+\epsilon \theta ),\u2009\u2009\u2009\epsilon \theta \u223cN(\epsilon \theta |0,\sigma \u0302\theta 2)$.

- (3)
Mixed error data, $D2b:\u2009y=f(x,\theta \u0302+\epsilon \theta )+\epsilon y,\u2009\u2009\u2009\epsilon y\u223cN(\epsilon y|0,\sigma \u0302y2)$ and $\epsilon \theta \u223cN(\epsilon \theta |0,\sigma \u0302\theta 2)$.

where $N(z|\mu ,\sigma 2)$ denotes a Gaussian distribution for *z* with mean *μ* and variance *σ*^{2}. We set $\theta \u0302=1,\u2009\sigma \u0302\theta =0.5$, and $\sigma \u0302y=0.2$. Each data set contains 1000 data points independently generated from a uniformly distributed *x* value between preset bounds and random errors *ε _{y}* and

*ε*from the corresponding Gaussian distributions. We generate all three types of data twice, once with

_{θ}*x*between 0 and 1 and once with

*x*between 0.4 and 1 (see Figs. 2 and 3, respectively). Note that for the case with

*x*between 0.4 and 1, the three data sets look a lot more similar to each other visually, which will be reflected in the difficulty of model selection, as will be explained later.

Given the three types of data, we perform Bayesian model class selection on four different stochastic model classes:

- (1)
Nonhierarchical model, $M1a$: This model class includes a uniform prior for

*θ*between –1 and 3, and a Gaussian likelihood for*θ*, i.e., $p({xi,yi}|\theta ,\sigma y)=N(yi|f(xi,\theta ),\sigma y2)$ for any*i*. Here,*σ*is also treated as a parameter to be inferred. We use a uniform prior for_{y}*σ*between 0.001 and 1._{y} - (2)
Hierarchical prior model, $M1b$: This model class has the same likelihood as $M1a$ and a hierarchical prior for

*θ*. The prior is distributed as $N(\theta |\mu \theta ,\sigma \theta 2)$, i.e., the hyperparameters $\psi ={\mu \theta ,\sigma \theta}$. We use uniform priors for*μ*between –1 and 3 and_{θ}*σ*between 0.001 and 1. The prior for_{θ}*σ*follows the one in $M1a$._{y} - (3)
Zero noise HSM, $M2a$: This is an HSM with $p(\theta i|\psi )$ modeled as the same Gaussian prior used for

*θ*in $M1b$ for all*i*. No additive error is assumed in this model. This implies that the likelihood function of*θ*is a delta function centered at the given data, i.e., $p({xi,yi}|\theta i)=\delta (yi\u2212f(xi,\theta i))$. The priors of the hyperparameters are the same as in $M1b$._{i} - (4)
Full HSM, $M2b$: This is the same HSM as $M2a$ except that a Gaussian additive error is assumed in this model. Hence, the likelihood function $p({xi,yi}|\theta i,\sigma y)=N(yi|f(xi,\theta i),\sigma y2)$. We use the same uniform priors as in $M1b$ for

*σ*,_{y}*μ*, and_{θ}*σ*._{θ}

Supplementary material, which is available under “Supplemental Data” tab for this paper on the ASME Digital Collection, includes all derivations of the analytical expressions used for estimating the evidence $p(D|Mk)$, the marginal posterior probability density function (PDF) for the model parameters $p(\theta |D,Mk)$, and the posterior PDF for the hyperparameters $p(\psi |D,Mk)$, for a given model class $Mk$. We use Monte Carlo simulation to draw posterior samples of the hyperparameters for the “naive” estimation of the evidence. 10K samples are used to obtain accurate estimates with small variances.

#### Results and Discussion.

Tables 1 and 2 summarize the results of the Bayesian inference and model selection for all six data sets (three with *x* between 0 and 1, and three with *x* between 0.4 and 1). The Bayesian model selection framework is capable of selecting the corresponding model used to generate the data. All models have a relatively good estimate for $\theta \u0302$ given by $E[\theta |D]$. The major difference appears in the accuracy of estimating $\sigma \u0302\theta $ and $\sigma \u0302y$ given by $Std[\theta |D]$ and $E[\sigma y|D]$, respectively. In general, in terms of accurately estimating $\sigma \u0302\theta $ and $\sigma \u0302y,\u2009M1a$, and $M1b$ always prefer putting uncertainty into *σ _{y}* and thus perform well for $D1$ only. $M2a$ cannot handle additive error and thus performs well for $D2a$ only. $M2b$ is the most flexible model and it performs relatively well for all data sets $D1,\u2009D2a$, and $D2b$.

We note that the results of $M1a$ and $M1b$ are extremely similar. This is because their only difference is the prior of *θ*, as explained in Sec. 2. In this case, the marginalized prior *p*(*θ*) in $M1b$ is slightly larger than the prior in $M1a$. Hence, the evidence of both models is almost same in all cases.

Furthermore, we observe that in the case of $D2b$ with *x* between 0.4 and 1, $M2a$ and $M2b$ have a similar posterior model probability. Intuitively, the data with small *x* values are more sensitive to additive noise because the noise-to-signal ratio is much higher than in the case of larger *x* values. When such data are not available, it is challenging to distinguish between $D2a$ and $D2b$. This can be proved visually by observing the similarity between $D2a$ and $D2b$ in Fig. 3 as compared to the one in Fig. 2.

Our results suggest using $M2b$ for model calibration, when no prior knowledge indicates that additive error is irrelevant. This HSM can appropriately separate the two types of uncertainties in our case study into the additive and nonadditive parts based on the observed data. If computational power allows, Bayesian model class selection should always be performed among a set of different candidate models, as illustrated in this study.

#### Effect of Grouping.

In Sec. 3.1, all synthetic data are generated independently based on different stochastic models. In practice, we may group some of the predictions to be modeled under one set of parameter values $\theta i$, where *i* is the index for each group of predictions. For example, if an experiment is repeated in different laboratories, we may model the data from the same laboratory using the same set of model parameters, assuming that the data share the same environmental factors during the experiment. When the prediction grouping is not known, we may want to know the most plausible grouping, which defines a unique HSM.

We use the same linear function *f*(*x*, *θ*) = *θx* as before, but generate a new set of data $D$. First of all, five random samples *θ*^{(}^{i}^{)} for *i* = 1,…, 5 are drawn from $N(\theta |\mu \u0302\theta ,\sigma \u0302\theta 2)$. Then, 11 data points are generated for each *θ*^{(}^{i}^{)} to form a data set *D _{i}* by using the stochastic forward model

*y*=

*f*(

*x*,

*θ*

^{(}

^{i}^{)}) +

*ε*, where

_{y}*ε*is random error drawn from $N(\epsilon y|0,\sigma \u0302y2)$ independently for each data point. In this study, $\mu \u0302\theta =1,\u2009\sigma \u0302\theta =0.5$, and $\sigma \u0302y=0.1$. Figure 4(a) shows the five data sets used in the section. The sample mean and standard deviation of

_{y}*θ*

^{(}

^{i}^{)}are 1.0191 and 0.4079, respectively.

To study the effect of grouping predictions in the HSM, we set up six different HSMs based on different information dependency between the predictions. We perform Bayesian model class selection to find the most plausible HSM to describe the data. The evidence is calculated similar to $M2b$ in Sec. 3.1. Details of the calculation can be found in the Supplementary material, which is available under “Supplemental Data” tab for this paper on the ASME Digital Collection. The six data groupings are the following:

- (1)
Actual grouping, $M\u20321$: group predictions according to Fig. 4(a).

- (2)
Constant

*x*grouping, $M\u20322$: predictions with the same*x*value are grouped together. This case represents grouping data according to the wrong variable. Figure 4(b) shows the resulting grouping. - (3)
One-half error grouping, $M\u20323$: starting from $M\u20321$, the predictions for each data set are further divided into two sets by the centerline of the data points in the set. This centerline is defined by the mean of

*θ*inferred from each data point in the set. Figure 4(c) shows the resulting data grouping. - (4)
One-fourth error grouping, $M\u20324$: starting from $M\u20323$, the predictions for each data set are again further divided into two sets by the centerline of the data points in the set, as done in $M\u20323$.

- (5)
Single prediction, $M\u20325$: each prediction is treated as an independent data set (same as Sec. 3.1.1).

- (6)
Random grouping, $M\u20326$: predictions for the data points are randomly collected into five groups. Figure 4(d) shows the resulting grouping.

Table 3 summarizes the results of the Bayesian inference and model class selection for all six HSM. All models have a good estimate for $\mu \u0302\theta $ (corresponds to $E[\theta |D]$). However, $M\u20322$ and $M\u20326$, representing two completely wrong groupings, have significantly lower estimates for $\sigma \u0302\theta $ (given by $Std[\theta |D]$), but larger estimates for the mean of $\sigma \u0302y$ (given by $E[\sigma y|D]$). Hence, they have a very low log-evidence value. $M\u20321$ has the closest estimates of $\mu \u0302\theta ,\u2009\sigma \u0302\theta $, and $\sigma \u0302y$, but it is not the most probable model class among the six. $M\u20323$ is the most probable model class even though it has a low value of the estimates for both $\sigma \u0302\theta $ and $\sigma \u0302y$. We note that when the total number of data points is fixed, the more groups there are, the smaller the average number of predictions in a group is. This affects the likelihood of $\psi $, which equals the product of $p(Di|\psi )$ for each data set *D _{i}* (see Eq. (5)). When the number of $p(Di|\psi )$ increases, the value of each evidence term may decrease. This decrease is due to a less peaked $p(Di|\theta i)$ as the number of data points in the set decreases. The final evidence $p(D|M\u2032k)$ for a given data grouping $M\u2032k$ is a tradeoff between these two factors. Starting from $M\u20321$ being modified to $M\u20323$, if we repeat the process many times, eventually we will reach $M\u20325$. Therefore, $M\u20321,\u2009M\u20323,\u2009M\u20324$, and $M\u20325$ can be considered as a sequence of similar data grouping methods. In the end, the tradeoff suggests that $M\u20323$ is the most probable grouping. This implies that model class selection based on the evidence may not necessarily lead to the actual grouping if it exists. Overall, it tries to minimize uncertainties in all parameters. On the other hand, if we choose the actual grouping, the uncertainty quantification will be accurate.

### Uncertainty Quantification for Reduced Order Models Using Hierarchical Stochastic Model.

A reduced order model simplifies the information obtained in the original model. This loss of information can be treated as a source of uncertainty in the Bayesian framework. The structure of this type of uncertainty can be defined if the mapping between the original model and the reduced order model is completely known. In most cases, however, the mapping is not well-defined or too complicated to be studied directly. Therefore, there is no clear way to model such kind of loss of information. In this section, we compare the performance of using different model classes in Sec. 3.1 to represent the reduced order model uncertainty. This study is based on fitting second and third degree polynomial data with linear functions.

#### Problem Setup for Polynomials.

We consider a total of 16 sets of data: eight sets from a quadratic function *f*(*x*) = *x*^{2} and eight sets from a cubic function *f*(*x*) = *x*^{3}. For each type of function, we generate 20, 50, 100, and 200 data points twice (with and without noise), i.e., a total of eight data sets. The *x* values of the data points are generated randomly in the interval [−1, 1]. Additive Gaussian error with standard deviation $\sigma \u0302y=0.1$ is chosen for the noise. Figure 5 shows all 16 data sets used in this study.

We perform Bayesian model class selection and posterior robust prediction using three models described in Sec. 3.1: $M1a$ (denoted as $M1$ in this section), $M2a$, and $M2b$. Supplementary material, which is available under “Supplemental Data” tab for this paper on the ASME Digital Collection, includes all derivations for the analytical expressions used to estimate the log-evidence values and posterior robust predictions $p(y|D,Mk)$ for a given model $Mk$. We evaluate the posterior probability of observing each grid point $(x\u0302,y\u0302)$ on a two-dimensional (2D) fine grid of (*x*, *y*) using the analytical expressions in order to construct the distribution of the posterior robust prediction.

#### Results and Discussion.

Table 4 summarizes the results of model class selection for this study. $M2b$ is the most probable model class for data with additive noise and $M2a$ is the most probable model class for data without any additive noise. $M1$ is the least significant model in most cases because the error caused by the reduced order model is very nonlinear. We do not expect that the additive noise alone is sufficient to explain the error. In this study, the sample size (number of data points) does not play a significant role in the results.

Figure 6 shows one set of the prediction results (the results are not sensitive to the change of the number of sample size). We observe that for cases without additive noise, predictions from $M2a$ and $M2b$ are almost the same. Because $M2a$ is a simpler model than $M2b$ (less parameters), Bayesian model class selection prefers $M2a$ and gives it a higher log-evidence value. For cases with additive noise, the uncertainty of prediction from $M2a$ is significantly larger than $M2b$. This is because *θ* is very sensitive to noise for *x* values close to zero when a model does not have any additive error component. As a result, Bayesian model class selection prefers $M2b$ for its ability to better fit the data on average. These results suggest that the HSM with additive error is preferred as the first test for uncertainty quantification of reduced order model. The flexibility of such a model class to handle two types of uncertainty simultaneously (additive error and embedded error in the model parameters) results in a higher chance of discovering the underlying uncertainty structure of a reduced order model.

## Efficient Approximation of Hierarchical Stochastic Model

The posterior distribution $p(\psi |D)$ plays an important role in the HSM. Equation (5) shows that the calculation of the likelihood $p(D|\psi )$ involves evaluations of multiple integrals, which correspond to the evidences for each data set *D _{i}* conditional on the hyperparameters $\psi $. This leads to an extremely large computational cost for sampling from $p(\psi |D)$, as well as estimating the evidence of the model, $p(D)$. Current approaches include using conjugate pairs for analytical results [13], approximating the integrals with Laplace asymptotic approximation (LAA) [25] or using some advanced Markov chain Monte Carlo techniques [26]. These methods are still restrictive for either studying many complex systems or the efficiency is not very scalable to handle extra data sets.

In this section, we present an efficient approximation method based on a special use of IS. Most current research efforts focus on the HSM that adds only one extra level to the classical Bayesian model. Our method can be intuitively and efficiently extended to any complex HSMs with more levels of parameters. Moreover, we can lay out a standard procedure for fully analyzing any HSMs based on this method.

For ease of illustration, we demonstrate our method based on the commonly used stochastic model shown in Eq. (1). In practice, due to different assumptions on the additive noise, there may be more complicated HSM than the one shown in Fig. 1(b). Here, we consider two possible alternatives (see Fig. 7): (1) independent additive error parameters—applicable to heterogeneous data with different likelihoods [27] and (2) common additive error parameters—applicable to data sets that are expected to have the same measurement errors [25].

Here, we introduce our method to the basic HSM shown in Fig. 1(b), which represents the cases that either $\sigma y$ is known or it is uncertain and included with the other uncertain parameters $\theta $. We denote this model class as $MHS1$. In the Supplementary material, which is available under “Supplemental Data” tab for this paper on the ASME Digital Collection, we present the extension of the idea to the two alternative HSMs shown in Fig. 7, which we denote as $MHS2$ and $MHS3$.

Sampling from $p(\psi |D,MHS1)$ requires repeated evaluations of the likelihood $p(D|\psi ,MHS1)$ and thus evaluations of the integrals with different values of $\psi $. Our idea is to approximate the integrals by IS with a special choice of the proposal PDF that can significantly reduce the total computational cost. In many cases, the likelihood $p(Di|\theta i,MHS1)$, which is related to the evaluation of $f(x,\theta )$, dominates the computational effort.

*D*using the same likelihood $p(Di|\theta i,MHS1)$, but a different prior (choice of such prior is discussed later). To be more specific, we draw samples ${\theta i(j)|j=1,\u2026,Ns,i}$ from the posterior distribution $p(\theta i|Di,Mi)$, where $Mi$ denotes this specific stochastic model class

_{i}*D*(draw posterior samples and estimate the evidence $p(Di|Mi)$). Then, the hierarchical analysis comes as a postprocessing of the results. A very good tool for such a Bayesian inference is the transitional Markov chain Monte Carlo method, where the evidence comes as a by-product of drawing posterior samples and the algorithm is inherently parallel [28]. The likelihood $p(D|\psi ,MHS1)$ can then be estimated by

_{i}In other words, the evidence ratio between $MHS1$ with given hyperparameters and $Mi$ can be approximated by the mean of the prior ratio between the two models over the posterior $\theta i$ samples of $Mi$.

*p*(

*x*) and

*q*(

*x*), this implies the choice

A large number of posterior samples from $p(\theta i|Di,Mi)$ can also reduce the variance of the IS estimate. If this is computationally feasible, an alternative for $p(\theta i|Di,Mi)$ would be a uniform distribution that covers the significant regions of $p(\theta i|MHS1)=\u222bp(\theta i|\psi ,MHS1)p(\psi |MHS1)\u2009d\psi $. We note that the evidence will be insignificantly small for $\psi $ values that lead to the prior $p(\theta i|\psi ,MHS1)$ having a mode well away from the maximum of the likelihood $p(Di|\theta i,MHS1)$. Hence, accuracy is not important in this case. In the opposite case, IS is a good approximation. The inaccuracy problem is important only for the middle case, i.e., the prior mode is not too far and not too close to the maximum of the likelihood. The range of $\psi $ leading to this important case depends on the posterior sample size *N _{s}*

_{,}

*for each data set*

_{i}*D*.

_{i}We note that this approximation method provides information about Bayesian inference for each individual data set first. This information is useful to compare with the HSM analysis to give further insight about the system of interest. Furthermore, our method is very efficient for introducing extra data sets because we do not need to rerun the whole problem. Instead, the overhead is only the classical Bayesian inferences for the extra data sets and a postprocessing step that is not computationally intensive.

## Illustrative Examples

We test our method using two examples: molecular dynamics simulation of Krypton and pharmacokinetic/pharmacodynamic (PKPD) modeling of a cancer drug. Both examples are taken from the previous studies. Their results are used to validate the efficiency and accuracy of our method.

### Molecular Dynamics: Krypton.

Wu et al. [25] proposed using the HSM (called the HBM in the paper) to calibrate the Lennard-Jones (LJ) potential (*ε _{LJ}* and

*σ*) of Krypton based on viscosity data $D$ reported from nine different laboratories. In that paper, the authors approximated the integrals in Eq. (7) by LAA. Since the joint posterior distribution of

_{LJ}*ε*and

_{LJ}*σ*for each data set is close to a Gaussian distribution, LAA is a good approximation for this particular case. We revisit the problem with the same data and the same model class setup as described in Ref. [25]. Instead of the LAA approximation, we use our proposed method to estimate the joint posterior distribution of

_{LJ}*ε*and

_{LJ}*σ*and the posterior robust prediction.

_{LJ}Following the assumptions in Ref. [25], we use the HSM with common additive error. First, we build the EIM basis functions based on the method shown in the Supplementary material, which is available under “Supplemental Data” tab for this paper on the ASME Digital Collection. We begin with a training set of *ε _{LJ}* and

*σ*from a coarse grid of 256 points within the boundaries 100 ≤

_{LJ}*ε*≤ 400 and 0.2 ≤

_{LJ}*σ*≤ 0.5. A fine grid of 172 points for

_{LJ}*σ*is used between 0.0001 and 0.5, uniformly distributed in log scale. We obtain a total of 50 basis functions for each data set to achieve an error less than 0.001%. Figure 8 shows the results of the EIM approximation. We observe that the bases of

_{y}*ε*and

_{LJ}*σ*coincide with the high likelihood values in the domain. This is consistent with our intuition that important regions of the likelihood should be included in the online estimation for better accuracy.

_{LJ}*ε*and

_{LJ}*σ*are recorded to be used in the postprocessing step of our HSM analysis. We use BASIS to draw the samples because it is highly parallel and the evidence is a by-product of the algorithm [29]. Then, we draw the posterior samples of the hyperparameter $\psi $ with the postprocessing step in our method. BASIS is once again used to draw 1000 posterior samples of $\psi $. The posterior distribution of

_{LJ}*ε*and

_{LJ}*σ*is approximated by

_{LJ}*N*posterior samples of $\psi $

Figure 9 shows the results of the HSM analysis using our proposed methods. We observe that our results are consistent with the results reported in Ref. [25] both in terms of the posterior distributions of the parameters and the posterior-robust prediction. This provides a verification of the accuracy of our method.

### Pharmacokinetic/Pharmacodynamic Model: Cancer Drug.

Finley et al. [30] used the classical Bayesian model to study a PKPD model for the antivascular endothelial growth factor (antiVEGF) cancer therapeutic agent, aflibercept. Multiple models with a different number of parameters were calibrated using two types of clinical data: the plasma concentrations of free aflibercept and bound aflibercept. The data can be separated into six groups, each corresponding to a different dosage of the drug. We apply the HSM structure in Fig. 1(b) to the basic model used in Ref. [30], which has three PK model parameters and five prediction error parameters. The prediction of the two types of data using these parameters is calculated by first running the system to reach steady-state and then imposing the corresponding drug dosages. The three model parameters {*k _{EC}*,

*k*,

_{N}*k*} represent the secretion rate of anti VEGF in the blood, normal tissue, and tumor compartment, respectively. The five prediction error parameters {

_{T}*σ*,

_{SS}*σ*,

_{BAs}*σ*,

_{BAf}*σ*,

_{FAs}*σ*} represent standard deviations of zero-mean Gaussian distributions for five different types of prediction errors:

_{FAf}*σ*—error on the steady-state prediction,

_{SS}*σ*—error on the bound aflibercept prediction scaled by time,

_{BAs}*σ*—error on the bound aflibercept prediction without any scaling,

_{BAf}*σ*—error on the free aflibercept prediction scaled by time, and

_{FAs}*σ*—error on the free aflibercept prediction without any scaling. We choose independent Gaussian prior distributions for all of the eight parameters in log-10 scale.

_{FAf}Figure 10 and Table 5 compare the posterior distribution of the classical Bayesian model used in Ref. [30] with the posterior distribution of the basic HSM presented in this paper. We observe that fitting all data at once (the classical Bayesian model) leads to a significantly larger posterior variance for the model parameters, while the HSM leads to a larger posterior variance for the prediction error parameters. This implies that the knowledge about the model parameters is more transferable across different dosages than the knowledge about the prediction error parameters. We note that the steady-state prediction is not affected by the change of dosage. Indeed, we observe that the inferred value of *σ _{SS}* in the HSM is similar to the value for the classical Bayesian model. The slight increase of the coefficient of variation (CV) of

*σ*for the HSM case is expected because the data used for inference in the HSM are divided into groups of smaller data sets. Moreover, the scaled standard deviations

_{SS}*σ*and

_{BAs}*σ*are larger, and the fixed standard deviations

_{FAs}*σ*and

_{BAf}*σ*are smaller in the HSM. This is also expected because in the classical Bayesian model, predictions for different dosages may have different levels of sensitivity to the errors scaled by time. Hence, the classical Bayesian model tends to have a larger

_{FAf}*σ*and

_{BAf}*σ*than

_{FAf}*σ*and

_{BAs}*σ*. As a result, the posterior robust prediction based on the HSM (Fig. 11) has completely different prediction errors for both bound aflibercept and free aflibercept compared to the one in Ref. [30].

_{FAs}## Conclusion

We demonstrate the benefits of using the hierarchical Bayesian framework in complex systems. Because of the double usage of the term HBM in the literature, we first explain the distinction between two very different HBMs: the HPM and the HSM. Then, we focus on studying the HSM, which has many interesting theoretical implications, as well as many computational challenges in practice. Based on examples of polynomial functions, we suggest that the HSM is capable of explicitly separating different types of uncertainties in a system and can be an effective tool for modeling uncertainty of reduced order models. In order to apply HSM to practical problems, we propose an efficient approximation method to tackle the high computational cost associated with using HSMs. Our method is a “bottom-up” approach that begins by drawing posterior samples of the model parameters for each data set. Then, we use a postprocessing step to move up the hierarchy of the parameters for drawing posterior samples. Building on the basic HSM, we demonstrate our method using two alternative HSMs that have a more complicated hierarchical structure. Finally, we validate our method using two illustrative examples based on the previous studies: a molecular dynamics simulation of Krypton and a PKPD model of a cancer drug.

The HSM is a convenient and effective tool to build the stochastic model/likelihood for a complicated system. It also opens up new types of analyses and it results in different conclusions as compared to classical Bayesian inference. Our approximation method provides a standard procedure for analyzing hierarchical models. Beginning with an analysis for each individual data set, our method allows us to move up the hierarchy of the parameters efficiently to potentially extract more in-depth information about the system of interest. Moreover, it is very efficient for sequentially received data sets because our estimation scheme is a computationally fast postprocessing step. So far, we have mainly used a purely statistical (empirical) model for $p(\theta |\psi )$. In future work, we plan to employ other choices for modeling $p(\theta |\psi )$, which may motivate the development of new modeling tools.

## Acknowledgment

We acknowledge support from ETH Zurich and computational time at the Swiss National Supercomputing Center CSCS under Project No. s448. P. K. and P. A. acknowledge support from the European Research Council (Advanced Investigator Award No. 341117).