Solving optimal design problems through crowdsourcing faces a dilemma: On the one hand, human beings have been shown to be more effective than algorithms at searching for good solutions of certain real-world problems with high-dimensional or discrete solution spaces; on the other hand, the cost of setting up crowdsourcing environments, the uncertainty in the crowd's domain-specific competence, and the lack of commitment of the crowd contribute to the lack of real-world application of design crowdsourcing. We are thus motivated to investigate a solution-searching mechanism where an optimization algorithm is tuned based on human demonstrations on solution searching, so that the search can be continued after human participants abandon the problem. To do so, we model the iterative search process as a Bayesian optimization (BO) algorithm and propose an inverse BO (IBO) algorithm to find the maximum likelihood estimators (MLEs) of the BO parameters based on human solutions. We show through a vehicle design and control problem that the search performance of BO can be improved by recovering its parameters based on an effective human search. Thus, IBO has the potential to improve the success rate of design crowdsourcing activities, by requiring only good search strategies instead of good solutions from the crowd.

## Introduction

### Challenges and Opportunities for Design Crowdsourcing.

Optimal design problems often have large solution spaces and highly nonconvex objectives and constraints, inhibiting effective solution searching through existing optimization algorithms. Some of these problems, however, have been quite successfully (yet heuristically) solved by human beings. Notable examples include protein folding [1,2], RNA synthesis [3,4], genome sequence alignment [5], robot trajectory planning [6], and others [7–9]. The superior performance of *some* human beings at solving these problems demonstrates the advantages of human intelligence, which are supported by cognitive science and neuroscience findings [10] (see discussion in Sec. 5.1). However, despite a handful of success stories, applications of crowdsourcing to real-world design problems have yet to overcome several practical barriers. The cost of setting up problem-dependent crowdsourcing environments, the lack of commitment from crowd members, and uncertainty in domain-specific crowd competence have all contributed to its lack of adoption, while the growing availability of computation resources often makes straight-forward optimization or brute-force search a more convenient approach.

Our earlier study [8] highlighted these challenges for design crowdsourcing: We gamified a vehicle design and control problem (called the “ecoRacer” problem in what follows) where the objective is to complete a track with the minimal energy consumption within a time limit, by finding the optimal final drive ratio of the vehicle and the control policy for acceleration and regenerative braking. The game was broadcast on social media and received more than 2000 plays from 124 unique players within the first month. Results showed that (1) the marginal improvement in average game score of the crowd over an algorithm does not necessarily justify the high cost for developing crowdsourcing games and (2) only a few players were committed to the search for more than 50 iterations, and still fewer can outperform the computer-found solution at all (see summary in Fig. 1).

Nonetheless, human search results displayed a significantly different *search pattern* than that of the algorithm. In particular, quite a few players showed rapid early improvement in performance, beyond the average performance of the computer, before they quit the game without reaching a solution close to the theoretical optimum. This observation is consistent with existing research (see, for example, Khatib et al. [2] on a human-designed protein folding algorithm having a short-term advantage over a standard algorithm) and suggests that while few people care to actually find the “best solution,” their early demonstrations on *how they search* for a better solution may still be valuable. Specifically, we hypothesize that if a computer algorithm can be tuned to mimic these demonstrations, it can serve as a replacement to human solvers in their absence, to search in an effective way without ever abandoning the problem.

### Learning to Search.

This paper aims to test the previously mentioned hypothesis. We model a human solver's search behavior through a Bayesian optimization (BO) algorithm (BO, also known as efficient global optimization) [12,13]. The algorithm iterates between two steps: (1) Estimating the shape of the problem space, based on previous solutions and the corresponding performances, using a Gaussian process (GP) model [14] and (2) creating a new solution based on this estimate (details in Sec. 2). While BO is not provably the underlying mechanism humans use, we hypothesize that the algorithm can be tuned to mimic the results of successful human search strategies, specifically in comparison with other popular gradient- and nongradient-based optimization algorithms. The key assumption in modeling human search behavior through BO is the use of a GP to account for human beings' learning of input–output relationships (or called “function learning” in psychology). This assumption is supported by various findings: In a recent review of function-learning models, Lucas et al. [15] showed that the two major schools of models, i.e., rule- and similarity-based, can be unified through a Gaussian process.^{1} As discussed in Wilson et al. [16], the evidence that Occam's Razor plays an important role in human prediction also suggests that GP is an appropriate model for function learning, as GP reduces model complexity by construction [17]. Empirically, Borji and Itti [18] showed that BO, with the use of GP, has the closest convergence performance to human searches when applied to one-dimensional (1D) optimization problems. In fact, many higher-dimensional problems that human beings naturally solve, such as locomotion planning, have also been successfully solved through the use of GP [19–22].

Under this modeling assumption, we investigate how BO parameters can be estimated for the algorithm to best match human solver's search trajectory, i.e., the sequence of solution-performance pairs. To this end, we introduce an inverse BO (IBO) algorithm to derive the maximum likelihood estimators (MLEs) for BO parameters and discuss challenges in its implementation (see Sec. 3). Validation of the IBO algorithm takes two steps. We first use a simulation study to show that IBO can successfully estimate BO parameters used in generating a search trajectory (Sec. 3.2). We then show through the ecoRacer problem that the search performance of BO can be improved when its parameters are modified based on observing an effective human search and implementing IBO (Sec. 4). The results provide evidence that IBO can accelerate a search using only good search *strategies* without needing a large number of good human solutions. Thus, incorporating IBO in design crowdsourcing may lower the requirement on crowd commitment and so increase its chance of success. Limitations and their potential relaxations of the current IBO implementation will be discussed in depth in Sec. 5.

### Related Work.

It is important to note that the focus of this paper is on the design of optimization algorithms aided by human demonstrations, rather than the derivation of qualitative explanations of the strengths and limitations of human design strategies. There have been numerous studies from the latter category in recent years (see Refs. [23–28] for example). This paper is also distinguished from studies that propose human-inspired optimization algorithms (see Refs. [29–31] for example), in that the learning of the optimization algorithm in our case is conducted by another algorithm, rather than by human researchers. From this aspect, our study is related to studies in learning-to-learn [32] where algorithms (e.g., for gradient-based optimization [33] and optimal control [34]) are tuned and controlled by a higher-level algorithm. In such work, however, the algorithms are often improved purely computationally through reinforcement learning (RL) by solving similar problems repeatedly. Due to the use of human demonstrations, our paper is also related to inverse reinforcement learning (IRL) (see discussion in Sec. 5.2), where human control strategies are used for defining and finding optimal control strategies.

## Preliminaries on Bayesian Optimization

This section provides some background knowledge on BO to facilitate the discussion on IBO in Sec. 3.

### Terminologies and Notations.

Let an optimization problem be $minx\u2208Xf(x)$ where $X\u2286\mathbb{R}p$ is the *solution space*. A search trajectory with *K* iterations can be represented by $hK:=<XK,fK>$, where **X*** _{K}* and $fK$ represent the collection of

*K*samples in $X$ and their objective values, respectively. $h0:=<X0,f0>$ represents an initial exploration set with

*K*

_{0}samples. Human strategy is represented by algorithmic parameters $\lambda $ that govern the search behavior: During the search, each new solution $xk+1$ (for $k=0,\u2026,K\u22121$) is determined by $hk:=<Xk,fk>$ and $\lambda $ through maximizing a merit function with respect to $x$: $xk+1=argmaxx\u2208XQ(x;hk,\lambda )$. The functional forms of the merit function $Q(x)$ will be introduced in Secs. 2.2 and 3. We also define $\Lambda :=diag(\lambda )$ and its estimator as $\Lambda \u0302:=diag(\lambda \u0302)$.

### The BO Algorithm.

We briefly review the BO algorithm, to explain how each new sample $x$ is drawn based on the merit function $Q(x)$, itself defined by previous samples. Knowing this procedure is necessary for understanding the inverse BO algorithm, where we estimate the most likely BO parameters for a given trajectory of samples.

*Model update:*It first updates a GP model to predict objective values, based on current observations

*h*and Gaussian parameters $\lambda $. Without considering random noise in evaluating the objective, the GP model can be derived as $f\u0302(x;hk,\lambda )=b+rTR\u22121(fk\u2212b)$, where $b=((1TR\u22121fk)/(1TR\u221211)),\u2009r$ is a column vector with elements $ri=exp\u2009(\u2212(x\u2212xi)T\Lambda (x\u2212xi))$ for $i=1,\u2026,k,\u2009R$ is a symmetric matrix with $Rij=exp\u2009(\u2212(xi\u2212xj)T\Lambda (xi\u2212xj))$ for $i,j=1,\u2026,k$, and $1$ is a column vector with ones. Without prior knowledge, the MLE of $\lambda $ for the GP model can be derived by solving

_{k}*Sampling the solution space*: The second step is to determine the next sample using the GP model. A common sampling strategy is to pick the new solution in $X$ that maximizes the

*expected improvement*from the current best objective value $fmin:=minfk$ (assuming a minimization problem): $QEI(x;hk,\lambda )=(fmin\u2212f\u0302)\Phi ((fmin\u2212f\u0302)/\sigma )+\sigma \varphi ((fmin\u2212f\u0302)/\sigma )$. Here, $\Phi (\xb7)$ and $\varphi (\xb7)$ are the cumulative distribution function and probability density function of the standard normal distribution, respectively. The new sample is thus obtained by solving

Figure 2 demonstrates four iterations of BO in optimizing a 1D function, with the GP model and the expected improvement function updated in each iteration. Note that similar to human searching behavior, BO is a stochastic process: First, the choice of the new design is stochastic, with better designs being more probable to be chosen;^{2} and second, the initial exploration *h*_{0} can be stochastic when it is modeled by a random sampling scheme, e.g., Latin hypercube sampling (LHS, see Ref. [12] for details).

## Inverse Bayesian Optimization

We consider human solution search to consist two stages: A few exploratory searches are first conducted to acquire a preliminary understanding of the problem, before the execution of BO follows. For example, a player may spend a few trials to get familiar with a new game, before thinking about strategies to improve his score. IBO minimizes the sum of two costs corresponding to the exploration and BO stages, respectively. By doing so, it finds the most likely explanation of the underlying search strategy.

Specifically, IBO estimates $\lambda $, along with the size of the initial exploration set *K*_{0}, given the trajectory *h _{K}*. To do so, we introduce and minimize a cost function consisting of the exploration cost for

*h*

_{0}, denoted as

*L*

_{INI}, and the BO cost for the rest of

*h*, denoted as

_{K}*L*

_{BO}. We define $LINI:=\u2212log\u2009(Dp(X0))$, where $p(X0)$ is the joint probability of the exploration set and $D:=|X|$ is the size of the solution space; and $LBO:=\u2212log\u2009(Dp(hK\u2212h0|h0))=\u2212\u2211k=0K\u22121\u2009log\u2009(Dp(xk+1|hk))$ where $p(xk+1|hk)$ is the density for choosing $xk+1$ conditioned on

*h*. Here, $log(\xb7)$ stands for natural logarithm.

_{k}The derivation of *L*_{INI} and *L*_{BO} is as follows: To calculate *L*_{INI}, we assume that each new sample during the exploration phase, **x*** _{i}* for $i=1,\u2026,K0$, tends to maximize its minimum Euclidean distance $d(xi,X<i)$ to previous samples $X<i$, this is referred to as the

*max–min sampling scheme*in what follows. Let the joint probability of the exploration set be $p(X0)=p(x1)p(x2|x1)\cdots p(xK0|X<K0)$ and each conditional probability follows a Boltzmann distribution: $p(xi|X<i)=exp\u2009(\alpha INId(xi,X<i))/ZINI(xi,\alpha INI)$. Here, the scalar

*α*

_{INI}represents how strictly each sample from $X0$ follows the max–min sampling scheme, and $ZINI(xi,\alpha INI)=\u222bx\u2208X\u2009exp\u2009(\alpha INId(x,X<i))dx$ is a partition function that ensures that $\u222bXp(xi|X<i)dx=1$. Note that the first sample in the exploration set is considered to be uniformly drawn, and thus, its contribution to the cost (a constant) can be omitted.

where $ZBO(hk,\lambda ,\alpha BO)=\u222bx\u2208X\u2009exp(\alpha BOQEI(x;hk,\lambda ))dx$ is also a partition function. The parameter *α*_{BO} plays a similar role to *α*_{INI}. For simplicity, we define $l\u0303i:=\u2212log(Dp(xi|X<i))$ and $lk:=\u2212log(Dp(xk+1|hk))$, so that $LINI=\u2211i=1K0l\u0303i$ and $LBO=\u2211k=0K\u22121lk$. A lower value of $l\u0303$ or *l* represents higher probability density of the current sample to be drawn by max–min sampling or BO, respectively, and a zero indicates that the sample can be considered as uniformly drawn.

Note that to find the optimal *K*_{0} for any given *α*_{INI}, *α*_{BO}, and $\lambda $, one can first calculate the optimal $l\u0303i$ and *l _{k}* for $i,k=2,\u2026,K$, with respect to

*α*

_{INI},

*α*

_{BO}, and $\lambda $, and then scan $K0=2,\u2026,K$ to find the lowest value of $LINI+LBO$. The scan starts at $K0=2$ because it is not meaningful to initialize BO with a single sample.

### Numerical Integration for *Z*_{BO}.

*l*requires an approximation of the integral $ZBO(hk,\lambda ,\alpha BO)$, where the integrand $QEI(x;hk,\lambda )$ is usually a highly nonconvex function with respect to $x$, with function values dropping significantly around local maxima. See Fig. 2 for example. Thus, we propose to approximate

*Z*

_{BO}with importance sampling using a customized proposal density function that combines a uniform distribution with density $p(x)=1/D$ and a multivariate normal distribution with density $q(x)=(2\pi \sigma Ip)\u22121\u2009exp(\u2212||x\u2212\mu ||2/2\sigma I2)$, where

*σ*and $\mu $ are the parameters of $q(x)$. The uniform distribution is used to sample over $X$, while the normal distribution helps to improve the approximation by capturing the potential peak at the current sample $xk+1$. Thus, we set $\mu :=xk+1$. Let $xiu\u2208U$ for $i=1,...,I$ and $xjn\u2208N$ for $j=1,...,J$ be samples from $p(x)$ and $q(x)$, respectively. The approximation $Z\u0302BO$ can be calculated by

_{I}with arguments of *Q*_{EI} omitted for simplicity. The derivation of Eq. (5) is presented in the Appendix. Note that this approximation works under the assumption that $\u222bx\u2208Xq(x)dx\u22481$, which is plausible as the normal distribution is designed to have a narrow spread to match the local peak at $xk+1$. In this paper, the shape of this normal distribution is set by $\sigma I=0.01$ universally. While the setting of *σ _{I}* affects the variance of the approximation of

*Z*

_{BO}, we found this setting to perform well in practice. For

*Z*

_{INI}, since the minimum Euclidean distance function in a high-dimensional space with limited samples is a relatively smooth function, we use Monte Carlo sampling for its approximation.

### Simulation Studies.

As a validation step, we show that IBO can recover the parameters of a general BO given only an observed search trajectory. If IBO can determine the correct parameters (1) after a few number of iterations, (2) in a high-dimensional problem space, and (3) from a wide range of trajectory/parameter settings, then it could be used to recover parameters for matching a BO algorithm to an observed human search.

We use a simulation study to show that, for a given search trajectory, IBO can correctly identify the true $\lambda $ provided the trajectory is sufficiently different from a random search. In addition, the simulation indicates that learning from already-efficient search behavior (i.e., estimating $\lambda $ through IBO of an observed effective search trajectory) can lead to better BO convergence than the more common self-improvement methods (i.e., updating $\lambda \u0302$ by maximizing the likelihood of the observations according to the GP model).

#### Simulation Settings and Results.

The simulation study is detailed as follows: We apply BO to a 30-dimensional Rosenbrock function constrained by $X:=[\u22122,2]30$. To initialize BO, we use LHS to draw ten samples from $X$. BO terminates when the expected improvement for the next iteration is less than $10\u22123$. At each iteration, the expected improvement is maximized using a multistart gradient descent algorithm [38] with 100 LHS initial guesses. A set of BO parameters, $\Lambda =0.01I,0.1I,1.0I$, and $10.0I$, are used to perform the search, where $I$ is the identity matrix. For each of the four settings, 30 independent trials are recorded.

For each BO setting $\Lambda $, each candidate estimator $\Lambda \u0302$, and each trajectory of length $K=5,...,20$, we solve Eq. (4) using a grid search with $G\alpha BO:={0.01,0.1,1.0,10.0}$ and $GK0:={2,\u2026,K}$. We fix *α*_{INI} to 1.0 and 10.0 and will discuss its influence on the estimation. Figure 3 presents the resulting minimal *L* for all the four cases and under all guesses. Each curve in each subplot shows how the minimal *L* (with respect to *α*_{BO} and *K*_{0}) changes as the search continues. The means and standard deviations of *L* are calculated using the 30 trials. *Z*_{INI} is approximated using a sample size of 10,000. In approximating *Z*_{BO}, samples from the normal and the uniform distributions are of equal sizes ($I=J=5000$).

#### Analysis of the Results.

Based on the results from this simulation, as summarized in Fig. 3, the major finding from this simulation study is that IBO can successfully recover the BO parameters in cases where BO does not resemble uniform random sampling of the design space. In the cases of $\Lambda =0.01I,0.1I,1.0I$, we see that the correct choices of $\Lambda \u0302$ consistently lead to the lowest cost along the search process. After only one or two iterations, in nearly all the cases, the correct parameter has the highest likelihood of all the four propositions, and this remains the case along the search. However, under large BO parameters such as $\Lambda =10.0I$, the similarity between any two points in the design space becomes close to zero, leading to (almost) uniform uncertainty and expected improvement. Therefore, this setting reduces BO to a uniform random sampling scheme. Figure 3(d) shows that IBO does not perform well in this situation. To better understand the behavior of IBO under near-random searches, a curious reader may find a discussion on the properties of the costs *l* and $l\u0303$ in the Appendix.

#### Learning From Others Versus Self-Adaptation.

The study mentioned earlier showed that the correct BO setting $\lambda $ can be learned through IBO. This subsection further demonstrates the advantage of “learning from others” (i.e., updating $\lambda $ through IBO), over “self-adaptation” (i.e., finding the MLE of $\lambda $ using *h _{k}*). The settings follow the study mentioned earlier and the results are shown in Fig. 4. First, to show the significant influence of $\lambda $ on search effectiveness, we show the convergence of two fixed search strategies with $\Lambda =0.01$ and 10.0. Note that while neither converges to the optimal solution within 50 iterations, the former is significantly more effective than the latter. For “self-adaptive BO,” we use a grid search ($G\Lambda ={0.01I,0.1I,1.0I,10.0I}$) to find $\Lambda \u0302GP$ that maximizes Eq. (1) at each iteration and use $\Lambda \u0302GP$ to find the next sample. We show in Fig. 4(b) the percentages of the four guesses being $\Lambda \u0302GP$ along the search, using $G\Lambda $ as the initial guesses for BO. The learning from others case starts with $\Lambda =10.0I$ and uses IBO to derive $\Lambda \u0302$ from the trajectory produced by $\Lambda =0.01I$. From Figs. 3 and 4(b), we see that $\Lambda \u0302GP$ does not converge to $\Lambda =0.01I$ as quickly as IBO, which explains why learning from others outperforms self-adaptation in Fig. 4(a). It is worth noting that this difference in performance may be relatively dependant on the dimensionality of the problem, as the two strategies were found to have similar convergence performance when applied to two-dimensional functions. One potential explanation for this is that, in a lower dimensional space, an effective $\Lambda \u0302GP$ can be learned with a smaller number of samples.

## Case Study

We now investigate how IBO may improve the performance of BO when applied to a vehicle design and control problem.

### Dimension Reduction for Player's Control Signals.

The solution data from each game play consist of (1) the final gear ratio, (2) the recorded acceleration and braking signals, and (3) the corresponding game score. The length of a raw control signal matches that of the track, which has 18,160 distance steps. Encoding control signals to a low-dimensional space is feasible since common acceleration and braking patterns exist across all plays. In Ref. [11], this was done by introducing manually defined state-dependent basis functions (i.e., polynomials of the velocity of the car, slope of the track, distance to the terminal, remaining battery energy, and time spent) to parameterize the control signals. The underlying assumption that human players are aware of all the state-dependent bases is untested.

In this paper, we perform dimension reduction based on evidence that human beings often solve high-dimensional problem by performing problem abstraction and using a hierarchical search [39–43]. In the context of the ecoRacer game, we hypothesize that players segment the track into *m* discrete sections and make separate control decisions in each segment. Mathematically, this is equivalent to projecting observed signals onto *m* independent basis, which can be elegantly addressed by ICA [44]. Compared with principal component analysis, where the bases minimize the covariance of the data, our ICA implementation (see results in Fig. 5) maximizes the Kullback–Leibler divergence between all bases pairs and is more suitable for non-Gaussian signals, such as the control data from this game (i.e., the acceleration/braking signals across players at each step along the track are unlikely to follow a Gaussian distribution).

Much like principal component analysis the choice of the number of ICA bases requires a balance between fidelity and practicality. While it is theoretically possible to find the “most likely” number of bases using information-theoretic criteria for model selection [45],^{3} we chose to use 30 bases because (1) over 95% of the variance is explained and (2) the resultant solution space (30 control variables and one design variable) is small enough for BO to be effective.

### Derivation of λ̂ and λ̂GP.

We apply IBO to two players, referred to as “P2” and “P3,” who achieved the second and third highest score within 31 and 73 plays, respectively, much less than the 150 plays from the achiever of the highest score. To do so, we first encode all the control solutions from the two players using the learned ICA bases. Together with the final drive ratios, all the solutions are then normalized to be within $[\u22121,1]31$. IBO is performed separately on P2 and P3. We found that the probability for either player to have followed the max–min sampling scheme is lower than that of following BO, as the minimal values of $l\u0303(xk,\alpha INI)$ for $k=2,...,31$ (with respect to *α*_{INI}) are dominated by those of $l(xk,\alpha BO)$. This means that the players were not likely to have performed an exploration before they started trying to improve their performance. This finding is reasonable, as the scoring mechanism in ecoRacer game, just like in other racer games with fairly predictable vehicle dynamics, can be understood by the player early on. Therefore, the search for $\lambda \u0302$ is performed by solving Eq. (4) with $\lambda \u2208[0.01,10.0]31$, $\alpha BO\u2208G\alpha BO$, and a minimal number of initial samples ($K0=2$) required for BO. For comparison purpose, we obtain $\lambda \u0302GP$ using plays from P2, which represents a case where BO parameters are fine-tuned by the observed game plays, without trying to explain why these solutions were searched by the player.

Due to the nonconvexity of Eqs. (4) and (1), gradient-based searches using a series of ten initial guesses are conducted to avoid inferior local solutions. Finite difference is used for gradient approximation. Both $\lambda \u0302$ and $\lambda \u0302GP$ are calculated offline and fixed during the execution of BO.

### Comparison of BO Performance.

Figure 6 compares the BO performance under $\lambda \u0302$ (for P2 and P3), $\lambda \u0302GP$, and $\Lambda =I$. In each case, we start with the first two plays from the players and run 180 BO iterations. Similar to the simulation study, results are reported using 20 trials due to the stochastic nature of BO. Due to the small trial number, bootstrap variance estimators are reported as the shades around the average in the figure. $\lambda \u0302$ outperforms the other two settings consistently along the search with statistical significance. The BO performance by mimicking P2 is slightly better than that of P3.

The result shows that BO can be improved noticeably by learning from P2 and P3. However, the players' search is not fully mimicked by IBO, as they improved much faster than the modified BO does, indicating that the proposed model still has room for improvement. Nevertheless, the IBO implementation still achieves the closest performance to the players' among all the BO instances, and it is the only algorithm that achieved better performance than the players' best play within 100 iterations. This result demonstrates the potential of IBO to continue an effective human search after the player quits, with an improved search performance from a standard BO.

For completeness, we also note that in all the cases, the BO identifies the true optimal final drive ratio at the end of the search. We also qualitatively compare the best human solution with one BO solution with high score, along with the theoretically optimal solution in Fig. 7. The result indicates that while these control strategies yield similar scores, they are quantitatively different, although braking toward the end is observed as a common strategy. Human search data are documented in the webpage,^{4} where the best players' solution strategies are published.

## Discussion

The study mentioned earlier provided a starting point for learning optimization algorithms based on human solution-search data. Yet, many pressing questions remain unanswered. This section will address a few notable ones. Some potential answers to these questions will rely on readers' familiarity with inverse reinforcement learning [19,47,48] (IRL, also called apprenticeship learning [49,50] and inverse optimal control [51]). To familiarize readers with this topic, a discussion on the connection between IBO and IRL is provided in Sec. 5.2.

### Limitations and Potential Values of IBO.

From the case study, a strategy learning through IBO outperformed default algorithms, but is yet to reach the performance of the best human solver. This indicates potential room to further improve the algorithm. In the following, we discuss notable limitations of IBO. We shall also note that these also apply to the general problem of designing optimization algorithms through human demonstrations (called DO in what follows).

*Model of human search strategies*: Studies in cognitive science have put forth several core ingredients of human intelligence, including intuitive physics [52–55], problem decomposition skills [42,56,57], ability in learning-to-learn [58], and others [10]. While evidence has shown the connection between BO and human search [18], suitable models for human search strategies can be problem dependent. For example, for low-dimensional design problems, Egan et al. [59] showed that people adopting univariate search are more likely to achieve effective search. This result is supported by earlier psychological studies on how children perform scientific reasoning and thus may be useful to explain how people identify unfamiliar systems. However, univariate search may not reflect how people search for solutions in a familiar context (such as car driving) and with a large number of control and design variables to tune, as is the situation of the ecoRacer game. For such high-dimensional and physics-based design and control problems, a potentially reasonable human search model could be to incorporate human intuitive physics models into the evaluation of the expected improvement. Thus instead of estimating GP parameters, one could estimate a statistical model of the state-space equations of the dynamical system, which influences the expected improvement. At a more abstract level, the fundamental challenge in understanding how a human search strategy should be modeled is the lack of knowledge about the functional form of the local objective (i.e., the Q-function) that governs the generation of new solutions during the search based on the current state (cumulative knowledge learned by the human solver). As we will discuss later in this section, this challenge is also a key topic in IRL. Not surprisingly, one notable solution from IRL to this problem is in fact to use nonparametric models such as GP [19,60].

*Uncertainty in estimation*: A limited amount of demonstrations could be insufficient to provide a good estimation of the BO parameters, even though the underlying parameters are the effective ones. One potential solution to this could be to create a reward mechanism in the crowdsourcing setting, where the reward is determined by both the observed search effectiveness of each human solver and the uncertainty in the estimation of their search strategy. In the context of BO, this uncertainty can be measured by the covariance of the estimator, i.e., the Hessian of the cost function in Eq. (4). For people with effective search yet high estimation uncertainty, we can solicit more solutions from them by offering rewards. It would also be interesting to understand the influence of the properties of the problem, e.g., the size of the solution space, on the convergence of the estimation.

*Knowledge transferability*: The third limitation concerns the transferability of knowledge (search strategies) learned from one task (an optimization problem) to others. This limitation also leads to the question of how “effectiveness” of searches shall be measured, as we are not yet able to tell in what condition a strategy that has high rate of improvement (such as P2 in ecoRacer) will continue to produce better solutions than other strategies in a long term. The same issue, however, exists in IRL: e.g., a control policy learned for pancake flipping does not guarantee optimal egg flipping due to the differences in physical properties between pancakes and eggs. One solution to this in IRL is to allow the policy to adjust to new problem settings, by correcting the state transition model according to the new observations. This solution may also be applied to IBO. In the context of ecoRacer, knowledge such as “starting acceleration at the beginning of the track” could be considered as a universal strategy and requires less exploration, while the actual duration for executing this strategy may differ across problem settings. Therefore, it could be more effective for BO to adjust its parameters based on the ones that are learned from human demonstrations on a similar problem, rather than learning from scratch.

To summarize, IBO could be a valuable tool for machines to mimic human search behavior when (1) the underlying human search mechanism follows BO, (2) the demonstration is sufficient for estimating the true BO parameters with low variances, and (3) the true optimal BO parameters for a long-term search can be estimated based on an effective short-term search.

### The Difference Between Learning to Search and Learning a Solution.

The proposed IBO approach can be considered as a way to design optimization algorithms with human guidance and is mathematically similar to IRL. In order to explain the similarities and differences between the two, we first introduce Markov decision process (MDP) and RL and make an analogy between MDP and an optimization algorithm.

#### Preliminaries on MDP and RL.

An MDP is defined by a tuple $<S,A,T,R,\gamma ,b0>$, where $S$ is a set of states, $A$ is a set of actions, the state transition function $T(s,a,s\u2032)$ determines the probability of changing from state $s$ to $s\u2032$ when action $a$ is taken, $R(s,a)$ is the instantaneous reward of taking action $a$ at state $s$, $\gamma \u2208[0,1)$ is the discount factor of future reward, and $b0(s)$ specifies the probability of starting the process at state $s$. In RL, a control policy *π* is a mapping from a state to an action, i.e., $\pi :S\u2192A$. The long-term *value* of *π* for a starting state $s$ can be calculated by $V\pi (s)=R(s,\pi (s))+\gamma \u2211s\u2032\u2208ST(s,\pi (s),s\u2032)V\pi (s\u2032)$, and thus, the value of *π* over all possible starting states is the expectation $V\pi =\u2211s\u2208Sb0(s)V\pi (s)$. A common way to represent a control policy is to introduce a Q-function $Q(s,a;\lambda )$ with unknown control parameters $\lambda $, and let the policy be $a(s)=argmaxAQ(s,a;\lambda )$. RL identifies the optimal $\lambda $ that maximizes $V\pi $.

#### MDP Versus Optimization Algorithm.

An optimization algorithm defines a decision process: Its instantaneous reward is the improvement in the objective value achieved by each new sample, and the cumulative reward represents the total improvement in the objective within a finite number of iterations; its state contains the current solution (in $X$), the corresponding objective value, and potentially the gradient and higher-order derivatives of the objective function at the current solution; its action is the next solution to evaluate; and its state transition is governed by the optimization algorithm and its parameters. This is similar to MDP where the state transition is affected by the control parameters. The decision process defined by an optimization algorithm, however, is usually non-Markovian, as the new solutions rely on the entire search trajectory. Note that it is still possible to consider the optimization process as an MDP, by redefining the state as the continuously growing search trajectory, i.e., elements in the state set $S$ shall represent all possible search trajectories, rather than samples in $X$.

#### IRL Versus IBO.

RL algorithms identify an optimal control policy for an MDP with a given reward function. However, real-world applications hardly have explicit definitions of rewards, e.g., the reward for “driving a car” cannot be explicitly defined, although people form control policies based on their inherent reward (preference). Therefore, control policy for such applications can be learned more effectively through demonstrations of human beings, which are assumed to be optimal according to the inherent reward of the demonstrator. IRL techniques have thus been developed to identify the reward (and consequently, the Q-function and the optimal control policy) that explains human demonstrations, either by estimating the reward parameters so that the demonstrated policy has a higher value than any other policies by a margin [47,49,61,62] or by finding the maximum likelihood control parameters directly [48,63].

*h*:

where $Zi(\lambda )$ is a partition function for the visited state **s*** _{i}*. One can notice the similarities between Eqs. (6) and (4): (1) Both are maximum likelihood parameter estimations related to an instantaneous cost, i.e., the reward in Eq. (6) and the expected improvement in IBO. (2) Both involves partition functions that are computationally expensive and dependent on the parameters $\lambda $. Due to this dependency, a direct Markov-Chain Monte Carlo sampling in the space of $\lambda $ (e.g., as in Ref. [63]) cannot be applied to optimize the likelihood function since the partition values for two different samples of $\lambda $ do not cancel. Ziebart et al. discussed on alternative approach to address this computational challenge, by using the “expected edge frequency calculation” algorithm that has a complexity of $O(N|S||A|)$ for each gradient calculation of the objective in Eq. (6), where

*N*is a large number [48]. However, this approach can be infeasible for the IBO estimation problem in Eq. (4) since (1) the space $X$ is usually continuous and (2) even with a discretization of $X$, the enormous size of $S$ and $A$ can easily make the calculation intractable, based on the discussion in Sec. 5.2.2.

Further, one shall notice that IRL and IBO use different assumptions about human demonstrations: Demonstrations in IRL are assumed to be *near-optimal*. Thus, learning from them leads to an optimal control policy for an MDP. Demonstrations in IBO, on the other hand, are assumed to be from an effect search strategy, yet are not necessarily optimal. Thus, learning from them leads to an optimization algorithm, rather than a solution. This difference affects the application of the two: IRL can be used when the machine is told to mimic existing solutions, by understanding why these solutions are considered good, e.g., it answers the question “why do people flip pancakes this way?”; IBO can be used when the machine is meant to mimic the process of searching for good solutions, by understanding how to evaluate the expected improvement of solutions, e.g., it answers the question “how did people figure out this way of pancake flipping?”

## Conclusions

In this paper, we attempted to address a dilemma in design crowdsourcing: While human beings acquire more advanced intelligence than machines in solving certain types of optimal design problems, soliciting valuable solutions through existing crowdsourcing mechanisms is not cost-effective due to the lack of control over crowd participation and the problem-specific qualification of the crowd. Based on the previous finding that more people acquire good searching strategies than good solutions, we proposed in this paper to mimic human search demonstrations by inversely learning a Bayesian optimization algorithm, so that long-term search can be executed more effectively by the computer even when human solvers abandon the problem. Through simulation and case studies, we showed improved performance of BO when it is equipped with parameters learned through an effective human search. However, the significant performance gap between a human demonstrator and the proposed algorithm in the case study suggested room for improvement of the algorithm. Future investigation will focus on closing this gap by exploring more suitable cognitive models of human solution searching for specific types of optimal design problems.

## Funding Data

National Science Foundation (Grant No. CMMI-1266184).

### Appendix

##### Derivation of the Partition Function (ẐBO).

*q*(

*x*) be a uniform and a normal density function, respectively,

*D*be the size of $X$, and

*f*(

*x*) be the function to be integrated. Also, let $I$ and $J$ be the sample sets drawn from these two distributions, with sizes $I:=|I|$ and $J:=|J|$. We have

##### IBO Behavior Under Near-Random Search

###### Properties of l and l̃.

From Sec. 3.1, the unbiased estimation of $l(x,\alpha BO)$ through importance sampling is

$l\u0302(x,\alpha BO)$ has the following properties.

*Property 1*. $\alpha BO=0$ leads to $l\u0302(x,0)=0$, indicating that $x$ is uniformly sampled. One can see that the optimal cost of *L*_{BO} is nonpositive, as one can always achieve *L*_{BO} = 0 by considering samples to be uniformly drawn.

*Property 2*. When the expected improvement function is constant almost everywhere, i.e., $Pr(QEI(x)=C)=1$, we have $Pr(l\u0302(x,\alpha BO)=0)=1$. This is because a uniformly drawn initial guess will almost surely satisfy the optimality condition for maximizing a constant function.

*Property 3*. Note that $1+Dq(xi)\u22481$ for $xi\u2208U$ due to the small

*σ*(see Sec. 3.1) and $(exp(\alpha BOQEI(xi)))/(1+Dq(xi))\u22480$ for large

_{I}*D*and small

*α*

_{BO}. The partial derivative of $l\u0302(x,0)$ with respect to

*α*

_{BO}can be approximated as

where $c(\alpha BO)>0$ and $\Delta ai:=QEI(xi)\u2212QEI(x)$. Here, we need to introduce a conjecture: Let $Q\xafEI:=\u222bXQEI(x)dx/D$ be the average expected improvement, and $A:=\u222bX\u22ae(QEI(x)>Q\xafEI)dx$ be the measure of a subspace where the sampled expected improvement value is higher than $Q\xafEI$. *A* decreases from above to below $D/2$ along the increase of the BO sample size. In other words, a uniformly drawn sample has more than 50% of chance to have an expected improvement value higher than $Q\xafEI$ at the early stage of BO and less than 50% at the late stage.

One evidence of the conjecture is illustrated in Fig. 2: In the first iteration, $Q\xafEI$ is slightly lower than 0.5 while the majority of $X$ has $QEI>Q\xafEI$; in the fourth iteration, however, only a small region around the peak has $QEI>Q\xafEI$. Using this conjecture, we can show that $\u2211U\Delta ai<0$ when the sample size is small, thus $((\u2202l\u0302(x,0))/(\u2202\alpha BO))<0$. Together with Property 1, we have $l\u0302(x,\alpha BO)<0$ for a small *α*_{BO} and a small sample size.

*Property 4*. We notice that in this experiment, the discrepancy between LHS and the modeled max–min sampling scheme leads to overall high (positive) $l\u0303$ values, indicating that the samples are not likely follow this scheme. This is consistent with the fact that LHS is not exactly the same as max–min sampling, at least until all of *h*_{0} have been considered. We also see that negative $l\u0303$ values can be observed when *α*_{INI} is low, suggesting that the LHS samples can be better explained by a loosely executed max–min sampling scheme than a strict one.

###### Discussion on Findings From Fig. 3.

We now summarize a complete list of findings based on these properties. *Finding 1*. A comparison between $\alpha INI=1.0$ and 10.0 leads to a finding consistent with Property 4. Since the samples are not likely to be drawn from a strictly executed max–min sampling scheme, the entire search trajectory is considered to be created from BO in the case of $\alpha INI=10.0$. While the early samples (less than 10) can be considered as from max–min sampling when $\alpha INI=1.0$ ($l\u0303<0$), the low magnitude of $l\u0303$ causes this difference to be only visible in the case of $\Lambda =10.0I$, where the magnitude of *l* is also low.

*Finding 2.* IBO correctly identifies the true $\Lambda $ within a few iterations after the initial exploration, except for the case of $\Lambda =10.0I$. To explain this exception, we first note that $\Lambda =10.0I$ leads to an expected improvement function that is constant almost everywhere (except for the sampled locations where *Q*_{EI} = 0), and thus, BO reduces to uniform sampling. From Property 2, *L*_{BO} = 0 almost surely when we have the correct guess on $\Lambda $. Also, recall from Property 4 that *L*_{INI} > 0 when *α*_{INI} is high. The above two together explain why with the correct guess of $\Lambda =10.0I$, we have *L* close to zero when $\alpha INI=10.0$ and slightly negative when $\alpha INI=1.0$.^{5}

To explain the negative *L* values for the incorrect guesses of $\Lambda $, we use Property 3 to show that when the sample size is small and the expected improvement function is not flat, *L*_{BO} < 0 for a small *α*_{BO}, and thus *L* < 0. To summarize, Finding 2 suggests that for a search trajectory with a limited length that resembles a random search, the proposed IBO approach will consider it being derived from a BO that loosely solves Eq. (4). However, this caveat is of little practical concern, since (1) a random search rarely outperforms BO with nontrivial settings and (2) a BO with low *α*_{BO} (instead of high $\Lambda $) can equally simulate a random search.

For completeness, we used 1000 principal component analysis as preprocessing to obtain the most likely number of ICA components under three suitable criteria: minimum description length, Akaike information criterion, and Kullback information criterion, as 187, 464, and 373, respectively, using the method from Ref. [45]. While these dimensionalities could make sense from a neurological perspective (e.g., given that the game takes 36 s, a decision interval of $36\u2009s/187=192\u2009ms$ is close to the range for the time-frame of attentional blink, which is 200–500 ms [46]), the resultant high-dimensional solution spaces are unfavorable for BO.

But why does the guess of $\Lambda =10.0I$ lead to significantly decreasing *L* in the other three cases? This is because in those cases, BO does not resemble random sampling, i.e., the sequences of samples are more clustered. When a new sample is among this cluster, its similarities to existing ones are nonzero even when a large $\Lambda $ is assumed, due to the small Euclidean distance among the pairs. And in turn, the expected improvement function has peaks within the clusters and remains constant far away from them, rather than being a constant almost everywhere. As a result, the optimal value of $l\u0302(x,\alpha BO)$ with respect to *α*_{BO} becomes negative, even when $\Lambda $ is incorrectly guessed as $10.0I$.