Researchers have used the (calculation of phase diagram) CALPHAD method to solve the forward phase stability problem of mapping from specific thermodynamic conditions (material composition, temperature, pressure, etc.) to the associated phase constitution. Recently, optimization has been used to solve the inverse problem: mapping specific phase constitutions to the thermodynamic conditions that give rise to them. These pointwise results, however, are of limited value since they do not provide information about the forces driving the point to equilibrium. In this paper, we investigate the problem of mapping a desirable region in the phase constitution space to corresponding regions in the space of thermodynamic conditions. We term this problem the generalized inverse phase stability problem (GIPSP) and model the problem as a continuous constraint satisfaction problem (CCSP). In this paper, we propose a new CCSP algorithm tailored for the GIPSP. We investigate the performance of the algorithm on Fe–Ti binary alloy system using ThermoCalc with the TCFE7 database against a related algorithm. The algorithm is able to generate solutions for this problem with high performance.

## Introduction

The designers of novel materials require an understanding of phase stability in order to assess the feasibility of a material and how it changes during processing. The calculation of phase diagram (CALPHAD) method has enabled researchers to develop databases that contain pertinent thermodynamic information on specific alloys and associated phases [1]. In this method, the thermodynamics of phases are described through mathematical models fitted to experimental data.

Many calphad software packages developed to date are well-suited for solving the forward phase stability problem: calculating the phase stability state of a multicomponent, multiphase system given a set of thermodynamic conditions (processing conditions such as composition, temperature, and pressure). However, because design is inherently an inverse process, we are interested instead in the inverse problem: determining the thermodynamic conditions that give rise to a desired phase stability state.

where the set $\Omega \u2282RN$ is the feasible region in the decision space (thermodynamic conditions), and *f* is a function (e.g., a CALPHAD model) from $RN$ to $R$ that maps thermodynamic conditions to the thermodynamic degree-of-freedom of interest. The optimization problem in Eq. (1) can be solved using conventional techniques. In Ref. [4], mesh-adaptive direct search (MADS) [9] algorithms are used to find extrema in phase stability. More recently, the Arroyave group has used genetic algorithms (GAs) to solve similar problems [6].

Although the location of extremal points, or any isolated point(s) in the phase constitution space, is sometimes of interest to material scientists [10,11], more generalized knowledge is desirable for the materials discovery and design process. Specifically, we are interested in identifying the region or set in the design space that gives rise to a desirable range of thermodynamic conditions, we refer to this problem as the generalized inverse phase stability problem (GIPSP).

For example, a designer may want to know what thermodynamic conditions (a set of designs) result in the presence of *σ* phase in steels, since any amount of *σ* phase is undesirable due to its embrittling effects (as little as 3 wt.% reduces impact toughness by more than half [12]) and its drastic deterioration of the stability against corrosion. A compact representation of this set would be useful to material designers as a constraint on the search space that is easy to evaluate. In cases when uncertainty in the appropriate CALPHAD model is high, materials designers may be interested in finding a set of potentially desirable solutions that they can refine and prune through experimentation. In general, materials designers are interested in identifying the set of all the thermodynamic conditions that could produce the set of arbitrary phase constitutions. In the language of engineering design, materials designers would like to apply set-based approach [13] to the materials discovery process.

In the context of set-based design, Shahan and Seepersad suggested the use of Bayesian network classifiers to model the satisfactory region [14]. In Ref. [15], the technique is extended to multilevel design problems (multilevel models are those that are hierarchically nested) and applied to materials design. An extension is developed by Rosen [16] that enables the incorporation of process knowledge. However, the focus of these works is on the design methodology concerning multiscale design rather than an efficient algorithm approach for approximating the satisfactory set, which is the aim of this paper.

We model the GIPSP as a continuous constraint satisfaction problem (CCSP) [17], a type of constraint satisfaction problem (CSP) where all the variable domains are continuous real intervals. The GIPSP is to map specific regions in multidimensional phase constitution spaces to ranges of values of thermodynamic conditions space. The solution set to this problem is ranges of values (enclosures) rather than discrete points. The GIPSP is particularly challenging because the search space is highly nonlinear and discontinuous, the CALPHAD model is a black-box since its details are not available, and the problems of interest may be multidimensional (>10).

Typical algorithmic approaches for CCSPs (Fig. 1), such as those based on interval arithmetic [18], require accessible analytical problem formulations. However, the GIPSP involves a nonanalytical CALPHAD model. Methods such as design of experiments (DOE) [19] or inductive design exploration (IDEM) [20,21] are not tractable for high-dimensional problems since they sample the search space. Another approach is to use adaptive sampling schemes to replace the expensive to evaluate constraint with a surrogate model [22–25]. However, these approaches have difficulty representing discontinuities in the search space and suffer from the curse of dimensionality [26]. To address this limitation, Basudhar and Missoum developed the explicit design space decomposition (EDSD) method. The EDSD method relies on a support vector machine (SVM) classifier to model the design space constraint boundary, rather than approximate the constraint function. The EDSD method has been shown to accurately model the constraint boundary on several test problems within only a fewer iterations.

However as we will show that the SVM technique used in EDSD has difficulty representing the solution to a typical GIPSP. In a GIPSP, it is undesirable to densely sample the unsatisfactory region since it comprises the vast majority of the space. In the case where the samples are imbalanced (many more from one class than another), the SVM will be unreliable in the undersampled region. In cases with imbalanced sampling, the support vector domain description (SVDD) technique has been shown to be more appropriate [27]. In this paper, we present an algorithm that instead relies on the support vector domain description (SVDD) [27] classifier, which leads to improvements in both computation time and solution quality of the GIPSP.

## Generalized Inverse Phase Stability Problem as a Continuous Constraint Satisfaction Problem

where $X=(x1,x2,\u2026,xn)$ is a *n*-tuple of variables defined in the thermodynamic conditions space, and $D=I1,\xd7,I2,\xd7\cdots \xd7,In$ is the Cartesian product of the corresponding domains, where *I _{i}* is a real interval for $i=1,2,\u2026,n$. The set of constraints that must be satisfied is denoted as $C=(C1,C2,\u2026,Ct)$, which is a

*t*-tuple of constraints. Each constraint

*C*is a pair

_{j}*R*,

_{j}*S*, where

_{j}*R*is a relation on the variables in $Sj=scope(Cj)$. The relation

_{j}*R*defines the consistent value combinations. Specifically, in the context of the GIPSP,

_{j}*R*is an inequality or equality in terms of the function mapping between thermodynamic conditions and phase constitutions (i.e., constraints defined on the CALPHAD model). The set $Sj\u2282X$ is an unordered

_{j}*k*-tuple of distinct variables, where

*k*is the arity of

*R*. In other words,

_{j}*S*is simply the tuple of variables that participate in the constraint

_{j}*R*.

_{j}*n*-tuple $A=(a1,a2,\u2026,an)$, where $A\u2208D$ and each

*C*is satisfied, in which

_{j}*R*holds on the projection of $A$ onto the scope, i.e.,

_{j}*S*. The problem is to find the set of all the solutions to the problem, denoted $sol(P)$. In the GIPSP, the user defines each constraint

_{j}*C*in the phase constitution space. For example, one may be interested in the thermodynamic conditions $X=(x1,x2)$ that produce materials consisting of between 40 wt.% and 60 wt.% of a specific phase. This constraint is expressed as $C\u22610.4\u2264f(X)\u22640.6$, where $f(\xb7)$ maps thermodynamic conditions to the phase composition of interest. The CCSP is expressed as

_{j}where the superscripts lb and ub denote the variable lower and upper bounds, respectively.

Since mapping from the thermodynamic conditions space to the phase constitution space is highly complex, discontinuous, and nonanalytical, satisfying the user-defined constraints *C* is nontrivial. In the motivating problem, we are interested in the set of all the solutions, $sol(P)$, to the CCSP where all the variable domains are continuous real intervals and all the numeric relations are equalities and/or inequalities. The constraints are, in general, highly nonlinear.

## Related Work

### Existing CSP Algorithms.

Classical methods for solving CSPs such as backtracking [29], iterative broadening search [30], and limited discrepancy search [31] are intended for problems with discrete domains and are not efficient for solving CCSPs. To apply these to a CCSP, one must discretize the variable space to an enumerable set [32]. Such an approach would be intractable for most GIPSPs because of the high-dimensional (often *n* > 10) [33] of many materials design scenarios. Most techniques developed specifically for solving CCSPs are based on interval arithmetic, branch and bound, or the root inclusion test. In one of the first examples of set-based design, Ward proposed the use of interval arithmetic to search for the feasible set of designs efficiently [18]. Subsequently, this work was extended to more general set-based representations [34]. Hu et al. proposed a method that uses generalized interval to solve for the feasible set [35,36]. The NUMERICA [37] modeling language in particular guarantees correctness, completeness, and certainty. Devanathan and Ramani presented a polytope-based method, where the constraints are transformed into ternary constraints [38]. Then, the so-called consistency method is used to prune the search space. These techniques require an analytical or closed-form expression to determine whether a subsection of the search space contains feasible solution [28,39,40]. Since the CALPHAD model is a “black-box” in the sense that the details are not accessible, methods that rely on interval arithmetic or constraint decomposition cannot be used for the GIPSP.

### Design of Experiments.

Identifying the feasible set (solving a CSP) is a key challenge in many set-based design methods. Design of experiments (DOE) can be used to test the constraint model at a set of finite points in the design domain. A limitation is that the points will not lie on the boundary of the constraints. The inductive design exploration method (IDEM) addresses this limitation [20,21]. The IDEM consists of discrete point evaluations (discrete sampling) of the design process, performing inductive, top-down feasible space exploration based on metamodels. The aim is to obtain a robust solution with respect to model uncertainty. To approximate the constraint boundary, the design space is sampled using a typical DOE. The true constraint boundary will lie between the satisfactory and unsatisfactory discrete points. A root-finding technique is used to find the location of the boundary between these points. The constraint is then represented as a set of boundary points and points inside the feasible space. Although this method generates points on the boundary of the constraint, it does little to reduce the computational burden of the initial DOE, which suffers from the curse of dimensionality.

### Constraint Modeling.

*g*(

*x*), where the product fails if

*g*(

*x*) < 0. The reliability of the product is defined as

where $fX(x)$ is the joint probability density function of all the random parameters. This integral may be approximated using Monte Carlo simulation (MCS), but this approach can be computationally expensive when a large number of samples are required or the constraints are expensive to evaluate. A principle focus in RBDO is the efficient approximation of the solutions to the constraint *g*(*x*). An approach common in the literature is to replace the expensive to evaluate constraint *g*(*x*), with an inexpensive surrogate model generated from sampled data. The most straightforward methods sample the space globally, which becomes prohibitively expensive at high dimensionality. An approach for reducing the number of samples required to construct the surrogate model is adaptive sampling. Adaptive sampling approaches incrementally improve the surrogate model by sampling points that most effectively improve the model [22–24]. These approaches attempt to efficiently “train” a surrogate model over the entire search space. However, to approximate Eq. (2), it is only necessary to know where the constraint is satisfied. This insight led to the efficient global reliability analysis (EGRA) method. In EGRA, samples are generated by maximizing the expected feasibility function, which provides an indication of how well the true value of the response is expected to satisfy the constraint [25]. Several other criteria can be found in the literature [41–43].

As argued by Basudhar and Missoum, the principle limitations surrogate-based approaches are response discontinuities and the curse of dimensionality [26]. The discontinuities in the search space are problematic for gradient-based techniques. To address this limitation, they proposed a method referred to as explicit design space decomposition (EDSD), which does not approximate the response of the limit state function, instead constructs an explicit constraint boundary around the design variables [44]. Thus, the EDSD method is unaffected by discontinuities and other irregularities in the search space. The EDSD method relies on the use of support vector machines to construct an explicit boundary of the constraint in the design space. Support vector machines (SVMs) are a machine-learning technique for supervised learning associated with *two-class* classification. That is, given a set of training examples, each example is marked as belonging to one of the two classes. The SVM uses the training examples to build a model that assigns new (unobserved) examples to one of the categories. The SVM is initially trained on a sample of points. Then, an adaptive strategy is used to generate points that likely lie on the constraint boundary.

The EDSD method has been shown to approximate the limit state function successfully for a number of test problems [45]. However, because the EDSD method relies on the SVM method, its performance deteriorates when an SVM is not an appropriate representation of the boundary. The SVM treats evaluated samples as a *two-class* data set: feasible and infeasible observations. For a good performance, the SVM technique requires both classes to be well sampled [46,47].

This limitation is significant in the GIPSP, where, intuitively, few process parameter combinations will yield a desirable material. Thus, the solution to a GIPSP is typically small relative to the search space. In this scenario, it is undesirable to evenly sample both satisfactory and unsatisfactory classes, especially for high-dimensional problems where a balances sampling of the entire search space may be computationally expensive. If to reduce computational expense, we undersample the unsatisfactory space, the resulting SVM will be unreliable in the undersampled unsatisfactory space and will tend to have high rates of false positives. In cases with imbalanced sampling, the support vector domain description (SVDD) technique has been shown to be more appropriate [27]. Whereas the SVM scheme is a two-class classifier, the SVDD scheme is a one-class classifier intended to address the case where the training data are mainly from a single category.

## Support Vector Domain Description

**x**

*denote a vector in the design variable space $X$ —the input space. Given*

_{i}*n*data points, $X={xi|i=1,2,\u2026,n}$, the minimum radius hypersphere containing every data point with centroid

**a**and radius

*r*

*ξ*are the slack variables that allow for the possibility of outliers in the training set. The parameter

_{i}*c*defines how to tradeoff between Hampshire volume and errors. Any point at a distance equal to or less than

*r*from the hypersphere center is inside of the domain description. However, because a hypersphere is typically a poor representation of the domain, a kernel function is used to nonlinearly remap the training data into a higher-dimensional feature space where a hypersphere is a good model. Through the so-called kernel trick, the data points are mapped to the feature space, without computing the mapping explicitly. The result is an implicit mapping to a feature space of unknown, possible infinite, dimensionality. There are several valid kernel functions common in the literature [51]. The proposed algorithm uses the Gaussian kernel function

*q*parameter determines how “tightly” or “loosely” the domain description is fit around the training data. The constraint in Eq. (5) then becomes

**b**is the centroid of the feature space hypersphere. Rewriting in terms of the kernel function, the Wolfe dual problem can be developed from Eq. (7) as

For a detailed description of the method for formulating the Wolfe dual problem see Ref. [52]. For each data point, **x*** _{i}* for

*i*= 1, 2,…,

*n*, there are three possible classifications:

- (1)
It is inside the hypersphere, which is indicated by

*β*= 0._{i} - (2)
It is on the boundary of the hypersphere, which is indicated by 0 <

*β*<_{i}*c*. - (3)
It is an outlier outside of the hypersphere, which is indicated by

*β*=_{i}*c*.

*c*≥ 1 yields no outliers since $\u2211i\beta 1=1$ and 0 <

*β*<

_{i}*c*∃

*i*, and therefore, $\beta i\u2260c\u2200i$. The squared distance of the feature space image of a point,

**z**, to the centroid of the hypersphere is

A new test point, **z**, is inside the domain description if the distance from the feature space image of test point to the hypersphere centroid is less than the radius of the hypersphere. The expression for classification, Eq. (9), is a simple algebraic expression that is fast to evaluate. In fact for the Gaussian kernel function, the first term is equal to 1, and the last term can be precomputed since it is independent of **z**.

*i*,

*j*and the outlier-data by

*l*,

*m*. Further, the target-data are labeled

*y*= 1 and outlier-data are

_{i}*y*= −1. The search problem becomes

_{l}where $\beta \u2032i=yi\beta i$ (the index *i* again enumerates both target and outlier-data). See Ref. [46] for a detailed exposition of the SVDD method with negative examples.

To prevent weighting variables with large magnitudes more than those with lower ones in this comparison, the training data are centralizing (scale all data to a −1 to 1 range), which improves the SVDD model. An important benefit of the SVDD method is that it can be constructed incrementally and decrementally [53]. This allows for a relatively inexpensive update procedure to be used when new members are added or removed from the SVDD. Figure 2 is an illustration of the SVDD method on a two-dimensional data set in the thermodynamic conditions space.

To underscore the difference in performance between the SVM and SVDD classifiers, we develop two test problems where one class (the satisfactory region) is (*a*) large and (*b*) small relative to the domain, see Fig. 3. In each case, we created a training data set with 50 satisfactory and 50 unsatisfactory random examples. In example (*a*) where the regions are proportional, the samples are relatively balanced, while in (*b*), they are imbalanced (the unsatisfactory region is undersampled).

We use both examples to train an SVM and SVDD classification model. We used the SVM algorithm in MATLAB's Statistics and Machine Learning Toolbox [54], and SVDD model in DD_Tools [55] developed in the Pattern Recognition Laboratory at Delft University of Technology, Delft, Netherlands. The results are illustrated in Fig. 3. In scenario (*a*), the SVM model outperforms the SVDD model. This is expected since the SVDD tends to generate conservative model; recall that the SVDD finds the minimum radius hypersphere around the target-data. In contrast, in scenario (*b*) where the data are not balanced, the SVM classifier results in a significant overestimation. In the latter case, the more conservative SVDD scheme has better performance. This basic insight motivates an adaptive sampling scheme based on SVDD rather than SVM for the GIPSP.

## Proposed Algorithm

Our aim is to develop an adaptive sampling scheme based on the SVDD technique for approximating the constraint boundary. The basic idea of the proposed algorithm is that the true boundary of the satisfactory region is approximately parallel to our current best guess. In the proposed algorithm, we search along the direction perpendicular to the SVDD boundary (our current best guess) for a point on the *true* boundary of the satisfactory region using a root-finding method. Figure 4 is an illustration of this idea. An initial point is selected along the SVDD boundary along with an initial step size. The initial step size is selected using information from the SVDD. If the end point is outside of the satisfactory region, a root-finding method is used to find a point on the boundary. The proposed method is described in detail in Algorithm 1. We assume that the designer has available small number of designs that satisfy the specified phase state properties, i.e., the constraints. The initial samples may be obtained from prior equilibrium experimental data or found using conventional optimization techniques. In the case studies presented in this paper, we use random sampling to generate *n* data points $X={xi|i=1,2,\u2026,n}$ in the design variable space. The randomly generated samples are assigned labels $Y={yi|i=1,2,\u2026,n}$, such that *y* = 1 or *y* = −1 according to whether or not they satisfy the constraints, respectively. The set of indices *I _{b}* is initialized to the empty set.

*q*and

*c*are selected such that the so-called

*F*

_{1}measure is minimized, see Ref. [56] for details. Ultimately, our goal is to search along the direction perpendicular to the boundary of $M$. First, we must select a suitable starting point for this search. An intuitive initial point is some support vector (SV), since the SVs are those samples that lie on the boundary of the domain description. However, in practice, there tends to be few SVs relative to the size of training data. In only a few iterations, all the SVs of the SVDD model may lie on the true boundary of the satisfactory region but our approximation may still be poor. Instead, the algorithm selects the point on the boundary of $M$ that maximizes the distance to any point on the true boundary. Let the indices corresponding to the samples that lie on the boundary of the true solutions be

*I*. The initial sample

_{b}**x**

*is found such that*

_{a}where *t* is a dummy variable, $X={xi,i=1,\u2026,n}$ are the training data, and $r=\u2211i\beta \u2032iKG(xi,xk)$ for any $k\u2208SV\u2282{1,\u2026,n}$, the set of support vectors. To prevent the case where the algorithm attempts to “reuse” a previous initial point, *I _{b}* should also include the indices corresponding to previous initial sample points. Because the Gaussian Kernel is an inexpensive function, Eq. (12) is fast to evaluate. It is important to note that the initial point

**x**

*is not guaranteed to be satisfactory. This can occur when the SVDD modeling parameter*

_{a}*q*(see Sec. 4) is set too “loose.” As a result, it is necessary to evaluate the initial point

**x**

*against the constraints $C$ to determine its label, denoted as*

_{a}*y*.

_{a}It is desirable to choose a step size *γ* large enough to cross the boundary of the satisfactory region. If the label *y _{a}* = 1, we should step outside of $M$ to find additional satisfactory points, “growing” the model. On the other hand,

*y*= −1 indicates that $M$ is optimistic, and we should step inside of $M$ to find additional unsatisfactory points, “tightening” the model. Choosing an appropriate initial step size has a significant impact on algorithmic performance. One consideration is that the initial step should not take us too far from the current SVDD, $M$. To address this, we limit the step size to $\gamma \u2264min\gamma {r2(xa\u2212\gamma *d)}/2$, which limits the step size according to the size and shape of $M$ along the direction of

_{a}*d*. Another consideration is that during search, we should expect disjointed “clusters” in $M$. In this situation, we should take a step size that is at the feature space midpoint of the clusters. If disjointed clusters exist along the direction

*d*, their feature space midpoint is at $\gamma =max\gamma {r2(x0+\gamma *d)}$. We use these concepts to determine the initial step size according to Algorithm 2.

The next step is to search for a boundary point using line search (spec. bisection search), see Algorithm 3 for a description. If the initial step **x*** _{b}* did not cross the boundary of the true solution, that is,

*y*=

_{a}*y*, the bisection search algorithm terminates and returns the training set

_{b}*X*and

*Y*containing only the samples

**x**

*and*

_{a}**x**

*. Else, the algorithm uses bisection search to reduce the size of the interval between*

_{b}**x**

*and*

_{a}**x**

*until it is less than some user-defined error tolerance,*

_{b}*ε*.

Finally, the training set *X* and *Y* are updated to contain the samples points that were evaluated against the constraints, i.e., points for which a label *y* was generated. The set of indices *I _{b}* is also updated to contain indices corresponding to the true boundary points found (within tolerance

*ε*). This process is repeated user-defined

*N*times, a termination rather than a convergence criteria. It would be possible to develop a convergence criteria based on the change in $M$ at each generation; this is left as future work. Without a convergence criteria, analysis of computation complexity is less meaningful but still worth considering. In the case of the GIPSP, evaluating a design against constraints (

**ClassLabel**in Algorithms 1–3) is the elementary operation, all others are lower order. Let

*ε*

_{0}and

*ε*be the maximum interval length and error tolerance for the bisection search, respectively. The time complexity is $O(N\u2009log2(\epsilon 0/\epsilon ))$, where

*N*is the number of iterations to be performed.

**Algorithm 1** SVDD-Based Sampling Algorithm

1. **procedure** SVDDsample($X,D,C$)

2. $X,n\u2190RandomSample(X,D)$

3. $Y\u2190ClassLabel(X,C)$

4. $Ib\u2190\u2205$

5. **for**$i\u21901$ to *N***do**

6. $M\u2190$**train** SVDD model with *X*, *Y* ▷ Eq. (8)

7. $xa,ya\u2190Select\u2009initial\u2009point$ ▷ Eq. (12)

8. $xb,yb\u2190TakeInitStep(M,C,xa,ya)$

9. $Xi,Yi,Ii\u2190BisectionSearch(xa,ya,xb,yb,C)$

10. $X,Y\u2190(X,Xi),(Y,Yi)$ ▷ Concatenate

11. $Ib\u2190(Ib,Ii+n)$

12. $n\u2190n+Ii$

13. **return**$M$

**Algorithm 2** Take Initial Step

1. **procedure** TakeInitStep($M,C,xa,ya$)

2. $d\u2190$**gradient** of $r(xa)$ ▷ Eq. (13)

3. $\gamma max\u2190max\gamma {r2(xa+\gamma d)}$ ▷ *r*^{2} from Eq. (9)

4. $\gamma min\u2190min\gamma {r2(xa\u2212\gamma d)}$

5. $\gamma \u2190min{\gamma max,\gamma min}$

6. **if***y _{a}* = −1

**then**▷ If

**x**

*is not satisfactory*

_{a}7. $\gamma \u2190\u2212\gamma $

8. $xb\u2190xa+\gamma d$

9. $yb\u2190ClassLabel(xb,C)$

10. **return x*** _{b}*,

*y*

_{b}**Algorithm 3** Bisection Search

1. **procedure** BisectionSearch($xa,ya,xb,yb,C$)

2. $X,Y\u2190(xb,xa),(yb,ya)$ ▷ Concatenate

3. $I\u21902$ ▷ Counter

4. **if**$ya\u2260yb$**then**

5. **while**$||xa\u2212xb||\u2264\epsilon $**do**

6. $xc\u2190xb+12(xa\u2212xb)$ ▷ Midpoint

7. $yc\u2190ClassLabel(xc,C)$

8. $X,Y\u2190(X,xc),(Y,yc)$

9. **if***y _{a}* =

*y*

_{c}**then**▷ Update interval

10. $xa,ya\u2190xc,yc$

11. **else**

12. $xb,yb\u2190xc,yc$

13. $I\u2190I+1$ ▷ Update counter

14. **return***X*, *Y*, *I*

## Case Study

### Test Problems.

We evaluate the performance of the proposed algorithm on Fe–Ti binary alloy system, see phase diagram in Fig. 5. Given the thermodynamic conditions, the phase compositions are computed using ThermoCalc with the TCFE7 database. Recall that the GIPSP is the triple $P=(X,D,C)$. The thermodynamic conditions (design variables) are $X=(x1,x2,x3)$, where *x*_{1} is the mass percent of Ti, *x*_{2} is the mass percent of Fe, and *x*_{3} is the temperature (Kelvin). The search domain $D$ is defined as

*C*is

### Results and Analysis.

The proposed algorithm is motivated by the GIPSP, which is typically multidimensional and features response discontinuities that are problematic for gradient-based search techniques. To evaluate the performance of the proposed algorithm on this class of problem, we compare to EDSD, which is intended for constraint satisfaction problems with similar characteristics. The principal difference is that the proposed algorithm is intended to address the case where the satisfactory region is undersampled (as we expect is the case with most GIPSPs).

In both algorithms, performance is dependent on the initial training data. To account for this, each case was tested for 30 trials with random initial training data. For each trial (in either test case), we randomly sampled the domain $D$ to find ten feasible and ten infeasible sites. The same data are used to initialize each algorithm. These initializing functions are not counted toward the overall function count of either algorithm.

Since both EDSD and the GIPSP algorithm use binary classifiers, we evaluate solution quality using the *precision* and *recall* metrics commonly used in pattern recognition [57]. We generate a set *X* of 10^{6} random samples in the design space domain $D$ defined by Eq. (15). We then find $X\u2032\u2282X$, the subset that satisfies the user-specified conditions, $C$, in Eq. (16) for test case 1 and Eq. (17) for test case 2. Next, we find $X\u2033\u2282X$ the subset that is classified as belonging to the satisfactory set, according to the classifier (SVM for EDSD and SVDD for the GIPSP algorithm). We compute the *true positives*, *true negatives*, *false positives*, and *false negatives* as

Positive | Negative | |
---|---|---|

True | $Ntp=|X\u2032\u222aX\u2033|$ | $Ntn=|X\u2216X\u2032\u222aX\u2216X\u2033|$ |

False | $Nfp=|X\u2032\u222aX\u2216X\u2033|$ | $Nfn=|X\u2216X\u2032\u222aX\u2033|$ |

Positive | Negative | |
---|---|---|

True | $Ntp=|X\u2032\u222aX\u2033|$ | $Ntn=|X\u2216X\u2032\u222aX\u2216X\u2033|$ |

False | $Nfp=|X\u2032\u222aX\u2216X\u2033|$ | $Nfn=|X\u2216X\u2032\u222aX\u2033|$ |

*positive*and

*negative*refer to the classifier's prediction and

*true*and

*false*refer to how that prediction corresponds to the

*actual*classification. Taking these terms into account, we can compute the precision and recall measures as

Lower values of misclassification rate are preferred, however, one must be cautions when interpreting this measure. For example, in a case where the satisfactory region is very small, classifying the entire region as “unsatisfactory” will have a low misclassification rate.

We calculate each performance metric at several intervals for each algorithm. We report the mean values and 95% confidence interval of 30 trials for both test cases in Figs. 6 and 7, respectively. In test case 1, the GIPSP algorithm produces a higher precision approximation for any given number of function evaluations. This is not unexpected since the SVDD is more conservative than the SVM technique. Both algorithms converge to a similar measure of recall and a low level of misclassification error. In test case 2, the GIPSP algorithm generates high-precision solutions and converges to a solution with high recall and low misclassification rate. However, the precision of the EDSD solution does not improve significantly after 100 iterations, and the recall measure becomes worse. Further, the solution quality is highly variable across each iteration, resulting in large confidence intervals.

For illustrative purposes, we include Figs. 8 and 9, which depict the progression of each algorithm at the (*a*) 100, (*b*) 250, and (c) 400, function evaluations for each test case. The shaded region represents the true solution to each CCSP found through an exhaustive search. An attempt was made to illustrate trials that are representative of the mean values in 6 and 7. However, we should not that in both test cases (but especially for test case 2), the performance of the EDSD algorithm varied significantly. Therefore, no illustration of a single result can be truly illustrative of the typical results. Taking these limitations into account, the illustrations still provide some valuable insight into the performance of each algorithm.

As can be seen in Fig. 8, the GIPSP algorithm maintains high precision during its progression, including few false positives. The EDSD, on the other hand, produces more optimistic estimations of the satisfactory region, initially. The higher precision of the GIPSP solution is reflected in the “tighter” classifier boundary. The precision at (*c*) 400 function evaluations for the EDSD algorithm is considerably higher than the med.

Note that Fig. 9 is focused on the solution space (which is quite small), the search space for this test problem is defined by Eq. (15) and depicted in Fig. 5. Test case 2 is intended to highlight the limitations of EDSD method on problems where the satisfactory region is small relative to the search space. In such cases, the SVM technique used in EDSD tends to overestimate the satisfactory region. In Fig. 9, the overestimation occurs near the true solution (shaded portions), however, this is not always the case. Figure 10 is an illustration of the results from another trial of the EDSD algorithm in test case 2.

## Discussion and Summary

In this paper, we have presented a novel algorithm for approximating all the solutions to a CCSP with nonisolated solution where the satisfactory region is small relative to the search space. The algorithm uses the SVDD technique combined with a sampling strategy to gradually develop the solution. The motivation for the algorithm is the general inverse phase stability problem of mapping user-specified regions in multidimensional phase constitution space to ranges in values of thermodynamic conditions, which we term the GIPSP. In the GIPSP, one class (the satisfactory region) is small relative to the other (unsatisfactory region). For scalability, it is desirable to undersample the unsatisfactory region, since it comprises the vast majority of the space. This motivates the use of the SVDD method in the algorithm since it is able to more accurately (in terms of precision and recall) model scenarios with imbalanced training data.

We investigated the performance of the algorithm on Fe–Ti binary alloy system using ThermoCalc with the TCFE7 database. Using this system, we formulated two test cases. In the first test case, the solution set is nonconvex; in the second, the solution set is small relative to the search space. We compare the performance of the GIPSP algorithm to the EDSD algorithm, which uses a related classification scheme, namely, SVM. The performance of each algorithm on the test problems was measured as the precision, recall, and misclassification rate. In both test problems, the GIPSP algorithm is able to converge to a solution with high precision and recall. The EDSD algorithm, however, had significant difficulty in approximating the solution to test case 2. This is likely the result of the limitations of the SVM technique used in EDSD. The SVM technique is known to underperform in cases with imbalanced training data sets. In test case 2, the satisfactory region is small relative to the search space, resulting in an imbalanced training data set.

Future work should also investigate the performance of the algorithm on problems of higher dimensionality that are more representative of real-world materials design problems.

## Acknowledgment

This work was supported by the National Science Foundation and the Air Force under Grant No. EFRI-1240483. The authors would like to thank Paul Mason from ThermoCalc for providing critical thermodynamic data, which made this research possible.

## Nomenclature

**a**=centroid of hypersphere

**b**=centroid of feature space hypersphere

*c*=SVDD parameter

*C*=_{j}constraint

*d*=direction perpendicular to SVDD boundary

*f*=thermodynamic conditions → phase constitution

- $fX$ =
random parameter joint pdf

*g*=performance function

*I*=_{b}set of indices corresponding to boundary points

*I*=_{i}real interval

*K*=kernel function

*n*=number of data points

*N*=dimensionality of thermodynamic conditions space

*N*_{fn}=number of false negatives

*N*_{fp}=number of false positives

*N*_{tn}=number of true negatives

*N*_{tp}=number of true positives

*q*=Gaussian kernel parameter

*r*=hypersphere radius

*R*=reliability

*R*=_{j}relation on the variables involved in constraint

*C*_{j}*S*=_{j}scope of the constraint

*C*_{j}- SV =
set of support vectors

*t*=dummy variable

*X*=training data set

*x*=_{i}design variable

**x**=_{i}a vector in the design variable space $X$

*Y*=set of training data labels

*y*=_{i}training data label

**z**=test point

*β*=_{i}Lagrangian multiplier

*γ*=step size

*ε*=error tolerance

*ε*_{0}=maximum interval length

*ξ*=_{i}slack variable

- Φ =
data space → feature space

- $A$ =
*n*-tuple solution to $P$ - $C$ =
*t*-tuple of constraints - $D$ =
search space

- $M$ =
classification model

- $P$ =
constraint satisfaction problem

- $X$ =
*n*-tuple of variables