## Abstract

This paper examines and compares the commonly used machine learning algorithms in their performance in interpolation and extrapolation of flame describing function (FDFs), based on experimental and simulation data. Algorithm performance is evaluated by interpolating and extrapolating FDFs and then the impact of errors on the limit cycle amplitudes are evaluated using the extended FDF (xFDF) framework. The best algorithms in interpolation and extrapolation were found to be the widely used cubic spline interpolation, as well as the Gaussian processes (GPs) regressor. The data itself were found to be an important factor in defining the predictive performance of a model; therefore, a method of optimally selecting data points at test time using Gaussian processes was demonstrated. The aim of this is to allow a minimal amount of data points to be collected while still providing enough information to model the FDF accurately. The extrapolation performance was shown to decay very quickly with distance from the domain and so emphasis should be put on selecting measurement points in order to expand the covered domain. Gaussian processes also give an indication of confidence on its predictions and are used to carry out uncertainty quantification, in order to understand model sensitivities. This was demonstrated through application to the xFDF framework.

## Introduction

Increasing regulation on NOx emissions in aero and power turbines is necessitating a move to lean premixed combustion. With this comes increased exposure to thermoacoustic instabilities requiring better tools for their modeling and prediction. A very powerful tool in creating low-order models for linearized systems is the flame transfer function (FTF). The FTF describes the heat release fluctuations from the flame resulting from incoming velocity fluctuation as a function of frequency. Noiray et al. [1] showed that the FTF can be extended to weakly nonlinear systems by including a dependence on amplitude, which gives the flame describing function (FDF). In order to determine FDFs, either experimental measurements are carried out by forcing a flame using a loudspeaker or numerical measurements are made of how the flame responds to different frequencies and levels of forcing in a simulation. Collecting data from both of these methods can be costly and is often subject to a restriction on the domain (range of frequencies and amplitudes) that can be tested, e.g., the maximum amplitude of forcing in an experimental rig can be limited by the loudspeaker. This results in a sparse grid of measurements over a limited domain, meaning practitioners utilizing FDFs often have to make assumptions of how the data behave between data points or outside of the domain. This paper compares different machine learning algorithms for the interpolation and extrapolation of some examples of FDFs. It investigates some of the main sources of errors in both of these regimes and aims to show which algorithms are best for both interpolation and extrapolation. The importance of intelligently selecting data points is demonstrated, and finally, the impact of interpolation errors on a system model is evaluated, wherein the limit cycle oscillation of a laminar slit burned is calculated utilizing the extended FDF (xFDF) framework.

### Interpolation Versus Extrapolation.

In regression tasks, we use data to generalize a function that maps a set of input variables X to output variables y. Using this function mapping, a y value can be predicted for any combination of input variables. This process is referred to as interpolation when the input variables lie in between the training data, whereas if the point of estimation lies outside this region it is referred to as extrapolation. In the univariate example in Fig. 1, the gray and white regions represent the extrapolation and interpolation regimes, respectively. The black lines represent a sample of polynomial models used to create predictions in and outside the training data domain. In the interpolation regime, the models are well constrained, causing them to collapse in a small region. Outside of the domain however, the models diverge, giving wildly different predictions. This dramatic divergence of predictions (despite being the same model with slightly differing hyperparameters and trained on the same set of data) is due to the lack of information available to the model during training that would constrain the model to predictions with a smaller variance. This is the danger of carrying out extrapolation: model predictions outside of the training domain are extremely sensitive to the training data and model parameters, resulting in unpredictable behavior, unless there are implicit assumptions in the model definition or explicit assumptions embedded in its definition or training. Part of this study will therefore highlight the assumptions that make some algorithms better at carrying out predictions on FDFs in the extrapolation regime and to quantify the limits of these assumptions.

Fig. 1
Fig. 1

### Data Point Reduction.

A problem that goes hand in hand with interpolation methods is that of data point reduction, i.e., reducing the number of measurements that must be taken. The balance between minimizing the number of measurements without compromising the quality of predictions is very difficult to establish ahead of time and the solution is not clear until more measurements than are necessary have been taken. The better a model is at inferring the shape of the data, the fewer data points will be required to make good predictions. However, the data points that are provided to the model are also of significant importance to the model performance. This is a problem that is associated with the complexity of the underlying data: the more complex the data, the more data points that are required. For example, a linear equation requires two data points to be fully described, and n order polynomials require n +1 data points. Real problems are somewhat more difficult as the complexity and thus required number of measurements changes region to region. Knowing which regions will require more data points ahead of time in general is impossible; however, domain knowledge can guide us for the problems of FDFs. It is well known, for example, that for the fundamental response in laminar flames the data vary more strongly in the direction of frequency than amplitude and that the response tends to die down at higher frequencies. We can take this into account by adjusting our measurement grid to take measurements at smaller steps in frequency and reduce the number of measurements in amplitude, as was the case for some of the datasets used in this study. There are also techniques that adjust the traditional grid sampling method based purely on the shape of the data; these will be explored in more detail in the test methodology.

## Training Data

The data used in this study come from two sources. One is from Palies et al. [2] and is an experimental evaluation of the gain and phase at different amplitudes and frequencies for two different flames (referred to as FDF A and B). The FDF for flame A is shown in Fig. 2. The second set of data is an extended FDF (xFDF) from Haeringer et al. [3] and consists of FDFs for individual harmonics, which were determined by monofrequent acoustic excitation of a high-resolution computational fluid dynamics (CFD) simulation of a laminar slit burner.

Fig. 2
Fig. 2

These datasets are used to compare the interpolation and extrapolation performance of a selection of algorithms and to understand the reasons for their ability (or lack thereof) to give an accurate estimation. The xFDF data are then used to evaluate the impact of errors in the FDF prediction on a parameter of interest, namely the limit cycle amplitudes calculated using the xFDF framework.

### Experimental Data.

The experimental data used for the first analysis are two FDFs from Palies et al. [2], of which the FDF for flame A is shown in the contour plot in Fig. 2. The gain (a) and phase (b) response of heat release rate to velocity fluctuations are shown as a function of signal amplitude and frequency. Data are given at eight amplitudes, from 0.01 to 0.71 (0.04 to 0.6 for flame B) and at 14 frequencies from 10 Hz to 400 Hz. Each rectangle in the plot represents one data point, of which there are 210 (198 for flame B).

The experimental setup for this data is shown in Fig. 3. The pressure fluctuations were induced using a loudspeaker to force the mean flow and the unsteady heat release rate is deduced from the emission intensity of OH* radicals.

Fig. 3
Fig. 3

### Numerical Data.

The data used for the numerical case study come from the analysis carried out by Haeringer et al. on the inclusion of higher harmonics in flame describing functions [3]. Haeringer et al. propose an extension of the standard FDF (denoted as xFDF) that includes additional describing functions, accounting for higher harmonics in the flame response. The authors determine the xFDF of a laminar slit burner numerically by monofrequently forcing the flame with various amplitudes and frequencies. This numerical data originate from Jaensch et al. [4] who used a reactive flow simulation coupled with a low-order model acoustic network to model a laminar slip burner [5]. A structured mesh of 122,300 cells was used and the schematic of the setup can be seen in Fig. 4.

Fig. 4
Fig. 4

In the original study, an acoustic network that integrated this xFDF was created and used to estimate the limit cycle oscillation amplitude of the system as a function of plenum length, which was then compared to that found using a high-resolution CFD simulation. The estimation of limit cycle oscillations using the xFDF could not be made beyond a plenum length of 0.33 m due to the amplitude of the oscillations exceeding the maximum amplitude of the xFDF. The extrapolation methods will therefore be used to evaluate the xFDF at these higher amplitudes as well as in the interpolation regime, with the results compared to those found by Haeringer et al.

## Machine Learning Algorithms

In this paper, several well-established machine learning algorithms that are commonly used for regression analyses are fit to the training datasets described above and their performance compared. It is outside of the scope of this paper to give a rigorous derivation of all of them; however, this section gives a brief introduction to each group algorithms. Interested readers are directed to Refs. [6] and [7] to gain a deeper understanding of their behavior.

### Linear Regression.

Linear regression is a very well-known algorithm that, in the simplest univariate case, simply adjusts coefficients a and b in y = ax + b to minimize the mean squared error (MSE), given in the following equation:
$MSE=∑i=0n(yi−ŷi)2$
(1)
where yi is the ith training data point, $ŷi$ is the prediction at the ith training point, and n is the number of training points. It can also be shown that modeling the error in the prediction as Gaussian and maximizing the log marginal likelihood equates to minimizing the MSE [6]. The linear case above can be extended to fit multivariate inputs and higher order polynomials by optimizing the weights, wn in a linear combination of parameters and parameter combinations. For a two input polynomial of order n, this results in the optimization of the following equation:
$y=w0+w1x1+w2x2+w3x1x2⋯+wm−1x1n+wmx2n$
(2)

where n is the order of the polynomial and m is the total number of weights, in this case 2n. The order of the polynomial, n, is a hyperparameter that can be adjusted to find a model that best fits the data [7].

#### ElasticNet.

With high-order polynomials, the least squares optimization can over-fit the data, resulting in a predictor that goes through every training point exactly, however does so by assigning large values to the weights, which causes the function to vary significantly between the training points. Regularization is used to compensate for this tendency by introducing a penalty in the MSE equation for excessively large weights. The ElasticNet algorithm employs an absolute penalty, which drives weight to zero and a squared weight penalty, that drives weight from large values. This results in a loss function shown in the following equation:
$Loss=∑i=0n(yi−ŷi)2+∑k=0mα|wk|+βwk2$
(3)

### Gaussian Processes.

Gaussian processes (GPs) are based on Bayes' theorem of conditional probability and use a prior assumption with the training data to generate a Gaussian output that is continuous with respect to the input variables. This results in a Gaussian distribution for any prediction point, the width of which can be interpreted as the model uncertainty in the prediction. The hyperparameters in this algorithm are the prior distribution, the covariance function, and the mean function. The prior distribution is the assumption of the probability distribution before taking observations into account. In this study, the prior is assumed to be a Gaussian with variance 0.01 and a mean defined by the mean function. The covariance function describes the variance in the output between different input values and contains parameters that are tuned during the training of the GP. The covariance function itself is made up of a combination of kernels; in this study, only the radial basis function or Gaussian kernel is used. The assumption of how the algorithm behaves outside of the training domain can be incorporated into this algorithm through the mean function. However, the mean function is generally set to be zero, causing the model prediction to drop to zero when making predictions outside of the domain. A nonzero mean function has been used in the tests involving extrapolation where the mean function is constructed as described in the Splines section.

The quality of the fit can be estimated by evaluating the log marginal likelihood, which, unlike for many other algorithms, is tractable due to the use of pure Gaussian integrals. This is one of the advantages of this algorithm that makes finding the best covariance function very straightforward and does not require carrying out an extensive hyperparameter tuning process, as is the case for other algorithms.

For a more detailed explanation of GPs, the reader is referred to Ref. [8].

### Neural Networks.

Neural networks (NNs) are a very popular machine learning algorithm due to their success in deep learning and big data applications. They are a layered network architecture with each layer being made up of m neurons. Each neuron computes a nonlinear transformation on a weighted sum of the outputs of all the neurons in the previous layer (or the training data, if it is the first layer) [6]. For example, a two layer network can be represented as
$ŷ=ϕ(2)(W(2)Tϕ(1)(W(1)TX+b(1))+b(2))$
(4)

where X is the matrix of input variables, $ϕ(n)(x)$ is the nth layer transformation (usually a sigmoid or tanh function), and Wn and bn are the weights and biases of the nth layer. The hyperparameters for the NN are the number of layers, the number of neurons in each layer, and the hyperparameters associated with any regularization as described in the case of ElasticNet.

One of the main advantages delivered by NNs is that their performance can almost always be improved by including more data in the training dataset and increasing the size of the data. Unfortunately in the application of FDFs, generating datasets is a costly process which means the main advantages of NNs will not prove much use here.

### Gaussian Kernel Methods.

These methods generate predictions by fitting a set of Gaussian kernels in the predictor space and summing their contribution to predict a value. The nature of these methods creates a smooth prediction surface, and two methods for training these structures (positioning of the mean, variance and number of the kernels) are included here.

#### Radial Basis Function Neural Networks.

Radial basis function neural networks (RBFNNs) are a specialized format of NNs that effectively use gradient descent to fit the heights of Gaussian kernels, which are then summed to create a smooth surface. It is a two-layer network with a “radial basis layer” feeding into a “linear layer.” Unlike most NN architectures that carry out a dot product between the inputs and the weights in the input to a layer, the radial basis layer calculates the Euclidean distance between the inputs, X, and the first layer weights, W1. The vector of distances is then multiplied by the bias vector, b2, and passed through the radial basis activation function, $f(n)=e−n2$ [9]. This architecture then maps a surface made up of Gaussian kernels with centers (mean) defined by W1 and a variance defined by b1. The heights of these kernels are defined in weights the linear layer, W2, where they are multiplied with the outputs of the radial basis layer through a dot product and then summed together. This process is summarized in the following equation:
$ŷ=∑(W(2)Tf(||W(1)·X||b(1))+b(2))$
(5)

Gradient descent is used to set the weights in the second layer in order to minimize the MSE. Each row in the weight matrix of the first layer describes the position of a neuron and can be set using several different training algorithms:

1. A neuron positioned under each training point. This results in zero error in training and has the tendency to over-fit the training data.

2. Neurons positioned under n training points. This process evaluates the impact of fitting a neuron under all available training points on the loss function (MSE) and repeats this process until n neurons have been fit or a minimum error has been satisfied on the training set. This process will generalize better than the first algorithm however adds a hyperparameter, n, that needs to be set.

3. Neurons under n cluster centroids [10]. In this algorithm, rather than fitting neurons under training points, a clustering algorithm is used to find n cluster centroids, and these centroids are then used as the weights. This makes the algorithm much better at generalizing, however, brings in hyperparameters in the form n, and the clustering algorithm used.

The RBFNN has been reviewed for its performance on FDFs before and found to be good in interpolation [11]. However, they are quite unfit for extrapolation as their surface will tend to zero away from the data points.

#### Support Vector Regression.

Support vector regression has a very different origin to RBFNNs; however, the resulting definition, when used with a Gaussian kernel, is very similar
$ŷ=∑(W2·f(||W1−X||b1))+b$
(6)

A significant difference from the RBFNN is that only predictions that have an error greater than ϵ are used to influence the MSE loss function during training [12]. This results in the training process only adjusting to improve predictions in areas where performance is very poor and results in good generalization. This algorithm has three hyperparameters: an error penalty term C, ϵ and the width of the Gaussian kernels, γ.

### Random Forest Regression.

This algorithm is an ensemble method that makes its predictions from an ensemble of decision trees trained on subsets of the training data [7]. This allows it to generalize surprisingly well, and far better than that can be achieved using the decision trees alone. Unlike the other algorithms described here, it does not result in a smooth surface, due to it being comprised of decision trees that break up the predictor space and create constant value predictions. This algorithm implicitly makes an assumption that all data outside the training domain is equal to that at its edge, as made in constant value extrapolation. The hyperparameters tuned in this study were the number of estimators in each tree and the maximum number of features used in each tree.

### Symbolic Regression (Equation Learner).

The equation learner (EQL) is an algorithm developed by Martius and Lampert [13] that effectively carries out a symbolic regression using gradient descent. It does this by utilizing a NN architecture with activation functions corresponding to the choice of basis functions (e.g., $sin(x), cos(x)$, and multiplication). This allows it to come up with a nonlinear combination of basis functions that can be interpreted as an equation. The training process is classical gradient descent but with three different regularization phases. The first gives no regularization, allowing the algorithm to explore the loss space, then a penalty on the absolute value of the weights is applied to drive them toward zero. Finally, all weights that are zero are kept at zero and all other regularization is turned off. This encourages the algorithm to find solutions that are simple in terms of its basis functions. Symbolic regression has been suggested to be suitable for extrapolation applications as most dynamical systems can be described using combinations of sine and cosine functions [14]. Compared to many of the other algorithms, it takes a long time to train and suffers from stability issues due to local minima generated by the use of periodic functions.

### Splines.

Splines are a piecewise interpolation method where the interpolation can be carried out at different orders of polynomial. They are by far the most widely used interpolation method and come as built in functions with any program that can perform interpolation. The most basic form is linear interpolation, which effectively plots a straight line between each data point. The most commonly used spline is the cubic spline, which fits a cubic polynomial in between each pair of points, while restricting the value and first derivative at each connecting point along the spline to be equal [15]. Aside from the order of the interpolation, there are no hyperparameters to be tuned and the fitting methods give an exact solution.

#### Constant Value Extrapolation.

Since the cubic spline is by definition an interpolation method, it requires an explicitly defined extrapolation method. The simplest and most commonly used method of extrapolation is that of constant value extrapolation. This method simply assumes that the gain and phase values outside of the domain are the same as that found at the edge of the domain. This results in a flat surface outside of the domain. It can also be combined with any of the above methods for interpolation to give a prediction across the whole of the extended domain.

Another very simple algorithm, this method assumes that the gradient that exists at the edge of the domain applies to all of the outside of the domain. This is obviously a poor assumption for extrapolation far from the domain when applied to FDFs, as it is likely to result in unphysical predictions.

## Test Methodology

### Interpolation and Extrapolation Performance Evaluation.

In order to evaluate the performance of the algorithms in different regimes, the algorithms were trained on different subsets of the FDF data and then their error in predicting the remaining data was used to evaluate their performance. In the interpolation study, two amplitudes and alternating frequencies are removed from the original data and then the algorithm error in predicting the values at these points is used as a measure of the interpolation performance.

This effectively simulates the performance of the algorithm when increasing the measurement resolution of an FDF dataset. Care was taken in the selection of points to ensure that they were in the interpolation regime as discussed in the introduction.

In the extrapolation study, subsets of the data, generated by removing the maximum amplitudes, are used for training the algorithms. This results in test/train datasets with increasing levels of extrapolation (defined as the ratio between the test amplitude and the maximum amplitude in the training dataset) and allows a simulation of extrapolating an FDF dataset to higher amplitudes. An example of the training and sets points for FDF A is shown in Fig. 5.

Fig. 5
Fig. 5

The code for training the algorithms is widely available in open source libraries and not normally up to the user to implement [16]. For each algorithm however, there is a set of tunable parameters (hyperparameters) that have significant impact on the performance of the resulting algorithm. The hyperparameters considered in this study are included in the algorithm descriptions in the Machine Learning Algorithms section. The hyperparameters are often a way of controlling the complexity of a model and as thus whether a model is over- or under-fitting the data. With the exception of Gaussian processes and cubic splines (for reasons given in their descriptions), all of the algorithms require their hyperparameters to be explicitly tuned to the dataset they are being trained on. The aim of tuning the hyperparameters find a balance between making the model complex enough to capture the shape of the data, yet keep it simple in order to improve its ability to generalize to unseen data. In this study, this is achieved using a grid search K-fold cross-validation, which evaluates the ability of every hyperparameter combination to generalize on a subset of the training data. For a more detailed explanation of K-fold cross-validation, the reader is referred to Ref [8].

### Performance Evaluation Metrics.

In this study, the coefficient of determination (R2 score) will be used as an evaluation of the performance of the machine learning model. The R2 score is a measure of correlation or “goodness of fit” and is equal to unity minus the ratio of the residual squared error of the model versus the residual square error from making a constant value prediction equal to the mean of the training data
$R2=1−1n∑(yi−ŷi)21n∑(yi−y¯)2$
(7)

An R2 score of 1 refers to a perfect model with zero error and a score of 0 represents a model that is no better than making all predictions equal to the mean of the data. This metric was chosen above others as the score is relative to a very basic model and is not an absolute metric, which allows the algorithm performance to be compared across datasets with different ranges (e.g., in the gain and phase data) with the score remaining on the same scale. This is in comparison with an absolute metric like mean squared error where the magnitude of the metric, and what can be considered to be a good score, depends on the data on which the model is fit [16].

### Minimization of the Number of Data Points.

As discussed in the introduction, the minimization of data points in a problem of interest in any scenario where acquiring data points is expensive. In this study, a Bayesian design of experiments method based on Gaussian processes is used [17,18]. The procedure is simple and consists of five steps:

1. Select points on the edge of the domain.

2. Fit a Gaussian process.

3. Evaluate the GP at all points where a measurement could be taken.

4. Select the n points with the largest variance and carry out measurements there.

5. Repeat steps 2–4 until the development score has exceed the threshold, t, k consecutive times.

where n refers to the number of data points added to the GP at each step, t is a threshold for the development score, and k is the number of times the development score must exceed t to reach a satisfactory error. The development score is a score of the desirable performance metric, evaluated on the next data points. These three parameters are selectable by the user, and respective values of 5, 0.9, and 3 have been chosen for this study. A test of the effectiveness of this method for minimizing the number of measurements is carried out by comparing the point at which the procedure predicts enough data have been collected to the actual error on the dataset.

### Impact of Flame Describing Function Interpolation and Extrapolation Errors.

The analyses so far are focused on the errors in the FDF that arise from using different interpolation or extrapolation methods. However, to the practitioner, these errors are of little interest and their main concern is on how this will impact the system as a whole, for example, in terms of limit cycle amplitudes, mode frequencies, or growth rates. A full evaluation of this impact would require an in depth sensitivity study; however, we can gain an intuition of effects on these parameters by artificially introducing errors and seeing how the system outputs behave.

This is done in the interpolation regime for the xFDF datasets, by reducing the number of data points available for training, which will in turn increase the error in the model predictions. As the performance is strongly affected by the selection of data points, the above methodology for minimizing the number of data points is used to select the data points in the reduced datasets.

The numerical xFDFs are be used in a one-dimensional code developed by Haeringer et al. [3] to find the limit cycle amplitude estimation as a function of the model plenum length, for different harmonics. The one-dimensional code is used to find the amplitude and frequency that satisfy the following equation:
$Fn(ω,An)=H(ω)$
(8)

where Fn is the FDF of the nth harmonic and H is the low-order model representation of the system. This allows the frequency of the unstable mode to be identified and hence it's limit cycle amplitude. Throughout this process, evaluation of the xFDFs will be made at points that are not colocated with the measurement points, thus requiring some form of interpolation. The effect of extrapolation is only evaluated on the xFDF data following the same methodology as for the interpolation. All of these predictions are then compared to the CFD results [4].

## Results and Discussion

This section presents and discusses the results of each of the tests described in the Training Data section. The interpolation results on all of the FDF datasets are shown first, followed by the extrapolation test. The results for minimizing the number data points are then presented and their results used in the analysis of the impact of interpolation error which follows. Finally, a brief demonstration of the impact of extrapolation errors is shown before a discussion on aspects of algorithms that are also useful, outside of just interpolation and extrapolation performance.

### Interpolation.

The interpolation test results for each algorithm are shown in Fig. 6 for datasets where a good R2 score (>0.95) was achieved. No algorithm achieved a score above 0.95 for the higher harmonic xFDFs and so only the fundamental is shown. The linear and cubic spline, GP, and EQL algorithms performed the best, with R2 scores above 0.95 for both the gain and phase. The other algorithms showed similar performance on the phase data, with only polynomial regression not achieving a score over 0.95 for all the datasets that are shown.

Fig. 6
Fig. 6

There are two main factors that can be seen affecting the performance of the algorithms here. The first is the amount of information captured by the training points compared to the complexity of the quantity being measured and the second is the ability of the algorithm to model the complexities of the surface. The first issue, namely that the data provided does not contain enough points to describe peaks and troughs in the data, was touched upon in the introduction and accounts for the poor performance on the xFDF higher harmonic datasets. This can be seen in the prediction made by the cubic spline on the gain of the fourth harmonic, shown in Fig. 7. Here, the test points are shown by large spheres and the training points with small spheres. The amplitudes with no training points available are shown in black. This type of error occurs twice in this plot, at the test point with maximum gain and maximum amplitude and for most of the test points along the amplitude of $0.7u′/u$. In both of these cases, the training points around the test points do not infer the existence of any peaks, resulting in the spline significantly under-predicting the gain with an relative error of 50% in places. We know that this example in particular is of this error type as the surface for the second harmonic gain is very similar and yet the cubic spline algorithm was able to capture the missing peaks.

Fig. 7
Fig. 7

The second type of error is a limitation of the algorithm and arises from the algorithm's lack of ability to model nonlinear data. This type of error will often arise as the cross-validation process tries to keep models simple (as this prevents over-fitting and improves generalization) while being complex enough to model the features in the data. An example of this type of error is shown in Fig. 8. Here, the RBFNN model has chosen a wide enough kernel so that it can describe the general trend; however, because of this, it has ignored the faster changing regions of the data at low frequencies and thus has under-fit the data. The implementation of the RBFNN algorithm in this paper fails to fit that dataset well, as the rate of variation in the directions of amplitude and frequency are very different while there is only one parameter controlling the width of the kernels, i.e., it cannot change in two directions. This will result in it either under-fitting the data, as seen here, or picking a narrow kernel to capture the peaks, causing it to go to zero where there is not enough data, i.e., over-fit the data.

Fig. 8
Fig. 8

The cubic spline, GP, and EQL algorithms were able to capture the surface complexities with the given data. The cubic spline fits the training data exactly and assuming a smooth function has an interpolation error bound on proportional to $f″″h4$ where $h=xi+1−xi$ and f is the underlying function [15]. This essentially means that the limit of the error will be smaller with closer data points or with more slowly varying functions, which agrees with what we would intuitively expect. The GP algorithm achieved good performance by applying a covariance function that had a short length scale in the direction of frequency and a long length scale in the direction of amplitude, allowing it to capture the peaks in the data. The EQL has a broad range of basis functions that it can combine, allowing it to fit very complex shapes, while enforcing a strong regularization allows it to better generalize the underlying behavior.

These algorithms were able to extract the most information from the available data while remaining complex enough to model the data surface. The other algorithms suffered from some combination of the two types of error discussed.

### Extrapolation.

Figure 9 shows plots of R2 scores for increasing extrapolation distance from the domain. The data points correspond to the evaluation points described in the test methodology and are plotted against the ratio between the test amplitude and the max amplitude used in their training, effectively giving a distance in terms of the size of the training domain. As would be expected, for all algorithms the R2 score tends to decrease with increasing distance from the domain. At the largest level of extrapolation, algorithms that employ a constant value extrapolation have a good R2 score on the phase data (>0.9) but a poor score on the gain dataset. The best performing algorithms are the EQL, as well as the cubic spline and GP algorithms. The polynomial regressor struggled to fit the smallest dataset well, resulting in wildly inaccurate predictions and so was not included here. All Gaussian Kernel Methods showed comparatively poor performance, even in near-domain extrapolation.

Fig. 9
Fig. 9

As described in the introduction, the behavior of algorithms in the extrapolation regime is either chaotic or dependent on implicit assumptions. This is reflected in the extrapolation results of the linear regressors and NN that give large errors and tend off in different directions, regardless of the actual values. The Gaussian kernel methods also give large errors due to their implicit assumption that everything away from the domain tends to zero at a rate defined by the width of the kernels. This is in contrast to the results found by Oh et al. [11]; however, they were extrapolating over a much smaller distance.

In the phase data, the algorithms employing constant value or linear assumptions in the extrapolation regime give very good fits at very high levels of extrapolation. This is likely due to the data being simple and not varying by much, rather than the algorithms. In the gain dataset, predictive performance significantly deteriorates at extrapolation around 0.4 time the domain. The best performing algorithms in this region are those that have assumed constant value or linear behavior, suggesting that these assumptions can give reasonable predictions surprisingly far outside the domain. It should be noted that constant value extrapolation is a special case of linear extrapolation (where the gradient is zero), which is why they achieved similar performance.

### Data Point Minimization Method Validation.

Figure 10 shows the test R2 scores and the point where the development R2 score has exceeded the 0.9 threshold for three consecutive steps (the vertical line). The highest harmonic xFDFs are not shown as they do not exceed the threshold in the test or development scores, suggesting there is not enough data even in the full dataset to model the surface using GPs. For all the cases, the threshold method succeeds in predicting a good R2 score before the full dataset is tested, bar the gain of the first xFDF harmonic. In this case, the metric stops the measurements when the R2 core is just 0.82. This shows that the metric is a good qualitative indicator of the fit generated by the measured points and could be used to minimize the number of required data measurements.

Fig. 10
Fig. 10

### Impact of Interpolation and Extrapolation Errors.

Figure 11 shows the predictions of limit cycle amplitudes versus plenum length made by a Gaussian process incorporated into the xFDF framework for different proportions of the full dataset. The data points were selected using the GP sampling method outlined in the methodology. Table 1 shows the R2 scores for xFDF datasets and limit cycle calculations featured in Fig. 11.

Fig. 11
Fig. 11
Table 1

R2 scores for xFDF predictions and limit cycle calculations of reduced datasets evaluated against the full datasets

R2 score
Proportion of data14%24%50%
xFDF gain prediction fundamental0.590.960.99
xFDF phase prediction fundamental0.991.001.00
xFDF gain prediction first harmonic−0.04−0.190.84
xFDF phase prediction first harmonic0.620.540.92
Limit cycle calculation fundamental0.040.690.99
Limit cycle calculation first harmonic−1.38−0.410.98
R2 score
Proportion of data14%24%50%
xFDF gain prediction fundamental0.590.960.99
xFDF phase prediction fundamental0.991.001.00
xFDF gain prediction first harmonic−0.04−0.190.84
xFDF phase prediction first harmonic0.620.540.92
Limit cycle calculation fundamental0.040.690.99
Limit cycle calculation first harmonic−1.38−0.410.98

In Fig. 11 (a), 14% of the data is used to train the GP. The limit cycle calculations for the fundamental and first harmonic do not follow the trend of the full data at all, which is not surprising given the poor R2 scores in the interpolation of the xFDFs (a negative score means that the model prediction is worse than a predictor that just predicts the mean value). The second plot, with 24% of the data, shows that both the fundamental and first harmonic follow the trend of the full data, however their values differ significantly from the full data, as indicated by their poor R2 scores in 1. The final plot shows the limit cycle amplitudes with 50% of the data. The fit here is very good and the fundamental and first harmonic limit cycle calculations give an R2 score of 0.99 and 0.98, indicating an extremely good fit.

Plot (b) demonstrates that a good score in the interpolation of an FDF is not a sufficient criterion to ensure that there will be very little error introduced in the system of interest. This is shown as the fundamental xFDF has a very high score but the score for the limit cycle is quite poor. This is due to the metric being an indicator for the global error across the training domain, while the calculation of the limit cycle amplitude only uses a part of this data, the frequency along which the unstable mode exists, where there might be large errors locally. This, in turn, means that the limit cycle amplitudes can be very well reproduced using an FDF that has a poor R2 score measured over the whole FDF. This is seen in the reproduction of the first harmonic of Plot (c), where the R2 score on the gain of the first harmonic is relatively poor, yet the limit cycle calculation is very good. This suggests that the development error is a better indicator of the performance of system parameters, as it is more conservative and also looks at smaller samples of the training domain making it more strongly affected by large local errors.

Figure 12 shows the impact of the error from the extrapolation of the xFDF, relative to a CFD result. In the extrapolation regime, the cubic spline predicts an amplitude of 2. The GP value remains close to the CFD value up to an amplitude of 2.1; however, it shows a different trend to the CFD as soon as it leaves the interpolation regime. The EQL follows the trend of the CFD the furthest, up to an amplitude of 2.08; after this point, it jumps to a different solution that does not match the CFD. The linear cubic spline and the GPs both show good qualitative agreement with the CFD result to plenum lengths of 0.33 m. After a length of 0.33 m however, the xFDF interpolation regime is exceeded and the prediction departs from the trend exhibited by the CFD. In the case of the cubic spline, the resulting value is constant due to its constant value extrapolation, which prevents a solution to Eq. (8) from being found. The GP also begins to deviate from the CFD result at this point. The EQL performed the best in the extrapolation regime but only to an amplitude exceeding the domain by 4%. The relatively poor extrapolation performance compared to the algorithm comparison is likely due to the shape of the data. The FDF from Silva et al. [19] showed small gradients at high amplitudes, favoring simple constant value models; however, the shape of higher harmonic xFDF is more complicated than this, as was observed by Haeringer et al. [3]. The more complex data shape makes it difficult for the algorithms to predict outside of the training regime. This confirms the well documented result that extrapolation should be treated with caution, however suggests that extrapolation very slightly outside of the domain can give reasonable results.

Fig. 12
Fig. 12

### Other Algorithm Attributes.

Splines and GPs stand out from the other algorithms in that they do not require an exhaustive search over parameters using cross-validation in order to find their best hyperparameters, as discussed in their descriptions. This makes creating the predictions a more straightforward process for practitioners, requiring little knowledge of the inner workings of the algorithm to make it work effectively. This is a very important aspect that should be taken into account when choosing an algorithm.

One advantage of the Gaussian process modeling that has been touched on previously is that every prediction it makes is accompanied by a confidence interval (CI). This CI incorporates information representing the smoothness relative to the density of data points, the noise associated with the data, and the distance from the prediction domain. The GP after training actually represents a multivariate Gaussian distribution at each evaluation point. This means that samples can be drawn from the distribution to model sensitivity of the system to changes in the FDF. As means of demonstration, was done for 10,000 samples of the first and second harmonics of the xFDF limit cycle calculation and is shown in Fig. 13. The size of the confidence interval shown is made up of a contribution from the GP model CI as well as the model sensitivity. Since there exist other uncertainties that the CI here does not account for (FDF measurements, xFDF or low-order network model assumptions, etc.), it does not cover the CFD solution. In the case of the fundamental, the CI increases as the GP enters the extrapolation regime, largely due to the increase in GP CI. However, in the case of the first harmonic the GP operates solely in the interpolation regime, so the changes in CI are largely due to the changing sensitivity of the amplitudes to changes in the xFDF. Being able to carry out this analysis in a statistically robust manner is not possible with traditional regression methods, which makes the GP a very powerful tool.

Fig. 13
Fig. 13

Although not implemented in this study, the GP also allows practitioners to incorporate known information about a system in the form of a prior. As mentioned in the introduction, there are some assumptions that tend to hold in general for fundamental mode FDFs. These assumptions can be encoded into the GP before training in order to help it converge faster to a solution, requiring even fewer measurement points.

## Conclusion

This paper has compared several commonly used machine learning algorithms in their ability to interpolate and extrapolate FDFs. It has been found that the cubic and linear spline interpolation is matched only by Gaussian processes in interpolating FDFs, while the latter is preferred due to its inherent ability to provide an uncertainty with its prediction. The two main causes for errors in interpolation were discussed, and a method using GPs for designing an experiment while minimizing the number of data points was demonstrated. This method was used to simulate increasing levels of interpolation error, the impact of which on the calculated limit cycle amplitudes was evaluated. It was demonstrated that satisfying the metric for enough data points was sufficient to give good agreement between the full dataset and the reduced dataset.

In extrapolation, predictions made far outside the domain were very poor, as they require inherent assumptions in the behavior of the data which are not general. The assumption of linearity gave reasonable R2 scores up to amplitudes of 1.2–1.4 times the domain on the experimental dataset, but this was due to the simplicity of the data. Despite these levels of performance in the extrapolation test, all algorithms performed poorly when extrapolating in the applied case, with only the EQL showing qualitative agreement at amplitudes exceeding the domain, and then only by 4%. This demonstrates that the assumptions of linearity outside the domain do not hold well. In addition to indicating distance from the domain, the GP was used to demonstrate the sensitivity of the amplitudes to changes in the (interpolated and extrapolated) xFDF, which is a very useful property for conducting uncertainty analysis.

The well-known methods of linear and cubic interpolation, along with Gaussian processes, have been shown to be the best performing algorithms for the problem of interpolating data. It has also been presented that the algorithm is not the only important factor and that the data used for interpolation is just as important, the selection of which can be improved by GPs. These methods are therefore recommended for the interpolation of FDFs, while extrapolation outside of the domain is not. Practitioners should, therefore, allocate resources to extending the boundaries of the training domain in order to avoid extrapolation errors and then fill it using GP data selection methods.

## Funding Data

• European Union Horizons 2020 Marie Sklodowska-Curie (Grant No. 766264; Funder ID: 10.13039/100010665).

• Research Association for Combustion Engines (Forschungsvereinigung Verbrennungskraftmaschinen e.V. FVV, Project No. 6012700; Funder ID: 10.13039/501100003162).

## Nomenclature

• CFD =

computational fluid dynamics

•
• CI =

confidence interval

•
• EQL =

the equation learner

•
• FDF =

flame describing function

•
• FTF =

flame transfer function

•
• GP =

Gaussian processes

•
• MSE =

mean squared error

•
• NN =

neural network

•
• R2 =

coefficient of determination

•
• RBFNN =

•
• W =

algorithm weight matrix

•
• X =

input matrix

•
• y =

output vector

•
• $ŷ$ =

algorithm prediction

•
• $y¯$ =

output mean

## References

1.
Noiray
,
N.
,
Durox
,
D.
,
Schuller
,
T.
, and
Candel
,
S.
,
2008
, “
A Unified Framework for Nonlinear Combustion Instability Analysis Based on the Flame Describing Function
,”
J. Fluid Mech.
,
615
, pp.
139
167
.10.1017/S0022112008003613
2.
Palies
,
P.
,
Durox
,
D.
,
Schuller
,
T.
, and
Candel
,
S.
,
2010
, “
The Combined Dynamics of Swirler and Turbulent Premixed Swirling Flames
,”
Combust. Flame
,
157
(
9
), pp.
1698
1717
.10.1016/j.combustflame.2010.02.011
3.
Haeringer
,
M.
,
Merk
,
M.
, and
Polifke
,
W.
,
2018
, “
Inclusion of Higher Harmonics in the Flame Describing Function for Predicting Limit Cycles of Self-Excited Combustion Instabilities
,”
Proc. Combust. Inst.
, 37(4), pp.
5255
5262
.10.1016/j.proci.2018.06.150
4.
Jaensch
,
S.
,
Merk
,
M.
,
Gopalakrishnan
,
E.
,
Bomberg
,
S.
,
Emmert
,
T.
,
Sujith
,
R.
, and
Polifke
,
W.
,
2017
, “
Hybrid CFD/Low-Order Modeling of Nonlinear Thermoacoustic Oscillations
,”
Proc. Combust. Inst.
,
36
(
3
), pp.
3827
3834
.10.1016/j.proci.2016.08.006
5.
Kornilov
,
V. N.
,
Rook
,
R.
,
ten Thije Boonkkamp
,
J. H. M.
, and
de Goey
,
L. P. H.
,
2009
, “
Experimental and Numerical Investigation of the Acoustic Response of Multi-Slit Bunsen Burners
,”
Combust. Flame
,
156
(
10
), pp.
1957
1970
.10.1016/j.combustflame.2009.07.017
6.
Goodfellow
,
I.
,
Bengio
,
Y.
, and
Courville
,
A.
,
2016
,
Deep Learning
,
MIT Press
, Cambridge, MA.
7.
Kuhn
,
M.
,
Bengio
,
Y.
, and
Johnson
,
K.
,
2013
,
Applied Predictive Modeling
,
Springer-Verlag
,
New York
.
8.
Rasmussen
,
C. E.
,
2004
, “
Gaussian Processes in Machine Learning
,”
In Advanced Lectures on Machine Learning
,
Springer,
Berlin, pp.
63
71
.
9.
Wasserman
,
P. D.
,
1989
,
Neural Computing: Theory and Practice
,
Van Nostrand Reinhold
,
New York
.
10.
Wu
,
Y.
,
Wang
,
H.
,
Zhang
,
B.
, and
Du
,
K.
,
2012
, “
Using Radial Basis Function Networks for Function Approximation and Classification
,”
ISRN Appl. Math.
,
2012
, p.
1
.10.5402/2012/324194
11.
Oh
,
S.
,
Ji
,
H.
, and
Kim
,
Y.
,
2017
, “
FDF-Based Combustion Instability Analysis for Stabilization Effects of a Slotted Plate in a Multiple Flame Combustor
,”
Aerosp. Sci. Technol.
,
70
, pp.
95
107
.10.1016/j.ast.2017.07.045
12.
Boser
,
B. E.
,
Guyon
,
I. M.
, and
Vapnik
,
V. N.
,
1992
, “
A Training Algorithm for Optimal Margin Classifiers
,”
Proceedings of the Fifth Annual Workshop on Computational Learning Theory
, COLT ‘92, ACM, New York, pp.
144
152
.
13.

Martius
,
G.
, and
Lampert
,
C. H.
,
2016
, “
Extrapolation and Learning Equations
,” Fifth International Conference on Learning Representations Workshop Track Proceedings (ICLR), Toulon, France.
14.
Schmidt
,
M.
, and
Lipson
,
H.
,
2009
, “
Distilling Free-Form Natural Laws From Experimental Data
,”
Science
,
324
(
5923
), pp.
81
85
.10.1126/science.1165893
15.
Hall
,
C. A.
, and
Meyer
,
W.
,
1976
, “
Optimal Error Bounds for Cubic Spline Interpolation
,”
J. Approximation Theory
,
16
(
2
), pp.
105
122
.10.1016/0021-9045(76)90040-X
16.
Pedregosa
,
F.
,
Varoquaux
,
G.
,
Gramfort
,
A.
,
Michel
,
V.
,
Thirion
,
B.
,
Grisel
,
O.
,
Blondel
,
M.
,
Prettenhofer
,
P.
,
Weiss
,
R.
,
Dubourg
,
V.
,
Vanderplas
,
J.
,
Passos
,
A.
,
Cournapeau
,
D.
,
Brucher
,
M.
,
Perrot
,
M.
, and
Duchesnay
,
E.
,
2011
, “
Scikit-Learn: Machine Learning in Python
,”
J. Mach. Learn. Res.
,
12
, pp.
2825
2830
.10.5555/1953048.2078195
17.
Apley
,
D. W.
,
Liu
,
J.
, and
Chen
,
W.
,
2006
, “
Understanding the Effects of Model Uncertainty in Robust Design With Computer Experiments
,”
ASME J. Mech. Des.
,
128
(
4
), pp.
945
958
.10.1115/1.2204974
18.
O'Hagan
,
A.
,
2006
, “
Bayesian Analysis of Computer Code Outputs: A Tutorial
,”
Reliab. Eng. Syst. Saf.
,
91
(
10
), pp.
1290
1300
.10.1016/j.ress.2005.11.025
19.
Silva
,
C.
,
Nicoud
,
F.
,
Schuller
,
T.
,
Durox
,
D.
, and
Candel
,
S.
,
2013
, “
Combining a Helmholtz Solver With the Flame Describing Function to Assess Combustion Instability in a Premixed Swirled Combustor
,”
Combust. Flame
,
160
(
9
), pp.
1743
1754
.10.1016/j.combustflame.2013.03.020