Abstract
A traditional ensemble approach to predicting the remaining useful life (RUL) of equipment and other assets has been constructing data-driven and model-based ensembles using identical predictors. This ensemble approach may perform well on quality data collected from laboratory tests but may ultimately fail when deployed in the field because of higher-than-expected noise, missing measurements, and different degradation trends. In such work environments, the high similarity of the predictors can lead to large under/overestimates of RUL, where the ensemble is only as accurate as the predictor which under/overestimated RUL the least. In response to this, we investigate whether an ensemble of diverse predictors might be able to predict RUL consistently and accurately by dynamically aggregating the predictions of various algorithms which are found to perform differently under the same conditions. We propose improving ensemble model performance by (1) using a combination of diverse learning algorithms which are found to perform differently under the same conditions and (2) training a data-driven model to adaptively estimate the prediction weight each predictor receives. The proposed methods are compared to three existing ensemble prognostics methods on open-source run-to-failure datasets from two popular systems of prognostics research: lithium-ion batteries and rolling element bearings. Results indicate the proposed ensemble method provides the most consistent prediction accuracy and uncertainty estimation quality across multiple test cases, whereas the individual predictors and ensembles of identical predictors tend to provide overconfident predictions.
1 Introduction
Accurately predicting the remaining useful life (RUL) of equipment and other assets deployed in the field is critically important, especially in settings where failure can lead to costly downtime and threaten the safety of operators [1–3]. Determining when maintenance may be needed requires first defining (1) a health index (HI), typically derived from direct measurements (voltage, vibration, sound, power, etc.) of the equipment in question, and (2) a suitable HI threshold, that once exceeded, the equipment is deemed to have failed [4–6]. Note that we changed the term “health indicator” used in the conference paper [7] to “health index” in this paper for better consistency with earlier relevant studies on ensemble prognostics. The health of the asset is then tracked with respect to the end-of-life (EOL) threshold, and the RUL is generally estimated in units of time for easy interpretation of when maintenance may need to be performed [8,9].
Various model-based and data-driven methods have been proposed to accurately estimate the RUL of equipment operating in the field. However, accurately estimating RUL is challenging because the recorded measurements are often clouded with noise, and the degradation trajectory is highly influenced by the operating conditions. The traditional model-based approach to estimating RUL has been to update and extrapolate a mathematical model, such as a stochastic process model [10,11], empirical model [12,13], or physics-based model [14–16], which represents the observed HI trajectory (or degradation trajectory). This is usually done using a filtering algorithm such as a Kalman filter (KF) [17–19], a particle filter [8,20,21], or one of their variants [22–25]. In this approach, the optimal model hyperparameters (variance terms, empirical model parameters) are inferred from earlier measurements and are updated as more data are collected. Data-driven methods are another popular category of methods for predicting RUL. In this approach, readily available offline run-to-failure data are used to train a machine learning model, which is then used to predict the RUL of an online test unit by leveraging the correlations learned from the offline training data. Machine learning techniques are able to predict RUL by directly mapping from input features to RUL or by extrapolating the historical HI/degradation parameter measurements into the future. Popular data-driven machine learning models used for RUL prediction include artificial neural networks [26–28], support/relevance vector machines (S/RVMs) [29–31], convolution neural networks [32–34], and recurrent neural networks (RNNs) [35–37]. Each of these models has been shown to provide accurate RUL predictions, and some models have even been trained to estimate RUL uncertainty as well. However, the overall accuracy and adaptability of single-model RUL prediction methods, whether they are model-based or data-driven, have been exposed as one of the main marks against these methods. In many cases, the observed degradation trends of many equipment and assets are found to be highly nonlinear and time-varying, making it difficult for a single model to remain accurate over the entire lifespan of the asset.
To address the issues associated with single-model prognostics methods, researchers have proposed forming ensemble predictors, where the final RUL prediction is a combination of the predictions from individual models. Researchers in Refs. [35,36] proposed using an ensemble of long short-term memory (LSTM) RNN models where the final prediction was aggregated using a Bayesian method. Similar to Refs. [35,36], our earlier work in Ref. [38] used two separate ensembles of LSTM RNNs. In this method, the first ensemble predicted the RUL, and the second ensemble applied a correction based on the current value of the HI. The two-step prediction process was found to be more accurate than a single ensemble because the error correction could be learned effectively from available offline run-to-failure data.
Despite the promise shown in our earlier work, we observed that under certain degradation conditions, the use of an ensemble did not improve accuracy, as all the predictors in the ensemble would either under or overestimate the RUL (depending on conditions). The lack of diversity among the predictors in the ensemble and the simple model-averaging scheme to combine the predictions sometimes lead to errant predictions of RUL. To improve model robustness and prevent the aforementioned scenarios, researchers have proposed increasing ensemble diversity by including multiple different types of model-based and data-driven models. Likewise, new adaptive weighting schemes have been proposed to replace the more traditional equal weight averaging methods. In Ref. [39], researchers tested three different weighting schemes for combining predictors in a prognostic ensemble. The three weighting schemes, namely accuracy-based, diversity-based, and optimization-based, were used to pre-compute a set of weights for each predictor in an ensemble. During online prediction, the weights remain constant, which is better known as “degradation-independent weighting.” These degradation-independent weighting schemes can improve prognostic accuracy when (1) each predictor in an ensemble performs similarly across the lifetime of a training/test unit, and (2) the training and test data have similar and uniform degradation trajectories. In a later study, researchers in Ref. [40] proposed a “degradation-dependent” weighting scheme, designed to update the weights for each ensemble predictor depending on the system's health status. The proposed method categorized the system's health index into three regions and precomputed an optimal set of weights for combining the predictor's predictions in an ensemble. The adaptive three-stage weighting scheme outperformed a traditional ensemble method because the weights were optimized considering the additional dimension of health status. A limitation of this degradation-dependent weighting scheme, or more precisely, degradation stage-dependent weighting scheme, is that it requires dividing the lifetime of a training/test unit into a finite number of discrete degradation stages, which inevitably involves subjectivity in choosing the number of degradation stages and picking the HI thresholds for defining these stages. Additionally, the weight of each predictor in an ensemble is kept as a constant within each degradation stage, and due to the use of an often small number (≤ 5) of degradation stages, the weight versus HI relationship is limited to a piecewise constant function, which may not be optimal in complex engineering applications.
Most similar to the methods and ideas proposed in this paper are those from Refs. [41,42]. In Ref. [41], researchers investigated an ensemble of LSTM RNN models where each model was trained using a different-length sliding window of historical degradation measurements to increase prediction diversity and improve ensemble accuracy. The method was shown to outperform single models and ensembles of RNNs with an identical window size of input data. In Ref. [42], the researchers investigated whether better RUL prediction performance can be achieved by combining predictions from diverse individual models that overestimate and underestimate the RUL. The authors also implemented a degradation-dependent weighting scheme by assigning model weight based on the current perceived level of degradation.
To further improve prognostic model adaptability and RUL prediction accuracy, we propose using an ensemble of diverse predictors where the final RUL prediction is adaptively combined using a set of weights determined by a data-driven model. Similar to Refs. [40,42], our method aims to consider the effects of degradation on the accuracy of the individual prognostic models in the ensemble. At any given time during prediction, the predictors in an ensemble are designed to produce different RUL predictions, either consistently greater than or less than the true RUL. By producing RUL predictions that are both greater and less than the true RUL, there exists an optimal weighted sum of the predictors’ predictions that exactly equals the true RUL. To exploit this, we propose offline training of a Gaussian process regression (GPR) surrogate model to learn the optimal weight each predictor should receive throughout the duration of the RUL prediction. The use of a GPR surrogate model is one minor difference from the conference paper [7] where we had previously used a feedforward neural network. Then, during online operation, the trained surrogate model is used to estimate the weight each predictor should receive, effectively improving the overall accuracy of the ensemble. We consider three base RUL predictors to study the interaction between model diversity and the health index, where the GPR surrogate model is used to learn the dependence. The three models, an exponential unscented Kalman filter (EUKF), a GPR forecasting model, and an LSTM RNN, differ in how they predict RUL. Note that the GPR forecasting model is different from the GPR surrogate model (from this point on we refer to the GPR surrogate model as just a “surrogate model” to avoid confusion). We would like to note that in the conference version of this paper [7], we proposed the dynamic weighting scheme for two diverse models: EUKF and GPR. In the journal version, we further developed the conference paper [7] by (1) enhancing the generality of the model by including an LSTM RNN as a third diverse model, (2) comparing our dynamic degradation-aware weighting scheme to other weighting schemes from literature, and (3) providing source codes (GitHub link4) for the literature-based weighting schemes evaluated on an open-source battery dataset.
The remainder of the paper is organized as follows. Section 2 covers the methods and implementation of the proposed dynamically weighted ensemble and other literature-based weighting schemes. Section 3 outlines the two open-source datasets used to evaluate the performance of the proposed method. Section 4 discusses the results, and Sec. 5 presents concluding remarks.
2 Methods
In this section, we formulate the proposed dynamic degradation-dependent ensemble (DDDEn) along with three other contemporary ensemble methods.
2.1 Proposed Approach (DDDEn).
The effective RUL at each time instance is obtained by weighing individual model predictions. Finally, a surrogate model is built to predict at each time instance.
Although the calculation of RMSE depends only on the mean RUL, we evaluate the uncertainty quantification capabilities of all the models. Further, the stochastic nature of EUKF provides good epistemic uncertainty. Therefore, during weighting, we consider an ensemble of EUKF as , GPR as , and LSTM as .
2.2 Degradation-Independent Ensemble.
2.3 Degradation Stage-Dependent Ensemble.
In the two case studies presented later, we divide the HI into a total of S = 3 stages, with stage 3 being the closest to EOL.
2.4 Diverse Predictors.
The ensemble techniques described in Secs. 2.1–2.3 require diverse individual models for RUL prediction. Although the selection of individual predictors can be subjective to the type of dataset, appropriate model selection is nevertheless required to ensure model diversity and prediction accuracy. Model diversity can stem from (1) prediction bias (some models often either under- or over-predict RUL) or (2) generalization performance (different models generalize better to different test samples that fall outside a training distribution), either consistently throughout the lifetime or consistently during one or multiple degradation stages. When model diversity comes from prediction bias, it is desired to have a combination of both over- and under-predicting models where the weighting schemes could theoretically achieve true RUL, thereby increasing accuracy. As noted earlier, individual models may not be consistently biased towards under- or over-prediction during the entire lifetime. However, if there is such a bias in one or multiple local degradation stages, the DDDEn and degradation stage-dependent ensemble (DSDEn) models, which take the HI as an input, can assign weights to those local regions and achieve better overall performance. In this section, we briefly present three diverse predictors, namely EUKF, GPR, and LSTM, which we use for the two case studies. However, the applicability of this study's findings is not limited to these three diverse predictors.
2.4.1 EUKF for Remaining Useful Life Prediction.
2.4.2 Model-Based Remaining Useful Life Prediction Using Gaussian Process.
Above, denotes the parameters of an empirical degradation model (e.g., an empirical capacity fade model for the battery case study) which are to be determined from the training dataset , σ2 is the noise variance of the training data, and are the HI measurements and their corresponding time index from the dataset , and and are the predicted mean and variance of the HI at future time index , respectively.
2.4.3 LSTM for Remaining Useful Life Prediction.
LSTM RNN models excel at time series forecasting because their cell architecture consists of an internal memory gate that stores time-dependent information relevant to future predictions. Additionally, LSTM models can be flexibly scaled to process big data or run locally on a microcontroller [45]. For these reasons, LSTM models have been extensively studied for RUL prediction of lithium-ion batteries [36,46], rolling element bearings [38,47], and other engineered systems [48–50].
3 Datasets
3.1 Battery Dataset.
The dataset published in Ref. [51] consists of 124 commercial lithium iron phosphate/graphite (LFP) cells manufactured by A123. The complementary study published in Ref. [52] included an additional 45 LFP cells as part of the same experiments. The authors divided the cumulative dataset into four batches of roughly 45 cells each. We adopted the same data partitioning, and the four batches are denoted: training (41 cells), primary test (43 cells), secondary test (40 cells), and tertiary test (45 cells). One primary test cell experienced extremely fast degradation and was removed from the dataset following a recommendation by the authors in Ref. [51]. In this study, we treated the cell discharge capacity as the HI and tracked RUL with respect to an HI threshold. Prior to use, the discharge capacities of all cells in the dataset were normalized by dividing every measurement by the value of the first measurement. This treatment ensures that every cell starts at a normalized capacity of 1. Prediction of cell RUL begins when the normalized capacity falls below 97% of its initial value and ends when the normalized capacity reaches 80%. Some of the cells do not reach the normalized capacity threshold of 80%. For these cells, we linearly extrapolated the previous 50 normalized discharge capacity data until they reached the EOL threshold (roughly 50 additional cycles). Since many of the cells in this dataset exceed 1000 cycles before EOL, we subsampled the dataset five times. This effectively equates to performing RUL prediction every 5th cycle. The cells from the dataset are shown below in Fig. 2.
3.2 Bearing Dataset.
The Xi'an Jiaotong University and Changxing Sumyoung Technology Co., Ltd. (XJTU-SY) bearing dataset consists of run-to-failure vibration data of 15 rolling element bearings [53] (Table 1). The failure of these bearings was accelerated by applying large radial loads. The 15 bearings were divided into three groups of five bearings where each group was subject to a specific radial load and rotational speed. Two accelerometers mounted in the vertical (y) and horizontal (x) directions were used to gather vibration data for each bearing. Data were collected for 1.28 s every minute at a sampling frequency of 25.6 kHz. The first prediction time and EOL of each bearing were calculated using a threshold-based method described in Ref. [38].
4 Results and Discussion
In this study, we compare the RUL prediction accuracy and uncertainty quantification of the following models: (1) single EUKF, (2) single GPR with quadratic trend function, (3) ensemble of EUKF (En-EUKF), (4) standard ensemble of EUKF and GPR (En-EUKF + GPR) with equal importance to both En-EUKF and GPR, (5) DDDEn of En-EUKF and GPR (DDDEn-EUKF + GPR), (6) single LSTM, (7) simple, equally-weighted ensemble of all models, viz. En-EUKF, GPR, and En-LSTM, named as En-all, (8) degradation-independent ensemble (DIEn), (9) DSDEn, and (10) DDDEn. The first five models and results are exactly the same as those in our conference paper [7]. The additional models and results are an extension of the conference paper by adding an LSTM model and comparing our weighting scheme with other optimization-based weighting schemes from the prognostics literature. For all these models, hyperparameter optimization has been carried out as described in Sec. 2.1. These models are compared by evaluating the metrics described in Sec. 4.1 on the test dataset.
4.1 Prognostic Metrics.
We briefly introduce a few other metrics which are useful in assessing the performance of an RUL prediction algorithm. We first define an accuracy zone around the true RUL using a threshold value, α. Briefly, any prediction within the accuracy zone is considered accurate and would be acceptable in the field. To assess a model's ability to closely predict the true RUL, we define α-accuracy as the percentage of RUL predictions that are within the accuracy zone. Likewise, to assess a model's uncertainty quantification performance, we calculate β as the average probability mass of the predicted RUL probability density function (PDF) which covers the α-accuracy zone. Ideal scores for α-accuracy and β are 100% and 1.0, respectively. These metrics are shown graphically in Fig. 3.
Although NLL is defined at each prediction instance n, in comparing the models, we show the median of NLL over the entire test set.
4.2 Remaining Useful Life Prediction of Lithium-Ion Batteries.
Results for the proposed method on the open-source battery dataset were obtained via five repeated runs where the model hyperparameters were reoptimized each run. The results were averaged over the repetitions and the test cells in each repetition.
Summary of bearings and testing conditions in XJTU-SY dataset
Operating condition | Bearing IDs | Speed (rpm) | Radial force (kN) |
---|---|---|---|
1 | 1_1, 1_2, 1_3, 1_4, 1_5 | 2100 | 12 |
2 | 2_1, 2_2, 2_3, 2_4, 2_5 | 2250 | 11 |
3 | 3_1, 3_2, 3_3, 3_4, 3_5 | 2400 | 10 |
Operating condition | Bearing IDs | Speed (rpm) | Radial force (kN) |
---|---|---|---|
1 | 1_1, 1_2, 1_3, 1_4, 1_5 | 2100 | 12 |
2 | 2_1, 2_2, 2_3, 2_4, 2_5 | 2250 | 11 |
3 | 3_1, 3_2, 3_3, 3_4, 3_5 | 2400 | 10 |
Figure 4 shows a snapshot of the capacity trajectory prediction(s) for primary test cell #6 by each model considered in this study. We also visualize each model's predicted RUL PDF(s). The single models (GPR, EUKF) tend to produce narrow RUL PDFs, while the ensemble models (En-EUKF, En-EUKF + GPR) produce wide PDFs. In particular, the RUL PDF prediction for the proposed method (En-EUKF + GPR) is observed to span the entirety of the two individual predictors’ PDFs. This is because the ensemble performs a weighted sum of the Gaussian PDFs, where the resulting Gaussian mixture maintains the absolute span in the RUL PDF of each model in the mixture. This improves the model's ability to estimate uncertainty, as discussed further below.
Detailed results for the proposed method and similarly comparable methods are shown in Table 2. The results for the proposed method are shown in the far-right column in each grouping (conference paper and journal extension). Each performance metric is shaded for better comparison—the darker the shade, the better the model's performance for that metric. Right away, it is evident the proposed method performed exceptionally well at predicting the RUL of lithium-ion batteries. The use of diverse predictors in the ensemble combined with the adaptive weighting methodology proved to effectively reduce the RMSE over the other methods, particularly the individual models. The LSTM model has significantly better performance than the other individual models; therefore, adding LSTM into the ensemble further improves the performance. Although the RMSEs of the optimization-based weighting models like the DIEn and DSDEn are similar to DDDEn, DDDEn has superior performance in metrics like α–accuracy and NLL. This indicates that the more complex dynamic weighting method significantly improves other aspects of the RUL prediction, like timeliness (α–accuracy) and prediction uncertainty (NLL).
Evaluation metrics for various models on training and three test datasets
Conference paper | Journal extension | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
EUKF | GPR | En-EUKF | En-EUKF + GPR | DDDEn-EUKF + GPR | LSTM | En-all | DIEn | DSDEn | DDDEn | |
RMSE (cycles) | ||||||||||
Training | 30.0 | 29.4 | 28.4 | 27.7 | 25.1 | 33.7 | 26.6 | 28.1 | 27.7 | 24.2 |
Test 1 | 43.1 | 43.9 | 39.2 | 39.1 | 32.3 | 51.1 | 38.5 | 41.1 | 40.7 | 39.0 |
Test 2 | 30.7 | 65.2 | 30.4 | 37.8 | 31.2 | 36.9 | 27.4 | 25.5 | 25.1 | 26.2 |
Test 3 | 20.1 | 37.5 | 24.2 | 26.8 | 22.3 | 8.7 | 19.6 | 15.9 | 15.4 | 15.3 |
α—Accuracy (%) | ||||||||||
Training | 27.0 | 36.6 | 19.0 | 29.1 | 47.8 | 54.3 | 35.2 | 47.1 | 49.9 | 73.0 |
Test 1 | 24.7 | 36.8 | 19.5 | 26.0 | 45.3 | 45.8 | 31.0 | 44.1 | 49.7 | 60.8 |
Test 2 | 40.1 | 17.6 | 32.5 | 18.6 | 42.0 | 42.7 | 28.0 | 46.9 | 51.9 | 56.4 |
Test 3 | 12.6 | 5.8 | 9.5 | 6.1 | 36.0 | 51.9 | 9.7 | 13.7 | 16.8 | 40.5 |
β—Probability | ||||||||||
Training | 0.26 | 0.38 | 0.22 | 0.28 | 0.38 | 0.43 | 0.24 | 0.25 | 0.27 | 0.34 |
Test 1 | 0.24 | 0.36 | 0.22 | 0.26 | 0.37 | 0.38 | 0.24 | 0.24 | 0.26 | 0.30 |
Test 2 | 0.39 | 0.17 | 0.30 | 0.22 | 0.34 | 0.35 | 0.23 | 0.24 | 0.26 | 0.24 |
Test 3 | 0.12 | 0.10 | 0.11 | 0.11 | 0.28 | 0.41 | 0.17 | 0.20 | 0.21 | 0.23 |
NLL | ||||||||||
Training | 28.9 | 3.2 | 4.7 | 3.0 | 2.5 | 0.7 | 2.0 | 1.8 | 1.6 | 1.3 |
Test 1 | 34.8 | 3.6 | 4.9 | 3.3 | 2.8 | 0.9 | 2.2 | 2.0 | 1.7 | 1.4 |
Test 2 | 29.2 | 9.6 | 3.6 | 4.0 | 3.1 | 0.9 | 2.4 | 2.2 | 1.9 | 2.1 |
Test 3 | 71.9 | 9.7 | 6.3 | 4.2 | 3.2 | 1.3 | 2.5 | 2.1 | 1.9 | 1.6 |
Conference paper | Journal extension | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
EUKF | GPR | En-EUKF | En-EUKF + GPR | DDDEn-EUKF + GPR | LSTM | En-all | DIEn | DSDEn | DDDEn | |
RMSE (cycles) | ||||||||||
Training | 30.0 | 29.4 | 28.4 | 27.7 | 25.1 | 33.7 | 26.6 | 28.1 | 27.7 | 24.2 |
Test 1 | 43.1 | 43.9 | 39.2 | 39.1 | 32.3 | 51.1 | 38.5 | 41.1 | 40.7 | 39.0 |
Test 2 | 30.7 | 65.2 | 30.4 | 37.8 | 31.2 | 36.9 | 27.4 | 25.5 | 25.1 | 26.2 |
Test 3 | 20.1 | 37.5 | 24.2 | 26.8 | 22.3 | 8.7 | 19.6 | 15.9 | 15.4 | 15.3 |
α—Accuracy (%) | ||||||||||
Training | 27.0 | 36.6 | 19.0 | 29.1 | 47.8 | 54.3 | 35.2 | 47.1 | 49.9 | 73.0 |
Test 1 | 24.7 | 36.8 | 19.5 | 26.0 | 45.3 | 45.8 | 31.0 | 44.1 | 49.7 | 60.8 |
Test 2 | 40.1 | 17.6 | 32.5 | 18.6 | 42.0 | 42.7 | 28.0 | 46.9 | 51.9 | 56.4 |
Test 3 | 12.6 | 5.8 | 9.5 | 6.1 | 36.0 | 51.9 | 9.7 | 13.7 | 16.8 | 40.5 |
β—Probability | ||||||||||
Training | 0.26 | 0.38 | 0.22 | 0.28 | 0.38 | 0.43 | 0.24 | 0.25 | 0.27 | 0.34 |
Test 1 | 0.24 | 0.36 | 0.22 | 0.26 | 0.37 | 0.38 | 0.24 | 0.24 | 0.26 | 0.30 |
Test 2 | 0.39 | 0.17 | 0.30 | 0.22 | 0.34 | 0.35 | 0.23 | 0.24 | 0.26 | 0.24 |
Test 3 | 0.12 | 0.10 | 0.11 | 0.11 | 0.28 | 0.41 | 0.17 | 0.20 | 0.21 | 0.23 |
NLL | ||||||||||
Training | 28.9 | 3.2 | 4.7 | 3.0 | 2.5 | 0.7 | 2.0 | 1.8 | 1.6 | 1.3 |
Test 1 | 34.8 | 3.6 | 4.9 | 3.3 | 2.8 | 0.9 | 2.2 | 2.0 | 1.7 | 1.4 |
Test 2 | 29.2 | 9.6 | 3.6 | 4.0 | 3.1 | 0.9 | 2.4 | 2.2 | 1.9 | 2.1 |
Test 3 | 71.9 | 9.7 | 6.3 | 4.2 | 3.2 | 1.3 | 2.5 | 2.1 | 1.9 | 1.6 |
Figure 5 shows the confidence level calibration curves of five select models. The proposed method was slightly underconfident in its predictions, indicated by the majority of its calibration curve falling above the y = x line. In general, under-confidence is preferred, as it builds a margin of safety into reliability and manufacturing engineers’ maintenance practice. In contrast, the rest of the models were largely overconfident in their RUL predictions. Another observation from Fig. 5 is that En-EUKF is much less overconfident than a single EUKF, proving that the concept of an ensemble, in general, improves model uncertainty quantification and the reliability in RUL prediction.

Confidence level calibration curves comparing the uncertainty estimation performance of five models on the battery dataset. Perfect uncertainty quantification follows y = x.
Another notable observation is the consistent prediction accuracy and predictive uncertainty quantification across all three test datasets. The open-source battery dataset used in this study is extremely diverse in the cells’ lifetime, as shown in Fig. 2, resulting from diverse conditions under which the cells were charged during repeated charge-discharge cycling. It is challenging to build prognostic models that perform well on such diverse degradation trajectories. The proposed method performed well on such a diverse dataset because of the diverse model-based predictors in the ensemble. Additionally, we observe (not depicted in any figure) that the EUKF and LSTM models would underestimate RUL most of the time, while the GPR model would overestimate RUL. This scenario is ideal because an optimal set of weights exists to accurately predict the true RUL, consistent with what we observe. The surrogate model was able to learn the general trend of the optimal model weights over cells’ lifetime and accurately predict the optimal weights on the three test datasets. This is confirmed by comparing the RMSEs of the DDDEn to the En-all in Table 2.
Figure 6 shows the variation of model weights for the DIEn and DSDEn models. During the early degradation stage (stage 1), DSDEn assigns larger weights to the LSTM and GPR models. In stage 2, the EUKF model receives almost zero importance, and LSTM and GPR have almost equal importance. However, close to EOL, the GPR model gets the highest importance because its underlying trend function can accurately model each cell's capacity trajectory, thus accurately predicting its RUL.

Variation of the three model weights for DIEn and DSDEn for the battery dataset. The vertical black lines indicate the cutoffs separating one stage from another.
On the other hand, the DIEn model does not adaptively vary the weights over the course of degradation, which is why the weights of the three predictors remain different but constant. Interestingly, the LSTM and GPR weights determined by the DIEn model for stage 1 seem very close to those determined by the DSDEn model. This is because predicting battery RUL in the early stages (stages 1 and 2) is more difficult, and the overall prediction error can be drastically improved by optimizing the ensemble weights for this region. While the RMSEs of both the DIEn and DSDEn methods are similar, the extra flexibility of the DSDEn model to change the model weights across the three regions led to the slightly lower RMSEs and better performance.
Last, we would like to briefly discuss the performance of the proposed dynamically weighted ensemble when asked to predict on a set of test units which are very different from the units used for training. In general, machine learning models perform poorly when a data distribution shift occurs where the distribution of the unseen test data is different than that of the training data. Table 3 presents the summary statistics calculated for each of the four datasets which comprise the 169 LFP dataset [54]. We report the mean initial capacity and mean slope of the capacity fade trajectory for the first 200 cycles.
Summary statistics for the four datasets comprising 169 LFP battery dataset
Dataset | Initial capacity (Ah) | Capacity trajectory slope (first 200 cycles) (Ah/cycle) |
---|---|---|
Training | 1.074 | −9.9 × 10–5 |
Primary test | 1.074 | −6.1 × 10–5 |
Secondary test | 1.063 | −3.8 × 10–5 |
Tertiary test | 1.051 | −2.4 × 10–5 |
Dataset | Initial capacity (Ah) | Capacity trajectory slope (first 200 cycles) (Ah/cycle) |
---|---|---|
Training | 1.074 | −9.9 × 10–5 |
Primary test | 1.074 | −6.1 × 10–5 |
Secondary test | 1.063 | −3.8 × 10–5 |
Tertiary test | 1.051 | −2.4 × 10–5 |
Both the initial capacity and the slope of the capacity trajectory in the initial 200 cycles greatly affect the performance of the prognostic algorithms used in this work because they significantly alter the forecasted capacity trajectories and, thus, the predicted RUL. The statistics in Table 3 shows that there is a significant data distribution shift from the training dataset to the secondary and tertiary test datasets. The effect of this distribution shift on model accuracy and uncertainty is most visible comparing the NLL of a single model to that of an ensemble model from the results in Table 2. The NLL values for the EUKF and GPR models are significantly higher for the secondary and tertiary test datasets than for the training and primary test datasets. This is due entirely to the distribution shift causing the model to predict inaccurately and be more uncertain in its predictions. These results align with previously reported results where single machine learning models were found to be less accurate on the secondary and tertiary test datasets because of the distribution shift [54]. However, if we look at the NLL values of any of the ensemble models, for example, DDDEn, we see that its NLL values are consistent across all four datasets. The consistent accuracy and uncertainty quantification of the ensemble models even in the presence of distribution shift is due to the aggregated effect of combining the predictions from multiple predictors. So, while the accuracy of an ensemble might decrease in the presence of a distribution shift, the power of an ensemble is that the uncertainty is captured accordingly.
4.3 Remaining Useful Life Prediction of Rolling Element Bearings.
The 15 bearings from the XJTU-SY bearing dataset were split into five folds for cross-validation, as shown in Table 4. The root-mean-square (RMS) of the vibration signal was used as the HI, where a is the accelerometer measurement. With a sampling time of 1.28 s collected at a sampling frequency of 25.6 kHz, the total number of datapoints N used for determining the RMS at each vibration measurement is N = 32, 768. The RMS represents energy of vibration and is a commonly used HI by the bearing prognostics community.
Test bearing IDs and number of test samples for each cross-validation fold
Fold # | Test bearing IDs | # Test samples |
---|---|---|
1 | 1_1, 2_1, 3_1 | 221 |
2 | 1_2, 2_2, 3_2 | 506 |
3 | 1_3, 2_3, 3_3 | 334 |
4a | 1_4*, 2_4, 3_4 | 63 |
5 | 1_5, 2_5, 3_5 | 168 |
Fold # | Test bearing IDs | # Test samples |
---|---|---|
1 | 1_1, 2_1, 3_1 | 221 |
2 | 1_2, 2_2, 3_2 | 506 |
3 | 1_3, 2_3, 3_3 | 334 |
4a | 1_4*, 2_4, 3_4 | 63 |
5 | 1_5, 2_5, 3_5 | 168 |
Bearing 1_4 undergoes sudden catastrophic failure and is therefore not considered in this study.
During the cross-validation study, a single-fold was chosen as the test set while the other four folds were used for hyperparameter optimization and surrogate model construction. The cross-validation study was repeated five times, and the average of the performance metrics in those five independent runs is shown in Table 5. The entries for each fold over every metric have been color-coded for easy comparison, where the darker shade indicates better model performance. The overall comparison in Table 5 was achieved by combining all the folds and assigning weights to each fold that are proportional to the number of test samples in the given fold. The proposed DDDEn-EUKF + GPR model performed the best for most of the metrics and nearly all test scenarios. It was closely followed by DDDEn (with LSTM). Unlike the battery dataset results, the LSTM model has much poorer performance than EUKF and GPR. This is most likely due to the vastly smaller amount of training data available with the bearing dataset and the overall higher level of data noise. As a result, DDDEn without LSTM performs slightly better than the ensemble which includes LSTM. This highlights the importance of selecting the right type of predictor by properly balancing accuracy and diversity, particularly in cases where the individual predictors vastly disagree. Ultimately, the benefit of using an ensemble is the consistency of prediction accuracy and uncertainty quantification across the test datasets. Occasionally, a single GPR or EUKF model outperformed the more complex models, but this only happened for certain folds and was not significant across all the tests. The solid overall performance of the proposed method is due to the dynamic weighting scheme which effectively combines the strengths of each model in the ensemble by anticipating which model may provide the best RUL estimate at a given time. We would also like to note that, in this case study, other ensemble methods are not too far from the proposed model, suggesting that the proposed dynamic weighting scheme may need further tuning on this dataset.
Evaluation metrics of all the models on the five-fold cross-validation study
Conference paper | Journal extension | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
EUKF | GPR | En-EUKF | En-EUKF + GPR | DDDEn-EUKF + GPR | LSTM | En-all | DIEn | DSDEn | DDDEn | |
RMSE | ||||||||||
Fold 1 | 48.0 | 40.2 | 57.6 | 45.2 | 40.3 | 44.9 | 30.0 | 33.2 | 42.1 | 26.8 |
Fold 2 | 82.3 | 89.9 | 82.0 | 84.6 | 88.5 | 116.3 | 94.9 | 104.0 | 114.0 | 102.4 |
Fold 3 | 80.3 | 75.4 | 81.1 | 75.0 | 72.9 | 83.1 | 75.9 | 76.2 | 81.0 | 69.9 |
Fold 4 | 40.9 | 56.7 | 48.8 | 40.5 | 36.0 | 10.7 | 38.1 | 36.7 | 10.5 | 19.8 |
Fold 5 | 45.6 | 35.2 | 50.6 | 38.9 | 38.1 | 28.1 | 28.9 | 30.8 | 26.4 | 21.7 |
Net | 69.1 | 68.9 | 71.9 | 67.3 | 67.1 | 84.8 | 71.5 | 77.5 | 84.4 | 75.5 |
α—Accuracy (%) | ||||||||||
Fold 1 | 11.8 | 21.7 | 7.2 | 22.2 | 33.0 | 17.2 | 26.7 | 34.7 | 19.9 | 26.7 |
Fold 2 | 10.1 | 24.3 | 19.0 | 21.1 | 25.9 | 4.2 | 21.3 | 14.9 | 5.3 | 18.2 |
Fold 3 | 14.1 | 2.1 | 15.0 | 7.8 | 2.7 | 7.5 | 8.1 | 7.8 | 8.4 | 15.3 |
Fold 4 | 23.8 | 4.8 | 7.9 | 4.9 | 9.5 | 34.9 | 4.8 | 6.9 | 47.6 | 19.0 |
Fold 5 | 15.5 | 10.1 | 9.5 | 7.7 | 17.2 | 14.3 | 10.1 | 15.3 | 19.3 | 28.0 |
Net | 12.8 | 15.3 | 14.2 | 15.3 | 19.2 | 12.4 | 18.2 | 17.6 | 12.7 | 16.9 |
β—Probability | ||||||||||
Fold 1 | 0.14 | 0.22 | 0.15 | 0.29 | 0.31 | 0.16 | 0.25 | 0.26 | 0.16 | 0.26 |
Fold 2 | 0.14 | 0.22 | 0.19 | 0.20 | 0.25 | 0.04 | 0.18 | 0.14 | 0.05 | 0.17 |
Fold 3 | 0.13 | 0.02 | 0.14 | 0.09 | 0.10 | 0.07 | 0.09 | 0.06 | 0.09 | 0.12 |
Fold 4 | 0.09 | 0.03 | 0.10 | 0.04 | 0.09 | 0.31 | 0.07 | 0.09 | 0.39 | 0.17 |
Fold 5 | 0.15 | 0.11 | 0.08 | 0.10 | 0.15 | 0.11 | 0.11 | 0.14 | 0.15 | 0.20 |
Net | 0.14 | 0.14 | 0.15 | 0.17 | 0.20 | 0.11 | 0.17 | 0.15 | 0.11 | 0.16 |
NLL | ||||||||||
Fold 1 | 22.7 | 3.1 | 3.7 | 2.7 | 2.7 | 1.7 | 2.9 | 2.8 | 1.5 | 2.6 |
Fold 2 | 26.2 | 2.2 | 2.9 | 2.7 | 2.4 | 13.7 | 2.6 | 2.7 | 7.8 | 2.6 |
Fold 3 | 6.2 | 11.7 | 3.1 | 2.9 | 3.3 | 5.5 | 2.9 | 11.9 | 3.7 | 2.8 |
Fold 4 | 2.6 | 1.9 | 2.5 | 2.5 | 1.8 | 1.2 | 2.3 | 2.5 | 1.1 | 1.8 |
Fold 5 | 6.9 | 3.4 | 2.8 | 2.7 | 3.2 | 1.5 | 2.6 | 2.7 | 1.4 | 2.3 |
Net | 16.8 | 5.0 | 3.1 | 2.8 | 2.8 | 2.1 | 2.7 | 2.7 | 2.0 | 2.6 |
Conference paper | Journal extension | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
EUKF | GPR | En-EUKF | En-EUKF + GPR | DDDEn-EUKF + GPR | LSTM | En-all | DIEn | DSDEn | DDDEn | |
RMSE | ||||||||||
Fold 1 | 48.0 | 40.2 | 57.6 | 45.2 | 40.3 | 44.9 | 30.0 | 33.2 | 42.1 | 26.8 |
Fold 2 | 82.3 | 89.9 | 82.0 | 84.6 | 88.5 | 116.3 | 94.9 | 104.0 | 114.0 | 102.4 |
Fold 3 | 80.3 | 75.4 | 81.1 | 75.0 | 72.9 | 83.1 | 75.9 | 76.2 | 81.0 | 69.9 |
Fold 4 | 40.9 | 56.7 | 48.8 | 40.5 | 36.0 | 10.7 | 38.1 | 36.7 | 10.5 | 19.8 |
Fold 5 | 45.6 | 35.2 | 50.6 | 38.9 | 38.1 | 28.1 | 28.9 | 30.8 | 26.4 | 21.7 |
Net | 69.1 | 68.9 | 71.9 | 67.3 | 67.1 | 84.8 | 71.5 | 77.5 | 84.4 | 75.5 |
α—Accuracy (%) | ||||||||||
Fold 1 | 11.8 | 21.7 | 7.2 | 22.2 | 33.0 | 17.2 | 26.7 | 34.7 | 19.9 | 26.7 |
Fold 2 | 10.1 | 24.3 | 19.0 | 21.1 | 25.9 | 4.2 | 21.3 | 14.9 | 5.3 | 18.2 |
Fold 3 | 14.1 | 2.1 | 15.0 | 7.8 | 2.7 | 7.5 | 8.1 | 7.8 | 8.4 | 15.3 |
Fold 4 | 23.8 | 4.8 | 7.9 | 4.9 | 9.5 | 34.9 | 4.8 | 6.9 | 47.6 | 19.0 |
Fold 5 | 15.5 | 10.1 | 9.5 | 7.7 | 17.2 | 14.3 | 10.1 | 15.3 | 19.3 | 28.0 |
Net | 12.8 | 15.3 | 14.2 | 15.3 | 19.2 | 12.4 | 18.2 | 17.6 | 12.7 | 16.9 |
β—Probability | ||||||||||
Fold 1 | 0.14 | 0.22 | 0.15 | 0.29 | 0.31 | 0.16 | 0.25 | 0.26 | 0.16 | 0.26 |
Fold 2 | 0.14 | 0.22 | 0.19 | 0.20 | 0.25 | 0.04 | 0.18 | 0.14 | 0.05 | 0.17 |
Fold 3 | 0.13 | 0.02 | 0.14 | 0.09 | 0.10 | 0.07 | 0.09 | 0.06 | 0.09 | 0.12 |
Fold 4 | 0.09 | 0.03 | 0.10 | 0.04 | 0.09 | 0.31 | 0.07 | 0.09 | 0.39 | 0.17 |
Fold 5 | 0.15 | 0.11 | 0.08 | 0.10 | 0.15 | 0.11 | 0.11 | 0.14 | 0.15 | 0.20 |
Net | 0.14 | 0.14 | 0.15 | 0.17 | 0.20 | 0.11 | 0.17 | 0.15 | 0.11 | 0.16 |
NLL | ||||||||||
Fold 1 | 22.7 | 3.1 | 3.7 | 2.7 | 2.7 | 1.7 | 2.9 | 2.8 | 1.5 | 2.6 |
Fold 2 | 26.2 | 2.2 | 2.9 | 2.7 | 2.4 | 13.7 | 2.6 | 2.7 | 7.8 | 2.6 |
Fold 3 | 6.2 | 11.7 | 3.1 | 2.9 | 3.3 | 5.5 | 2.9 | 11.9 | 3.7 | 2.8 |
Fold 4 | 2.6 | 1.9 | 2.5 | 2.5 | 1.8 | 1.2 | 2.3 | 2.5 | 1.1 | 1.8 |
Fold 5 | 6.9 | 3.4 | 2.8 | 2.7 | 3.2 | 1.5 | 2.6 | 2.7 | 1.4 | 2.3 |
Net | 16.8 | 5.0 | 3.1 | 2.8 | 2.8 | 2.1 | 2.7 | 2.7 | 2.0 | 2.6 |
Unlike the battery dataset explored in Sec. 4.2, the bearing dataset is much noisier and less monotonic, making it challenging to forecast the HI accurately. The non-monotonic nature of the HI caused issues for the single models. Almost 10% of the test data using single GPR and EUKF did not have an RUL prediction because the predicted trajectories of the HI were found not to cross the EOL threshold. In contrast, the ensemble model provided RUL prediction for most test samples where there was at least one RUL prediction from EUKF, GPR, or LSTM. This has to be accounted for when comparing models in Table 5.
Figure 7 shows the confidence level calibration curves for select models trained and tested on the bearing dataset. Similar to results for the battery dataset, the individual models are found to be overconfident in their predictions while the ensemble models produce more reliable uncertainty estimates, indicated by their closeness to the ideal calibration line. The proposed DDDEn model is the least overconfident of the group.

Confidence level calibration curves comparing the uncertainty estimation performance of each model on the battery dataset. Perfect uncertainty quantification follows y = x.
As stated previously, the bearing dataset is very noisy and non-monotonic, making the lines between the different degradation stages blurry. In Fig. 8, we show the ensemble weights from the DIEn and DSDEn models over the range of the HI. Due to the non-monotonic nature of the dataset, the HI could decrease in value back into the range of a lower degradation stage but still be classified as a higher stage. This is why some symbols from a higher degradation stage appear in a lower stage. As shown in Fig. 8, the LSTM model is given very low priority in stages 2 and 3 due to its poor performance. This is because the model-based prediction methods (EUKF and GPR) are much more accurate at estimating RUL in the late aging stage, due to the use of an underlying mathematical model (exponential and quadratic) which accurately models the observed degradation trajectories. Essentially the comparison of individual models (EUKF versus GPR versus LSTM) from Table 2 and Table 5 and their relative importance obtained from DIEn and DSDEn models (Figs. 6 and 8) can be used as a precursor to model selection for creating an ensemble. Although, for the bearing case study, adding a vanilla LSTM RNN seems to negatively affect the ensemble's performance, this does not rule out the potential benefits of other curated LSTM models that are established to work on the same dataset, such as those in Ref. [38].

Variation of the three model weights for DIEn and DSDEn model for the bearing dataset. The vertical black lines indicate the cutoffs separating one stage from another.
5 Conclusion
In this paper, we have explored an ensemble of models with diverse architectures to provide robust and consistent RUL predictions. The models are combined using a dynamic weighting that assigns model importance based on the predictions of individual models and the health index at the current time instance. Using two open-source datasets, one pertaining to battery capacity fade and the other related to rolling element bearing failure, we show the superiority of the proposed ensemble model in its ability to both reduce RUL prediction error and improve the accuracy in estimating RUL predictive uncertainty than comparable methods. The dynamic weighting algorithm helps to reduce the degree of model uncertainty and overconfidence. We also compared our method to other state-of-the-art optimization-based ensemble weighting techniques that estimate either degradation-independent model weights (DIEn) or degradation stage-dependent model weights (DSDEn). The ensemble models show superior performance when the health index is less noisy and monotonic. On the other hand, a noisy and non-monotonic health index leads to strong disagreements among the diverse predictors causing ensemble methods to perform similarly to individual predictors. However, in either case, ensembles of diverse predictors were found to be reliable and consistent across test cases. Although we use EUKF, LSTM, and GPR to form the ensemble, the concept of choosing models which are found to predict RUL differently under the same conditions can be extended to include other model-based and data-driven methods. In our future work, we aim to investigate the proper selection of diverse models depending on the dataset. We also aim to improve the dynamic weighting surrogate model.
Replication of Results
The individual model predictors described in Sec. 2.4 (EUKF, GPR, and LSTM) have been implemented in matlab and python (Tensorflow/Keras) on the battery dataset. These diverse models and the implementation of three weighted ensemble methods described in Secs. 2.2 and 2.3 (En-all, DIEn, and DSDEn) are available for download on our GitHub page.5
Footnotes
Acknowledgment
This work was supported in part by Vermeer Corporation. Any opinions, findings, or conclusions in this paper are those of the authors and do not necessarily reflect the sponsor's views.
Conflict of Interest
There are no conflicts of interest.
Data Availability Statement
The data and information that support the findings of this article are freely available.6