Abstract
In this work, a Variational Autoencoder (VAE)-based data-driven modeling framework is developed with the overarching goal of enabling fuel design. The VAE model is trained on a large dataset with several chemical species to learn a compressed latent space molecular representation. Chemical structure in the form of Simplified Molecular Input Line Entry System (SMILES) string is fed as input, encoded into the VAE latent space, and decoded back to the SMILES string using Long Short-Term Memory (LSTM) networks. Complexities of the VAE training loss function are thoroughly examined by varying the weightage (beta (𝜷) parameter) of the latent space regularization term, thereby assessing the balance between reconstruction accuracy and validity, and focusing on both accurate molecular structure reconstruction and latent space consistency. Two different strategies for 𝜷 variation are evaluated: linear annealing and cyclic annealing. In addition, the impact of total correlation adjustment and hierarchical priors is also studied with regard to the balance between reconstruction fidelity and latent space regularization, and potential issues such as posterior collapse, over-regularization, and poor disentanglement of latent variables. Overall, the best performance of the model is achieved with hierarchical priors and incrementally increasing 𝜷 from 0 to a threshold value of 0.25 over 75 epochs. The generative VAE model can be readily coupled with Quantitative Structure–Property Relationship (QSPR) analysis to develop an integrated end-to-end framework for fuel-property prediction and molecular design of novel promising fuels.