Abstract
Over the past decade, machine learning models have enabled significant technical achievements in a variety of fields, however the application of these models is an area of active research and development in conventional and regulated industries, which are often more cautious to adopt new technologies. In this paper, a case study is presented where various statistical and machine learning models, including logistic regression, random forest, gradient-boosted decision trees, and artificial neural networks are trained and validated using a historical incident record dataset to quantify the probability of pipe failure on a distribution pipeline system. The relative performance of each model type is compared against a held-out test dataset using an evaluation framework that utilizes lift charts to quantify each model’s performance. Observed strengths and limitations of the different model types are discussed with respect to performance, interpretability, and ease of incorporating additional data, along with key considerations for fitting and evaluating models. Additional case studies are also presented to illustrate how model performance depends on the quantity of training data and predictor features. These additional cases illustrate the benefit of continually collecting and leveraging asset data, as well as the benefit of augmenting existing asset data with external datasets, such as those obtained from public geospatial datasets. The results of this study will provide operators with additional insights and guidance in developing and evaluating machine learning models for pipeline risk assessment and integrity management.