Mid-America Transportation Center


Date of this Version


Document Type



Farooq, M.U., & Khattak, A. J. (2023). Exploring Statistical and Machine Learning-Based Missing Data Imputation Methods to Improve Crash Frequency Prediction Models for Highway-Rail Grade Crossings. Presented at the International Road Federation (IRF) Global R2T Conference & Exhibition, Tempe, AZ.


Copyright © 2023 by the authors.


Highway-rail grade crossings (HRGCs) are critical spatial locations of transportation safety because crashes at HRGCs are often catastrophic, potentially causing several injuries and fatalities. Every year in the United States, a significant number of crashes occur at these crossings, prompting local and state organizations to engage in safety analysis and estimate crash frequency prediction models for resource allocation. These models provide valuable insights into safety and risk mitigation strategies for HRGCs. Furthermore, the estimation of these models is based on inventory details of HRGCs, and their quality is crucial for reliable crash predictions. However, many of these models exclude crossings with missing inventory details, which can adversely affect the precision of these models. In this study, a random sample of inventory details of 2000 HRGCs was taken from the Federal Railroad Administration’s HRGCs inventory database. Data filters were applied to retain only those crossings in the data that were at-grade, public and operational (N=1096). Missing values were imputed using various statistical and machine learning methods, including Mean, Median and Mode (MMM) imputation, Last Observation Carried Forward (LOCF) imputation, K-Nearest Neighbors (KNN) imputation, Expectation-Maximization (EM) imputation, Support Vector Machine (SVM) imputation, and Random Forest (RF) imputation. The results indicated that the crash frequency models based on machine learning imputation methods yielded better-fitted models (lower AIC and BIC values). The findings underscore the importance of obtaining complete inventory data through machine learning imputation methods when developing crash frequency models for HRGCs. This approach can substantially enhance the precision of these models, improving their predictive capabilities, and ultimately saving valuable human lives.