Date of this Version
Department of Computer Science & Engineering, University of Nebraska-Lincoln, Technical Report, TR-UNL-CSE-2011-0003
Missing values are very common in real-world datasets for a variety of reasons. Deleting data points with missing values can negatively impact the performance of data analysis methods (e.g., machine learning, data mining). Using a human expert to restore the missing values is expensive and time consuming. The alternative is to impute the missing values during data preprocessing using the known values. This improves performance for data analysis, assuming the imputed values are correct. Unfortunately, imputation algorithms which use all the known values (e.g., mean imputation) often have considerable variance between the imputed and real values. More complex imputation algorithms (e.g., deck and model-based) choose a suitable subset of the data points for imputation. However, a weakness of these algorithms is they use all the variables (i.e., attributes) for imputation even if some of the variables are uncorrelated. Here, we propose a framework called ClustFrame for imputation algorithms that chooses suitable subsets for both data points and variables. We also present a ClustImpute algorithm based on our framework that uses single imputation with (1) hierarchical clustering, (2) dynamic tree cut, and (3) a regression model to impute all missing values. Using nine datasets from the UCI repository and an empirically collected complex dataset, we evaluate our algorithm against several existing algorithms including state-of-the-art model-based algorithms that use multiple imputation. Results show that ClustImpute achieves significantly higher imputation accuracy on many of the datasets. We conclude with some suggestions on improvements to our algorithm.