Date of this Version
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 6, JUNE 2015.
Boosting is an iterative process that improves the predictive accuracy for supervised (machine) learning algorithms. Boosting operates by learning multiple functions with subsequent functions focusing on incorrect instances where the previous functions predicted the wrong label. Despite considerable success, boosting still has difficulty on data sets with certain types of problematic training data (e.g., label noise) and when complex functions overfit the training data. We propose a novel cluster-based boosting (CBB) approach to address limitations in boosting for supervised learning systems. Our CBB approach partitions the training data into clusters containing highly similar member data and integrates these clusters directly into the boosting process. CBB boosts selectively (using a high learning rate, low learning rate, or not boosting) on each cluster based on both the additional structure provided by the cluster and previous function accuracy on the member data. Selective boosting allows CBB to improve predictive accuracy on problematic training data. In addition, boosting separately on clusters reduces function complexity to mitigate overfitting. We provide comprehensive experimental results on 20 UCI benchmark data sets with three different kinds of supervised learning systems. These results demonstrate the effectiveness of our CBB approach compared to a popular boosting algorithm, an algorithm that uses clusters to improve boosting, and two algorithms that use selective boosting without clustering.