Statistics, Department of
Document Type
Article
Date of this Version
2015
Citation
PLOS 1-7
Abstract
Changes in performance with prior feature selection
Random forest (RF) is designed to create uncorrelated trees using random subsets of features in each node of each tree. RF by itself is a great tool for feature selection from a high dimensional set of features. But we observed that the prediction accuracy is improved when a prior feature selection (RELIEFF) [1] approach is implemented. Table A shows the performance of RF, VMRF and CMRF with and without RELIEFF feature selection in 2 drug sets of GDSC.
Performance Analysis for drugsets consisting of more 8 than two drugs
We have generated empirical copulas for the bivariate cases as they are able to capture all forms of dependency structures. However, generation of empirical copulas has high computational complexity along with the need for a significant number of training samples at each node. Thus for more than two drug responses, we have considered parametric copulas and the difference between Gaussian copula parameters generated using root node and split node samples instead of the integral difference between empirical copulas is used. To test our hypothesis that VMRF and CMRF will perform better than RF, we considered a drug set with 4 different drugs from CCLE with single common target between them and a drug set with 3 different drugs in GDSC with a common target between them. The CCLE set has 482 cell lines and the GDSC set has 308 cell lines. RELIEFF was used to reduce the feature space prior to random forest application. For simplicity, in this case, we’ve used 30% of the sample cell lines as training data and 70% of them as testing data.