Statistics, Department of

 

The R Journal

Accessibility Remediation

If you are unable to use this item in its current form due to accessibility barriers, you may request remediation through our remediation request form.

Date of this Version

12-2015

Document Type

Article

Citation

The R Journal (December 2015) 7(2); Editor: Bettina Grün

Comments

Copyright 2015, The R Foundation. Open access material. License: CC BY 3.0 Unported

Abstract

This paper describes the R package VSURF. Based on random forests, and for both regression and classification problems, it returns two subsets of variables. The first is a subset of important variables including some redundancy which can be relevant for interpretation, and the second one is a smaller subset corresponding to a model trying to avoid redundancy focusing more closely on the prediction objective. The two-stage strategy is based on a preliminary ranking of the explanatory variables using the random forests permutation-based score of importance and proceeds using a stepwise forward strategy for variable introduction. The two proposals can be obtained automatically using data-driven default values, good enough to provide interesting results, but strategy can also be tuned by the user. The algorithm is illustrated on a simulated example and its applications to real datasets are presented.

Share

COinS