R: Variable selection using the "importance spectrum"
varSelImpSpecRF
R Documentation
Variable selection using the "importance spectrum"
Description
Perform variable selection based on a simple heuristic using the
importance spectrum of the
original data compared to the importance spectra from the same data
with the class labels randomly permuted.
A previously fitted random forest (see randomForest).
xdata
A data frame or matrix, with subjects/cases in rows and
variables in columns. NAs not allowed.
Class
The dependent variable; must be a factor.
randomImps
A list with a structure such as the object
return by randomVarImpsRF
.
threshold
The threshold for the selection of variables. See details.
numrandom
The number of random permutations of the class labels.
whichImp
One of impsUnscaled,
impsScaled, impsGini, that correspond, respectively, to
the (unscaled) mean decrease in accuracy, the scaled mean decrease
in accuracy, and the Gini index. See below and
randomForest,
importance and the references for further explanations of the
measures of variable importance.
usingCluster
If TRUE use a cluster to parallelize the calculations.
TheCluster
The name of the cluster, if one is used.
...
Not used.
Details
You can either pass as arguments a valid object for randomImps,
obtained from a previous call to randomVarImpsRF OR
you can pass a covariate data frame and a dependent variable, and
these will be used to obtain the random importances. The former is
preferred for normal use, because this function will not returned the
computed random variable importances, and this computation can be
lengthy. If you pass both randomImps, xdata, and Class,
randomImps will be used.
To select variables, start by ordering from largest (i=1) to smallest
(i = p, where p is the number of
variables), the variable importances from the original data and from
each of the data sets with permuted class labels. (So the ordering is
done in each data set independently). Compute
q_i, the 1 - threshold quantile of
the ordered variable importances from the permuted data at ordered
postion i. Then,
starting from i = 1, let i_a be the first i for which
the variable importance from the original data is smaller than
q_i. Select all variables from i=1 to i = i_a - 1.
Value
A vector with the names of the selected variables, ordered by
decreasing importance.
Note
The name
of this function is related to the idea of "importance spectrum plot",
which is the term that Friedman & Meulman, 2005 use in their paper.