R: Bootstrap the variable selection procedure in varSelRF
varSelRFBoot
R Documentation
Bootstrap the variable selection procedure in varSelRF
Description
Use the bootstrap to estimate the prediction error rate (wuth the
.632+ rule) and the
stability of the variable selection procedure implemented in varSelRF.
A data frame or matrix, with subjects/cases in rows and
variables in columns. NAs not allowed.
Class
The dependent variable; must be a factor.
c.sd
The factor that multiplies the sd. to decide on stopping
the tierations or choosing the final solution. See reference for details.
mtryFactor
The multiplication factor of
√{number.of.variables} for the number of variables to use for
the ntry argument of randomForest.
ntree
The number of trees to use for the first forest;
same as ntree for randomForest.
ntreeIterat
The number of trees to use (ntree of randomForest)
for all additional forests.
vars.drop.frac
The fraction of variables, from those
in the previous forest, to exclude at each iteration.
whole.range
If TRUE continue dropping variables until a forest
with only two variables is built, and choose the best model from the
complete series of models. If
FALSE, stop the iterations if the current OOB error becomes larger
than the initial OOB error (plus c.sd*OOB standard error) or
if the current OOB error becoems larger than the
previous OOB error (plus c.sd*OOB standard error).
recompute.var.imp
If TRUE recompute variable importances at
each new iteration.
bootnumber
The number of bootstrap samples to draw.
usingCluster
If TRUE use a cluster to parallelize the calculations.
TheCluster
The name of the cluster, if one is used.
srf
An object of class varSelRF. If used, the ntree and
mtryFactor parameters are taken from this object, not from the
arguments to this function. If used, it allows to skip carrying out
a first iteration to build the random forest to the complete,
original data set.
verbose
Give more information about what is being done.
...
Not used.
Details
If a cluster is used for the calculations, it will be used for the
embarrisingly parallelizable task of building as many random forests
as bootstrap samples.
Value
An object of class varSelRFBoot, which is a list with components:
number.of.bootsamples
The number of bootstrap replicates.
bootstrap.pred.error
The .632+ estimate of the prediction
error.
leave.one.out.bootstrap
The leave-one-out estimate of the error
rate (used when computing the .632+ estimate).
all.data.randomForest
A random forest built from all the data,
but after the variable selection. Thus, beware because the OOB error
rate is severely biased down.
all.data.vars
The variables selected in the run with all the
data.
all.data.run
An object of class varSelRF; the one obtained from
a run of varSelRF on the original, complete, data set. See
varSelRF.
class.predictions
The out-of-bag predictions from the
bootstrap, of type "response".See
predict.randomForest. This is an array, with
dimensions number of cases by number of bootstrap replicates.
prob.predictions
The out-of-bag predictions from the bootstrap,
of type "class probability". See
predict.randomForest. This is a 3-way array; the last
dimension is the bootstrap replication; for each bootstrap
replication, the 2D array
has dimensions case by number of classes, and each value is the
probability of belonging to that class.
number.of.vars
A vector with the number of variables selected
for each bootstrap sample.
overlap
The "overlap" between the variables selected from the
run in original sample and the variables returned from a bootstrap
sample. Overlap between the sets of variables A and B is defined as
frac{|variables.in.A cap variables.in.B|}{√{|variables.in.A|
|variables.in.B|}}
or size (cardinality) of
intersection between the two sets / sqrt(product of size of each
set).
all.vars.in.solutions
A vector with all the genes selected in
the runs on all the bootstrap samples. If the same gene is selected in several
bootstrap runs, it appears multiple times in this vector.
all.solutions
Each solutions is a character vector with all
the variables in a particular solution concatenated by a "+". Thus,
all.solutions is a vector, with length equal to
number.of.bootsamples, of the solution from each bootstrap
run.
Class
The original class argument.
allBootRuns
A list of length number.of.bootsamples. Each
component of this list is an element of class varSelRF
and stores the results from the runs on each bootstrap sample.
Note
The out-of-bag predictions stored in class.predictions and
prob.predictions are NOT the OOB votes from random
forest itself for a given run. These are predictions from the
out-of-bag samples for each bootstrap replication. Thus, these are
samples that have not been used at all in any of the variable selection
procedures in the given bootstrap replication.
Efron, B. & Tibshirani, R. J. (1997) Improvements on cross-validation: the .632+ bootstrap method.
J. American Statistical Association, 92, 548–560.
Svetnik, V., Liaw, A. , Tong, C & Wang, T. (2004) Application of
Breiman's random forest to modeling structure-activity relationships of
pharmaceutical molecules. Pp. 334-343 in F. Roli, J. Kittler, and T. Windeatt
(eds.). Multiple Classier Systems, Fifth International Workshop, MCS
2004, Proceedings, 9-11 June 2004, Cagliari, Italy. Lecture Notes in
Computer Science, vol. 3077. Berlin: Springer.