R: Variable Selection with Random Forest and the Area Under the...
AUCRF
R Documentation
Variable Selection with Random Forest and the Area Under the Curve
Description
AUCRF is an algorithm for variable selection using Random Forest based on
optimizing the area-under-the ROC curve (AUC) of the Random Forest. The proposed
strategy implements a backward elimination process based on the initial ranking
of the variables.
an object of class formula: a symbolic description of the model
to be fitted. The details of model specification are given in Details.
data
a data frame containing the variables in the model. Dependent variable must be a
binary variable defined as factor and codified as 1 for
positives (e.g. cases) and 0 for negatives (e.g. controls).
k0
number of remaining variables for stopping the backward elimination process.
By default k0=1.
pdel
fraction of remaining variables to be removed in each step. By default pdel=0.2.
If pdel=0, only one variable is removed each time.
ranking
specifies the importance measure provided by randomForest for ranking the variables.
There are two options MDG (by default) for MeanDecreaseGini and MDA for MeanDecreaseAccuracy.
...
optional parameters to be passed to the randomForest function. If no arguments
are specified, default arguments of randomForest function will be used.
Details
The AUC-RF algorithm is described in detail in Calle et. al.(2011). The following is
a summary:
Ranking and AUC of the initial set:
Perform a random forest using all predictor variables and the response, as specified in the formula
argument, and compute the AUC of the random forest. Based on the selected measure of importance (by default MDG),
obtain a ranking of predictors.
Elimination process:
Based on the variables ranking, remove the less important variables (fraction of variables specified in
pdel argument). Perform a new random forest with the remaining variables and compute its AUC.
This step is iterated until the number of remaining variables is less or equal than k0.
Optimal set:
The optimal set of predictive variables is considered the one giving rise to the Random Forest with the
highest OOB-AUCopt. The number of selected predictors is denoted by Kopt
Value
An object of class AUCRF, which is a list with the following components:
call
the original call to AUCRF.
data
the data argument.
ranking
the ranking of predictors based on the importance measure.
Xopt
optimal set of predictors obtained.
OOB-AUCopt
AUC obtained for the optimal set of predictors.
Kopt
size of the optimal set of predictors obtained.
AUCcurve
values of AUC obtained for each set of predictors evaluated in the elimination process.
RFopt
the randomForest adjusted with the optimal set.
References
Calle ML, Urrea V, Boulesteix A-L, Malats N (2011) "AUC-RF: A new strategy for genomic
profiling with Random Forest". Human Heredity. (In press)
See Also
OptimalSet, AUCRFcv, randomForest.
Examples
# load the included example dataset. This is a simulated case/control study
# data set with 4000 patients (2000 cases / 2000 controls) and 1000 SNPs,
# where the first 10 SNPs have a direct association with the outcome:
data(exampleData)
# call AUCRF process: (it may take some time)
# fit <- AUCRF(Y~., data=exampleData)
# The result of this example is included for illustration purpose:
data(fit)
summary(fit)
plot(fit)
# Additional randomForest parameters can be included, otherwise default
# parameters of randomForest function will be used:
# fit <- AUCRF(Y~., data=exampleData, ntree=1000, nodesize=20)
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(AUCRF)
Loading required package: randomForest
randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.
AUCRF 1.1
> png(filename="/home/ddbj/snapshot/RGM3/R_CC/result/AUCRF/AUCRF.Rd_%03d_medium.png", width=480, height=480)
> ### Name: AUCRF
> ### Title: Variable Selection with Random Forest and the Area Under the
> ### Curve
> ### Aliases: AUCRF print.AUCRF summary.AUCRF exampleData fit
>
> ### ** Examples
>
>
> # load the included example dataset. This is a simulated case/control study
> # data set with 4000 patients (2000 cases / 2000 controls) and 1000 SNPs,
> # where the first 10 SNPs have a direct association with the outcome:
> data(exampleData)
>
> # call AUCRF process: (it may take some time)
> # fit <- AUCRF(Y~., data=exampleData)
>
> # The result of this example is included for illustration purpose:
>
> data(fit)
> summary(fit)
Number of selected variables: Kopt= 32
AUC of selected variables: OOB-AUCopt= 0.7787711
Importance Measure: MDG
Selected.Variables Importance
1 SNP9 15.047305
2 SNP4 12.912120
3 SNP3 10.486599
4 SNP7 9.767075
5 SNP8 9.283819
6 SNP2 9.043039
7 SNP6 8.743129
8 SNP10 8.465736
9 SNP5 7.844703
10 SNP1 7.533021
11 SNP369 2.677609
12 SNP584 2.565316
13 SNP747 2.504847
14 SNP47 2.469360
15 SNP55 2.469196
16 SNP674 2.445041
17 SNP354 2.441501
18 SNP993 2.424503
19 SNP661 2.423057
20 SNP73 2.399690
21 SNP690 2.398267
22 SNP14 2.390978
23 SNP878 2.387848
24 SNP651 2.353301
25 SNP191 2.349521
26 SNP684 2.346010
27 SNP278 2.341461
28 SNP771 2.336632
29 SNP575 2.318485
30 SNP544 2.307716
31 SNP726 2.299561
32 SNP336 2.279044
> plot(fit)
>
> # Additional randomForest parameters can be included, otherwise default
> # parameters of randomForest function will be used:
> # fit <- AUCRF(Y~., data=exampleData, ntree=1000, nodesize=20)
>
>
>
>
>
> dev.off()
null device
1
>