An object of class (rfsrc, grow) or (rfsrc,
forest).
newdata
Test data. If missing, the original grow
(training) data is used.
outcome.target
Character vector for multivariate families
specifying the target outcomes to be used. The default is to use all
coordinates.
importance
Method for computing variable importance (VIMP).
See rfsrc for details. Only applies when the test data
contains y-outcome values.
na.action
Missing value action. The default
na.omit removes the entire record if
even one of its entries is NA. Selecting na.impute
imputes the test data.
outcome
Determines whether the y-outcomes from the training
data or the test data are used to calculate the predicted value.
The default and natural choice is train which uses the
original training data. Option is ignored when newdata is
missing as the training data is used for the test data in such
settings. The option is also ignored whenever the test data is
devoid of y-outcomes. See the details and examples below for more
information.
proximity
Should proximity between test observations
be calculated? Possible choices are "inbag", "oob",
"all", TRUE, or FALSE — but some options may
not be valid and will depend on the context of the predict call.
The safest choice is TRUE if proximity is desired.
var.used
Record the number of times a variable is split?
split.depth
Return minimal depth for each variable for each case?
seed
Negative integer specifying seed for the random number
generator.
do.trace
Number of seconds between updates to the user on
approximate time to completion.
membership
Should terminal node membership and inbag
information be returned?
statistics
Should split statistics be returned? Values can be
parsed using stat.split.
...
Further arguments passed to or from other methods.
Details
Predicted values are obtained by dropping test data down the grow
forest (the forest grown using the training data). The overall error
rate and VIMP are also returned if the test data contains y-outcome
values. Single as well as joint VIMP measures can be requested. Note
that calculating VIMP can be computationally expensive (especially
when the dimension is high), thus if VIMP is not needed, computational
times can be significantly improved by setting
importance="none" which turns VIMP off.
Setting na.action="na.impute" imputes missing test data
(x-variables and/or y-outcomes). Imputation uses the grow-forest and
only training data is used to impute test data to avoid biasing error
rates and VIMP (Ishwaran et al. 2008). See the rfsrc
help file for details.
If no test data is provided, then the original training data is used
and the code reverts to restore mode allowing the user to restore the
original grow forest. This is useful, because it gives the user the
ability to extract outputs from the forest that were not asked for in
the original grow call.
If outcome="test", the predictor is calculated by using
y-outcomes from the test data (outcome information must be present).
In this case, the terminal nodes from the grow-forest are recalculated
using the y-outcomes from the test set. This yields a modified
predictor in which the topology of the forest is based solely on the
training data, but where the predicted value is based on the test
data. Error rates and VIMP are calculated by bootstrapping the test
data and using out-of-bagging to ensure unbiased estimates. See the
examples for illustration.
Value
An object of class (rfsrc, predict), which is a list with the
following components:
call
The original grow call to rfsrc.
family
The family used in the analysis.
n
Sample size of test data (depends upon NA values).
ntree
Number of trees in the grow forest.
yvar
Test set y-outcomes or original grow y-outcomes if none.
yvar.names
A character vector of the y-outcome names.
xvar
Data frame of test set x-variables.
xvar.names
A character vector of the x-variable names.
leaf.count
Number of terminal nodes for each tree in the
grow forest. Vector of length ntree.
proximity
Symmetric proximity matrix of the test data.
forest
The grow forest.
membership
Matrix recording terminal node membership for the
test data where each column contains the node number that a
case falls in for that tree.
inbag
Matrix recording inbag membership for the test data
where each column contains the number of times that a case
appears in the bootstrap sample for that tree.
var.used
Count of the number of times a variable was used in
growing the forest.
imputed.indv
Vector of indices of records in test data with
missing values.
imputed.data
Data frame comprising imputed test data. The first
columns are the y-outcomes followed by the x-variables.
split.depth
Matrix [i][j] or array [i][j][k] recording the
minimal depth for variable [j] for case [i], either averaged over
the forest, or by tree [k].
node.stats
Split statistics returned when
statistics=TRUE which can be parsed using stat.split.
err.rate
Cumulative OOB error rate for the test data if
y-outcomes are present.
importance
Test set variable importance (VIMP). Can be
NULL.
predicted
Test set predicted value.
predicted.oob
OOB predicted value (NULL unless
outcome="test").
++++++++
for classification settings, additionally ++++++++
class
In-bag predicted class labels.
class.oob
OOB predicted class labels (NULL unless outcome="test").
++++++++
for multivariate settings, additionally ++++++++
regrOutput
List containing performance values for
test regression responses (if such values are present).
clasOutput
List containing performance values for
test categorical (factor) responses (if such values are present).
++++++++
for survival settings, additionally ++++++++
chf
Cumulative hazard function (CHF).
chf.oob
OOB CHF (NULL unless outcome="test").
survival
Survival function.
survival.oob
OOB survival function (NULL unless outcome="test").
time.interest
Ordered unique death times.
ndead
Number of deaths.
++++++++
for competing risks, additionally ++++++++
chf
Cause-specific cumulative hazard function (CSCHF)
for each event.
chf.oob
OOB CSCHF for each event (NULL unless outcome="test").
cif
Cumulative incidence function (CIF) for each event.
cif.oob
OOB CIF for each event (NULL unless outcome="test").
time.interest
Ordered unique event times.
ndead
Number of events.
Note
The dimensions and values of returned objects depend heavily on the
underlying family and whether y-outcomes are present in the test data.
In particular, items related to performance will be NULL when
y-outcomes are not present. For multivariate families, predicted
values, VIMP, error rate, and performance values are stored in the
lists regrOutput and clasOutput.
For detailed definitions of returned values (such as predicted)
see the rfsrc help file.
Author(s)
Hemant Ishwaran and Udaya B. Kogalur
References
Breiman L. (2001). Random forests, Machine Learning, 45:5-32.
Ishwaran H., Kogalur U.B., Blackstone E.H. and Lauer M.S.
(2008). Random survival forests, Ann. App.
Statist., 2:841-860.
Ishwaran H. and Kogalur U.B. (2007). Random survival forests for R,
Rnews, 7(2):25-31.
## ------------------------------------------------------------
## typical train/testing scenario
## ------------------------------------------------------------
data(veteran, package = "randomForestSRC")
train <- sample(1:nrow(veteran), round(nrow(veteran) * 0.80))
veteran.grow <- rfsrc(Surv(time, status) ~ ., veteran[train, ], ntree = 100)
veteran.pred <- predict(veteran.grow, veteran[-train , ])
print(veteran.grow)
print(veteran.pred)
## Not run:
## ------------------------------------------------------------
## predicted probability and predicted class labels are returned
## in the predict object for classification analyses
## ------------------------------------------------------------
data(breast, package = "randomForestSRC")
breast.obj <- rfsrc(status ~ ., data = breast[(1:100), ], nsplit = 10)
breast.pred <- predict(breast.obj, breast[-(1:100), ])
print(head(breast.pred$predicted))
print(breast.pred$class)
## ------------------------------------------------------------
## example illustrating restore mode
## if predict is called without specifying the test data
## the original training data is used and the forest is restored
## ------------------------------------------------------------
# first we make the grow call
airq.obj <- rfsrc(Ozone ~ ., data = airquality)
# now we restore it and compare it to the original call
# they are identical
predict(airq.obj)
print(airq.obj)
# we can retrieve various outputs that were not asked for in
# in the original call
#here we extract the proximity matrix
prox <- predict(airq.obj, proximity = TRUE)$proximity
print(prox[1:10,1:10])
#here we extract the number of times a variable was used to grow
#the grow forest
var.used <- predict(airq.obj, var.used = "by.tree")$var.used
print(head(var.used))
## ------------------------------------------------------------
## unique feature of randomForestSRC
## cross-validation can be used when factor labels differ over
## training and test data
## ------------------------------------------------------------
# first we convert all x-variables to factors
data(veteran, package = "randomForestSRC")
veteran.factor <- data.frame(lapply(veteran, factor))
veteran.factor$time <- veteran$time
veteran.factor$status <- veteran$status
# split the data into unbalanced train/test data (5/95)
# the train/test data have the same levels, but different labels
train <- sample(1:nrow(veteran), round(nrow(veteran) * .05))
summary(veteran.factor[train,])
summary(veteran.factor[-train,])
# grow the forest on the training data and predict on the test data
veteran.f.grow <- rfsrc(Surv(time, status) ~ ., veteran.factor[train, ])
veteran.f.pred <- predict(veteran.f.grow, veteran.factor[-train , ])
print(veteran.f.grow)
print(veteran.f.pred)
## ------------------------------------------------------------
## example illustrating the flexibility of outcome = "test"
## illustrates restoration of forest via outcome = "test"
## ------------------------------------------------------------
# first we make the grow call
data(pbc, package = "randomForestSRC")
pbc.grow <- rfsrc(Surv(days, status) ~ ., pbc, nsplit = 10)
# now use predict with outcome = TEST
pbc.pred <- predict(pbc.grow, pbc, outcome = "test")
# notice that error rates are the same!!
print(pbc.grow)
print(pbc.pred)
# note this is equivalent to restoring the forest
pbc.pred2 <- predict(pbc.grow)
print(pbc.grow)
print(pbc.pred)
print(pbc.pred2)
# similar example, but with na.action = "na.impute"
airq.obj <- rfsrc(Ozone ~ ., data = airquality, na.action = "na.impute")
print(airq.obj)
print(predict(airq.obj))
# ... also equivalent to outcome="test" but na.action = "na.impute" required
print(predict(airq.obj, airquality, outcome = "test", na.action = "na.impute"))
# classification example
iris.obj <- rfsrc(Species ~., data = iris)
print(iris.obj)
print(predict.rfsrc(iris.obj, iris, outcome = "test"))
## ------------------------------------------------------------
## another example illustrating outcome = "test"
## unique way to check reproducibility of the forest
## ------------------------------------------------------------
# primary call
set.seed(542899)
data(pbc, package = "randomForestSRC")
train <- sample(1:nrow(pbc), round(nrow(pbc) * 0.50))
pbc.out <- rfsrc(Surv(days, status) ~ ., data=pbc[train, ],
nsplit = 10)
# standard predict call
pbc.train <- predict(pbc.out, pbc[-train, ], outcome = "train")
#non-standard predict call: overlays the test data on the grow forest
pbc.test <- predict(pbc.out, pbc[-train, ], outcome = "test")
# check forest reproducibilility by comparing "test" predicted survival
# curves to "train" predicted survival curves for the first 3 individuals
Time <- pbc.out$time.interest
matplot(Time, t(exp(-pbc.train$chf)[1:3,]), ylab = "Survival", col = 1, type = "l")
matlines(Time, t(exp(-pbc.test$chf)[1:3,]), col = 2)
## ------------------------------------------------------------
## survival analysis using mixed multivariate outcome analysis
## compare the predicted value to RSF
## ------------------------------------------------------------
# fit the pbc data using RSF
data(pbc, package = "randomForestSRC")
rsf.obj <- rfsrc(Surv(days, status) ~ ., pbc, nsplit = 10)
yvar <- rsf.obj$yvar
# fit a mixed outcome forest using days and status as y-variables
pbc.mod <- pbc
pbc.mod$status <- factor(pbc.mod$status)
mix.obj <- rfsrc(Multivar(days, status) ~., pbc.mod, nsplit = 10)
# compare oob predicted values
rsf.pred <- rsf.obj$predicted.oob
mix.pred <- mix.obj$regrOutput$days$predicted.oob
plot(rsf.pred, mix.pred)
# compare C-index error rate
if (library("Hmisc", logical.return = TRUE))
{
rsf.err <- rcorr.cens(rsf.pred, Surv(yvar[, 1], yvar[, 2]))[1]
mix.err <- 1 - rcorr.cens(mix.pred, Surv(yvar[, 1], yvar[, 2]))[1]
cat("RSF :", rsf.err, "\n")
cat("multivariate forest:", mix.err, "\n")
}
## End(Not run)