A symbolic description of the model to be fit.
Must be specified unless object is given.
data
Data frame containing the y-outcome and x-variables in
the model. Must be specified unless object is given.
object
An object of class (rfsrc, synthetic).
Not required when formula and data are supplied.
newdata
Test data used for prediction (optional).
ntree
Number of trees.
mtry
mtry value for synthetic forest.
mtrySeq
Sequence of mtry values used for fitting the
collection of RF machines. If NULL, set to the default
value p/3.
nodesize
Nodesize value for the synthetic forest.
nodesizeSeq
Sequence of nodesize values used for the fitting the
collection of RF machines.
nsplit
If non-zero, nsplit-randomized splitting is used which can
significantly increase speed.
min.node
Minimum forest averaged number of nodes a RF machine
must exceed in order to be used as a synthetic feature.
use.org.features
In addition to synthetic features, should
the original features be used when fitting synthetic forests?
na.action
Missing value action. The default na.omit
removes the entire record if even one of its entries is NA.
The action na.impute pre-imputes the data using fast
imputation via impute.rfsrc.
verbose
Set to TRUE for verbose output.
...
Further arguments to be passed to the rfsrc
function used for fitting the synthetic forest.
Details
A collection of random forests are fit using different nodesize
values. The predicted values from these machines are then used as
synthetic features (called RF machines) to fit a synthetic random
forest (the original features are also used when fitting the synthetic
forest). Currently only implemented for regression and classification
settings (univariate and multivariate).
Note that synthetic features are constructed using out-of-bag (OOB)
data in order to avoid double dipping into training data.
Nevertheless, the internal OOB error rate for the synthetic forest
will be biased and thus cross-validation must be used for determining
performance.
If mtrySeq is set, RF machines are constructed for each
combination of nodesize and mtry values specified by
nodesizeSeqmtrySeq. However, a sequence of values for
mtrySeq generally does not work as well as using a fixed value.
Generally, performance gains are observed when one of the two
sequences is fixed: mtrySeq is fixed and nodesizeSeq is
varied, or nodesizeSeq is fixed and mtrySeq is varied.
However, see the examples below where this is not the case.
Value
A list with the following components:
rfMachines
RF machines used to construct the synthetic
features.
rfSyn
The (grow) synthetic RF built over training data.
rfSynPred
The predict synthetic RF built over test data (if available).
synthetic
List containing the synthetic features.
opt.machine
Optimal machine: RF machine with smallest
OOB error rate.
Author(s)
Hemant Ishwaran and Udaya B. Kogalur
References
Ishwaran H. and Malley J.D. (2014). Synthetic learning machines.
BioData Mining, 7:28.
See Also
rfsrc,
impute.rfsrc
Examples
## Not run:
## ------------------------------------------------------------
## compare synthetic forests to regular forest (classification)
## ------------------------------------------------------------
## rfsrc and rfsrcSyn calls
if (library("mlbench", logical.return = TRUE)) {
## simulate the data
ring <- data.frame(mlbench.ringnorm(250, 20))
## classification forests
ringRF <- rfsrc(classes ~., data = ring)
## synthetic forests:
## 1 = nodesize varied
## 2 = nodesize/mtry varied
ringSyn1 <- rfsrcSyn(classes ~., data = ring)
ringSyn2 <- rfsrcSyn(classes ~., data = ring, mtrySeq = c(1, 10, 20))
## test-set performance
ring.test <- data.frame(mlbench.ringnorm(500, 20))
pred.ringRF <- predict(ringRF, newdata = ring.test)
pred.ringSyn1 <- rfsrcSyn(object = ringSyn1, newdata = ring.test)$rfSynPred
pred.ringSyn2 <- rfsrcSyn(object = ringSyn2, newdata = ring.test)$rfSynPred
print(pred.ringRF)
print(pred.ringSyn1)
print(pred.ringSyn2)
}
## ------------------------------------------------------------
## compare synthetic forest to regular forest (regression)
## ------------------------------------------------------------
## simulate the data
n <- 250
ntest <- 1000
N <- n + ntest
d <- 50
std <- 0.1
x <- matrix(runif(N * d, -1, 1), ncol = d)
y <- 1 * (x[,1] + x[,4]^3 + x[,9] + sin(x[,12]*x[,18]) + rnorm(n, sd = std)>.38)
dat <- data.frame(x = x, y = y)
test <- (n+1):N
## regression forests
regF <- rfsrc(y ~ ., data = dat[-test, ], )
pred.regF <- predict(regF, dat[test, ], importance = "none")
## synthetic forests
## we pass both the training and testing data
## but this can be split into separate commands as in the
## previous classification example
synF1 <- rfsrcSyn(y ~ ., data = dat[-test, ],
newdata = dat[test, ])
synF2 <- rfsrcSyn(y ~ ., data = dat[-test, ],
newdata = dat[test, ], mtrySeq = c(1, 10, 20, 30, 40, 50))
## standardized MSE performance
mse <- c(tail(pred.regF$err.rate, 1),
tail(synF1$rfSynPred$err.rate, 1),
tail(synF2$rfSynPred$err.rate, 1)) / var(y[-test])
names(mse) <- c("forest", "synthetic1", "synthetic2")
print(mse)
## ------------------------------------------------------------
## multivariate synthetic forests
## ------------------------------------------------------------
mtcars.new <- mtcars
mtcars.new$cyl <- factor(mtcars.new$cyl)
mtcars.new$carb <- factor(mtcars.new$carb, ordered = TRUE)
trn <- sample(1:nrow(mtcars.new), nrow(mtcars.new)/2)
mvSyn <- rfsrcSyn(cbind(carb, mpg, cyl) ~., data = mtcars.new[trn,])
mvSyn.pred <- rfsrcSyn(object = mvSyn, newdata = mtcars.new[-trn,])
## End(Not run)