R Graphical Manual

Browse All

Last data update: 2014.03.03

R: Synthetic Random Forests

rfsrcSyn

R Documentation

Synthetic Random Forests

Description

Grows a synthetic random forest (RF) using RF machines as synthetic features. Applies only to regression and classification settings.

Usage

## S3 method for class 'rfsrc'
rfsrcSyn(formula, data, object, newdata,
  ntree = 1000,
  mtry = NULL,
  mtrySeq = NULL,
  nodesize = 5,
  nodesizeSeq = c(1:10,20,30,50,100),
  nsplit = 0,
  min.node = 3,
  use.org.features = TRUE,
  na.action = c("na.omit", "na.impute"),
  verbose = TRUE,
  ...)

Arguments

`formula`	A symbolic description of the model to be fit. Must be specified unless `object` is given.
`data`	Data frame containing the y-outcome and x-variables in the model. Must be specified unless `object` is given.
`object`	An object of class `(rfsrc, synthetic)`. Not required when `formula` and `data` are supplied.
`newdata`	Test data used for prediction (optional).
`ntree`	Number of trees.
`mtry`	mtry value for synthetic forest.
`mtrySeq`	Sequence of mtry values used for fitting the collection of RF machines. If `NULL`, set to the default value `p`/3.
`nodesize`	Nodesize value for the synthetic forest.
`nodesizeSeq`	Sequence of nodesize values used for the fitting the collection of RF machines.
`nsplit`	If non-zero, nsplit-randomized splitting is used which can significantly increase speed.
`min.node`	Minimum forest averaged number of nodes a RF machine must exceed in order to be used as a synthetic feature.
`use.org.features`	In addition to synthetic features, should the original features be used when fitting synthetic forests?
`na.action`	Missing value action. The default `na.omit` removes the entire record if even one of its entries is `NA`. The action `na.impute` pre-imputes the data using fast imputation via `impute.rfsrc`.
`verbose`	Set to `TRUE` for verbose output.
`...`	Further arguments to be passed to the `rfsrc` function used for fitting the synthetic forest.

Details

A collection of random forests are fit using different nodesize values. The predicted values from these machines are then used as synthetic features (called RF machines) to fit a synthetic random forest (the original features are also used when fitting the synthetic forest). Currently only implemented for regression and classification settings (univariate and multivariate).

Note that synthetic features are constructed using out-of-bag (OOB) data in order to avoid double dipping into training data. Nevertheless, the internal OOB error rate for the synthetic forest will be biased and thus cross-validation must be used for determining performance.

If mtrySeq is set, RF machines are constructed for each combination of nodesize and mtry values specified by nodesizeSeq mtrySeq. However, a sequence of values for mtrySeq generally does not work as well as using a fixed value. Generally, performance gains are observed when one of the two sequences is fixed: mtrySeq is fixed and nodesizeSeq is varied, or nodesizeSeq is fixed and mtrySeq is varied. However, see the examples below where this is not the case.

Value

A list with the following components:

`rfMachines`	RF machines used to construct the synthetic features.
`rfSyn`	The (grow) synthetic RF built over training data.
`rfSynPred`	The predict synthetic RF built over test data (if available).
`synthetic`	List containing the synthetic features.
`opt.machine`	Optimal machine: RF machine with smallest OOB error rate.

Author(s)

Hemant Ishwaran and Udaya B. Kogalur

References

Ishwaran H. and Malley J.D. (2014). Synthetic learning machines. BioData Mining, 7:28.

Examples

## Not run: 
## ------------------------------------------------------------
## compare synthetic forests to regular forest (classification)
## ------------------------------------------------------------

## rfsrc and rfsrcSyn calls
if (library("mlbench", logical.return = TRUE)) {

  ## simulate the data 
  ring <- data.frame(mlbench.ringnorm(250, 20))

  ## classification forests
  ringRF <- rfsrc(classes ~., data = ring)

  ## synthetic forests:
  ## 1 = nodesize varied
  ## 2 = nodesize/mtry varied
  ringSyn1 <- rfsrcSyn(classes ~., data = ring)
  ringSyn2 <- rfsrcSyn(classes ~., data = ring, mtrySeq = c(1, 10, 20))

  ## test-set performance
  ring.test <- data.frame(mlbench.ringnorm(500, 20))
  pred.ringRF <- predict(ringRF, newdata = ring.test)
  pred.ringSyn1 <- rfsrcSyn(object = ringSyn1, newdata = ring.test)$rfSynPred
  pred.ringSyn2 <- rfsrcSyn(object = ringSyn2, newdata = ring.test)$rfSynPred


  print(pred.ringRF)
  print(pred.ringSyn1)
  print(pred.ringSyn2)

}

## ------------------------------------------------------------
## compare synthetic forest to regular forest (regression)
## ------------------------------------------------------------

## simulate the data
n <- 250
ntest <- 1000
N <- n + ntest
d <- 50
std <- 0.1
x <- matrix(runif(N * d, -1, 1), ncol = d)
y <- 1 * (x[,1] + x[,4]^3 + x[,9] + sin(x[,12]*x[,18]) + rnorm(n, sd = std)>.38)
dat <- data.frame(x = x, y = y)
test <- (n+1):N

## regression forests
regF <- rfsrc(y ~ ., data = dat[-test, ], )
pred.regF <- predict(regF, dat[test, ], importance = "none")

## synthetic forests
## we pass both the training and testing data
## but this can be split into separate commands as in the
## previous classification example
synF1 <- rfsrcSyn(y ~ ., data = dat[-test, ],
  newdata = dat[test, ])
synF2 <- rfsrcSyn(y ~ ., data = dat[-test, ],
  newdata = dat[test, ], mtrySeq = c(1, 10, 20, 30, 40, 50))

## standardized MSE performance
mse <- c(tail(pred.regF$err.rate, 1),
         tail(synF1$rfSynPred$err.rate, 1),
         tail(synF2$rfSynPred$err.rate, 1)) / var(y[-test])
names(mse) <- c("forest", "synthetic1", "synthetic2")
print(mse)

## ------------------------------------------------------------
## multivariate synthetic forests
## ------------------------------------------------------------

mtcars.new <- mtcars
mtcars.new$cyl <- factor(mtcars.new$cyl)
mtcars.new$carb <- factor(mtcars.new$carb, ordered = TRUE)
trn <- sample(1:nrow(mtcars.new), nrow(mtcars.new)/2)
mvSyn <- rfsrcSyn(cbind(carb, mpg, cyl) ~., data = mtcars.new[trn,])
mvSyn.pred <- rfsrcSyn(object = mvSyn, newdata = mtcars.new[-trn,])

## End(Not run)