Last data update: 2014.03.03

R: Parallel random forest generation
prandomForestR Documentation

Parallel random forest generation

Description

The machine learning function prandomForest() is an ensemble tree classifier that constructs a forest of classification trees from bootstrap samples of a dataset in parallel. The random forest algorithm can be used to classify both categorical and continuous variables. This function provides a parallel equivalent to the serial randomForest() function from the randomForest package. Note that the randomForest library must be loaded before calling the prandomForest function. library("randomForest")

N.B. Please see the SPRINT User Guide for how to run the code in parallel using the mpiexec command.

Usage

prandomForest(x, ...)
## Default S3 method:
prandomForest(x, y=NULL, xtest=NULL, ytest=NULL, ntree=500, 
                      mtry = if (!is.null(y) && !is.factor(y))
                                 max(floor(ncol(x)/3), 1) 
                             else floor(sqrt(ncol(x))),
                      replace=TRUE, classwt=NULL, cutoff, strata,
                      sampsize = if (replace) nrow(x) 
                                 else ceiling(.632*nrow(x)),
                      nodesize = if (!is.null(y) && !is.factor(y)) 5 else 1,
                      maxnodes=NULL, importance=FALSE, localImp=FALSE, 
                      nPerm=1, proximity, oob.prox=proximity, norm.votes=TRUE,
                      do.trace=FALSE, 
                      keep.forest = !is.null(y) && is.null(xtest), 
                      corr.bias=FALSE, keep.inbag=FALSE, ...)

Arguments

x

array of data

...

optional parameters to be passed to the low level function randomForest.default.

y

vector, if a factor, classification is assumed, otherwise regression is assumed. If omitted, prandomForest() will run in unsupervised mode.

xtest

data array of predictors for the test set

ytest

response for the test set

ntree

integer, the number of trees to grow

mtry

integer, the number of variables randomly sampled as candidates at each split. The default value is sqrt(p) for classification and p/3 for regression, where p is the number of variables in the data matrix x.

replace

boolean, whether the sampling of cases is done with or without replacement. The default value is TRUE.

classwt

vector if priors of the classes. The default value is NULL.

cutoff

vector of k elements where k is the number of classes. The winning class for an observation is the one with the maximum ratio of proportion of votes to cutoff. The default value is 1/k.

strata

variable used for stratified sampling

sampsize

size of sample to draw. For classification, if sampsize is a vector of the length of the number of strata, then sampling is stratified by strata, and the elements of sampsize indicate the numbers to be drawn from the strata.

nodesize

integer, the minimum size of the terminal nodes. The default value is 1 for classification and 5 for regression.

maxnodes

integer, maximum number of terminal nodes allowed for the trees. The default value is NULL.

importance

boolean, whether the importance of predictors is assessed. The default value is FALSE.

localImp

boolean, whether casewise importance measure is to be computed. The default value is FALSE.

nPerm

integer, the number of times the out-of-bag data are permuted per tree for assessing variable importance. The default value is one. Regression only.

proximity

boolean, whether the proximity measure among the rows is to be calculated.

oob.prox

boolean, whether the proximity is to be calculated for out-of-bag data. The default value is set to be the same as the value of the proximity parameter.

norm.votes

boolean, whether the final result of votes are expressed as fractions or whether the raw vote counts are returned. The default value is TRUE. Classification only.

do.trace

boolean, whether a verbose output is produced. The default value is FALSE. If set to an integer i then the output is printed for every i trees.

keep.forest

boolean, whether the forest is returned in the output object. The default value is FALSE.

corr.bias

boolean, whether to perform a bias correction. The default value is FALSE. Regression only.

keep.inbag

boolean, whether the matrix which keeps track of which samples are in-bag in which trees should be returned. The default value is FALSE.

Author(s)

University of Edinburgh SPRINT Team sprint@ed.ac.uk www.r-sprint.org

See Also

randomForest SPRINT

Results