R Graphical Manual

Browse All

Last data update: 2014.03.03

R: Parallel random forest generation

prandomForest

R Documentation

Parallel random forest generation

Description

The machine learning function prandomForest() is an ensemble tree classifier that constructs a forest of classification trees from bootstrap samples of a dataset in parallel. The random forest algorithm can be used to classify both categorical and continuous variables. This function provides a parallel equivalent to the serial randomForest() function from the randomForest package. Note that the randomForest library must be loaded before calling the prandomForest function. library("randomForest")

N.B. Please see the SPRINT User Guide for how to run the code in parallel using the mpiexec command.

Usage

prandomForest(x, ...)
## Default S3 method:
prandomForest(x, y=NULL, xtest=NULL, ytest=NULL, ntree=500, 
                      mtry = if (!is.null(y) && !is.factor(y))
                                 max(floor(ncol(x)/3), 1) 
                             else floor(sqrt(ncol(x))),
                      replace=TRUE, classwt=NULL, cutoff, strata,
                      sampsize = if (replace) nrow(x) 
                                 else ceiling(.632*nrow(x)),
                      nodesize = if (!is.null(y) && !is.factor(y)) 5 else 1,
                      maxnodes=NULL, importance=FALSE, localImp=FALSE, 
                      nPerm=1, proximity, oob.prox=proximity, norm.votes=TRUE,
                      do.trace=FALSE, 
                      keep.forest = !is.null(y) && is.null(xtest), 
                      corr.bias=FALSE, keep.inbag=FALSE, ...)

Arguments

`x`	array of data
`...`	optional parameters to be passed to the low level function randomForest.default.
`y`	vector, if a factor, classification is assumed, otherwise regression is assumed. If omitted, prandomForest() will run in unsupervised mode.
`xtest`	data array of predictors for the test set
`ytest`	response for the test set
`ntree`	integer, the number of trees to grow
`mtry`	integer, the number of variables randomly sampled as candidates at each split. The default value is sqrt(p) for classification and p/3 for regression, where p is the number of variables in the data matrix x.
`replace`	boolean, whether the sampling of cases is done with or without replacement. The default value is TRUE.
`classwt`	vector if priors of the classes. The default value is NULL.
`cutoff`	vector of k elements where k is the number of classes. The winning class for an observation is the one with the maximum ratio of proportion of votes to cutoff. The default value is 1/k.
`strata`	variable used for stratified sampling
`sampsize`	size of sample to draw. For classification, if sampsize is a vector of the length of the number of strata, then sampling is stratified by strata, and the elements of sampsize indicate the numbers to be drawn from the strata.
`nodesize`	integer, the minimum size of the terminal nodes. The default value is 1 for classification and 5 for regression.
`maxnodes`	integer, maximum number of terminal nodes allowed for the trees. The default value is NULL.
`importance`	boolean, whether the importance of predictors is assessed. The default value is FALSE.
`localImp`	boolean, whether casewise importance measure is to be computed. The default value is FALSE.
`nPerm`	integer, the number of times the out-of-bag data are permuted per tree for assessing variable importance. The default value is one. Regression only.
`proximity`	boolean, whether the proximity measure among the rows is to be calculated.
`oob.prox`	boolean, whether the proximity is to be calculated for out-of-bag data. The default value is set to be the same as the value of the proximity parameter.
`norm.votes`	boolean, whether the final result of votes are expressed as fractions or whether the raw vote counts are returned. The default value is TRUE. Classification only.
`do.trace`	boolean, whether a verbose output is produced. The default value is FALSE. If set to an integer i then the output is printed for every i trees.
`keep.forest`	boolean, whether the forest is returned in the output object. The default value is FALSE.
`corr.bias`	boolean, whether to perform a bias correction. The default value is FALSE. Regression only.
`keep.inbag`	boolean, whether the matrix which keeps track of which samples are in-bag in which trees should be returned. The default value is FALSE.

Author(s)

University of Edinburgh SPRINT Team sprint@ed.ac.uk www.r-sprint.org

Parallel random forest generation

Description

Usage

Arguments

Author(s)

See Also

Results