R Graphical Manual

Browse All

Last data update: 2014.03.03

R: Impute Only Mode

impute.rfsrc

R Documentation

Impute Only Mode

Description

Fast imputation mode. A random forest is grown and used to impute missing data. No ensemble estimates or error rates are calculated.

Usage

## S3 method for class 'rfsrc'
impute(formula, data, ntree = 500, mtry = NULL,
  xvar.wt = NULL, nodesize = 1, splitrule = NULL, nsplit = 1,
  na.action = "na.impute", nimpute = 2, mf.q, blocks,
  always.use = NULL, max.iter = 10, eps = 0.01, verbose = TRUE,
  do.trace = FALSE, ...)

Arguments

`formula`	A symbolic description of the model to be fit. Can be left unspecified if there are no outcomes or we don't care to distinguish between y-outcomes and x-variables in the imputation.
`data`	Data frame containing the data to be imputed.
`ntree`	Number of trees to grow.
`mtry`	Number of variables randomly sampled at each split.
`nodesize`	Minimum terminal node size.
`splitrule`	Splitting rule used to grow trees.
`nsplit`	Non-negative integer value used to specify random splitting.
`na.action`	Missing value action. See details below.
`nimpute`	Number of iterations of the missing data algorithm. Ignored for multivariate missForest; in which case the algorithm iterates until a convergence criteria is achieved (users can however enforce a maximum number of iterations with the option `max.iter`).
`mf.q`	Fraction of variables (between 0 and 1) used as responses in multivariate missForest imputation. By default, multivariate missForest imputation is not performed if left unspecifed. Can be an integer, in which case this equals the number of multivariate responses.
`blocks`	Integer value specifying the number of blocks the data should be broken up into (by rows). This can improve computational efficiency when the sample size is large but imputation efficiency decreases. By default, no action is taken if left unspecified.
`always.use`	Character vector of variable names to always be included as a response in multivariate missForest imputation. Does not apply for other imputation methods.
`xvar.wt`	Weights for selecting variables for splitting on.
`max.iter`	Maximum number of iterations used when implementing multivariate missForest imputation.
`eps`	Tolerance value used to determine convergence of multivariate missForest imputation.
`verbose`	Send verbose output to terminal (only applies to multivariate missForest imputation).
`do.trace`	Number of seconds between updates to the user on approximate time to completion.
`...`	Further arguments passed to or from other methods.

Details

Grow a forest and use this to impute data. All external calculations such as ensemble calculations, error rates, etc. are turned off. Use this function if your only interest is imputing the data.
By default, prior to splitting a node, if there is missing data for a variable, the missing data is imputed by randomly drawing values from non-missing in-bag data. The purpose of this is to make it possible to assign cases to daughter nodes in the event the node is split on a variable with missing data. Imputed data is however not used to calculate the split-statistic, which uses non-missing data only.
If no formula is specified, unsupervised splitting is implemented using a ytry value of sqrt(p) where p equals the number of variables. More precisely, mtry variables are selected at random, and for each of these a random subset of ytry variables are selected and defined as the multivariate pseudo-responses. A multivariate composite splitting rule of dimension ytry is then applied to each of the mtry multivariate regression problems and the node split on the variable leading to the best split.
If mf.q is specified, then a multivariate version of missForest imputation (Stekhoven and Buhlmann, 2012) is applied. A fraction mf.q of the variables are used as multivariate responses and split on the remaining variables using a multivariate composite splitting rule. Missing data for responses are imputed by prediction. This is repeated with a new set of variables used as responses (mutually exclusive to the previous), until all variables have been imputed. The entire process is repeated, and the algorithm is iterated until a convergence criteria is met (specified using options max.iter and eps). Using an integer value for mf.q is allowed, in which case a total of mf.q variables are used as multivariate responses.
Prior to imputation, the data is processed and records with all values missing are removed, as are variables having all missing values.
If there is no missing data, either before or after processing of the data, the algorithm returns the processed data and no imputation is performed.
The default choice nimpute=2 is chosen for coherence with the default missing data algorithm implemented in grow mode. Thus, if the user imputes data with nimpute=2 and runs a grow forest using this imputed data, then performance values such as VIMP and error rates will coincide with those obtained by running a grow forest on the original non-imputed data using na.action = "na.impute". Ignored for multivariate missForest.
All options are the same as rfsrc and the user should consult the rfsrc help file for details.

Value

Invisibly, the data frame containing the orginal data with imputed data overlayed.

Author(s)

Hemant Ishwaran and Udaya B. Kogalur

References

Ishwaran H., Kogalur U.B., Blackstone E.H. and Lauer M.S. (2008). Random survival forests, Ann. App. Statist., 2:841-860.

Stekhoven D.J. and Buhlmann P. (2012). MissForest–non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1):112-118.

Tang F. and Ishwaran H. (2015). Random forest missing data algorithms.

Examples

## Not run: 
## ------------------------------------------------------------
## example of survival imputation
## ------------------------------------------------------------

#imputation using outcome splitting
data(pbc, package = "randomForestSRC")
pbc.d <- impute.rfsrc(Surv(days, status) ~ ., data = pbc, nsplit = 3)

#when no formula is given we default to unsupervised splitting
pbc2.d <- impute.rfsrc(data = pbc, nodesize = 1, nsplit = 10, nimpute = 5)

#random splitting can be reasonably good
pbc3.d <- impute.rfsrc(Surv(days, status) ~ ., data = pbc,
          splitrule = "random", nodesize = 1, nimpute = 5)

## ------------------------------------------------------------
## example of regression imputation
## ------------------------------------------------------------

air.d <- impute.rfsrc(Ozone ~ ., data = airquality, nimpute = 5)
air2.d <- impute.rfsrc(data = airquality, nimpute = 5, nodesize = 1)
air3.d <- impute.rfsrc(Ozone ~ ., data = airquality, nimpute = 5,
           splitrule = "random", nodesize = 1)

## ------------------------------------------------------------
## multivariate missForest imputation
## ------------------------------------------------------------

data(pbc, package = "randomForestSRC")

## use 10 percent of variables as responses
## i.e. multivariate missForest
pbc.d <- impute.rfsrc(data = pbc, mf.q = .01, nodesize = 1)

## use 1 variable as the response
## i.e. original missForest algorithm
pbc.d <- impute.rfsrc(data = pbc, mf.q = 1, nodesize = 1)

## End(Not run)