Last data update: 2014.03.03

R: Impute categorical variables using Random Forest within MICE
mice.impute.rfcatR Documentation

Impute categorical variables using Random Forest within MICE

Description

This method can be used to impute factor variables (binary or >2 levels) in MICE by specifying method = 'rfcat'. It was developed independently from the mice.impute.rf algorithm of Doove et al., and differs from it in some respects.

Usage

mice.impute.rfcat(y, ry, x, ntree_cat = NULL,
    nodesize_cat = NULL, maxnodes_cat = NULL, ntree = NULL, ...)

Arguments

y

a factor vector of observed values and missing values of the variable to be imputed. y must be a factor even if it has only 2 levels; it cannot be logical.

ry

a logical vector stating whether y is observed or not.

x

a matrix of predictors to impute y.

ntree_cat

number of trees, default = 10.

A global option can be set thus: setRFoptions(ntree_cat=10).

nodesize_cat

minimum size of nodes, default = 1.

A global option can be set thus: setRFoptions(nodesize_cat=1). Smaller values of nodesize create finer, more precise trees but increase the computation time.

maxnodes_cat

maximum number of nodes, default NULL. If NULL the number of nodes is determined by number of observations and nodesize_cat.

ntree

an alternative argument for specifying the number of trees, over-ridden by ntree_cat. This is for consistency with the mice.impute.rf function.

...

other arguments to pass to randomForest.

Details

This Random Forest imputation algorithm has been developed as an alternative to logistic or polytomous regression, and can accommodate non-linear relations and interactions among the predictor variables without requiring them to be specified in the model. The algorithm takes a bootstrap sample of the data to simulate sampling variability, fits a set of classification trees, and chooses each imputed value as the prediction of a randomly chosen tree.

Value

A vector of imputed values of y.

Note

This algorithm has been tested on simulated data and in survival analysis of real data with artificially introduced missingness completely at random. There was slight bias in hazard ratios compared to polytomous regression, but coverage of confidence intervals was correct.

Author(s)

Anoop Shah

References

Shah AD, Bartlett JW, Carpenter J, Nicholas O, Hemingway H. Comparison of Random Forest and parametric imputation models for imputing missing data using MICE: a CALIBER study. American Journal of Epidemiology 2014. doi: 10.1093/aje/kwt312

See Also

setRFoptions, mice.impute.rfcont, mice, mice.impute.rf, mice.impute.cart, randomForest

Examples

set.seed(1)

# A small dataset with a single row to be imputed
mydata <- data.frame(x1 = as.factor(c('this', 'this', NA, 'that')),
	x2 = 1:4, x3 = as.factor(c('other', 'another', NA, 'another')))
mice(mydata, method = c('logreg', 'norm', 'logreg'), m = 2, maxit = 2)
mice(mydata[, 1:2], method = c('rfcat', 'rfcont'), m = 2, maxit = 2)
mice(mydata, method = c('rfcat', 'rfcont', 'rfcat'), m = 2, maxit = 2)

# A larger simulated dataset
mydata <- simdata(100, x2binary = TRUE)
mymardata <- makemar(mydata)

cat('\nNumber of missing values:\n')
print(sapply(mymardata, function(x){sum(is.na(x))}))

# Test imputation of a single column in a two-column dataset
cat('\nTest imputation of a simple dataset')
print(mice(mymardata[, c('y', 'x2')], method = 'rfcat'))

# Analyse data
cat('\nFull data analysis:\n')
print(summary(lm(y ~ x1 + x2 + x3, data = mydata)))

cat('\nMICE normal and logistic:\n')
print(summary(pool(with(mice(mymardata,
    method = c('', 'norm', 'logreg', '', '')), lm(y ~ x1 + x2 + x3)))))

# Set options for Random Forest
setRFoptions(ntree_cat = 10)

cat('\nMICE using Random Forest:\n')
print(summary(pool(with(mice(mymardata,
    method = c('', 'rfcont', 'rfcat', '', '')), lm(y ~ x1 + x2 + x3)))))

cat('\nDataset with unobserved levels of a factor\n')
data3 <- data.frame(x1 = 1:100, x2 = factor(c(rep('A', 25),
    rep('B', 25), rep('C', 25), rep('D', 25))))
data3$x2[data3$x2 == 'D'] <- NA
mice(data3, method = c('', 'rfcat'))

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(CALIBERrfimpute)
Loading required package: mice
Loading required package: Rcpp
mice 2.25 2015-11-09
> png(filename="/home/ddbj/snapshot/RGM3/R_CC/result/CALIBERrfimpute/mice.impute.rfcat.Rd_%03d_medium.png", width=480, height=480)
> ### Name: mice.impute.rfcat
> ### Title: Impute categorical variables using Random Forest within MICE
> ### Aliases: mice.impute.rfcat
> 
> ### ** Examples
> 
> set.seed(1)
> 
> # A small dataset with a single row to be imputed
> mydata <- data.frame(x1 = as.factor(c('this', 'this', NA, 'that')),
+ 	x2 = 1:4, x3 = as.factor(c('other', 'another', NA, 'another')))
> mice(mydata, method = c('logreg', 'norm', 'logreg'), m = 2, maxit = 2)

 iter imp variable
  1   1  x1  x3
  1   2  x1  x3
  2   1  x1  x3
  2   2  x1  x3
Multiply imputed data set
Call:
mice(data = mydata, m = 2, method = c("logreg", "norm", "logreg"), 
    maxit = 2)
Number of multiple imputations:  2
Missing cells per column:
x1 x2 x3 
 1  0  1 
Imputation methods:
      x1       x2       x3 
"logreg"   "norm" "logreg" 
VisitSequence:
x1 x3 
 1  3 
PredictorMatrix:
   x1 x2 x3
x1  0  1  1
x2  0  0  0
x3  1  1  0
Random generator seed value:  NA 
> mice(mydata[, 1:2], method = c('rfcat', 'rfcont'), m = 2, maxit = 2)

 iter imp variable
  1   1  x1
  1   2  x1
  2   1  x1
  2   2  x1
Multiply imputed data set
Call:
mice(data = mydata[, 1:2], m = 2, method = c("rfcat", "rfcont"), 
    maxit = 2)
Number of multiple imputations:  2
Missing cells per column:
x1 x2 
 1  0 
Imputation methods:
      x1       x2 
 "rfcat" "rfcont" 
VisitSequence:
x1 
 1 
PredictorMatrix:
   x1 x2
x1  0  1
x2  0  0
Random generator seed value:  NA 
> mice(mydata, method = c('rfcat', 'rfcont', 'rfcat'), m = 2, maxit = 2)

 iter imp variable
  1   1  x1  x3
  1   2  x1  x3
  2   1  x1  x3
  2   2  x1  x3
Multiply imputed data set
Call:
mice(data = mydata, m = 2, method = c("rfcat", "rfcont", "rfcat"), 
    maxit = 2)
Number of multiple imputations:  2
Missing cells per column:
x1 x2 x3 
 1  0  1 
Imputation methods:
      x1       x2       x3 
 "rfcat" "rfcont"  "rfcat" 
VisitSequence:
x1 x3 
 1  3 
PredictorMatrix:
   x1 x2 x3
x1  0  1  1
x2  0  0  0
x3  1  1  0
Random generator seed value:  NA 
> 
> # A larger simulated dataset
> mydata <- simdata(100, x2binary = TRUE)
> mymardata <- makemar(mydata)
> 
> cat('\nNumber of missing values:\n')

Number of missing values:
> print(sapply(mymardata, function(x){sum(is.na(x))}))
 y x1 x2 x3 x4 
 0 22 23  0  0 
> 
> # Test imputation of a single column in a two-column dataset
> cat('\nTest imputation of a simple dataset')

Test imputation of a simple dataset> print(mice(mymardata[, c('y', 'x2')], method = 'rfcat'))

 iter imp variable
  1   1  x2
  1   2  x2
  1   3  x2
  1   4  x2
  1   5  x2
  2   1  x2
  2   2  x2
  2   3  x2
  2   4  x2
  2   5  x2
  3   1  x2
  3   2  x2
  3   3  x2
  3   4  x2
  3   5  x2
  4   1  x2
  4   2  x2
  4   3  x2
  4   4  x2
  4   5  x2
  5   1  x2
  5   2  x2
  5   3  x2
  5   4  x2
  5   5  x2
Multiply imputed data set
Call:
mice(data = mymardata[, c("y", "x2")], method = "rfcat")
Number of multiple imputations:  5
Missing cells per column:
 y x2 
 0 23 
Imputation methods:
      y      x2 
"rfcat" "rfcat" 
VisitSequence:
x2 
 2 
PredictorMatrix:
   y x2
y  0  0
x2 1  0
Random generator seed value:  NA 
> 
> # Analyse data
> cat('\nFull data analysis:\n')

Full data analysis:
> print(summary(lm(y ~ x1 + x2 + x3, data = mydata)))

Call:
lm(formula = y ~ x1 + x2 + x3, data = mydata)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.35823 -0.73137 -0.01909  0.62658  2.90390 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.26013    0.15625  -1.665   0.0992 .  
x1           0.89274    0.09935   8.986 2.24e-14 ***
x22          1.20618    0.20619   5.850 6.80e-08 ***
x3           0.95319    0.11194   8.515 2.29e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.014 on 96 degrees of freedom
Multiple R-squared:  0.6842,	Adjusted R-squared:  0.6743 
F-statistic: 69.31 on 3 and 96 DF,  p-value: < 2.2e-16

> 
> cat('\nMICE normal and logistic:\n')

MICE normal and logistic:
> print(summary(pool(with(mice(mymardata,
+     method = c('', 'norm', 'logreg', '', '')), lm(y ~ x1 + x2 + x3)))))

 iter imp variable
  1   1  x1  x2
  1   2  x1  x2
  1   3  x1  x2
  1   4  x1  x2
  1   5  x1  x2
  2   1  x1  x2
  2   2  x1  x2
  2   3  x1  x2
  2   4  x1  x2
  2   5  x1  x2
  3   1  x1  x2
  3   2  x1  x2
  3   3  x1  x2
  3   4  x1  x2
  3   5  x1  x2
  4   1  x1  x2
  4   2  x1  x2
  4   3  x1  x2
  4   4  x1  x2
  4   5  x1  x2
  5   1  x1  x2
  5   2  x1  x2
  5   3  x1  x2
  5   4  x1  x2
  5   5  x1  x2
                   est         se         t       df     Pr(>|t|)      lo 95
(Intercept) -0.3385370 0.18479858 -1.831924 26.61783 7.817569e-02 -0.7179672
x1           0.8721132 0.09690819  8.999376 75.40334 1.427747e-13  0.6790791
x22          1.2964700 0.23136072  5.603674 34.13948 2.783546e-06  0.8263592
x3           0.9792573 0.12302055  7.960111 38.19214 1.240457e-09  0.7302563
                 hi 95 nmis       fmi     lambda
(Intercept) 0.04089331   NA 0.3466220 0.29930628
x1          1.06514731   22 0.1054159 0.08199849
x22         1.76658078   NA 0.2870209 0.24644099
x3          1.22825824    0 0.2612296 0.22352968
> 
> # Set options for Random Forest
> setRFoptions(ntree_cat = 10)
Setting option CALIBERrfimpute_ntree_cat = 10
> 
> cat('\nMICE using Random Forest:\n')

MICE using Random Forest:
> print(summary(pool(with(mice(mymardata,
+     method = c('', 'rfcont', 'rfcat', '', '')), lm(y ~ x1 + x2 + x3)))))

 iter imp variable
  1   1  x1  x2
  1   2  x1  x2
  1   3  x1  x2
  1   4  x1  x2
  1   5  x1  x2
  2   1  x1  x2
  2   2  x1  x2
  2   3  x1  x2
  2   4  x1  x2
  2   5  x1  x2
  3   1  x1  x2
  3   2  x1  x2
  3   3  x1  x2
  3   4  x1  x2
  3   5  x1  x2
  4   1  x1  x2
  4   2  x1  x2
  4   3  x1  x2
  4   4  x1  x2
  4   5  x1  x2
  5   1  x1  x2
  5   2  x1  x2
  5   3  x1  x2
  5   4  x1  x2
  5   5  x1  x2
                     est        se            t       df     Pr(>|t|)
(Intercept) -0.001791581 0.2533471 -0.007071647 11.11182 9.944831e-01
x1           0.870611687 0.1372274  6.344297108 42.91905 1.162296e-07
x22          0.932141942 0.3512698  2.653635316 10.82735 2.271206e-02
x3           1.032758675 0.1517895  6.803886293 34.85743 6.995807e-08
                 lo 95     hi 95 nmis       fmi    lambda
(Intercept) -0.5587210 0.5551378   NA 0.5887155 0.5208009
x1           0.5938511 1.1473723   22 0.2349290 0.2000889
x22          0.1574956 1.7067883   NA 0.5966501 0.5284438
x3           0.7245645 1.3409529    0 0.2821919 0.2421551
> 
> cat('\nDataset with unobserved levels of a factor\n')

Dataset with unobserved levels of a factor
> data3 <- data.frame(x1 = 1:100, x2 = factor(c(rep('A', 25),
+     rep('B', 25), rep('C', 25), rep('D', 25))))
> data3$x2[data3$x2 == 'D'] <- NA
> mice(data3, method = c('', 'rfcat'))

 iter imp variable
  1   1  x2
  1   2  x2
  1   3  x2
  1   4  x2
  1   5  x2
  2   1  x2
  2   2  x2
  2   3  x2
  2   4  x2
  2   5  x2
  3   1  x2
  3   2  x2
  3   3  x2
  3   4  x2
  3   5  x2
  4   1  x2
  4   2  x2
  4   3  x2
  4   4  x2
  4   5  x2
  5   1  x2
  5   2  x2
  5   3  x2
  5   4  x2
  5   5  x2
Multiply imputed data set
Call:
mice(data = data3, method = c("", "rfcat"))
Number of multiple imputations:  5
Missing cells per column:
x1 x2 
 0 25 
Imputation methods:
     x1      x2 
     "" "rfcat" 
VisitSequence:
x2 
 2 
PredictorMatrix:
   x1 x2
x1  0  0
x2  1  0
Random generator seed value:  NA 
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>