Last data update: 2014.03.03
R: Impute continuous variables using Random Forest within MICE
mice.impute.rfcont R Documentation
Impute continuous variables using Random Forest within MICE
Description
This method can be used to impute continuous variables in MICE by specifying
method = 'rfcont'. It was developed independently from the
mice.impute.rf
algorithm of Doove et al.,
and differs from it in drawing imputed values from a normal distribution.
Usage
mice.impute.rfcont(y, ry, x, ntree_cont = NULL,
nodesize_cont = NULL, maxnodes_cont = NULL, ntree = NULL, ...)
Arguments
y
a vector of observed values and missing values of the variable to be imputed.
ry
a logical vector stating whether y is observed or not.
x
a matrix of predictors to impute y.
ntree_cont
number of trees, default = 10.
A global option can be set thus: setRFoptions(ntree_cont=10)
.
nodesize_cont
minimum size of nodes, default = 5.
A global option can be set thus: setRFoptions(nodesize_cont=5)
.
Smaller values of nodesize create finer, more precise trees but increase the computation time.
maxnodes_cont
maximum number of nodes, default NULL. If NULL the number of nodes is determined by number of observations and nodesize_cont.
ntree
an alternative argument for specifying the number of trees, over-ridden by ntree_cont
. This is for consistency with the mice.impute.rf
function.
...
other arguments to pass to randomForest.
Details
This Random Forest imputation algorithm has been developed as an alternative to normal-based
linear regression, and can accommodate non-linear relations and interactions among the
predictor variables without requiring them to be specified in the model. The algorithm takes
a bootstrap sample of the data to simulate sampling variability, fits a regression forest
trees and calculates the out-of-bag mean squared error. Each value is imputed as a random draw
from a normal distribution with mean defined by the Random Forest prediction and variance equal
to the out-of-bag mean squared error.
If only one tree is used (not recommended), a bootstrap sample is not taken in the first stage
because the Random Forest algorithm performs an internal bootstrap sample before fitting the tree.
Value
A vector of imputed values of y.
Note
This algorithm has been tested on simulated data with linear regression,
and in survival analysis of real data with artificially introduced missingness at random.
On the simulated data there was slight bias if the distribution of missing values was
very different from observed values, because imputed values were closer to the centre of
the data than the missing values. However in the survival analysis the hazard ratios
were unbiased and coverage of confidence intervals more conservative than normal-based MICE,
but the mean length of confidence intervals was shorter with mice.impute.rfcont.
Author(s)
Anoop Shah
References
Shah AD, Bartlett JW, Carpenter J, Nicholas O, Hemingway H. Comparison of Random Forest and parametric imputation models for imputing missing data using MICE: a CALIBER study. American Journal of Epidemiology 2014. doi: 10.1093/aje/kwt312
See Also
setRFoptions
, mice.impute.rfcat
,
mice
,
mice.impute.rf
,
mice.impute.cart
,
randomForest
Examples
set.seed(1)
# A small dataset with a single row to be imputed
mydata <- data.frame(x1 = c(2, 2, NA, 4), x2 = 1:4, x3 = c(1, 3, NA, 3))
mice(mydata, method = c('norm', 'norm', 'norm'), m = 2, maxit = 2)
mice(mydata[, 1:2], method = c('rfcont', 'rfcont'), m = 2, maxit = 2)
mice(mydata, method = c('rfcont', 'rfcont', 'rfcont'), m = 2, maxit = 2)
# A larger simulated dataset
mydata <- simdata(100)
cat('\nSimulated multivariate normal data:\n')
print(data.frame(mean = colMeans(mydata), sd = sapply(mydata, sd)))
# Apply missingness pattern
mymardata <- makemar(mydata)
cat('\nNumber of missing values:\n')
print(sapply(mymardata, function(x){sum(is.na(x))}))
# Test imputation of a single column in a two-column dataset
cat('\nTest imputation of a simple dataset')
print(mice(mymardata[, c('y', 'x1')], method = 'rfcont'))
# Analyse data
cat('\nFull data analysis:\n')
print(summary(lm(y ~ x1 + x2 + x3, data=mydata)))
cat('\nMICE using normal-based linear regression:\n')
print(summary(pool(with(mice(mymardata,
method = 'norm'), lm(y ~ x1 + x2 + x3)))))
# Set options for Random Forest
setRFoptions(ntree_cont = 10)
cat('\nMICE using Random Forest:\n')
print(summary(pool(with(mice(mymardata,
method = 'rfcont'), lm(y ~ x1 + x2 + x3)))))
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(CALIBERrfimpute)
Loading required package: mice
Loading required package: Rcpp
mice 2.25 2015-11-09
> png(filename="/home/ddbj/snapshot/RGM3/R_CC/result/CALIBERrfimpute/mice.impute.rfcont.Rd_%03d_medium.png", width=480, height=480)
> ### Name: mice.impute.rfcont
> ### Title: Impute continuous variables using Random Forest within MICE
> ### Aliases: mice.impute.rfcont
>
> ### ** Examples
>
> set.seed(1)
>
> # A small dataset with a single row to be imputed
> mydata <- data.frame(x1 = c(2, 2, NA, 4), x2 = 1:4, x3 = c(1, 3, NA, 3))
> mice(mydata, method = c('norm', 'norm', 'norm'), m = 2, maxit = 2)
iter imp variable
1 1 x1 x3
1 2 x1 x3
2 1 x1 x3
2 2 x1 x3
Multiply imputed data set
Call:
mice(data = mydata, m = 2, method = c("norm", "norm", "norm"),
maxit = 2)
Number of multiple imputations: 2
Missing cells per column:
x1 x2 x3
1 0 1
Imputation methods:
x1 x2 x3
"norm" "norm" "norm"
VisitSequence:
x1 x3
1 3
PredictorMatrix:
x1 x2 x3
x1 0 1 1
x2 0 0 0
x3 1 1 0
Random generator seed value: NA
> mice(mydata[, 1:2], method = c('rfcont', 'rfcont'), m = 2, maxit = 2)
iter imp variable
1 1 x1
1 2 x1
2 1 x1
2 2 x1
Multiply imputed data set
Call:
mice(data = mydata[, 1:2], m = 2, method = c("rfcont", "rfcont"),
maxit = 2)
Number of multiple imputations: 2
Missing cells per column:
x1 x2
1 0
Imputation methods:
x1 x2
"rfcont" "rfcont"
VisitSequence:
x1
1
PredictorMatrix:
x1 x2
x1 0 1
x2 0 0
Random generator seed value: NA
Warning messages:
1: In randomForest.default(xobs, yobs, ntree = ntree_cont, nodesize = nodesize_cont, :
The response has five or fewer unique values. Are you sure you want to do regression?
2: In randomForest.default(xobs, yobs, ntree = ntree_cont, nodesize = nodesize_cont, :
The response has five or fewer unique values. Are you sure you want to do regression?
3: In randomForest.default(xobs, yobs, ntree = ntree_cont, nodesize = nodesize_cont, :
The response has five or fewer unique values. Are you sure you want to do regression?
4: In randomForest.default(xobs, yobs, ntree = ntree_cont, nodesize = nodesize_cont, :
The response has five or fewer unique values. Are you sure you want to do regression?
> mice(mydata, method = c('rfcont', 'rfcont', 'rfcont'), m = 2, maxit = 2)
iter imp variable
1 1 x1 x3
1 2 x1 x3
2 1 x1 x3
2 2 x1 x3
Multiply imputed data set
Call:
mice(data = mydata, m = 2, method = c("rfcont", "rfcont", "rfcont"),
maxit = 2)
Number of multiple imputations: 2
Missing cells per column:
x1 x2 x3
1 0 1
Imputation methods:
x1 x2 x3
"rfcont" "rfcont" "rfcont"
VisitSequence:
x1 x3
1 3
PredictorMatrix:
x1 x2 x3
x1 0 1 1
x2 0 0 0
x3 1 1 0
Random generator seed value: NA
Warning messages:
1: In randomForest.default(xobs, yobs, ntree = ntree_cont, nodesize = nodesize_cont, :
The response has five or fewer unique values. Are you sure you want to do regression?
2: In randomForest.default(xobs, yobs, ntree = ntree_cont, nodesize = nodesize_cont, :
The response has five or fewer unique values. Are you sure you want to do regression?
3: In randomForest.default(xobs, yobs, ntree = ntree_cont, nodesize = nodesize_cont, :
The response has five or fewer unique values. Are you sure you want to do regression?
4: In randomForest.default(xobs, yobs, ntree = ntree_cont, nodesize = nodesize_cont, :
The response has five or fewer unique values. Are you sure you want to do regression?
5: In randomForest.default(xobs, yobs, ntree = ntree_cont, nodesize = nodesize_cont, :
The response has five or fewer unique values. Are you sure you want to do regression?
6: In randomForest.default(xobs, yobs, ntree = ntree_cont, nodesize = nodesize_cont, :
The response has five or fewer unique values. Are you sure you want to do regression?
7: In randomForest.default(xobs, yobs, ntree = ntree_cont, nodesize = nodesize_cont, :
The response has five or fewer unique values. Are you sure you want to do regression?
8: In randomForest.default(xobs, yobs, ntree = ntree_cont, nodesize = nodesize_cont, :
The response has five or fewer unique values. Are you sure you want to do regression?
>
> # A larger simulated dataset
> mydata <- simdata(100)
> cat('\nSimulated multivariate normal data:\n')
Simulated multivariate normal data:
> print(data.frame(mean = colMeans(mydata), sd = sapply(mydata, sd)))
mean sd
y -0.210909405 2.3792123
x1 0.002390471 0.9391735
x2 -0.098139347 1.0883954
x3 -0.059185372 1.1246414
x4 -0.111959955 1.0404145
>
> # Apply missingness pattern
> mymardata <- makemar(mydata)
> cat('\nNumber of missing values:\n')
Number of missing values:
> print(sapply(mymardata, function(x){sum(is.na(x))}))
y x1 x2 x3 x4
0 23 19 0 0
>
> # Test imputation of a single column in a two-column dataset
> cat('\nTest imputation of a simple dataset')
Test imputation of a simple dataset> print(mice(mymardata[, c('y', 'x1')], method = 'rfcont'))
iter imp variable
1 1 x1
1 2 x1
1 3 x1
1 4 x1
1 5 x1
2 1 x1
2 2 x1
2 3 x1
2 4 x1
2 5 x1
3 1 x1
3 2 x1
3 3 x1
3 4 x1
3 5 x1
4 1 x1
4 2 x1
4 3 x1
4 4 x1
4 5 x1
5 1 x1
5 2 x1
5 3 x1
5 4 x1
5 5 x1
Multiply imputed data set
Call:
mice(data = mymardata[, c("y", "x1")], method = "rfcont")
Number of multiple imputations: 5
Missing cells per column:
y x1
0 23
Imputation methods:
y x1
"rfcont" "rfcont"
VisitSequence:
x1
2
PredictorMatrix:
y x1
y 0 0
x1 1 0
Random generator seed value: NA
>
> # Analyse data
> cat('\nFull data analysis:\n')
Full data analysis:
> print(summary(lm(y ~ x1 + x2 + x3, data=mydata)))
Call:
lm(formula = y ~ x1 + x2 + x3, data = mydata)
Residuals:
Min 1Q Median 3Q Max
-2.4955 -0.8516 0.1293 0.8173 2.2906
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.06803 0.10761 -0.632 0.529
x1 0.98151 0.12224 8.030 2.46e-12 ***
x2 0.82731 0.11066 7.476 3.59e-11 ***
x3 1.08189 0.10072 10.742 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.071 on 96 degrees of freedom
Multiple R-squared: 0.8036, Adjusted R-squared: 0.7975
F-statistic: 131 on 3 and 96 DF, p-value: < 2.2e-16
>
> cat('\nMICE using normal-based linear regression:\n')
MICE using normal-based linear regression:
> print(summary(pool(with(mice(mymardata,
+ method = 'norm'), lm(y ~ x1 + x2 + x3)))))
iter imp variable
1 1 x1 x2
1 2 x1 x2
1 3 x1 x2
1 4 x1 x2
1 5 x1 x2
2 1 x1 x2
2 2 x1 x2
2 3 x1 x2
2 4 x1 x2
2 5 x1 x2
3 1 x1 x2
3 2 x1 x2
3 3 x1 x2
3 4 x1 x2
3 5 x1 x2
4 1 x1 x2
4 2 x1 x2
4 3 x1 x2
4 4 x1 x2
4 5 x1 x2
5 1 x1 x2
5 2 x1 x2
5 3 x1 x2
5 4 x1 x2
5 5 x1 x2
est se t df Pr(>|t|) lo 95
(Intercept) -0.02577896 0.1408735 -0.1829937 20.97589 8.565602e-01 -0.3187620
x1 1.02462270 0.1564965 6.5472581 32.84183 1.989256e-07 0.7061700
x2 0.77518439 0.1283210 6.0409781 29.43683 1.340919e-06 0.5129075
x3 1.14342099 0.1414431 8.0839618 15.31484 6.560078e-07 0.8424810
hi 95 nmis fmi lambda
(Intercept) 0.267204 NA 0.4073775 0.3534436
x1 1.343075 23 0.2960710 0.2544698
x2 1.037461 19 0.3220758 0.2775295
x3 1.444361 0 0.4937499 0.4316898
>
> # Set options for Random Forest
> setRFoptions(ntree_cont = 10)
Setting option CALIBERrfimpute_ntree_cont = 10
>
> cat('\nMICE using Random Forest:\n')
MICE using Random Forest:
> print(summary(pool(with(mice(mymardata,
+ method = 'rfcont'), lm(y ~ x1 + x2 + x3)))))
iter imp variable
1 1 x1 x2
1 2 x1 x2
1 3 x1 x2
1 4 x1 x2
1 5 x1 x2
2 1 x1 x2
2 2 x1 x2
2 3 x1 x2
2 4 x1 x2
2 5 x1 x2
3 1 x1 x2
3 2 x1 x2
3 3 x1 x2
3 4 x1 x2
3 5 x1 x2
4 1 x1 x2
4 2 x1 x2
4 3 x1 x2
4 4 x1 x2
4 5 x1 x2
5 1 x1 x2
5 2 x1 x2
5 3 x1 x2
5 4 x1 x2
5 5 x1 x2
est se t df Pr(>|t|) lo 95
(Intercept) 0.09767866 0.1387662 0.7039083 74.45803 4.836855e-01 -0.1787907
x1 0.97895961 0.1814181 5.3961527 52.22214 1.680685e-06 0.6149545
x2 0.72689111 0.1453314 5.0016120 53.14369 6.569268e-06 0.4354117
x3 1.26467777 0.1243316 10.1718116 75.91030 8.881784e-16 1.0170452
hi 95 nmis fmi lambda
(Intercept) 0.3741481 NA 0.1087152 0.08509184
x1 1.3429647 23 0.1913576 0.16097019
x2 1.0183705 19 0.1874755 0.15746184
x3 1.5123104 0 0.1036421 0.08033290
>
>
>
>
>
> dev.off()
null device
1
>