R: Multiple Imputation using Additive Regression, Bootstrapping,...
aregImpute
R Documentation
Multiple Imputation using Additive Regression, Bootstrapping, and
Predictive Mean Matching
Description
The transcan function creates flexible additive imputation models
but provides only an approximation to true multiple imputation as the
imputation models are fixed before all multiple imputations are
drawn. This ignores variability caused by having to fit the
imputation models. aregImpute takes all aspects of uncertainty in
the imputations into account by using the bootstrap to approximate the
process of drawing predicted values from a full Bayesian predictive
distribution. Different bootstrap resamples are used for each of the
multiple imputations, i.e., for the ith imputation of a sometimes
missing variable, i=1,2,... n.impute, a flexible additive
model is fitted on a sample with replacement from the original data and
this model is used to predict all of the original missing and
non-missing values for the target variable.
areg is used to fit the imputation models. By default, linearity
is assumed for target variables (variables being imputed) and
nk=3 knots are assumed for continuous predictors transformed
using restricted cubic splines. If nk is three or greater and
tlinear is set to FALSE, areg
simultaneously find transformations of the target variable and of all of
the predictors, to get a good fit assuming additivity, maximizing
R^2, using the same canonical correlation method as
transcan. Flexible transformations may be overridden for
specific variables by specifying the identity transformation for them.
When a categorical variable is being predicted, the flexible
transformation is Fisher's optimum scoring method. Nonlinear transformations for continuous variables may be nonmonotonic. If
nk is a vector, areg's bootstrap and crossval=10
options will be used to help find the optimum validating value of
nk over values of that vector, at the last imputation iteration.
For the imputations, the minimum value of nk is used.
Instead of defaulting to taking random draws from fitted imputation
models using random residuals as is done by transcan,
aregImpute by default uses predictive mean matching with optional
weighted probability sampling of donors rather than using only the
closest match. Predictive mean matching works for binary, categorical,
and continuous variables without the need for iterative maximum
likelihood fitting for binary and categorical variables, and without the
need for computing residuals or for curtailing imputed values to be in
the range of actual data. Predictive mean matching is especially
attractive when the variable being imputed is also being transformed
automatically. See Details below for more information about the
algorithm. A "regression" method is also available that is
similar to that used in transcan. This option should be used
when mechanistic missingness requires the use of extrapolation during
imputation.
A print method summarizes the results, and a plot method plots
distributions of imputed values. Typically, fit.mult.impute will
be called after aregImpute.
If a target variable is transformed nonlinearly (i.e., if nk is
greater than zero and tlinear is set to FALSE) and the
estimated target variable transformation is non-monotonic, imputed
values are not unique. When type='regression', a random choice
of possible inverse values is made.
The reformM function provides two ways of recreating a formula to
give to aregImpute by reordering the variables in the formula.
This ia a modified version of a function written by Yong Hao Pua. One
can specify nperm to obtain a list of nperm randomly
permuted variables. The list is converted to a single ordinary formula
if nperm=1. If nperm is omitted, variables are sorted in
descending order of the number of NAs. reformM also
prints a recommended number of multiple imputations to use, which is a
minimum of 5 and the percent of incomplete observations.
an S model formula. You can specify restrictions for transformations
of variables. The function automatically determines which variables
are categorical (i.e., factor, category, or character vectors).
Binary variables are automatically restricted to be linear. Force
linear transformations of continuous variables by enclosing variables
by the identify function (I()). It is recommended that
factor() or as.factor() do not appear in the formula but
instead variables be converted to factors as needed and stored in the
data frame. That way imputations for factor variables (done using
impute.transcan for example) will be correct. Currently
reformM does not handle variables that are enclosed in functions
such as I().
x
an object created by aregImpute. For aregImpute, set
x to TRUE to save the data matrix containing the final (number
n.impute) imputations in the result. This
is needed if you want to later do out-of-sample imputation.
Categorical variables are coded as integers in this matrix.
data
input raw data
subset
These may be also be specified. You may not specify na.action as
na.retain is always used.
n.impute
number of multiple imputations. n.impute=5 is frequently
recommended but 10 or more doesn't hurt.
group
a character or factor variable the same length as the
number of observations in data and containing no NAs.
When group is present, causes a bootstrap sample of the
observations corresponding to non-NAs of a target variable to
have the same frequency distribution of group as the
that in the non-NAs of the original sample. This can handle
k-sample problems as well as lower the chance that a bootstrap sample
will have a missing cell when the original cell frequency was low.
nk
number of knots to use for continuous variables. When both
the target variable and the predictors are having optimum
transformations estimated, there is more instability than with normal
regression so the complexity of the model should decrease more sharply
as the sample size decreases. Hence set nk to 0 (to force
linearity for non-categorical variables) or 3 (minimum number of knots
possible with a linear tail-restricted cubic spline) for small sample
sizes. Simulated problems as in the examples section can assist in
choosing nk. See nk to a vector to get bootstrap-validated
and 10-fold cross-validated R^2 and mean and median absolute
prediction errors for imputing each sometimes-missing variable, with
nk ranging over the given vector. The errors are on the
original untransformed scale. The mean absolute error is the
recommended basis for choosing the number of knots (or linearity).
tlinear
set to FALSE to allow a target variable (variable
being imputed) to have a nonlinear left-hand-side transformation when
nk is 3 or greater
type
The default is "pmn" for predictive mean matching,
which is a more nonparametric approach that will work for categorical
as well as continuous predictors. Alternatively, use
"regression" when all variables that are sometimes missing are
continuous and the missingness mechanism is such that entire intervals
of population values are unobserved. See the Details section for more
information. Another method, type="normpmm", only works
when variables containing NAs are continuous and tlinear
is TRUE (the default), meaning that the variable being imputed
is not transformed when it is on the left hand model side.
normpmm assumes that the imputation regression parameter
estimates are multivariately normally distributed and that the
residual variance has a scaled chi-squared distribution. For each
imputation a random draw of the estimates is taken and a random draw
from sigma is combined with those to get a random draw from the
posterior predicted value distribution. Predictive mean matching is
then done matching these predicted values from incomplete observations
with predicted values from complete potential donor observations,
where the latter predictions are based on the imputation model least
squares parameter estimates and not on random draws from the posterior.
For the plot method, specify type="hist"
to draw histograms of imputed values with rug plots at the top, or
type="ecdf" (the default) to draw empirical CDFs with spike
histograms at the bottom.
pmmtype
type of matching to be used for predictive mean
matching when type="pmm". pmmtype=2 means that predicted
values for both target incomplete and complete observations come from
a fit from the same bootstrap sample. pmmtype=1, the default,
means that predicted values for complete observations are based
on additive regression fits on original complete observations (using last
imputations for non-target variables as with the other methds), and using
fits on a bootstrap sample to get predicted values for missing target variables.
See van Buuren (2012) section 3.4.2 where pmmtype=1 is said to
work much better when the number of variables is small.
pmmtype=3 means that complete observation predicted values come
from a bootstrap sample fit whereas target incomplete observation
predicted values come from a sample with replacement from the bootstrap
fit (approximate Bayesian bootstrap).
match
Defaults to match="weighted" to do weighted multinomial
probability sampling using the tricube function (similar to lowess)
as the weights. The argument of the tricube function is the absolute
difference in transformed predicted values of all the donors and of
the target predicted value, divided by a scaling factor.
The scaling factor in the tricube function is fweighted times
the mean absolute difference between the target predicted value and
all the possible donor predicted values. Set match="closest"
to find as the donor the observation having the closest predicted
transformed value, even if that same donor is found repeatedly. Set
match="kclosest" to use a slower implementation that finds,
after jittering the complete case predicted values, the
kclosest complete cases on the target variable being imputed,
then takes a random sample of one of these kclosest cases.
kclosest
see match
fweighted
Smoothing parameter (multiple of mean absolute difference) used when
match="weighted", with a default value of 0.2. Set
fweighted to a number between 0.02 and 0.2 to force the donor
to have a predicted value closer to the target, and set
fweighted to larger values (but seldom larger than 1.0) to allow
donor values to be less tightly matched. See the examples below to
learn how to study the relationship between fweighted and the
standard deviation of multiple imputations within individuals.
curtail
applies if type='regression', causing imputed
values to be curtailed at the observed range of the target variable.
Set to FALSE to allow extrapolation outside the data range.
boot.method
By default, simple boostrapping is used in which the
target variable is predicted using a sample with replacement from the
observations with non-missing target variable. Specify
boot.method='approximate bayesian' to build the imputation
models from a sample with replacement from a sample with replacement
of the observations with non-missing targets. Preliminary simulations
have shown this results in good confidence coverage of the final model
parameters when type='regression' is used. Not implemented
when group is used.
burnin
aregImpute does burnin + n.impute iterations of the
entire modeling process. The first burnin imputations are
discarded. More burn-in iteractions may be requied when multiple
variables are missing on the same observations. When only one
variable is missing, no burn-ins are needed and burnin is set
to zero if unspecified.
pr
set to FALSE to suppress printing of iteration messages
plotTrans
set to TRUE to plot ace or avas transformations
for each variable for each of the multiple imputations. This is
useful for determining whether transformations are reasonable. If
transformations are too noisy or have long flat sections (resulting in
"lumps" in the distribution of imputed values), it may be advisable to
place restrictions on the transformations (monotonicity or linearity).
tolerance
singularity criterion; list the source code in the
lm.fit.qr.bare function for details
B
number of bootstrap resamples to use if nk is a vector
digits
number of digits for printing
nclass
number of bins to use in drawing histogram
datadensity
see Ecdf
diagnostics
Specify diagnostics=TRUE to draw plots of imputed values against
sequential imputation numbers, separately for each missing
observations and variable.
maxn
Maximum number of observations shown for diagnostics. Default is
maxn=10, which limits the number of observations plotted to at most
the first 10.
nperm
number of random formula permutations for reformM;
omit to sort variables by descending missing count.
...
other arguments that are ignored
Details
The sequence of steps used by the aregImpute algorithm is the
following.
(1) For each variable containing m NAs where m > 0, initialize the
NAs to values from a random sample (without replacement if
a sufficient number of non-missing values exist) of size m from the
non-missing values.
(2) For burnin+n.impute iterations do the following steps. The
first burnin iterations provide a burn-in, and imputations are
saved only from the last n.impute iterations.
(3) For each variable containing any NAs, draw a sample with
replacement from the observations in the entire dataset in which the
current variable being imputed is non-missing. Fit a flexible
additive model to predict this target variable while finding the
optimum transformation of it (unless the identity
transformation is forced). Use this fitted flexible model to
predict the target variable in all of the original observations.
Impute each missing value of the target variable with the observed
value whose predicted transformed value is closest to the predicted
transformed value of the missing value (if match="closest" and
type="pmm"),
or use a draw from a multinomial distribution with probabilities derived
from distance weights, if match="weighted" (the default).
(4) After these imputations are computed, use these random draw
imputations the next time the curent target variable is used as a
predictor of other sometimes-missing variables.
When match="closest", predictive mean matching does not work well
when fewer than 3 variables are used to predict the target variable,
because many of the multiple imputations for an observation will be
identical. In the extreme case of one right-hand-side variable and
assuming that only monotonic transformations of left and right-side
variables are allowed, every bootstrap resample will give predicted
values of the target variable that are monotonically related to
predicted values from every other bootstrap resample. The same is true
for Bayesian predicted values. This causes predictive mean matching to
always match on the same donor observation.
When the missingness mechanism for a variable is so systematic that the
distribution of observed values is truncated, predictive mean matching
does not work. It will only yield imputed values that are near observed
values, so intervals in which no values are observed will not be
populated by imputed values. For this case, the only hope is to make
regression assumptions and use extrapolation. With
type="regression", aregImpute will use linear
extrapolation to obtain a (hopefully) reasonable distribution of imputed
values. The "regression" option causes aregImpute to
impute missing values by adding a random sample of residuals (with
replacement if there are more NAs than measured values) on the
transformed scale of the target variable. After random residuals are
added, predicted random draws are obtained on the original untransformed
scale using reverse linear interpolation on the table of original and
transformed target values (linear extrapolation when a random residual
is large enough to put the random draw prediction outside the range of
observed values). The bootstrap is used as with type="pmm" to
factor in the uncertainty of the imputation model.
As model uncertainty is high when the transformation of a target
variable is unknown, tlinear defaults to TRUE to limit the
variance in predicted values when nk is positive.
Value
a list of class "aregImpute" containing the following elements:
call
the function call expression
formula
the formula specified to aregImpute
match
the match argument
fweighted
the fweighted argument
n
total number of observations in input dataset
p
number of variables
na
list of subscripts of observations for which values were originally missing
nna
named vector containing the numbers of missing values in the data
type
vector of types of transformations used for each variable
("s","l","c" for smooth spline, linear, or categorical with dummy
variables)
tlinear
value of tlinear parameter
nk
number of knots used for smooth transformations
cat.levels
list containing character vectors specifying the levels of
categorical variables
df
degrees of freedom (number of parameters estimated) for each
variable
n.impute
number of multiple imputations per missing value
imputed
a list containing matrices of imputed values in the same format as
those created by transcan. Categorical variables are coded using
their integer codes. Variables having no missing values will have
NULL matrices in the list.
x
if x is TRUE, the original data matrix with
integer codes for categorical variables
rsq
for the last round of imputations, a vector containing the R-squares
with which each sometimes-missing variable could be predicted from the
others by ace or avas.
van Buuren, Stef. Flexible Imputation of Missing Data. Chapman &
Hall/CRC, Boca Raton FL, 2012.
Little R, An H. Robust likelihood-based analysis of multivariate data
with missing values. Statistica Sinica 14:949-968, 2004.
van Buuren S, Brand JPL, Groothuis-Oudshoorn CGM, Rubin DB. Fully
conditional specifications in multivariate imputation. J Stat Comp
Sim 72:1049-1064, 2006.
de Groot JAH, Janssen KJM, Zwinderman AH, Moons KGM, Reitsma JB.
Multiple imputation to correct for partial verification bias
revisited. Stat Med 27:5880-5889, 2008.
Siddique J. Multiple imputation using an iterative hot-deck with
distance-based donor selection. Stat Med 27:83-102, 2008.
White IR, Royston P, Wood AM. Multiple imputation using chained
equations: Issues and guidance for practice. Stat Med 30:377-399,
2011.
# Check that aregImpute can almost exactly estimate missing values when
# there is a perfect nonlinear relationship between two variables
# Fit restricted cubic splines with 4 knots for x1 and x2, linear for x3
set.seed(3)
x1 <- rnorm(200)
x2 <- x1^2
x3 <- runif(200)
m <- 30
x2[1:m] <- NA
a <- aregImpute(~x1+x2+I(x3), n.impute=5, nk=4, match='closest')
a
matplot(x1[1:m]^2, a$imputed$x2)
abline(a=0, b=1, lty=2)
x1[1:m]^2
a$imputed$x2
# Multiple imputation and estimation of variances and covariances of
# regression coefficient estimates accounting for imputation
# Example 1: large sample size, much missing data, no overlap in
# NAs across variables
x1 <- factor(sample(c('a','b','c'),1000,TRUE))
x2 <- (x1=='b') + 3*(x1=='c') + rnorm(1000,0,2)
x3 <- rnorm(1000)
y <- x2 + 1*(x1=='c') + .2*x3 + rnorm(1000,0,2)
orig.x1 <- x1[1:250]
orig.x2 <- x2[251:350]
x1[1:250] <- NA
x2[251:350] <- NA
d <- data.frame(x1,x2,x3,y)
# Find value of nk that yields best validating imputation models
# tlinear=FALSE means to not force the target variable to be linear
f <- aregImpute(~y + x1 + x2 + x3, nk=c(0,3:5), tlinear=FALSE,
data=d, B=10) # normally B=75
f
# Try forcing target variable (x1, then x2) to be linear while allowing
# predictors to be nonlinear (could also say tlinear=TRUE)
f <- aregImpute(~y + x1 + x2 + x3, nk=c(0,3:5), data=d, B=10)
f
## Not run:
# Use 100 imputations to better check against individual true values
f <- aregImpute(~y + x1 + x2 + x3, n.impute=100, data=d)
f
par(mfrow=c(2,1))
plot(f)
modecat <- function(u) {
tab <- table(u)
as.numeric(names(tab)[tab==max(tab)][1])
}
table(orig.x1,apply(f$imputed$x1, 1, modecat))
par(mfrow=c(1,1))
plot(orig.x2, apply(f$imputed$x2, 1, mean))
fmi <- fit.mult.impute(y ~ x1 + x2 + x3, lm, f,
data=d)
sqrt(diag(vcov(fmi)))
fcc <- lm(y ~ x1 + x2 + x3)
summary(fcc) # SEs are larger than from mult. imputation
## End(Not run)
## Not run:
# Example 2: Very discriminating imputation models,
# x1 and x2 have some NAs on the same rows, smaller n
set.seed(5)
x1 <- factor(sample(c('a','b','c'),100,TRUE))
x2 <- (x1=='b') + 3*(x1=='c') + rnorm(100,0,.4)
x3 <- rnorm(100)
y <- x2 + 1*(x1=='c') + .2*x3 + rnorm(100,0,.4)
orig.x1 <- x1[1:20]
orig.x2 <- x2[18:23]
x1[1:20] <- NA
x2[18:23] <- NA
#x2[21:25] <- NA
d <- data.frame(x1,x2,x3,y)
n <- naclus(d)
plot(n); naplot(n) # Show patterns of NAs
# 100 imputations to study them; normally use 5 or 10
f <- aregImpute(~y + x1 + x2 + x3, n.impute=100, nk=0, data=d)
par(mfrow=c(2,3))
plot(f, diagnostics=TRUE, maxn=2)
# Note: diagnostics=TRUE makes graphs similar to those made by:
# r <- range(f$imputed$x2, orig.x2)
# for(i in 1:6) { # use 1:2 to mimic maxn=2
# plot(1:100, f$imputed$x2[i,], ylim=r,
# ylab=paste("Imputations for Obs.",i))
# abline(h=orig.x2[i],lty=2)
# }
table(orig.x1,apply(f$imputed$x1, 1, modecat))
par(mfrow=c(1,1))
plot(orig.x2, apply(f$imputed$x2, 1, mean))
fmi <- fit.mult.impute(y ~ x1 + x2, lm, f,
data=d)
sqrt(diag(vcov(fmi)))
fcc <- lm(y ~ x1 + x2)
summary(fcc) # SEs are larger than from mult. imputation
## End(Not run)
## Not run:
# Study relationship between smoothing parameter for weighting function
# (multiplier of mean absolute distance of transformed predicted
# values, used in tricube weighting function) and standard deviation
# of multiple imputations. SDs are computed from average variances
# across subjects. match="closest" same as match="weighted" with
# small value of fweighted.
# This example also shows problems with predicted mean
# matching almost always giving the same imputed values when there is
# only one predictor (regression coefficients change over multiple
# imputations but predicted values are virtually 1-1 functions of each
# other)
set.seed(23)
x <- runif(200)
y <- x + runif(200, -.05, .05)
r <- resid(lsfit(x,y))
rmse <- sqrt(sum(r^2)/(200-2)) # sqrt of residual MSE
y[1:20] <- NA
d <- data.frame(x,y)
f <- aregImpute(~ x + y, n.impute=10, match='closest', data=d)
# As an aside here is how to create a completed dataset for imputation
# number 3 as fit.mult.impute would do automatically. In this degenerate
# case changing 3 to 1-2,4-10 will not alter the results.
imputed <- impute.transcan(f, imputation=3, data=d, list.out=TRUE,
pr=FALSE, check=FALSE)
sd <- sqrt(mean(apply(f$imputed$y, 1, var)))
ss <- c(0, .01, .02, seq(.05, 1, length=20))
sds <- ss; sds[1] <- sd
for(i in 2:length(ss)) {
f <- aregImpute(~ x + y, n.impute=10, fweighted=ss[i])
sds[i] <- sqrt(mean(apply(f$imputed$y, 1, var)))
}
plot(ss, sds, xlab='Smoothing Parameter', ylab='SD of Imputed Values',
type='b')
abline(v=.2, lty=2) # default value of fweighted
abline(h=rmse, lty=2) # root MSE of residuals from linear regression
## End(Not run)
## Not run:
# Do a similar experiment for the Titanic dataset
getHdata(titanic3)
h <- lm(age ~ sex + pclass + survived, data=titanic3)
rmse <- summary(h)$sigma
set.seed(21)
f <- aregImpute(~ age + sex + pclass + survived, n.impute=10,
data=titanic3, match='closest')
sd <- sqrt(mean(apply(f$imputed$age, 1, var)))
ss <- c(0, .01, .02, seq(.05, 1, length=20))
sds <- ss; sds[1] <- sd
for(i in 2:length(ss)) {
f <- aregImpute(~ age + sex + pclass + survived, data=titanic3,
n.impute=10, fweighted=ss[i])
sds[i] <- sqrt(mean(apply(f$imputed$age, 1, var)))
}
plot(ss, sds, xlab='Smoothing Parameter', ylab='SD of Imputed Values',
type='b')
abline(v=.2, lty=2) # default value of fweighted
abline(h=rmse, lty=2) # root MSE of residuals from linear regression
## End(Not run)
d <- data.frame(x1=rnorm(50), x2=c(rep(NA, 10), runif(40)),
x3=c(runif(4), rep(NA, 11), runif(35)))
reformM(~ x1 + x2 + x3, data=d)
reformM(~ x1 + x2 + x3, data=d, nperm=2)
# Give result or one of the results as the first argument to aregImpute
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(Hmisc)
Loading required package: lattice
Loading required package: survival
Loading required package: Formula
Loading required package: ggplot2
Attaching package: 'Hmisc'
The following objects are masked from 'package:base':
format.pval, round.POSIXt, trunc.POSIXt, units
> png(filename="/home/ddbj/snapshot/RGM3/R_CC/result/Hmisc/aregImpute.Rd_%03d_medium.png", width=480, height=480)
> ### Name: aregImpute
> ### Title: Multiple Imputation using Additive Regression, Bootstrapping,
> ### and Predictive Mean Matching
> ### Aliases: aregImpute print.aregImpute plot.aregImpute reformM
> ### Keywords: smooth regression multivariate methods models
>
> ### ** Examples
>
> # Check that aregImpute can almost exactly estimate missing values when
> # there is a perfect nonlinear relationship between two variables
> # Fit restricted cubic splines with 4 knots for x1 and x2, linear for x3
> set.seed(3)
> x1 <- rnorm(200)
> x2 <- x1^2
> x3 <- runif(200)
> m <- 30
> x2[1:m] <- NA
> a <- aregImpute(~x1+x2+I(x3), n.impute=5, nk=4, match='closest')
Iteration 1 Iteration 2 Iteration 3 Iteration 4 Iteration 5
> a
Multiple Imputation using Bootstrap and PMM
aregImpute(formula = ~x1 + x2 + I(x3), n.impute = 5, nk = 4,
match = "closest")
n: 200 p: 3 Imputations: 5 nk: 4
Number of NAs:
x1 x2 x3
0 30 0
type d.f.
x1 s 3
x2 s 1
x3 l 1
Transformation of Target Variables Forced to be Linear
R-squares for Predicting Non-Missing Values for Each Variable
Using Last Imputations of Predictors
x2
0.99
> matplot(x1[1:m]^2, a$imputed$x2)
> abline(a=0, b=1, lty=2)
>
> x1[1:m]^2
[1] 0.925315897 0.085571299 0.066971341 1.327407883 0.038330915 0.000907452
[7] 0.007296189 1.246818367 1.485613400 1.606223478 0.554699626 1.279655455
[13] 0.513169486 0.063833220 0.023117897 0.094652479 0.908242033 0.420218743
[19] 1.498943851 0.039924679 0.334643416 0.887930672 0.041505171 2.777138392
[25] 0.234696753 0.549188688 1.347028987 1.024279865 0.005195306 1.292273993
> a$imputed$x2
[,1] [,2] [,3] [,4] [,5]
1 9.305771e-01 0.93057707 0.93057707 0.93057707 0.93057707
2 1.238805e-01 0.12399527 0.12399527 0.13530461 0.12399527
3 4.379575e-02 0.10438988 0.04379575 0.04379575 0.07825542
4 1.354657e+00 1.26259793 1.27885563 1.27885563 1.26259793
5 2.631055e-02 0.05674881 0.07133446 0.05108231 0.04379575
6 4.099071e-05 0.04379575 0.04273977 0.01634461 0.03883560
7 4.099071e-05 0.04379575 0.03567195 0.01634461 0.04379575
8 1.337074e+00 1.18470752 1.18470752 1.18470752 1.18470752
9 1.481256e+00 1.35465737 1.37906823 1.47965691 1.37906823
10 1.748349e+00 1.59142672 1.59142672 1.59142672 1.54302202
11 5.813279e-01 0.61859306 0.63005689 0.61859306 0.61859306
12 1.337074e+00 1.18470752 1.26259793 1.26259793 1.18470752
13 5.424354e-01 0.61859306 0.59489008 0.55889449 0.59489008
14 4.379575e-02 0.04379575 0.04379575 0.04379575 0.07825542
15 4.273977e-02 0.02460284 0.05674881 0.04273977 0.04379575
16 1.238805e-01 0.15278564 0.12399527 0.14905297 0.19338911
17 9.305771e-01 0.93057707 0.93057707 0.91224047 0.91224047
18 4.652969e-01 0.46591932 0.45054496 0.42364183 0.49775214
19 1.591427e+00 1.47965691 1.47965691 1.47965691 1.37906823
20 4.273977e-02 0.04379575 0.06568317 0.05502560 0.10438988
21 3.811211e-01 0.44666514 0.42239372 0.38112108 0.42239372
22 9.122405e-01 0.91224047 0.91224047 0.91224047 0.91224047
23 4.273977e-02 0.15498314 0.10438988 0.04379575 0.09638736
24 3.018085e+00 2.23627988 2.66392498 2.58567345 2.22849233
25 2.925818e-01 0.32892309 0.32892309 0.29258175 0.32892309
26 5.948901e-01 0.63090662 0.61859306 0.59489008 0.61859306
27 1.379068e+00 1.33707381 1.27885563 1.27885563 1.26259793
28 1.072242e+00 0.97998959 0.95899670 0.95899670 0.95899670
29 2.049374e-02 0.03883560 0.07148054 0.03466948 0.04379575
30 1.337074e+00 1.18470752 1.26259793 1.27885563 1.26259793
>
>
> # Multiple imputation and estimation of variances and covariances of
> # regression coefficient estimates accounting for imputation
> # Example 1: large sample size, much missing data, no overlap in
> # NAs across variables
> x1 <- factor(sample(c('a','b','c'),1000,TRUE))
> x2 <- (x1=='b') + 3*(x1=='c') + rnorm(1000,0,2)
> x3 <- rnorm(1000)
> y <- x2 + 1*(x1=='c') + .2*x3 + rnorm(1000,0,2)
> orig.x1 <- x1[1:250]
> orig.x2 <- x2[251:350]
> x1[1:250] <- NA
> x2[251:350] <- NA
> d <- data.frame(x1,x2,x3,y)
> # Find value of nk that yields best validating imputation models
> # tlinear=FALSE means to not force the target variable to be linear
> f <- aregImpute(~y + x1 + x2 + x3, nk=c(0,3:5), tlinear=FALSE,
+ data=d, B=10) # normally B=75
Iteration 1 Iteration 2 Iteration 3 Iteration 4 Iteration 5 Iteration 6 Iteration 7 Iteration 8
> f
Multiple Imputation using Bootstrap and PMM
aregImpute(formula = ~y + x1 + x2 + x3, data = d, nk = c(0, 3:5),
tlinear = FALSE, B = 10)
n: 1000 p: 4 Imputations: 5 nk: 0
Number of NAs:
y x1 x2 x3
0 250 100 0
type d.f.
y s 1
x1 c 2
x2 s 1
x3 s 1
R-squares for Predicting Non-Missing Values for Each Variable
Using Last Imputations of Predictors
x1 x2
0.337 0.606
Resampling results for determining the complexity of imputation models
Variable being imputed: x1
nk=0 nk=3 nk=4 nk=5
Bootstrap bias-corrected R^2 0.283 0.301 0.301 0.295
10-fold cross-validated R^2 0.311 0.296 0.307 0.298
Bootstrap bias-corrected mean |error| 1.065 1.050 1.057 1.057
10-fold cross-validated mean |error| 0.516 0.515 0.522 0.524
Bootstrap bias-corrected median |error| 1.000 1.000 1.000 1.000
10-fold cross-validated median |error| 0.200 0.100 0.200 0.100
Variable being imputed: x2
nk=0 nk=3 nk=4 nk=5
Bootstrap bias-corrected R^2 0.625 0.624 0.614 0.619
10-fold cross-validated R^2 0.627 0.617 0.624 0.621
Bootstrap bias-corrected mean |error| 1.140 1.206 1.232 1.209
10-fold cross-validated mean |error| 1.751 1.212 1.226 1.235
Bootstrap bias-corrected median |error| 0.935 1.002 1.026 1.024
10-fold cross-validated median |error| 1.538 1.000 1.033 1.048
> # Try forcing target variable (x1, then x2) to be linear while allowing
> # predictors to be nonlinear (could also say tlinear=TRUE)
> f <- aregImpute(~y + x1 + x2 + x3, nk=c(0,3:5), data=d, B=10)
Iteration 1 Iteration 2 Iteration 3 Iteration 4 Iteration 5 Iteration 6 Iteration 7 Iteration 8
> f
Multiple Imputation using Bootstrap and PMM
aregImpute(formula = ~y + x1 + x2 + x3, data = d, nk = c(0, 3:5),
B = 10)
n: 1000 p: 4 Imputations: 5 nk: 0
Number of NAs:
y x1 x2 x3
0 250 100 0
type d.f.
y s 1
x1 c 2
x2 s 1
x3 s 1
Transformation of Target Variables Forced to be Linear
R-squares for Predicting Non-Missing Values for Each Variable
Using Last Imputations of Predictors
x1 x2
0.282 0.647
Resampling results for determining the complexity of imputation models
Variable being imputed: x1
nk=0 nk=3 nk=4 nk=5
Bootstrap bias-corrected R^2 0.271 0.278 0.268 0.273
10-fold cross-validated R^2 0.279 0.276 0.275 0.275
Bootstrap bias-corrected mean |error| 1.047 1.044 1.050 1.044
10-fold cross-validated mean |error| 0.533 0.537 0.541 0.548
Bootstrap bias-corrected median |error| 1.000 1.000 1.000 1.000
10-fold cross-validated median |error| 0.200 0.200 0.100 0.300
Variable being imputed: x2
nk=0 nk=3 nk=4 nk=5
Bootstrap bias-corrected R^2 0.632 0.613 0.619 0.618
10-fold cross-validated R^2 0.623 0.623 0.625 0.619
Bootstrap bias-corrected mean |error| 1.149 1.165 1.155 1.164
10-fold cross-validated mean |error| 1.791 1.783 1.788 1.790
Bootstrap bias-corrected median |error| 0.967 0.963 0.972 0.962
10-fold cross-validated median |error| 1.617 1.587 1.599 1.603
>
> ## Not run:
> ##D # Use 100 imputations to better check against individual true values
> ##D f <- aregImpute(~y + x1 + x2 + x3, n.impute=100, data=d)
> ##D f
> ##D par(mfrow=c(2,1))
> ##D plot(f)
> ##D modecat <- function(u) {
> ##D tab <- table(u)
> ##D as.numeric(names(tab)[tab==max(tab)][1])
> ##D }
> ##D table(orig.x1,apply(f$imputed$x1, 1, modecat))
> ##D par(mfrow=c(1,1))
> ##D plot(orig.x2, apply(f$imputed$x2, 1, mean))
> ##D fmi <- fit.mult.impute(y ~ x1 + x2 + x3, lm, f,
> ##D data=d)
> ##D sqrt(diag(vcov(fmi)))
> ##D fcc <- lm(y ~ x1 + x2 + x3)
> ##D summary(fcc) # SEs are larger than from mult. imputation
> ## End(Not run)
> ## Not run:
> ##D # Example 2: Very discriminating imputation models,
> ##D # x1 and x2 have some NAs on the same rows, smaller n
> ##D set.seed(5)
> ##D x1 <- factor(sample(c('a','b','c'),100,TRUE))
> ##D x2 <- (x1=='b') + 3*(x1=='c') + rnorm(100,0,.4)
> ##D x3 <- rnorm(100)
> ##D y <- x2 + 1*(x1=='c') + .2*x3 + rnorm(100,0,.4)
> ##D orig.x1 <- x1[1:20]
> ##D orig.x2 <- x2[18:23]
> ##D x1[1:20] <- NA
> ##D x2[18:23] <- NA
> ##D #x2[21:25] <- NA
> ##D d <- data.frame(x1,x2,x3,y)
> ##D n <- naclus(d)
> ##D plot(n); naplot(n) # Show patterns of NAs
> ##D # 100 imputations to study them; normally use 5 or 10
> ##D f <- aregImpute(~y + x1 + x2 + x3, n.impute=100, nk=0, data=d)
> ##D par(mfrow=c(2,3))
> ##D plot(f, diagnostics=TRUE, maxn=2)
> ##D # Note: diagnostics=TRUE makes graphs similar to those made by:
> ##D # r <- range(f$imputed$x2, orig.x2)
> ##D # for(i in 1:6) { # use 1:2 to mimic maxn=2
> ##D # plot(1:100, f$imputed$x2[i,], ylim=r,
> ##D # ylab=paste("Imputations for Obs.",i))
> ##D # abline(h=orig.x2[i],lty=2)
> ##D # }
> ##D
> ##D table(orig.x1,apply(f$imputed$x1, 1, modecat))
> ##D par(mfrow=c(1,1))
> ##D plot(orig.x2, apply(f$imputed$x2, 1, mean))
> ##D
> ##D
> ##D fmi <- fit.mult.impute(y ~ x1 + x2, lm, f,
> ##D data=d)
> ##D sqrt(diag(vcov(fmi)))
> ##D fcc <- lm(y ~ x1 + x2)
> ##D summary(fcc) # SEs are larger than from mult. imputation
> ## End(Not run)
>
> ## Not run:
> ##D # Study relationship between smoothing parameter for weighting function
> ##D # (multiplier of mean absolute distance of transformed predicted
> ##D # values, used in tricube weighting function) and standard deviation
> ##D # of multiple imputations. SDs are computed from average variances
> ##D # across subjects. match="closest" same as match="weighted" with
> ##D # small value of fweighted.
> ##D # This example also shows problems with predicted mean
> ##D # matching almost always giving the same imputed values when there is
> ##D # only one predictor (regression coefficients change over multiple
> ##D # imputations but predicted values are virtually 1-1 functions of each
> ##D # other)
> ##D
> ##D set.seed(23)
> ##D x <- runif(200)
> ##D y <- x + runif(200, -.05, .05)
> ##D r <- resid(lsfit(x,y))
> ##D rmse <- sqrt(sum(r^2)/(200-2)) # sqrt of residual MSE
> ##D
> ##D y[1:20] <- NA
> ##D d <- data.frame(x,y)
> ##D f <- aregImpute(~ x + y, n.impute=10, match='closest', data=d)
> ##D # As an aside here is how to create a completed dataset for imputation
> ##D # number 3 as fit.mult.impute would do automatically. In this degenerate
> ##D # case changing 3 to 1-2,4-10 will not alter the results.
> ##D imputed <- impute.transcan(f, imputation=3, data=d, list.out=TRUE,
> ##D pr=FALSE, check=FALSE)
> ##D sd <- sqrt(mean(apply(f$imputed$y, 1, var)))
> ##D
> ##D ss <- c(0, .01, .02, seq(.05, 1, length=20))
> ##D sds <- ss; sds[1] <- sd
> ##D
> ##D for(i in 2:length(ss)) {
> ##D f <- aregImpute(~ x + y, n.impute=10, fweighted=ss[i])
> ##D sds[i] <- sqrt(mean(apply(f$imputed$y, 1, var)))
> ##D }
> ##D
> ##D plot(ss, sds, xlab='Smoothing Parameter', ylab='SD of Imputed Values',
> ##D type='b')
> ##D abline(v=.2, lty=2) # default value of fweighted
> ##D abline(h=rmse, lty=2) # root MSE of residuals from linear regression
> ## End(Not run)
>
> ## Not run:
> ##D # Do a similar experiment for the Titanic dataset
> ##D getHdata(titanic3)
> ##D h <- lm(age ~ sex + pclass + survived, data=titanic3)
> ##D rmse <- summary(h)$sigma
> ##D set.seed(21)
> ##D f <- aregImpute(~ age + sex + pclass + survived, n.impute=10,
> ##D data=titanic3, match='closest')
> ##D sd <- sqrt(mean(apply(f$imputed$age, 1, var)))
> ##D
> ##D ss <- c(0, .01, .02, seq(.05, 1, length=20))
> ##D sds <- ss; sds[1] <- sd
> ##D
> ##D for(i in 2:length(ss)) {
> ##D f <- aregImpute(~ age + sex + pclass + survived, data=titanic3,
> ##D n.impute=10, fweighted=ss[i])
> ##D sds[i] <- sqrt(mean(apply(f$imputed$age, 1, var)))
> ##D }
> ##D
> ##D plot(ss, sds, xlab='Smoothing Parameter', ylab='SD of Imputed Values',
> ##D type='b')
> ##D abline(v=.2, lty=2) # default value of fweighted
> ##D abline(h=rmse, lty=2) # root MSE of residuals from linear regression
> ## End(Not run)
>
> d <- data.frame(x1=rnorm(50), x2=c(rep(NA, 10), runif(40)),
+ x3=c(runif(4), rep(NA, 11), runif(35)))
> reformM(~ x1 + x2 + x3, data=d)
Recommended number of imputations: 30
~x3 + x2 + x1
<environment: 0x73adec0>
> reformM(~ x1 + x2 + x3, data=d, nperm=2)
Recommended number of imputations: 30
[[1]]
~x3 + x1 + x2
<environment: 0x730ccd8>
[[2]]
~x1 + x3 + x2
<environment: 0x730ccd8>
> # Give result or one of the results as the first argument to aregImpute
>
>
>
>
>
> dev.off()
null device
1
>