R: Transformations/Imputations using Canonical Variates
transcan
R Documentation
Transformations/Imputations using Canonical Variates
Description
transcan is a nonlinear additive transformation and imputation
function, and there are several functions for using and operating on
its results. transcan automatically transforms continuous and
categorical variables to have maximum correlation with the best linear
combination of the other variables. There is also an option to use a
substitute criterion - maximum correlation with the first principal
component of the other variables. Continuous variables are expanded
as restricted cubic splines and categorical variables are expanded as
contrasts (e.g., dummy variables). By default, the first canonical
variate is used to find optimum linear combinations of component
columns. This function is similar to ace except that
transformations for continuous variables are fitted using restricted
cubic splines, monotonicity restrictions are not allowed, and
NAs are allowed. When a variable has any NAs,
transformed scores for that variable are imputed using least squares
multiple regression incorporating optimum transformations, or
NAs are optionally set to constants. Shrinkage can be used to
safeguard against overfitting when imputing. Optionally, imputed
values on the original scale are also computed and returned. For this
purpose, recursive partitioning or multinomial logistic models can
optionally be used to impute categorical variables, using what is
predicted to be the most probable category.
By default, transcan imputes NAs with “best
guess” expected values of transformed variables, back transformed to
the original scale. Values thus imputed are most like conditional
medians assuming the transformations make variables' distributions
symmetric (imputed values are similar to conditionl modes for
categorical variables). By instead specifying n.impute,
transcan does approximate multiple imputation from the
distribution of each variable conditional on all other variables.
This is done by sampling n.impute residuals from the
transformed variable, with replacement (a la bootstrapping), or by
default, using Rubin's approximate Bayesian bootstrap, where a sample
of size n with replacement is selected from the residuals on
n non-missing values of the target variable, and then a sample
of size m with replacement is chosen from this sample, where
m is the number of missing values needing imputation for the
current multiple imputation repetition. Neither of these bootstrap
procedures assume normality or even symmetry of residuals. For
sometimes-missing categorical variables, optimal scores are computed
by adding the “best guess” predicted mean score to random
residuals off this score. Then categories having scores closest to
these predicted scores are taken as the random multiple imputations
(impcat = "rpart" is not currently allowed
with n.impute). The literature recommends using n.impute
= 5 or greater. transcan provides only an approximation to
multiple imputation, especially since it “freezes” the
imputation model before drawing the multiple imputations rather than
using different estimates of regression coefficients for each
imputation. For multiple imputation, the aregImpute function
provides a much better approximation to the full Bayesian approach
while still not requiring linearity assumptions.
When you specify n.impute to transcan you can use
fit.mult.impute to re-fit any model n.impute times based
on n.impute completed datasets (if there are any sometimes
missing variables not specified to transcan, some observations
will still be dropped from these fits). After fitting n.impute
models, fit.mult.impute will return the fit object from the
last imputation, with coefficients replaced by the average of
the n.impute coefficient vectors and with a component
var equal to the imputation-corrected variance-covariance
matrix. fit.mult.impute can also use the object created by the
mice function in the mice library to draw the
multiple imputations, as well as objects created by
aregImpute. The following components of fit objects are
also replaced with averages over the n.impute model fits:
linear.predictors, fitted.values, stats,
means, icoef, scale, center,
y.imputed.
The summary method for transcan prints the function
call, R^2 achieved in transforming each variable, and for each
variable the coefficients of all other transformed variables that are
used to estimate the transformation of the initial variable. If
imputed=TRUE was used in the call to transcan, also uses the
describe function to print a summary of imputed values. If
long = TRUE, also prints all imputed values with observation
identifiers. There is also a simple function print.transcan
which merely prints the transformation matrix and the function call.
It has an optional argument long, which if set to TRUE
causes detailed parameters to be printed. Instead of plotting while
transcan is running, you can plot the final transformations
after the fact using plot.transcan or ggplot.transcan,
if the option trantab = TRUE was specified to transcan.
If in addition the option
imputed = TRUE was specified to transcan,
plot and ggplot will show the location of imputed values
(including multiples) along the axes. For ggplot, imputed
values are shown as red plus signs.
impute method for transcan does imputations for a
selected original data variable, on the original scale (if
imputed=TRUE was given to transcan). If you do not
specify a variable to impute, it will do imputations for all
variables given to transcan which had at least one missing
value. This assumes that the original variables are accessible (i.e.,
they have been attached) and that you want the imputed variables to
have the same names are the original variables. If n.impute was
specified to transcan you must tell impute which
imputation to use. Results are stored in .GlobalEnv
when list.out is not specified (it is recommended to use
list.out=TRUE).
The predict method for transcan computes
predicted variables and imputed values from a matrix of new data.
This matrix should have the same column variables as the original
matrix used with transcan, and in the same order (unless a
formula was used with transcan).
The Function function is a generic function
generator. Function.transcan creates R functions to transform
variables using transformations created by transcan. These
functions are useful for getting predicted values with predictors set
to values on the original scale.
The vcov methods are defined here so that
imputation-corrected variance-covariance matrices are readily
extracted from fit.mult.impute objects, and so that
fit.mult.impute can easily compute traditional covariance
matrices for individual completed datasets.
The subscript method for transcan preserves attributes.
The invertTabulated function does either inverse linear
interpolation or uses sampling to sample qualifying x-values having
y-values near the desired values. The latter is used to get inverse
values having a reasonable distribution (e.g., no floor or ceiling
effects) when the transformation has a flat or nearly flat segment,
resulting in a many-to-one transformation in that region. Sampling
weights are a combination of the frequency of occurrence of x-values
that are within tolInverse times the range of y and the
squared distance between the associated y-values and the target
y-value (aty).
Usage
transcan(x, method=c("canonical","pc"),
categorical=NULL, asis=NULL, nk, imputed=FALSE, n.impute,
boot.method=c('approximate bayesian', 'simple'),
trantab=FALSE, transformed=FALSE,
impcat=c("score", "multinom", "rpart"),
mincut=40,
inverse=c('linearInterp','sample'), tolInverse=.05,
pr=TRUE, pl=TRUE, allpl=FALSE, show.na=TRUE,
imputed.actual=c('none','datadensity','hist','qq','ecdf'),
iter.max=50, eps=.1, curtail=TRUE,
imp.con=FALSE, shrink=FALSE, init.cat="mode",
nres=if(boot.method=='simple')200 else 400,
data, subset, na.action, treeinfo=FALSE,
rhsImp=c('mean','random'), details.impcat='', ...)
## S3 method for class 'transcan'
summary(object, long=FALSE, digits=6, ...)
## S3 method for class 'transcan'
print(x, long=FALSE, ...)
## S3 method for class 'transcan'
plot(x, ...)
## S3 method for class 'transcan'
ggplot(data, mapping, scale=FALSE, ..., environment)
## S3 method for class 'transcan'
impute(x, var, imputation, name, pos.in, data,
list.out=FALSE, pr=TRUE, check=TRUE, ...)
fit.mult.impute(formula, fitter, xtrans, data, n.impute, fit.reps=FALSE,
dtrans, derived, vcovOpts=NULL, pr=TRUE, subset, ...)
## S3 method for class 'transcan'
predict(object, newdata, iter.max=50, eps=0.01, curtail=TRUE,
type=c("transformed","original"),
inverse, tolInverse, check=FALSE, ...)
Function(object, ...)
## S3 method for class 'transcan'
Function(object, prefix=".", suffix="", pos=-1, ...)
invertTabulated(x, y, freq=rep(1,length(x)),
aty, name='value',
inverse=c('linearInterp','sample'),
tolInverse=0.05, rule=2)
## Default S3 method:
vcov(object, regcoef.only=FALSE, ...)
## S3 method for class 'fit.mult.impute'
vcov(object, regcoef.only=TRUE,
intercepts='mid', ...)
Arguments
x
a matrix containing continuous variable values and codes for
categorical variables. The matrix must have column names
(dimnames). If row names are present, they are used in
forming the names attribute of imputed values if
imputed = TRUE. x may also be a formula, in which
case the model matrix is created automatically, using data in the
calling frame. Advantages of using a formula are that
categorical variables can be determined automatically by a
variable being a factor variable, and variables with
two unique levels are modeled asis. Variables with 3 unique
values are considered to be categorical if a formula is
specified. For a formula you may also specify that a variable is to
remain untransformed by enclosing its name with the identify
function, e.g. I(x3). The user may add other variable names
to the asis and categorical vectors. For
invertTabulated, x is a vector or a list with three
components: the x vector, the corresponding vector of transformed
values, and the corresponding vector of frequencies of the pair of
original and transformed variables. For print, plot,
ggplot, impute, and predict, x is an
object created by transcan.
formula
any R model formula
fitter
any R, rms, modeling function (not in quotes) that computes
a vector of coefficients and for which
vcov will return a variance-covariance matrix. E.g.,
fitter = lm, glm,
ols. At present models
involving non-regression parameters (e.g., scale parameters in
parametric survival models) are not handled fully.
xtrans
an object created by transcan, aregImpute, or
mice
method
use method="canonical" or any abbreviation thereof, to use
canonical variates (the default). method="pc" transforms a
variable instead so as to maximize the correlation with the first
principal component of the other variables.
categorical
a character vector of names of variables in x which are
categorical, for which the ordering of re-scored values is not
necessarily preserved. If categorical is omitted, it is
assumed that all variables are continuous (or binary). Set
categorical="*" to treat all variables as categorical.
asis
a character vector of names of variables that are not to be
transformed. For these variables, the guts of
lm.fitmethod="qr" is used to impute
missing values. You may want to treat binary variables asis
(this is automatic if using a formula). If imputed = TRUE,
you may want to use "categorical" for binary variables if you
want to force imputed values to be one of the original data
values. Set asis="*" to treat all variables asis.
nk
number of knots to use in expanding each continuous variable (not
listed in asis) in a restricted cubic spline function.
Default is 3 (yielding 2 parameters for a variable) if
var{n} < 30, 4 if
30 eq var{n} < 100, and 5 if
var{n} >= 100 (4 parameters).
imputed
Set to TRUE to return a list containing imputed values on the
original scale. If the transformation for a variable is
non-monotonic, imputed values are not unique. transcan uses
the approx function, which returns the highest value
of the variable with the transformed score equalling the imputed
score. imputed=TRUE also causes original-scale imputed values
to be shown as tick marks on the top margin of each graph when
show.na=TRUE (for the final iteration only). For categorical
predictors, these imputed values are passed through the
jitter function so that their frequencies can be
visualized. When n.impute is used, each NA will have
n.impute tick marks.
n.impute
number of multiple imputations. If omitted, single predicted
expected value imputation is used. n.impute=5 is frequently
recommended.
boot.method
default is to use the approximate Bayesian bootstrap (sample with
replacement from sample with replacement of the vector of residuals).
You can also specify boot.method="simple" to use the usual
bootstrap one-stage sampling with replacement.
trantab
Set to TRUE to add an attribute trantab to the
returned matrix. This contains a vector of lists each with
components x and y containing the unique values and
corresponding transformed values for the columns of x. This
is set up to be used easily with the approx function.
You must specify trantab=TRUE if you want to later use the
predict.transcan function with type = "original".
transformed
set to TRUE to cause transcan to return an object
transformed containing the matrix of transformed variables
impcat
This argument tells how to impute categorical variables on the
original scale. The default is impcat="score" to impute the
category whose canonical variate score is closest to the predicted
score. Use impcat="rpart" to impute categorical variables
using the values of all other transformed predictors in conjunction
with the rpart function. A better but somewhat
slower approach is to
use impcat="multinom" to fit a multinomial logistic model to
the categorical variable, at the last iteraction of the
transcan algorithm. This uses the multinom
function in the nnet library of the MASS package (which
is assumed to have been installed by the user) to fit a polytomous
logistic model to the current working transformations of all the
other variables (using conditional mean imputation for missing
predictors). Multiple imputations are made by drawing multinomial
values from the vector of predicted probabilities of category
membership for the missing categorical values.
mincut
If imputed=TRUE, there are categorical variables, and
impcat = "rpart", mincut specifies the lowest node size
that will be allowed to be split. The default is 40.
inverse
By default, imputed values are back-solved on the original scale
using inverse linear interpolation on the fitted tabulated
transformed values. This will cause distorted distributions of
imputed values (e.g., floor and ceiling effects) when the estimated
transformation has a flat or nearly flat section. To instead use
the invertTabulated function (see above) with the
"sample" option, specify inverse="sample".
tolInverse
the multiplyer of the range of transformed values, weighted by
freq and by the distance measure, for determining the set of
x values having y values within a tolerance of the value of
aty in invertTabulated. For predict.transcan,
inverse and tolInverse are obtained from options that
were specified to transcan by default. Otherwise, if not
specified by the user, these default to the defaults used to
invertTabulated.
pr
For transcan, set to FALSE to suppress printing
R^2 and shrinkage factors. Set impute.transcan=FALSE
to suppress messages concerning the number of NA values
imputed. Set fit.mult.impute=FALSE to suppress printing
variance inflation factors accounting for imputation, rate of
missing information, and degrees of freedom.
pl
Set to FALSE to suppress plotting the final transformations
with distribution of scores for imputed values (if
show.na=TRUE).
allpl
Set to TRUE to plot transformations for intermediate iterations.
show.na
Set to FALSE to suppress the distribution of scores assigned
to missing values (as tick marks on the right margin of each
graph). See also imputed.
imputed.actual
The default is "none" to suppress plotting of actual
vs. imputed values for all variables having any NA values.
Other choices are "datadensity" to use
datadensity to make a single plot, "hist" to
make a series of back-to-back histograms, "qq" to make a
series of q-q plots, or "ecdf" to make a series of empirical
cdfs. For imputed.actual="datadensity" for example you get a
rug plot of the non-missing values for the variable with beneath it
a rug plot of the imputed values. When imputed.actual is not
"none", imputed is automatically set to TRUE.
iter.max
maximum number of iterations to perform for transcan or
predict. For predict, only one iteration is
used if there are no NA values in the data or if
imp.con was used.
eps
convergence criterion for transcan and predict.
eps is the maximum change in transformed values from one
iteration to the next. If for a given iteration all new
transformations of variables differ by less than eps (with or
without negating the transformation to allow for “flipping”)
from the transformations in the previous iteration, one more
iteration is done for transcan. During this last iteration,
individual transformations are not updated but coefficients of
transformations are. This improves stability of coefficients of
canonical variates on the right-hand-side. eps is ignored
when rhsImp="random".
curtail
for transcan, causes imputed values on the transformed scale
to be truncated so that their ranges are within the ranges of
non-imputed transformed values. For predict,
curtail defaults to TRUE to truncate predicted
transformed values to their ranges in the original fit (xt).
imp.con
for transcan, set to TRUE to impute NA values
on the original scales with constants (medians or most frequent
category codes). Set to a vector of constants to instead always use
these constants for imputation. These imputed values are ignored
when fitting the current working transformation for asingle
variable.
shrink
default is FALSE to use ordinary least squares or canonical
variate estimates. For the purposes of imputing NAs, you may
want to set shrink=TRUE to avoid overfitting when developing
a prediction equation to predict each variables from all the others
(see details below).
init.cat
method for initializing scorings of categorical variables. Default
is "mode" to use a dummy variable set to 1 if the value is
the most frequent value (this is the default). Use "random"
to use a random 0-1 variable. Set to "asis" to use the
original integer codes asstarting scores.
nres
number of residuals to store if n.impute is specified. If
the dataset has fewer than nres observations, all residuals
are saved. Otherwise a random sample of the residuals of length
nres without replacement is saved. The default for
nres is higher if boot.method="approximate bayesian".
data
Data frame used to fill the formula. For ggplot is the
result of transcan with trantab=TRUE.
subset
an integer or logical vector specifying the subset of observations
to fit
na.action
These may be used if x is a formula. The default
na.action is na.retain (defined by transcan)
which keeps all observations with any NA values. For
impute.transcan, data is a data frame to use as the
source of variables to be imputed, rather than using
pos.in. For fit.mult.impute, data is
mandatory and is a data frame containing the data to be used in
fitting the model but before imputations are applied. Variables
omitted from data are assumed to be available from frame1
and do not need to be imputed.
treeinfo
Set to TRUE to get additional information printed when
impcat="rpart", such as the predicted probabilities of
category membership.
rhsImp
Set to "random" to use random draw imputation when a
sometimes missing variable is moved to be a predictor of other
sometimes missing variables. Default is rhsImp="mean", which
uses conditional mean imputation on the transformed scale.
Residuals used are residuals from the transformed scale. When
"random" is used, transcan runs 5 iterations and
ignores eps.
details.impcat
set to a character scalar that is the name of a category variable to
include in the resulting transcan object an element
details.impcat containing details of how the categorical
variable was multiply imputed.
...
arguments passed to scat1d or to the fitter
function (for fit.mult.impute). For ggplot.transcan,
these arguments are passed to facet_wrap, e.g. ncol=2.
long
for summary, set to TRUE to print all imputed
values. For print, set to TRUE to print details
of transformations/imputations.
digits
number of significant digits for printing values by
summary
scale
for ggplot.transcan set scale=TRUE to
scale transformed values to [0,1] before plotting.
mapping,environment
not used; needed because of rules about generics
var
For impute, is a variable that was originally a column
in x, for which imputated values are to be filled
in. imputed=TRUE must have been used in transcan.
Omit var to impute all variables, creating new variables in
position pos (see assign).
imputation
specifies which of the multiple imputations to use for filling in
NA values
name
name of variable to impute, for impute function.
Default is character string version of the second argument
(var) in the call to impute. For
invertTabulated, is the name of variable being transformed
(used only for warning messages).
pos.in
location as defined by assign to find variables that
need to be
imputed, when all variables are to be imputed automatically by
impute.transcan (i.e., when no input variable name is
specified). Default is position that contains
the first variable to be imputed.
list.out
If var is not specified, you can set list.out=TRUE to
have impute.transcan return a list containing variables with
needed values imputed. This list will contain a single imputation.
Variables not needing imputation are copied to the list as-is. You
can use this list for analysis just like a data frame.
check
set to FALSE to suppress certain warning messages
newdata
a new data matrix for which to compute transformed
variables. Categorical variables must use the same integer codes as
were used in the call to transcan. If a formula was
originally specified to transcan (instead of a data matrix),
newdata is optional and if given must be a data frame; a
model frame is generated automatically from the previous formula.
The na.action is handled automatically, and the levels for
factor variables must be the same and in the same order as were used
in the original variables specified in the formula given to
transcan.
fit.reps
set to TRUE to save all fit objects from the fit for each
imputation in fit.mult.impute. Then the object returned will
have a component fits which is a list whose ith
element is the ith fit object.
dtrans
provides an approach to creating derived variables from a single
filled-in dataset. The function specified as dtrans can even
reshape the imputed dataset. An example of such usage is fitting
time-dependent covariates in a Cox model that are created by
“start,stop” intervals. Imputations may be done on a one
record per subject data frame that is converted by dtrans to
multiple records per subject. The imputation can enforce
consistency of certain variables across records so that for example
a missing value of sex will not be imputed as male for
one of the subject's records and female as another. An
example of how dtrans might be specified is
dtrans=function(w) {w$age <- w$years + w$months/12; w}
where months might havebeen imputed but years was
never missing.
derived
an expression containing R expressions for computing derived
variables that are used in the model formula. This is useful when
multiple imputations are done for component variables but the actual
model uses combinations of these (e.g., ratios or other
derivations). For a single derived variable you can specified for
example derived=expression(ratio <- weight/height). For
multiple derived variables use the form
derived=expression({ratio <- weight/height; product <-
weight*height}) or put the expression on separate input lines.
To monitor the multiply-imputed derived variables you can add to the
expression a command such as print(describe(ratio)).
See the example below. Note that derived is not yet
implemented.
vcovOpts
a list of named additional arguments to pass to the
vcov method for fitter. Useful for orm models
for retaining all intercepts
(vcovOpts=list(intercepts='all')) instead of just the middle one.
type
By default, the matrix of transformed variables is returned, with
imputed values on the transformed scale. If you had specified
trantab=TRUE to transcan, specifying
type="original" does the table look-ups with linear
interpolation to return the input matrix x but with imputed
values on the original scale inserted for NA values. For
categorical variables, the method used here is to select the
category code having a corresponding scaled value closest to the
predicted transformed value. This corresponds to the default
impcat. Note: imputed values
thus returned when type="original" are single expected value
imputations even in n.impute is given.
object
an object created by transcan, or an object to be converted to
R function code, typically a model fit object of some sort
prefix, suffix
When creating separate R functions for each variable in x,
the name of the new function will be prefix placed in front of
the variable name, and suffix placed in back of the name. The
default is to use names of the form .varname, where
varname is the variable name.
pos
position as in assign at which to store new functions
(for Function). Default is pos=-1.
y
a vector corresponding to x for invertTabulated, if its
first argument x is not a list
freq
a vector of frequencies corresponding to cross-classified x
and y if x is not a list. Default is a vector of ones.
aty
vector of transformed values at which inverses are desired
rule
see approx. transcan assumes rule is
always 2.
regcoef.only
set to TRUE to make vcov.default delete positions in
the covariance matrix for any non-regression coefficients (e.g., log
scale parameter from psm or survreg)
intercepts
this is primarily for orm
objects. Set to "none" to discard all intercepts from the
covariance matrix, or to "all" or "mid" to keep all
elements generated by orm (orm only outputs the
covariance matrix for the intercept corresponding to the median).
You can also set intercepts to a vector of subscripts for
selecting particular intercepts in a multi-intercept model.
Details
The starting approximation to the transformation for each variable is
taken to be the original coding of the variable. The initial
approximation for each missing value is taken to be the median of the
non-missing values for the variable (for continuous ones) or the most
frequent category (for categorical ones). Instead, if imp.con
is a vector, its values are used for imputing NA values. When
using each variable as a dependent variable, NA values on that
variable cause all observations to be temporarily deleted. Once a new
working transformation is found for the variable, along with a model
to predict that transformation from all the other variables, that
latter model is used to impute NA values in the selected
dependent variable if imp.con is not specified.
When that variable is used to predict a new dependent variable, the
current working imputed values are inserted. Transformations are
updated after each variable becomes a dependent variable, so the order
of variables on x could conceivably make a difference in the
final estimates. For obtaining out-of-sample
predictions/transformations, predict uses the same
iterative procedure as transcan for imputation, with the same
starting values for fill-ins as were used by transcan. It also
(by default) uses a conservative approach of curtailing transformed
variables to be within the range of the original ones. Even when
method = "pc" is specified, canonical variables are used for
imputing missing values.
Note that fitted transformations, when evaluated at imputed variable
values (on the original scale), will not precisely match the
transformed imputed values returned in xt. This is because
transcan uses an approximate method based on linear
interpolation to back-solve for imputed values on the original scale.
Shrinkage uses the method of
Van Houwelingen and Le Cessie (1990) (similar to
Copas, 1983). The shrinkage factor is
where R2 is the apparent R-squared for predicting the
variable, n is the number of non-missing values, and k is
the effective number of degrees of freedom (aside from intercepts). A
heuristic estimate is used for k:
A - 1 + sum(max(0,Bi - 1))/m + m, where
A is the number of d.f. required to represent the variable being
predicted, the Bi are the number of columns required to
represent all the other variables, and m is the number of all
other variables. Division by m is done because the
transformations for the other variables are fixed at their current
transformations the last time they were being predicted. The
+ var{m} term comes from the number of coefficients estimated
on the right hand side, whether by least squares or canonical
variates. If a shrinkage factor is negative, it is set to 0. The
shrinkage factor is the ratio of the adjusted R-squared to
the ordinary R-squared. The adjusted R-squared is
1 - (1 - R2)(n - 1)/(n - k - 1)
which is also set to zero if it is negative. If shrink=FALSE
and the adjusted R-squares are much smaller than the
ordinary R-squares, you may want to run transcan
with shrink=TRUE.
Canonical variates are scaled to have variance of 1.0, by multiplying
canonical coefficients from cancor by
sqrt(var{n} - 1).
When specifying a non-rms library fitting function to
fit.mult.impute (e.g., lm, glm),
running the result of fit.mult.impute through that fit's
summary method will not use the imputation-adjusted
variances. You may obtain the new variances using fit$var or
vcov(fit).
When you specify a rms function to fit.mult.impute (e.g.
lrm, ols, cph,
psm, bj, Rq,
Gls, Glm), automatically computed
transformation parameters (e.g., knot locations for
rcs) that are estimated for the first imputation are
used for all other imputations. This ensures that knot locations will
not vary, which would change the meaning of the regression
coefficients.
Warning: even though fit.mult.impute takes imputation into
account when estimating variances of regression coefficient, it does
not take into account the variation that results from estimation of
the shapes and regression coefficients of the customized imputation
equations. Specifying shrink=TRUE solves a small part of this
problem. To fully account for all sources of variation you should
consider putting the transcan invocation inside a bootstrap or
loop, if execution time allows. Better still, use
aregImpute or a package such as as mice that uses
real Bayesian posterior realizations to multiply impute missing values
correctly.
It is strongly recommended that you use the Hmiscnaclus
function to determine is there is a good basis for imputation.
naclus will tell you, for example, if systolic blood
pressure is missing whenever diastolic blood pressure is missing. If
the only variable that is well correlated with diastolic bp is
systolic bp, there is no basis for imputing diastolic bp in this case.
At present, predict does not work with multiple imputation.
When calling fit.mult.impute with glm as the
fitter argument, if you need to pass a family argument
to glm do it by quoting the family, e.g.,
family="binomial".
fit.mult.impute will not work with proportional odds models
when regression imputation was used (as opposed to predictive mean
matching). That's because regression imputation will create values of
the response variable that did not exist in the dataset, altering the
intercept terms in the model.
You should be able to use a variable in the formula given to
fit.mult.impute as a numeric variable in the regression model
even though it was a factor variable in the invocation of
transcan. Use for example fit.mult.impute(y ~ codes(x),
lrm, trans) (thanks to Trevor Thompson
trevor@hp5.eushc.org).
Value
For transcan, a list of class transcan with elements
call
(with the function call)
iter
(number of iterations done)
rsq, rsq.adj
containing the R-squares and adjusted
R-squares achieved in predicting each variable from all
the others
categorical
the values supplied for categorical
asis
the values supplied for asis
coef
the within-variable coefficients used to compute the first
canonical variate
xcoef
the (possibly shrunk) across-variables coefficients of the first
canonical variate that predicts each variable in-turn.
parms
the parameters of the transformation (knots for splines, contrast
matrix for categorical variables)
fillin
the initial estimates for missing values (NA if variable
never missing)
ranges
the matrix of ranges of the transformed variables (min and max in
first and secondrow)
scale
a vector of scales used to determine convergence for a
transformation.
formula
the formula (if x was a formula)
, and optionally a vector of shrinkage factors used for predicting
each variable from the others. For asis variables, the scale
is the average absolute difference about the median. For other
variables it is unity, since canonical variables are standardized.
For xcoef, row i has the coefficients to predict
transformed variable i, with the column for the coefficient of
variable i set to NA. If imputed=TRUE was given,
an optional element imputed also appears. This is a list with
the vector of imputed values (on the original scale) for each variable
containing NAs. Matrices rather than vectors are returned if
n.impute is given. If trantab=TRUE, the trantab
element also appears, as described above. If n.impute > 0,
transcan also returns a list residuals that can be used
for future multiple imputation.
impute returns a vector (the same length as var) of
class impute with NA values imputed.
predict returns a matrix with the same number of columns or
variables as were in x.
fit.mult.impute returns a fit object that is a modification of
the fit object created by fitting the completed dataset for the final
imputation. The var matrix in the fit object has the
imputation-corrected variance-covariance matrix. coefficients
is the average (over imputations) of the coefficient vectors,
variance.inflation.impute is a vector containing the ratios of
the diagonals of the between-imputation variance matrix to the
diagonals of the average apparent (within-imputation) variance
matrix. missingInfo is
Rubin's rate of missing information and dfmi is
Rubin's degrees of freedom for a t-statistic
for testing a single parameter. The last two objects are vectors
corresponding to the diagonal of the variance matrix. The class
"fit.mult.impute" is prepended to the other classes produced by
the fitting function.
fit.mult.impute stores intercepts attributes in the
coefficient matrix and in var for orm fits.
Side Effects
prints, plots, and impute.transcan creates new variables.
Kuhfeld, Warren F: The PRINQUAL Procedure. SAS/STAT User's Guide, Fourth
Edition, Volume 2, pp. 1265–1323, 1990.
Van Houwelingen JC, Le Cessie S: Predictive value of statistical models.
Statistics in Medicine 8:1303–1325, 1990.
Copas JB: Regression, prediction and shrinkage. JRSS B 45:311–354, 1983.
He X, Shen L: Linear regression after spline transformation.
Biometrika 84:474–481, 1997.
Little RJA, Rubin DB: Statistical Analysis with Missing Data. New
York: Wiley, 1987.
Rubin DJ, Schenker N: Multiple imputation in health-care databases: An
overview and some applications. Stat in Med 10:585–598, 1991.
Faris PD, Ghali WA, et al:Multiple imputation versus data enhancement
for dealing with missing data in observational health care outcome
analyses. J Clin Epidem 55:184–191, 2002.
## Not run:
x <- cbind(age, disease, blood.pressure, pH)
#cbind will convert factor object `disease' to integer
par(mfrow=c(2,2))
x.trans <- transcan(x, categorical="disease", asis="pH",
transformed=TRUE, imputed=TRUE)
summary(x.trans) #Summary distribution of imputed values, and R-squares
f <- lm(y ~ x.trans$transformed) #use transformed values in a regression
#Now replace NAs in original variables with imputed values, if not
#using transformations
age <- impute(x.trans, age)
disease <- impute(x.trans, disease)
blood.pressure <- impute(x.trans, blood.pressure)
pH <- impute(x.trans, pH)
#Do impute(x.trans) to impute all variables, storing new variables under
#the old names
summary(pH) #uses summary.impute to tell about imputations
#and summary.default to tell about pH overall
# Get transformed and imputed values on some new data frame xnew
newx.trans <- predict(x.trans, xnew)
w <- predict(x.trans, xnew, type="original")
age <- w[,"age"] #inserts imputed values
blood.pressure <- w[,"blood.pressure"]
Function(x.trans) #creates .age, .disease, .blood.pressure, .pH()
#Repeat first fit using a formula
x.trans <- transcan(~ age + disease + blood.pressure + I(pH),
imputed=TRUE)
age <- impute(x.trans, age)
predict(x.trans, expand.grid(age=50, disease="pneumonia",
blood.pressure=60:260, pH=7.4))
z <- transcan(~ age + factor(disease.code), # disease.code categorical
transformed=TRUE, trantab=TRUE, imputed=TRUE, pl=FALSE)
ggplot(z, scale=TRUE)
plot(z$transformed)
## End(Not run)
# Multiple imputation and estimation of variances and covariances of
# regression coefficient estimates accounting for imputation
set.seed(1)
x1 <- factor(sample(c('a','b','c'),100,TRUE))
x2 <- (x1=='b') + 3*(x1=='c') + rnorm(100)
y <- x2 + 1*(x1=='c') + rnorm(100)
x1[1:20] <- NA
x2[18:23] <- NA
d <- data.frame(x1,x2,y)
n <- naclus(d)
plot(n); naplot(n) # Show patterns of NAs
f <- transcan(~y + x1 + x2, n.impute=10, shrink=FALSE, data=d)
options(digits=3)
summary(f)
f <- transcan(~y + x1 + x2, n.impute=10, shrink=TRUE, data=d)
summary(f)
h <- fit.mult.impute(y ~ x1 + x2, lm, f, data=d)
# Add ,fit.reps=TRUE to save all fit objects in h, then do something like:
# for(i in 1:length(h$fits)) print(summary(h$fits[[i]]))
diag(vcov(h))
h.complete <- lm(y ~ x1 + x2, na.action=na.omit)
h.complete
diag(vcov(h.complete))
# Note: had the rms ols function been used in place of lm, any
# function run on h (anova, summary, etc.) would have automatically
# used imputation-corrected variances and covariances
# Example demonstrating how using the multinomial logistic model
# to impute a categorical variable results in a frequency
# distribution of imputed values that matches the distribution
# of non-missing values of the categorical variable
## Not run:
set.seed(11)
x1 <- factor(sample(letters[1:4], 1000,TRUE))
x1[1:200] <- NA
table(x1)/sum(table(x1))
x2 <- runif(1000)
z <- transcan(~ x1 + I(x2), n.impute=20, impcat='multinom')
table(z$imputed$x1)/sum(table(z$imputed$x1))
# Here is how to create a completed dataset
d <- data.frame(x1, x2)
z <- transcan(~x1 + I(x2), n.impute=5, data=d)
imputed <- impute(z, imputation=1, data=d,
list.out=TRUE, pr=FALSE, check=FALSE)
sapply(imputed, function(x)sum(is.imputed(x)))
sapply(imputed, function(x)sum(is.na(x)))
## End(Not run)
# Example where multiple imputations are for basic variables and
# modeling is done on variables derived from these
set.seed(137)
n <- 400
x1 <- runif(n)
x2 <- runif(n)
y <- x1*x2 + x1/(1+x2) + rnorm(n)/3
x1[1:5] <- NA
d <- data.frame(x1,x2,y)
w <- transcan(~ x1 + x2 + y, n.impute=5, data=d)
# Add ,show.imputed.actual for graphical diagnostics
## Not run:
g <- fit.mult.impute(y ~ product + ratio, ols, w,
data=data.frame(x1,x2,y),
derived=expression({
product <- x1*x2
ratio <- x1/(1+x2)
print(cbind(x1,x2,x1*x2,product)[1:6,])}))
## End(Not run)
# Here's a method for creating a permanent data frame containing
# one set of imputed values for each variable specified to transcan
# that had at least one NA, and also containing all the variables
# in an original data frame. The following is based on the fact
# that the default output location for impute.transcan is
# given by the global environment
## Not run:
xt <- transcan(~. , data=mine,
imputed=TRUE, shrink=TRUE, n.impute=10, trantab=TRUE)
attach(mine, use.names=FALSE)
impute(xt, imputation=1) # use first imputation
# omit imputation= if using single imputation
detach(1, 'mine2')
## End(Not run)
# Example of using invertTabulated outside transcan
x <- c(1,2,3,4,5,6,7,8,9,10)
y <- c(1,2,3,4,5,5,5,5,9,10)
freq <- c(1,1,1,1,1,2,3,4,1,1)
# x=5,6,7,8 with prob. .1 .2 .3 .4 when y=5
# Within a tolerance of .05*(10-1) all y's match exactly
# so the distance measure does not play a role
set.seed(1) # so can reproduce
for(inverse in c('linearInterp','sample'))
print(table(invertTabulated(x, y, freq, rep(5,1000), inverse=inverse)))
# Test inverse='sample' when the estimated transformation is
# flat on the right. First show default imputations
set.seed(3)
x <- rnorm(1000)
y <- pmin(x, 0)
x[1:500] <- NA
for(inverse in c('linearInterp','sample')) {
par(mfrow=c(2,2))
w <- transcan(~ x + y, imputed.actual='hist',
inverse=inverse, curtail=FALSE,
data=data.frame(x,y))
if(inverse=='sample') next
# cat('Click mouse on graph to proceed\n')
# locator(1)
}
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(Hmisc)
Loading required package: lattice
Loading required package: survival
Loading required package: Formula
Loading required package: ggplot2
Attaching package: 'Hmisc'
The following objects are masked from 'package:base':
format.pval, round.POSIXt, trunc.POSIXt, units
> png(filename="/home/ddbj/snapshot/RGM3/R_CC/result/Hmisc/transcan.Rd_%03d_medium.png", width=480, height=480)
> ### Name: transcan
> ### Title: Transformations/Imputations using Canonical Variates
> ### Aliases: transcan summary.transcan print.transcan plot.transcan
> ### ggplot.transcan impute.transcan predict.transcan Function
> ### Function.transcan fit.mult.impute vcov.default vcov.fit.mult.impute
> ### [.transcan invertTabulated
> ### Keywords: smooth regression multivariate methods models
>
> ### ** Examples
>
> ## Not run:
> ##D x <- cbind(age, disease, blood.pressure, pH)
> ##D #cbind will convert factor object `disease' to integer
> ##D par(mfrow=c(2,2))
> ##D x.trans <- transcan(x, categorical="disease", asis="pH",
> ##D transformed=TRUE, imputed=TRUE)
> ##D summary(x.trans) #Summary distribution of imputed values, and R-squares
> ##D f <- lm(y ~ x.trans$transformed) #use transformed values in a regression
> ##D #Now replace NAs in original variables with imputed values, if not
> ##D #using transformations
> ##D age <- impute(x.trans, age)
> ##D disease <- impute(x.trans, disease)
> ##D blood.pressure <- impute(x.trans, blood.pressure)
> ##D pH <- impute(x.trans, pH)
> ##D #Do impute(x.trans) to impute all variables, storing new variables under
> ##D #the old names
> ##D summary(pH) #uses summary.impute to tell about imputations
> ##D #and summary.default to tell about pH overall
> ##D # Get transformed and imputed values on some new data frame xnew
> ##D newx.trans <- predict(x.trans, xnew)
> ##D w <- predict(x.trans, xnew, type="original")
> ##D age <- w[,"age"] #inserts imputed values
> ##D blood.pressure <- w[,"blood.pressure"]
> ##D Function(x.trans) #creates .age, .disease, .blood.pressure, .pH()
> ##D #Repeat first fit using a formula
> ##D x.trans <- transcan(~ age + disease + blood.pressure + I(pH),
> ##D imputed=TRUE)
> ##D age <- impute(x.trans, age)
> ##D predict(x.trans, expand.grid(age=50, disease="pneumonia",
> ##D blood.pressure=60:260, pH=7.4))
> ##D z <- transcan(~ age + factor(disease.code), # disease.code categorical
> ##D transformed=TRUE, trantab=TRUE, imputed=TRUE, pl=FALSE)
> ##D ggplot(z, scale=TRUE)
> ##D plot(z$transformed)
> ## End(Not run)
>
>
> # Multiple imputation and estimation of variances and covariances of
> # regression coefficient estimates accounting for imputation
> set.seed(1)
> x1 <- factor(sample(c('a','b','c'),100,TRUE))
> x2 <- (x1=='b') + 3*(x1=='c') + rnorm(100)
> y <- x2 + 1*(x1=='c') + rnorm(100)
> x1[1:20] <- NA
> x2[18:23] <- NA
> d <- data.frame(x1,x2,y)
> n <- naclus(d)
> plot(n); naplot(n) # Show patterns of NAs
> f <- transcan(~y + x1 + x2, n.impute=10, shrink=FALSE, data=d)
Convergence criterion:0.695 0.273 0.07 0.039
Convergence in 5 iterations
R-squared achieved in predicting each variable:
y x1 x2
0.785 0.740 0.843
Adjusted R-squared:
y x1 x2
0.769 0.718 0.830
Warning message:
In transcan(~y + x1 + x2, n.impute = 10, shrink = FALSE, data = d) :
transcan provides only an approximation to true multiple imputation.
A better approximation is provided by the aregImpute function.
The MICE and other S libraries provide imputations from Bayesian posterior distributions.
> options(digits=3)
> summary(f)
transcan(x = ~y + x1 + x2, n.impute = 10, shrink = FALSE, data = d)
Iterations: 5
R-squared achieved in predicting each variable:
y x1 x2
0.785 0.740 0.843
Adjusted R-squared:
y x1 x2
0.769 0.718 0.830
Coefficients of canonical variates for predicting each (row) variable
y x1 x2
y -0.15 0.81
x1 -0.20 -0.88
x2 0.57 -0.50
Summary of imputed values
x1
n missing unique Info Mean
200 0 3 0.86 2.15
1 (40, 20%), 2 (90, 45%), 3 (70, 35%)
x2
n missing unique Info Mean .05 .10 .25 .50 .75
60 0 56 1 1.947 -0.6738 -0.5983 0.7509 1.8408 3.3847
.90 .95
3.9993 4.4396
lowest : -1.7442 -1.1792 -0.9121 -0.6612 -0.6470
highest: 4.4000 4.4229 4.7583 4.8547 4.9804
Starting estimates for imputed values:
y x1 x2
1.70 2.00 1.28
>
>
> f <- transcan(~y + x1 + x2, n.impute=10, shrink=TRUE, data=d)
Convergence criterion:0.599 0.321 0.088 0.037
Convergence in 5 iterations
R-squared achieved in predicting each variable:
y x1 x2
0.783 0.740 0.843
Adjusted R-squared:
y x1 x2
0.767 0.718 0.830
Shrinkage factors:
y x1 x2
0.949 0.952 0.939
Warning message:
In transcan(~y + x1 + x2, n.impute = 10, shrink = TRUE, data = d) :
transcan provides only an approximation to true multiple imputation.
A better approximation is provided by the aregImpute function.
The MICE and other S libraries provide imputations from Bayesian posterior distributions.
> summary(f)
transcan(x = ~y + x1 + x2, n.impute = 10, shrink = TRUE, data = d)
Iterations: 5
R-squared achieved in predicting each variable:
y x1 x2
0.783 0.740 0.843
Adjusted R-squared:
y x1 x2
0.767 0.718 0.830
Shrinkage factors:
y x1 x2
0.949 0.952 0.939
Coefficients of canonical variates for predicting each (row) variable
y x1 x2
y -0.15 0.77
x1 -0.19 -0.84
x2 0.53 -0.47
Summary of imputed values
x1
n missing unique Info Mean
200 0 3 0.83 2.2
1 (30, 15%), 2 (100, 50%), 3 (70, 35%)
x2
n missing unique Info Mean .05 .10 .25
60 0 51 1 2.137 -0.580955 -0.005509 0.946750
.50 .75 .90 .95
2.169410 3.524645 4.367843 4.980400
lowest : -1.9144 -0.6950 -0.6906 -0.5752 -0.2548
highest: 4.2406 4.2629 4.3461 4.5631 4.9804
Starting estimates for imputed values:
y x1 x2
1.70 2.00 1.28
>
>
> h <- fit.mult.impute(y ~ x1 + x2, lm, f, data=d)
Variance Inflation Factors Due to Imputation:
(Intercept) x1b x1c x2
1.02 1.01 1.03 1.02
Rate of Missing Information:
(Intercept) x1b x1c x2
0.02 0.01 0.02 0.02
d.f. for t-distribution for Tests of Single Coefficients:
(Intercept) x1b x1c x2
36915 69194 14651 23766
The following fit components were averaged over the 10 model fits:
fitted.values
Warning message:
In fit.mult.impute(y ~ x1 + x2, lm, f, data = d) :
If you use print, summary, or anova on the result, lm methods use the
sum of squared residuals rather than the Rubin formula for computing
residual variance and standard errors. It is suggested to use ols
instead of lm.
> # Add ,fit.reps=TRUE to save all fit objects in h, then do something like:
> # for(i in 1:length(h$fits)) print(summary(h$fits[[i]]))
>
>
> diag(vcov(h))
(Intercept) x1b x1c x2
0.0490 0.1157 0.2851 0.0173
>
>
> h.complete <- lm(y ~ x1 + x2, na.action=na.omit)
> h.complete
Call:
lm(formula = y ~ x1 + x2, na.action = na.omit)
Coefficients:
(Intercept) x1b x1c x2
0.238 -0.484 0.243 1.157
> diag(vcov(h.complete))
(Intercept) x1b x1c x2
0.0568 0.1408 0.3243 0.0201
>
>
> # Note: had the rms ols function been used in place of lm, any
> # function run on h (anova, summary, etc.) would have automatically
> # used imputation-corrected variances and covariances
>
>
> # Example demonstrating how using the multinomial logistic model
> # to impute a categorical variable results in a frequency
> # distribution of imputed values that matches the distribution
> # of non-missing values of the categorical variable
>
>
> ## Not run:
> ##D set.seed(11)
> ##D x1 <- factor(sample(letters[1:4], 1000,TRUE))
> ##D x1[1:200] <- NA
> ##D table(x1)/sum(table(x1))
> ##D x2 <- runif(1000)
> ##D z <- transcan(~ x1 + I(x2), n.impute=20, impcat='multinom')
> ##D table(z$imputed$x1)/sum(table(z$imputed$x1))
> ##D
> ##D # Here is how to create a completed dataset
> ##D d <- data.frame(x1, x2)
> ##D z <- transcan(~x1 + I(x2), n.impute=5, data=d)
> ##D imputed <- impute(z, imputation=1, data=d,
> ##D list.out=TRUE, pr=FALSE, check=FALSE)
> ##D sapply(imputed, function(x)sum(is.imputed(x)))
> ##D sapply(imputed, function(x)sum(is.na(x)))
> ## End(Not run)
>
> # Example where multiple imputations are for basic variables and
> # modeling is done on variables derived from these
>
>
> set.seed(137)
> n <- 400
> x1 <- runif(n)
> x2 <- runif(n)
> y <- x1*x2 + x1/(1+x2) + rnorm(n)/3
> x1[1:5] <- NA
> d <- data.frame(x1,x2,y)
> w <- transcan(~ x1 + x2 + y, n.impute=5, data=d)
Convergence criterion:0.167 0.025 0.002
Convergence in 4 iterations
R-squared achieved in predicting each variable:
x1 x2 y
0.507 0.114 0.542
Adjusted R-squared:
x1 x2 y
0.497 0.096 0.533
Warning message:
In transcan(~x1 + x2 + y, n.impute = 5, data = d) :
transcan provides only an approximation to true multiple imputation.
A better approximation is provided by the aregImpute function.
The MICE and other S libraries provide imputations from Bayesian posterior distributions.
> # Add ,show.imputed.actual for graphical diagnostics
> ## Not run:
> ##D g <- fit.mult.impute(y ~ product + ratio, ols, w,
> ##D data=data.frame(x1,x2,y),
> ##D derived=expression({
> ##D product <- x1*x2
> ##D ratio <- x1/(1+x2)
> ##D print(cbind(x1,x2,x1*x2,product)[1:6,])}))
> ## End(Not run)
>
>
> # Here's a method for creating a permanent data frame containing
> # one set of imputed values for each variable specified to transcan
> # that had at least one NA, and also containing all the variables
> # in an original data frame. The following is based on the fact
> # that the default output location for impute.transcan is
> # given by the global environment
>
>
> ## Not run:
> ##D xt <- transcan(~. , data=mine,
> ##D imputed=TRUE, shrink=TRUE, n.impute=10, trantab=TRUE)
> ##D attach(mine, use.names=FALSE)
> ##D impute(xt, imputation=1) # use first imputation
> ##D # omit imputation= if using single imputation
> ##D detach(1, 'mine2')
> ## End(Not run)
>
>
> # Example of using invertTabulated outside transcan
> x <- c(1,2,3,4,5,6,7,8,9,10)
> y <- c(1,2,3,4,5,5,5,5,9,10)
> freq <- c(1,1,1,1,1,2,3,4,1,1)
> # x=5,6,7,8 with prob. .1 .2 .3 .4 when y=5
> # Within a tolerance of .05*(10-1) all y's match exactly
> # so the distance measure does not play a role
> set.seed(1) # so can reproduce
> for(inverse in c('linearInterp','sample'))
+ print(table(invertTabulated(x, y, freq, rep(5,1000), inverse=inverse)))
6.5
1000
5 6 7 8
110 194 291 405
>
>
> # Test inverse='sample' when the estimated transformation is
> # flat on the right. First show default imputations
> set.seed(3)
> x <- rnorm(1000)
> y <- pmin(x, 0)
> x[1:500] <- NA
> for(inverse in c('linearInterp','sample')) {
+ par(mfrow=c(2,2))
+ w <- transcan(~ x + y, imputed.actual='hist',
+ inverse=inverse, curtail=FALSE,
+ data=data.frame(x,y))
+ if(inverse=='sample') next
+ # cat('Click mouse on graph to proceed\n')
+ # locator(1)
+ }
Convergence criterion:0.032 0.016
Convergence in 3 iterations
R-squared achieved in predicting each variable:
x y
1 1
Adjusted R-squared:
x y
1 1
Convergence criterion:0.032 0.016
Convergence in 3 iterations
R-squared achieved in predicting each variable:
x y
1 1
Adjusted R-squared:
x y
1 1
Warning message:
In invertTabulated(x[!j, i], newy[!j], aty = newy[j], name = nam[i], :
No actual x has y value within 0.05* range(y) (7.35) of the following y values:5.4.
Consider increasing tolInverse. Used linear interpolation instead.
>
>
>
>
>
> dev.off()
null device
1
>