Last data update: 2014.03.03

R: Impute multilevel missing data using 'pan'
panImputeR Documentation

Impute multilevel missing data using pan

Description

This function provides a user-friendly interface to the pan package for multiple imputation of multilevel data (Schafer & Yucel, 2002). Imputations can be generated using type or formula, which offer different options for model specification.

Usage


panImpute(data, type, formula, n.burn=5000, n.iter=100, m=10, group=NULL, 
  prior=NULL, seed=NULL, save.pred=FALSE, silent=FALSE)

Arguments

data

A data frame containing incomplete and auxiliary variables, the cluster indicator variable, and any other variables that should be present in the imputed datasets.

type

An integer vector specifying the role of each variable in the imputation model (see details).

formula

A formula specifying the role of each variable in the imputation model. The basic model is constructed by model.matrix, thus allowing to include derived variables in the imputation model using I() (see details and examples).

n.burn

The number of burn-in iterations before any imputations are drawn. Default is to 5,000.

n.iter

The number of iterations between imputations. Default is to 100.

m

The number of imputed data sets to generate.

group

(optional) A character string denoting the name of an additional grouping variable to be used with the formula argument. When specified, the imputation model is run separately within each of these groups.

prior

(optional) A list with components a, Binv, c, and Dinv for specifying prior distributions for the covariance matrix of random effects and the covariance matrix of residuals (see details). Default is to using least-informative priors.

seed

(optional) An integer value initializing pan's random number generator for reproducible results. Default is to using random seeds.

save.pred

(optional) Logical flag indicating if variables derived using formula should be included in the imputed data sets. Default is to FALSE.

silent

(optional) Logical flag indicating if console output should be suppressed. Default is to FALSE.

Details

This function serves as the main interface to the pan algorithm. The imputation model can be specified using either the type or the formula argument.

The type interface is designed to provide quick-and-easy imputations using pan. The type argument must be an integer vector denoting the role of each variable in the imputation model:

  • 1: target variables containing missing data

  • 2: predictors with fixed effect on all targets (completely observed)

  • 3: predictors with random effect on all targets (completely observed)

  • -1: grouping variable within which the imputation is run separately

  • -2: cluster indicator variable

  • 0: variables not featured in the model

At least one target variable and the cluster indicator must be specified. The intercept is automatically included both as a fixed and random effect. If a variable of type -1 is found, then imputations are performed separately within each level of that variable. This is useful if the cluster variable (e.g., schools) is contained in an even larger grouping variable for which imputations models are not deemed comparable (e.g., federal states, educational systems).

The formula argument is intended as more flexible and feature-rich interface to pan. Specifying the formula argument is similar to specifying other formulae in R. Given below is a list of operators that panImpute currently understands:

  • ~: separates the target (left-hand) and predictor (right-hand) side of the model

  • +: adds target or predictor variables to the model

  • *: adds an interaction term of two or more predictors

  • |: denotes cluster-specific random effects and specifies the cluster indicator (i.e., 1|ID)

  • I(): defines functions to be interpreted by model.matrix

Predictors are allowed to have fixed effects, random effects or both on all target variables. The intercept is automatically included both as a fixed and random effect, but it can be constrained if necessary (see examples). Note that, when specifying random effects other than the intercept, these will not be automatically added as fixed effects and must be included explicitly. Any predictors defined by I() will be used for imputation but not included in the data set unless save.pred=TRUE.

In order to run separate imputation models for an additional grouping variable, the group argument may be used. The variable name must be specified without quotation marks and must be present in the data set.

As a default prior, panImpute uses least informative inverse-Wishart priors for the covariance matrices of random effects and of residuals, that is, with minimum degrees of freedom (largest dispersion) and identity matrices for scale. For better control, the prior argument may be used for specifying alternative prior distributions. These must be supplied as a list containing the following components:

  • a: degrees of freedom for the residual covariance matrix

  • Binv: scale matrix for the residual covariance matrix

  • c: degrees of freedom for the covariance matrix of random effects

  • Dinv: scale matrix for the covariance matrix of random effects

A sensible choice for a diffuse non-default prior is to set the degrees of freedom to the lowest value possible, and the scale matrices according to a prior guess of the corresponding covariance matrices (see Schafer & Yucel, 2002).

Value

Returns an object of class mitml. A mitml class object is a list, each containing the following components:

data

The original (incomplete) data set that has been sorted according to the cluster variable and (if given) the grouping variable. An attribute "sort" contains the original row order. An attribute "group" contains the optional grouping variable.

replacement.mat

A matrix containing the multiple replacements (i.e., imputations) for each missing value. The replacement matrix contains one row for each missing value and one one column for each imputed data set.

index.mat

A matrix containing the row and column index for each missing value. The index matrix is used to link the missing values in the data set with their corresponding rows in the replacement matrix.

call

The matched function call.

model

A list containing the names of the cluster variable, the target variables, and the predictor variables with fixed and random effects, respectively.

random.L1

A character string denoting the handling of random residual covariance matrices (see jomoImpute).

prior

The prior parameters used in the imputation model.

iter

A list containing the number of burn-in iterations, the number of iterations between imputations, and the number of imputed data sets.

par.burnin

A multi-dimensional array containing the parameters of the imputation model from the burn-in phase.

par.imputation

A multi-dimensional array containing the parameters of the imputation model from the imputation phase.

Note

For objects of class mitml, methods for the generic functions print, summary and plot have been defined. mitmlComplete is used to extract the imputed data sets.

Author(s)

Simon Grund, Alexander Robitzsch, Oliver Luedtke

References

Schafer, J. L., and Yucel, R. M. (2002). Computational strategies for multivariate linear mixed-effects models with missing values. Journal of Computational and Graphical Statistics, 11, 437-457.

See Also

jomoImpute, mitmlComplete, summary.mitml, plot.mitml

Examples

# NOTE: The number of iterations in these examples is much lower than it
# should be! This is done in order to comply with CRAN policies, and more
# iterations are recommended for applications in practice!

data(studentratings)

# *** ................................
# the 'type' interface
# 

# * Example 1.1: 'ReadDis' and 'SES', predicted by 'ReadAchiev' and 
# 'CognAbility', with random slope for 'ReadAchiev'

type <- c(-2,0,0,0,0,0,3,1,2,0)
names(type) <- colnames(studentratings)
type

imp <- panImpute(studentratings, type=type, n.burn=1000, n.iter=100, m=5)

# * Example 1.2: 'ReadDis' and 'SES' groupwise for 'FedState',
# and predicted by 'ReadAchiev'

type <- c(-2,-1,0,0,0,0,2,1,0,0)
names(type) <- colnames(studentratings)
type

imp <- panImpute(studentratings, type=type, n.burn=1000, n.iter=100, m=5)

# *** ................................
# the 'formula' interface
# 

# * Example 2.1: imputation of 'ReadDis', predicted by 'ReadAchiev'
# (random intercept)

fml <- ReadDis ~ ReadAchiev + (1|ID)
imp <- panImpute(studentratings, formula=fml, n.burn=1000, n.iter=100, m=5)

# ... the intercept can be suppressed using '0' or '-1' (here for fixed intercept)
fml <- ReadDis ~ 0 + ReadAchiev + (1|ID)
imp <- panImpute(studentratings, formula=fml, n.burn=1000, n.iter=100, m=5)

# * Example 2.2: imputation of 'ReadDis', predicted by 'ReadAchiev'
# (random slope)

fml <- ReadDis ~ ReadAchiev + (1+ReadAchiev|ID)
imp <- panImpute(studentratings, formula=fml, n.burn=1000, n.iter=100, m=5)

# * Example 2.3: imputation of 'ReadDis', predicted by 'ReadAchiev',
# groupwise for 'FedState'

fml <- ReadDis ~ ReadAchiev + (1|ID)
imp <- panImpute(studentratings, formula=fml, group="FedState", n.burn=1000,
n.iter=100, m=5)

# * Example 2.4: imputation of 'ReadDis', predicted by 'ReadAchiev'
# including the cluster mean of 'ReadAchiev' as an additional predictor

fml <- ReadDis ~ ReadAchiev + I(clusterMeans(ReadAchiev,ID)) + (1|ID)
imp <- panImpute(studentratings, formula=fml, n.burn=1000, n.iter=100, m=5)

# ... using 'save.pred' to save the calculated cluster means in the data set
fml <- ReadDis ~ ReadAchiev + I(clusterMeans(ReadAchiev,ID)) + (1|ID)
imp <- panImpute(studentratings, formula=fml, n.burn=1000, n.iter=100, m=5,
save.pred=TRUE)

head(mitmlComplete(imp,1))

Results