Last data update: 2014.03.03

R: CPS Sequential Hot-Deck Imputation
impute.CPS_SEQ_HDR Documentation

CPS Sequential Hot-Deck Imputation

Description

Resolves missing data by the CPS sequential Hot-Deck Imputation.

Usage

impute.CPS_SEQ_HD(DATA = NULL, covariates = NULL, initialvalues = 0,
                  navalues = NA, modifyinplace = TRUE)

Arguments

DATA

Data containing missing values. Should be a matrix of numbers.

covariates

Vector containing the covariates (columns that should be used to create the imputation classes). If is.null(covariates) | length(covariates)==0 this function defaults to impute.SEQ_HD. See Section: Note for further Details.

initialvalues

The initial values for the start-up process of the imputation. Should be "integer" and length(initialvalues)==1 | length(initialvalues)==dim(DATA)[2]. The default of 0 is not normally a good value.

navalues

NA code for each variable that should be imputed. Should be "integer" and length(initialvalues)==1 | length(initialvalues)==dim(DATA)[2]. Default is R's NA value.

modifyinplace

Should DATA be modified in place? (See the Section: Warning.) If not, a copy is made.

Details

This function imputes the missing values in any variable by creating imputation classes and then replicating the most recently observed value in the class and variable. Imputation classes are created by the adjustment cell method.

Value

An imputed data matrix the same size as the input DATA.

Warning

If modifyinplace == FALSE DATA or rather the variable supplied is edited directly! This is significantly faster if the data set is large.

Note

This is a very fast imputation method. Only one pass of the data is needed. With the use of proper covariates, data may be missing MAR. Covariates should be complete (not missing data). If not, NA will be used for building classes. This may or may not be appropriate for the data. The presence of missing values in the covariates in not checked.

Author(s)

Dieter William Joenssen Dieter.Joenssen@googlemail.com

References

Hanson, R.H. (1978) The Current Population Survey: Design and Methodology. Technical Paper No. 40 . U.S. Bureau of the Census.

Joenssen, D.W. (2015) Hot-Deck-Verfahren zur Imputation fehlender Daten – Auswirkungen des Donor-Limits. Ilmenau: Ilmedia. [in German, Dissertation]

Joenssen, D.W. and Bankhofer, U. (2012) Donor Limited Hot Deck Imputation: Effects on Parameter Estimation. Journal of Theoretical and Applied Computer Science. 6, 58–70.

Joenssen, D.W. and Muellerleile, T. (2014) Fehlende Daten bei Data-Mining. HMD Praxis der Wirtschaftsinformatik. 51, 458–468, 2014. doi: 10.1365/s40702-014-0038-8 [in German]

See Also

impute.SEQ_HD, impute.mean, impute.NN_HD

Examples

#Set the random seed to an arbitrary number
set.seed(421)

n<-1000
m<-3
pmiss<-.1

#Generate matrix of random integers and 2 binary covariates
Y<-cbind(matrix(sample(0:1,replace=TRUE,size=n*2),nrow=n),
		 matrix(sample(0:9,replace=TRUE,size=n*m),nrow=n))

#generate missing values, MCAR, in all but the first two columns
Y[,-c(1,2)][sample(1:length(Y[,-c(1,2)]),
				   size=floor(pmiss*length(Y[,-c(1,2)])))]<-NA

#perform the sequential imputation Y within the 
#classes created by cross-classifying variables 1 and 2
impute.CPS_SEQ_HD(DATA=Y,covariates=c(1,2),initialvalues=0, navalues=NA, modifyinplace = FALSE)


####an example highlighting the modifyinplace option
#using cbind to show the results of the function and the intial data next to another
cbind(impute.CPS_SEQ_HD(DATA=Y,covariates=c(1,2),initialvalues=0,
                        navalues=NA, modifyinplace = FALSE),Y)
#notice that columns 8-10 (representing Y) still have missing data

#same procedure, except modifyinplace is set to TRUE
cbind(impute.CPS_SEQ_HD(DATA=Y,covariates=c(1,2),initialvalues=0,
                        navalues=NA, modifyinplace = TRUE),Y)
#notice that columns 8-10 (representing Y) are identical to columns 3-5, 
#Y has (and any Variables pointing to the same object have) been directly modified.

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(HotDeckImputation)
Error in library(HotDeckImputation) : 
  there is no package called 'HotDeckImputation'
Execution halted