R: Add missing values to a vector given a MCAR, MAR, or MNAR...
add_missing
R Documentation
Add missing values to a vector given a MCAR, MAR, or MNAR scheme
Description
Given an input vector, replace elements of this vector with missing values according to some scheme.
Default method replaces input values with a MCAR scheme (where on average 10% of the values will be
replaced with NAs). MAR and MNAR are supported by replacing the default FUN argument.
an input vector that should contain missing data in the form of NA's
fun
a user defined function indicating the missing data mechanism for each element in y.
Function must return a vector of probability values with the length equal to the length of y.
Each value in the returned vector indicates the probability that
the respective element in y will be replaced with NA.
Function must contain the argument y, representing the
input vector, however any number of additional arguments can be included
...
additional arguments to be passed to FUN
Details
Given an input vector y, and other relevant variables
inside (X) and outside (Z) the data-set, the three types of missingness are:
MCAR
Missing completely at random (MCAR). This is realized by randomly sampling the values of the
input vector (y) irrespective of the possible values in X and Z.
Therefore missing values are randomly sampled and do not depend on any data characteristics and
are truly random
MAR
Missing at random (MAR). This is realized when values in the dataset (X)
predict the missing data mechanism in y; conceptually this is equivalent to
P(y = NA | X). This requires the user to define a custom missing data function
MNAR
Missing not at random (MNAR). This is similar to MAR except
that the missing mechanism comes
from the value of y itself or from variables outside the working dataset;
conceptually this is equivalent to P(y = NA | X, Z, y). This requires
the user to define a custom missing data function
Value
the input vector y with the sampled NA values
(according to the FUN scheme)
Examples
set.seed(1)
y <- rnorm(1000)
## 10% missing rate with default FUN
head(ymiss <- add_missing(y), 10)
## 50% missing with default FUN
head(ymiss <- add_missing(y, rate = .5), 10)
## missing values only when female and low
X <- data.frame(group = sample(c('male', 'female'), 1000, replace=TRUE),
level = sample(c('high', 'low'), 1000, replace=TRUE))
head(X)
fun <- function(y, X, ...){
p <- rep(0, length(y))
p[X$group == 'female' & X$level == 'low'] <- .2
p
}
ymiss <- add_missing(y, X, fun=fun)
tail(cbind(ymiss, X), 10)
## missingness as a function of elements in X (i.e., a type of MAR)
fun <- function(y, X){
# missingness with a logistic regression approach
df <- data.frame(y, X)
mm <- model.matrix(y ~ group + level, df)
cfs <- c(-5, 2, 3) #intercept, group, and level coefs
z <- cfs %*% t(mm)
plogis(z)
}
ymiss <- add_missing(y, X, fun=fun)
tail(cbind(ymiss, X), 10)
## missing values when y elements are large (i.e., a type of MNAR)
fun <- function(y) ifelse(abs(y) > 1, .4, 0)
ymiss <- add_missing(y, fun=fun)
tail(cbind(y, ymiss), 10)
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(SimDesign)
> png(filename="/home/ddbj/snapshot/RGM3/R_CC/result/SimDesign/add_missing.Rd_%03d_medium.png", width=480, height=480)
> ### Name: add_missing
> ### Title: Add missing values to a vector given a MCAR, MAR, or MNAR scheme
> ### Aliases: add_missing
>
> ### ** Examples
>
>
> set.seed(1)
> y <- rnorm(1000)
>
> ## 10% missing rate with default FUN
> head(ymiss <- add_missing(y), 10)
[1] -0.6264538 NA -0.8356286 1.5952808 0.3295078 -0.8204684
[7] 0.4874291 0.7383247 0.5757814 -0.3053884
>
> ## 50% missing with default FUN
> head(ymiss <- add_missing(y, rate = .5), 10)
[1] -0.6264538 NA NA NA 0.3295078 -0.8204684
[7] NA NA 0.5757814 -0.3053884
>
> ## missing values only when female and low
> X <- data.frame(group = sample(c('male', 'female'), 1000, replace=TRUE),
+ level = sample(c('high', 'low'), 1000, replace=TRUE))
> head(X)
group level
1 male high
2 female high
3 male high
4 male low
5 female high
6 male high
>
> fun <- function(y, X, ...){
+ p <- rep(0, length(y))
+ p[X$group == 'female' & X$level == 'low'] <- .2
+ p
+ }
>
> ymiss <- add_missing(y, X, fun=fun)
> tail(cbind(ymiss, X), 10)
ymiss group level
991 -0.4826525 male high
992 NA female low
993 0.5128013 male high
994 1.0489099 male low
995 0.1210582 female low
996 -0.3132929 male high
997 -0.8806707 male high
998 -0.4192869 male high
999 -1.4827517 male high
1000 -0.6973182 male high
>
> ## missingness as a function of elements in X (i.e., a type of MAR)
> fun <- function(y, X){
+ # missingness with a logistic regression approach
+ df <- data.frame(y, X)
+ mm <- model.matrix(y ~ group + level, df)
+ cfs <- c(-5, 2, 3) #intercept, group, and level coefs
+ z <- cfs %*% t(mm)
+ plogis(z)
+ }
>
> ymiss <- add_missing(y, X, fun=fun)
> tail(cbind(ymiss, X), 10)
ymiss group level
991 -0.4826525 male high
992 -0.6691135 female low
993 0.5128013 male high
994 NA male low
995 0.1210582 female low
996 -0.3132929 male high
997 -0.8806707 male high
998 -0.4192869 male high
999 -1.4827517 male high
1000 -0.6973182 male high
>
> ## missing values when y elements are large (i.e., a type of MNAR)
> fun <- function(y) ifelse(abs(y) > 1, .4, 0)
> ymiss <- add_missing(y, fun=fun)
> tail(cbind(y, ymiss), 10)
y ymiss
[991,] -0.4826525 -0.4826525
[992,] -0.6691135 -0.6691135
[993,] 0.5128013 0.5128013
[994,] 1.0489099 NA
[995,] 0.1210582 0.1210582
[996,] -0.3132929 -0.3132929
[997,] -0.8806707 -0.8806707
[998,] -0.4192869 -0.4192869
[999,] -1.4827517 -1.4827517
[1000,] -0.6973182 -0.6973182
>
>
>
>
>
>
> dev.off()
null device
1
>