Last data update: 2014.03.03

R: Create simulated cross-classification data
poLCA.simdataR Documentation

Create simulated cross-classification data

Description

Uses the latent class model's assumed data-generating process to create a simulated dataset that can be used to test the properties of the poLCA latent class and latent class regression estimator.

Usage

poLCA.simdata(N = 5000, probs = NULL, nclass = 2, ndv = 4, 
              nresp = NULL, x = NULL, niv = 0, b = NULL, 
              P = NULL, missval = FALSE, pctmiss = NULL)

Arguments

N

number of observations.

probs

a list of matrices of dimension nclass by nresp with each matrix corresponding to one manifest variable, and each row containing the class-conditional outcome probabilities (which must sum to 1) If probs is NULL (default) then the outcome probabilities are generated randomly.

nclass

number of latent classes. Ifprobs is specified, then nclass is set equal to the number of rows in each matrix in that list. If P is specified, then nclass is set equal to the length of that vector. If b is specified, then nclass is set equal to one greater than the number of columns in b. Otherwise, the default is two.

ndv

number of manifest variables. If probs is specified, then ndv is set equal to the number of matrices in that list. If nresp is specified, then ndv is set equal to the length of that vector. Otherwise, the default is four.

nresp

number of possible outcomes for each manifest variable. If probs is specified, then ndv is set equal to the number of columns in each matrix in that list. If both probs and nresp are NULL (default), then the manifest variables are assigned a random number of outcomes between two and five.

x

a matrix of concomicant variables with N rows and niv columns. If x=NULL (default), but niv>0, then niv concomitant variables will be generated as mutually independent random draws from a standard normal distribution.

niv

number of concomitant variables (covariates). Setting niv=0 (default) creates a data set assuming no covariates. If nclass=1 then niv is automatically set equal to 0. If both x and niv are entered, then the number of columns in x overrides the value of niv. The number of rows in b, less one, also overrides niv.

b

when using covariates, an niv+1 by nclass-1 matrix of (multinomial) logit coefficients. If b is NULL (default), then coefficients are generated as random integers between -2 and 2.

P

a vector of mixing proportions (class population shares) of length nclass. P must sum to 1. Disregarded if b is specified or niv>1 because then P is, in part, a function of the concomitant variables. If P is NULL (default), then the mixing proportions are generated randomly.

missval

logical. If TRUE then a fraction pctmiss of the manifest variables are randomly dropped as missing values. Default is FALSE.

pctmiss

percentage of values to be dropped as missing, if missval=TRUE. If pctmiss is NULL (default), then a value between 5 and 40 percent is chosen randomly.

Details

Note that entering probs overrides nclass, ndv, and nresp. It also overrides P if the length of the P vector is not equal to the length of the probs list. Likewise, if probs=NULL, then length(nresp) overrides ndv and length(P) overrides nclass. Setting niv>1 causes any user-entered value of P to be disregarded.

Value

dat

a data frame containing the simulated variables. Variable names for manifest variables are Y1, Y2, etc. Variable names for concomitant variables are X1, X2, etc.

probs

a list of matrices of dimension nclass by nresp containing the class-conditional response probabilities.

nresp

a vector containing the number of possible outcomes for each manifest variable.

b

coefficients on covariates, if used.

P

mixing proportions corresponding to each latent class.

pctmiss

percent of observations missing.

trueclass

N by 1 vector containing the "true" class membership for each individual.

See Also

poLCA

Examples

# Create a sample data set with 3 classes and no covariates,
# and run poLCA to recover the specified parameters.
# Each matrix in the probs list contains one of the manifest variables'
# "true" conditional response probabilities.

probs <- list(matrix(c(0.6,0.1,0.3,     0.6,0.3,0.1,     0.3,0.1,0.6    ),ncol=3,byrow=TRUE), # Y1
              matrix(c(0.2,0.8,         0.7,0.3,         0.3,0.7        ),ncol=2,byrow=TRUE), # Y2
              matrix(c(0.3,0.6,0.1,     0.1,0.3,0.6,     0.3,0.6,0.1    ),ncol=3,byrow=TRUE), # Y3
              matrix(c(0.1,0.1,0.5,0.3, 0.5,0.3,0.1,0.1, 0.3,0.1,0.1,0.5),ncol=4,byrow=TRUE), # Y4
              matrix(c(0.1,0.1,0.8,     0.1,0.8,0.1,     0.8,0.1,0.1    ),ncol=3,byrow=TRUE)) # Y5
simdat <- poLCA.simdata(N=1000,probs,P=c(0.2,0.3,0.5))
f1 <- cbind(Y1,Y2,Y3,Y4,Y5)~1
lc1 <- poLCA(f1,simdat$dat,nclass=3)
table(lc1$predclass,simdat$trueclass)

# Create a sample dataset with 2 classes and three covariates.
# Then compare predicted class memberships when the model is 
# estimated "correctly" with covariates to when it is estimated
# "incorrectly" without covariates.

simdat2 <- poLCA.simdata(N=1000,ndv=7,niv=3,nclass=2,b=matrix(c(1,-2,1,-1)))
f2a <- cbind(Y1,Y2,Y3,Y4,Y5,Y6,Y7)~X1+X2+X3
lc2a <- poLCA(f2a,simdat2$dat,nclass=2)
f2b <- cbind(Y1,Y2,Y3,Y4,Y5,Y6,Y7)~1
lc2b <- poLCA(f2b,simdat2$dat,nclass=2)
table(lc2a$predclass,lc2b$predclass)

Results