Last data update: 2014.03.03

R: Longitudinal, Overdispersed Count Data
LongitudinalOverdispersedCountsR Documentation

Longitudinal, Overdispersed Count Data

Description

Four-timepoint longitudinal data generated from an arbitrary Monte Carlo simulation, for 1000 simulees. The response variable is a discrete count variable. There are three time-invariant covariates. The data are available in both "wide" and "long" format.

Usage

data("LongitudinalOverdispersedCounts")

Format

The "long" format dataframe, longData, has 4000 rows and the following variables (columns):

  1. id: Factor; simulee ID code.

  2. tiem: Numeric; represents the time metric, wave of assessment.

  3. x1: Numeric; time-invariant covariate.

  4. x2: Numeric; time-invariant covariate.

  5. x3: Numeric; time-invariant covariate.

  6. y: Numeric; the response ("dependent") variable.

The "wide" format dataset, wideData, is a numeric 1000x12 matrix containing the following variables (columns):

  1. id: Simulee ID code.

  2. x1: Time-invariant covariate.

  3. x3: Time-invariant covariate.

  4. x3: Time-invariant covariate.

  5. y0: Response at initial wave of assessment.

  6. y1: Response at first follow-up.

  7. y2: Response at second follow-up.

  8. y3: Response at third follow-up.

  9. t0: Time variable at initial wave of assessment (in this case, 0).

  10. t1: Time variable at first follow-up (in this case, 1).

  11. t2: Time variable at second follow-up (in this case, 2).

  12. t3: Time variable at third follow-up (in this case, 3).

Examples

data(LongitudinalOverdispersedCounts)
head(wideData)
str(longData)
#Let's try ordinary least-squares (OLS) regression:
olsmod <- lm(y~tiem+x1+x2+x3, data=longData)
#We will see in the diagnostic plots that the residuals are poorly approximated by normality, 
#and are heteroskedastic.  We also know that the residuals are not independent of one another, 
#because we have repeated-measures data:
plot(olsmod)
#In the summary, it looks like all of the regression coefficients are significantly different 
#from zero, but we know that because the assumptions of OLS regression are violated that 
#we should not trust its results:
summary(olsmod)

#Let's try a generalized linear model (GLM).  We'll use the quasi-Poisson quasilikelihood 
#function to see how well the y variable is approximated by a Poisson distribution 
#(conditional on time and covariates):
glm.mod <- glm(y~tiem+x1+x2+x3, data=longData, family="quasipoisson")
#The estimate of the dispersion parameter should be about 1.0 if the data are 
#conditionally Poisson.  We can see that it is actually greater than 2, 
#indicating overdispersion:
summary(glm.mod)

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(OpenMx)
Loading required package: digest
Loading required package: MASS
Loading required package: Matrix
Loading required package: Rcpp
Loading required package: parallel

Attaching package: 'OpenMx'

The following objects are masked from 'package:Matrix':

    %&%, expm

> png(filename="/home/ddbj/snapshot/RGM3/R_CC/result/OpenMx/LongitudinalOverdispersedCounts.Rd_%03d_medium.png", width=480, height=480)
> ### Name: LongitudinalOverdispersedCounts
> ### Title: Longitudinal, Overdispersed Count Data
> ### Aliases: LongitudinalOverdispersedCounts longData wideData
> ### Keywords: datasets
> 
> ### ** Examples
> 
> data(LongitudinalOverdispersedCounts)
> head(wideData)
     id          x1          x2         x3 y0 y1 y2 y3 t0 t1 t2 t3
[1,]  1  0.09028680 -0.70454619 0.98179355  1  4  4 13  0  1  2  3
[2,]  2 -0.60569794  1.84021070 0.34143632  2  3 24 23  0  1  2  3
[3,]  3 -1.64132905  0.06420197 0.18268172  0  3  3  9  0  1  2  3
[4,]  4 -0.94034250  0.13452838 1.41092610  2  2  2 17  0  1  2  3
[5,]  5 -0.08902176 -0.64903624 0.08836685  1 12  6 23  0  1  2  3
[6,]  6 -1.61535407  0.99948904 0.03628061  1  5  4 15  0  1  2  3
> str(longData)
'data.frame':	4000 obs. of  6 variables:
 $ id  : Factor w/ 1000 levels "1","2","3","4",..: 1 1 1 1 2 2 2 2 3 3 ...
 $ tiem: num  0 1 2 3 0 1 2 3 0 1 ...
 $ x1  : num  0.0903 0.0903 0.0903 0.0903 -0.6057 ...
 $ x2  : num  -0.705 -0.705 -0.705 -0.705 1.84 ...
 $ x3  : num  0.982 0.982 0.982 0.982 0.341 ...
 $ y   : num  1 4 4 13 2 3 24 23 0 3 ...
> #Let's try ordinary least-squares (OLS) regression:
> olsmod <- lm(y~tiem+x1+x2+x3, data=longData)
> #We will see in the diagnostic plots that the residuals are poorly approximated by normality, 
> #and are heteroskedastic.  We also know that the residuals are not independent of one another, 
> #because we have repeated-measures data:
> plot(olsmod)
> #In the summary, it looks like all of the regression coefficients are significantly different 
> #from zero, but we know that because the assumptions of OLS regression are violated that 
> #we should not trust its results:
> summary(olsmod)

Call:
lm(formula = y ~ tiem + x1 + x2 + x3, data = longData)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.566  -4.507  -0.873   3.405  55.311 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.4667     0.1802  -2.590  0.00963 ** 
tiem          6.8208     0.0968  70.461  < 2e-16 ***
x1            2.9791     0.1141  26.100  < 2e-16 ***
x2            1.8477     0.1153  16.029  < 2e-16 ***
x3           -1.0792     0.1109  -9.731  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.793 on 3935 degrees of freedom
  (60 observations deleted due to missingness)
Multiple R-squared:  0.6238,	Adjusted R-squared:  0.6234 
F-statistic:  1631 on 4 and 3935 DF,  p-value: < 2.2e-16

> 
> #Let's try a generalized linear model (GLM).  We'll use the quasi-Poisson quasilikelihood 
> #function to see how well the y variable is approximated by a Poisson distribution 
> #(conditional on time and covariates):
> glm.mod <- glm(y~tiem+x1+x2+x3, data=longData, family="quasipoisson")
> #The estimate of the dispersion parameter should be about 1.0 if the data are 
> #conditionally Poisson.  We can see that it is actually greater than 2, 
> #indicating overdispersion:
> summary(glm.mod)

Call:
glm(formula = y ~ tiem + x1 + x2 + x3, family = "quasipoisson", 
    data = longData)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-4.8007  -1.1976  -0.2377   0.7545   4.8464  

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.515972   0.022339   23.10   <2e-16 ***
tiem         0.840302   0.008741   96.14   <2e-16 ***
x1           0.305381   0.007884   38.73   <2e-16 ***
x2           0.194600   0.008152   23.87   <2e-16 ***
x3          -0.111168   0.007792  -14.27   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasipoisson family taken to be 2.157702)

    Null deviance: 41602.8  on 3939  degrees of freedom
Residual deviance:  8293.9  on 3935  degrees of freedom
  (60 observations deleted due to missingness)
AIC: NA

Number of Fisher Scoring iterations: 5

> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>