Last data update: 2014.03.03

R: Unmatched Control Sampling
sampcontR Documentation

Unmatched Control Sampling

Description

Take all cases and a random sample of controls from a data frame. Simple random sampling and stratified random sampling are available. For statified random sampling, strata can be defined by region, or by region and time. If no specific regions are specified then the function will create a regular grid for sampling.

Usage

sampcont(rdata, type = "stratified", regions = NULL, times = NULL, n = 1, 
         nrow = 100, ncol = 100)

Arguments

rdata

A data frame with the outcome (coded as 0/1) in the 1st column, and the geocoordinates (e.g., X and Y) in the 2nd and 3rd columns. Additional columns are not used in the sampling scheme but are retained in the sampled data frame.

type

"stratified" (default) or "simple". If "simple" then a simple random sample of n controls (rows of rdata with outcome=0) is obtained. If "stratified" then a stratified random sample of controls is obtained, with up to n controls per stratum. Sampling strata are defined by the regions and times arguments. All cases (rows with outcome=1) are taken for the sample regardless of the value supplied for type.

regions

A vector of length equal to the number of rows in rdata, used to construct sampling strata. Only used if type = "stratified". If regions = NULL and the PBSmapping package is available then the function will define regions as a vector of specific grid cells on a regular grid with nrow rows and ncol columns. If times = NULL then the nonempty regions are used as the sampling strata. If times is a vector, then the sampling strata are all nonempty combinations of regions and times.

times

A vector of length equal to the number of rows in rdata, used to construct sampling strata. If times = NULL then the sampling strata are defined only by the regions argument. If times is a vector, then the sampling strata are all nonempty combinations of regions and times. Continuous times should generally be binned before being passed through this argument, as there are no efficiency gains if each value in times is unique.

n

The number of controls to sample from the eligible controls in each stratum. All available controls will be taken for strata with fewer than n eligible controls.

nrow

The number of rows used to create a regular grid for sampling regions. Only used when regions = NULL.

ncol

The number of columns used to create a regular grid for sampling regions. Only used when regions = NULL.

Value

rdata

A data frame with all cases and a random sample of controls.

w

Inverse probability weights for the rows in rdata. Important to include as weights in subsequent analyses.

ncont

The total number of controls in the sample.

Author(s)

Scott Bartell sbartell@uci.edu.

See Also

modgam

Examples

#### load beertweets data, which has 719 cases and 9281 controls
data(beertweets)
# take a simple random sample of 1000 controls
samp1 <- sampcont(beertweets, type="simple", n=1000)

# take a stratified random sample of controls on a 80x50 grid
# requires PBSmapping package
samp2 <- NULL

if(require(PBSmapping)) samp2 <- sampcont(beertweets, nrow=80, ncol=50)

# Compare locations for the two sampling designs (cases in red)
par(mfrow=c(2,1), mar=c(0,3,4,3))
plot(samp1$rdata$longitude, samp1$rdata$latitude, col=3-samp1$rdata$beer,
	cex=0.5, type="p", axes=FALSE, ann=FALSE)
# Show US base map if maps package is available
mapUS <- require(maps)
if (mapUS) map("state", add=TRUE)
title("Simple Random Sample, 1000 Controls")

if (!is.null(samp2)) {
	plot(samp2$rdata$longitude, samp2$rdata$latitude, 
		col=3-samp2$rdata$beer, cex=0.5, type="p", axes=FALSE, 
		ann=FALSE)
	if (mapUS) map("state", add=TRUE)
	title(paste("Spatially Stratified Sample,",samp2$ncont,"Controls"))
	}

par(mfrow=c(1,1))

## Note that weights are needed in statistical analyses
# Prevalence of cases in sample--not in source data
mean(samp1$rdata$beer)		 
# Estimated prevalence of cases in source data 
weighted.mean(samp1$rdata$beer, w=samp1$w)	
## Do beer tweet odds differ below the 36.5 degree parallel?
# Using full data
glm(beer~I(latitude<36.5), family=binomial, data=beertweets) 
# Stratified sample requires sampling weights 
if (!is.null(samp2)) glm(beer~I(latitude<36.5), family=binomial, 
	data=samp2$rdata, weights=samp2$w)

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(MapGAM)
Loading required package: sp
Loading required package: gam
Loading required package: splines
Loading required package: foreach
Loaded gam 1.12

> png(filename="/home/ddbj/snapshot/RGM3/R_CC/result/MapGAM/sampcont.Rd_%03d_medium.png", width=480, height=480)
> ### Name: sampcont
> ### Title: Unmatched Control Sampling
> ### Aliases: sampcont
> ### Keywords: misc
> 
> ### ** Examples
> 
> #### load beertweets data, which has 719 cases and 9281 controls
> data(beertweets)
> # take a simple random sample of 1000 controls
> samp1 <- sampcont(beertweets, type="simple", n=1000)
> 
> # take a stratified random sample of controls on a 80x50 grid
> # requires PBSmapping package
> samp2 <- NULL
> ## No test: 
> if(require(PBSmapping)) samp2 <- sampcont(beertweets, nrow=80, ncol=50)
Loading required package: PBSmapping

-----------------------------------------------------------
PBS Mapping 2.69.76 -- Copyright (C) 2003-2016 Fisheries and Oceans Canada

PBS Mapping comes with ABSOLUTELY NO WARRANTY;
for details see the file COPYING.
This is free software, and you are welcome to redistribute
it under certain conditions, as outlined in the above file.

A complete user guide 'PBSmapping-UG.pdf' is located at 
/home/ddbj/local/lib64/R/library/PBSmapping/doc/PBSmapping-UG.pdf

Packaged on 2015-04-23
Pacific Biological Station, Nanaimo

All available PBS packages can be found at
http://code.google.com/p/pbs-software/

To see demos, type '.PBSfigs()'.
-----------------------------------------------------------


1000 controls selected from 9295 eligibles in 1000 strata.
> 
> # Compare locations for the two sampling designs (cases in red)
> par(mfrow=c(2,1), mar=c(0,3,4,3))
> plot(samp1$rdata$longitude, samp1$rdata$latitude, col=3-samp1$rdata$beer,
+ 	cex=0.5, type="p", axes=FALSE, ann=FALSE)
> # Show US base map if maps package is available
> mapUS <- require(maps)
Loading required package: maps

 # maps v3.1: updated 'world': all lakes moved to separate new #
 # 'lakes' database. Type '?world' or 'news(package="maps")'.  #


> if (mapUS) map("state", add=TRUE)
> title("Simple Random Sample, 1000 Controls")
> 
> if (!is.null(samp2)) {
+ 	plot(samp2$rdata$longitude, samp2$rdata$latitude, 
+ 		col=3-samp2$rdata$beer, cex=0.5, type="p", axes=FALSE, 
+ 		ann=FALSE)
+ 	if (mapUS) map("state", add=TRUE)
+ 	title(paste("Spatially Stratified Sample,",samp2$ncont,"Controls"))
+ 	}
> ## End(No test)
> par(mfrow=c(1,1))
> 
> ## Note that weights are needed in statistical analyses
> # Prevalence of cases in sample--not in source data
> mean(samp1$rdata$beer)		 
[1] 0.4134897
> # Estimated prevalence of cases in source data 
> weighted.mean(samp1$rdata$beer, w=samp1$w)	
[1] 0.8676018
> ## Do beer tweet odds differ below the 36.5 degree parallel?
> # Using full data
> glm(beer~I(latitude<36.5), family=binomial, data=beertweets) 

Call:  glm(formula = beer ~ I(latitude < 36.5), family = binomial, data = beertweets)

Coefficients:
           (Intercept)  I(latitude < 36.5)TRUE  
               -2.4804                 -0.2494  

Degrees of Freedom: 9999 Total (i.e. Null);  9998 Residual
Null Deviance:	    5099 
Residual Deviance: 5089 	AIC: 5093
> # Stratified sample requires sampling weights 
> if (!is.null(samp2)) glm(beer~I(latitude<36.5), family=binomial, 
+ 	data=samp2$rdata, weights=samp2$w)

Call:  glm(formula = beer ~ I(latitude < 36.5), family = binomial, data = samp2$rdata, 
    weights = samp2$w)

Coefficients:
           (Intercept)  I(latitude < 36.5)TRUE  
               -2.4808                 -0.2485  

Degrees of Freedom: 1704 Total (i.e. Null);  1703 Residual
Null Deviance:	    5099 
Residual Deviance: 5089 	AIC: 5093
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>