Last data update: 2014.03.03

R: Perform Association Test
assocTestR Documentation

Perform Association Test

Description

Method for performing a kernel-based association test given a genotype, VCF file, or kernel matrix

Usage

## S4 method for signature 'GenotypeMatrix,NullModel'
assocTest(Z, model, ranges,
          kernel=c("linear.podkat", "localsim.podkat",
                   "quadratic.podkat", "linear.SKAT",
                   "localsim.SKAT", "quadratic.SKAT"),
          width=1000, weights=NULL, weightFunc=betaWeights(),
          method=NULL, adj=c("automatic", "none", "force"),
          pValueLimit=0.05)
## S4 method for signature 'matrix,NullModel'
assocTest(Z, model, method=NULL,
          adj=c("automatic", "none", "force"), pValueLimit=0.05)
## S4 method for signature 'TabixFile,NullModel'
assocTest(Z, model, ranges,
          kernel=c("linear.podkat", "localsim.podkat",
                   "quadratic.podkat", "linear.SKAT",
                   "localsim.SKAT", "quadratic.SKAT"),
          cl=NULL, nnodes=1, batchSize=20,
          noIndels=TRUE, onlyPass=TRUE, na.limit=1, MAF.limit=1,
          na.action=c("impute.major", "omit"),
          MAF.action=c("invert", "omit", "ignore"),
          sex=NULL, weightFunc=betaWeights(), width=1000,
          method=NULL, adj=c("automatic", "none", "force"),
          pValueLimit=(0.1 / length(ranges)), tmpdir=tempdir(),
          displayProgress=TRUE)
## S4 method for signature 'character,NullModel'
assocTest(Z, model, ...)

Arguments

Z

an object of class GenotypeMatrix, a quadratic kernel matrix, an object of class TabixFile, or a character string with a file name

model

an object of class NullModel

ranges

an object with genomic regions to be tested; may be an object of class GRanges or GRangesList. If missing, assocTest takes the whole genotype matrix or the genotypes in the VCF file as a whole.

kernel

determines the kernel that should be used for association testing (see Subsection 9.2 of the package vignette for details)

width

tolerance radius parameter for position-dependent kernels “linear.podkat”, “quadratic.podkat”, and “localsim.podkat”; must be single positive numeric value; ignored for kernels “linear.SKAT”, “quadratic.SKAT”, and “localsim.SKAT” (see Subsection 9.2 of the package vignette for details)

weights

for the method with signature GenotypeMatrix,NullModel, it is also possible to supply weights directly as a numeric vector that is as long as the number of columns of Z. In this case, the argument weightFunc is ignored. Use NULL (default) to use automatic weighting with the function supplied as argument weightFunc. If weightFunc is NULL too, no weighting takes place, i.e. an unweighted kernel is used.

weightFunc

function for computing variant weights from minor allele frequencies (MAFs); see weightFuncs for weighting and Subsection 9.3 of the package vignette for functions provided by the podkat package. Use NULL for unweighted kernels.

method

identifies the method for computing the p-values. If the null model is of type “logistic” and small sample correction is applied (see argument adj below), possible values are “unbiased”, “population”, “sample”, and “SKAT” (see details below and Subsection 9.5 of the package vignette). If the null model is of type “linear” or if the null model is of type “logistic” and no small sample correction is applied, possible values are “davies”, “liu”, and “liu.mod” (see details below and Subsection 9.1 of the package vignette). If the null model is of type “bernoulli”, this argument is ignored.

adj

whether or not to use small sample correction for logistic models (binary trait with covariates). The choice “none” turns off small sample correction. If “force” is chosen, small sample correction is turned on unconditionally. If “automatic” is chosen (default), small sample correction is turned on if the number of samples does not exceed 2,000. This argument is ignored for any type of model except “logistic” and small sample correction is switched off. For details how to train a null model for small sample correction, see nullModel and Sections 4 and 9.5 of the package vignette. An adjustment of higher moments is performed whenever sampled null model residuals are available in the null model model (slot res.resampled.adj, see NullModel).

pValueLimit

if the null model is of type “bernoulli”, assocTest performs an exact mixture of Bernoulli test. This test uses a combinatorial algorithm to compute exact p-values and, for the sake of computational efficiency, quits if a pre-specified p-value threshold is exceeded. This threshold can be specified with the pValueLimit argument. This argument is ignored for other types of tests/null models.

cl

if cl is an object of class SOCKcluster, association testing is carried out in parallel on the cluster specified by cl. If NULL (default), either no parallelization is done (if nnodes=1) or assocTest launches a cluster with nnodes R client processes on localhost. See Subsection 8.5.2 of the package vignette.

nnodes

if cl is NULL and nnodes is greater than 1, makePSOCKcluster is called with nnodes nodes on localhost, i.e. nnodes R slave processes are launched on which association testing is carried out in parallel. The default is 1. See Subsection 8.5.2 of the package vignette.

batchSize

parameter which determines how many regions of ranges are processed at once. The larger batchSize, the larger the the batches that are read from the VCF file Z. A larger batchSize reduces the number of individual read operations, which improves performance. However, a larger batchSize also requires larger amounts of memory. A good choice of batchSize, therefore, depends on the size and sparseness of the VCF file and as well on the available memory. See Subsection 8.5 of the package vignette.

noIndels

if TRUE (default), only single nucleotide variants (SNVs) are considered and indels in the VCF file Z are skipped.

onlyPass

if TRUE (default), only variants are considered whose value in the FILTER column is “PASS”.

na.limit

all variants with a missing value ratio above this threshold in the VCF file Z are not considered.

MAF.limit

all variants with a minor allele frequency (MAF) above this threshold in the VCF file Z are not considered.

na.action

if “impute.major”, all missing values will be imputed by major alleles before association testing. If “omit”, all columns containing missing values in the VCF file Z are ignored.

MAF.action

if “invert”, all columns with an MAF exceeding 0.5 will be inverted in the sense that all minor alleles will be replaced by major alleles and vice versa. If “omit”, all variants in the VCF file with an MAF greater than 0.5 are ignored. If “ignore”, no action is taken and MAFs greater than 0.5 are kept as they are.

sex

if NULL, all samples are treated the same without any modifications; if sex is a factor with levels F (female) and M (male) that is as long as length(model), this argument is interpreted as the sex of the samples. In this case, the genotypes corresponding to male samples are doubled before further processing. This is designed for mixed-sex analyses of the X chromosome outside of the pseudoautosomal regions.

tmpdir

if computations are parallelized over multiple client processes (see arguments nnodes and cl), the exchange of the null model object between the master process and the client processes is done via a temporary file. The tmpdir argument allows to specify into which directory the temporary file should be saved. On multi-core systems, the default should be sufficient. If the computations are distributed over a custom cluster, the tmpdir argument needs to be chosen such that all clients can access it via the same path.

displayProgress

if TRUE (default) and if ranges is a GRangesList, a progress message is printed upon completion of each list component (typically consisting of regions of one chromosome); this argument is ignored if ranges is not an object of class GRangesList.

...

all other parameters are passed on to the assocTest method with signature TabixFile,NullModel.

Details

The assocTest method is the main function of the podkat package. For a given genotype and a null model, it performs the actual association test(s).

For null models of types “linear” and “logistic” (see NullModel and nullModel), a variance component score test is used (see Subsection 9.1 of the package vignette for details). The test relies on the choice of a particular kernel to measure the pairwise similarities of genotypes. The choice of the kernel can be made with the kernel argument (see computeKernel and Subsection 9.2 of the package vignette for more details). For null models of type “linear”, the test statistic follows a mixture of chi-squares distribution. For models of typ “logistic”, the test statistic approximately follows a mixture of chi-squares distribution. The computation of p-values for a given mixture of chi-squares can be done according to Davies (1980) (which is the default), according to Liu et al. (2009), or using a modified method similar to the one suggested by Liu et al. (2009) as implemented in the SKAT package, too. Which method is used can be controlled using the method argument. If method according to Davies (1980) fails, assocTest resorts to the method by Liu et al. (2009). See also Subsection 9.1 of the package vignette for more details.

For null models of type “logistic”, the assocTest method also offers the small sample correction suggested by Lee et al. (2012). Whether small sample correction is applied, is controlled by the adj argument. The additional adjustment of higher moments as suggested by Lee et al. (2012) is performed whenever resampled null model residuals are available in the null model model (slot res.resampled.adj, see NullModel). In this case, the method argument controls how the excess kurtosis of test statistics sampled from the null distribution are computed. The default setting “unbiased” computes unbiased estimates by using the exact expected value determined from the mixture components. The settings “population” and “sample” use almost unbiased and biased sample statistics, respectively. The choice “SKAT” uses the same method as implemented in the SKAT package. See Subsection 9.5 of the package vignette for more details.

If the null model is of type “bernoulli”, the test statistic follows a mixture of Bernoulli distributions. In this case, an exact p-value is determined that is computed as the probability to observe a test statistic for random Bernoulli-distributed traits (under the null hypothesis) that is at least as large as the observed test statistic. For reasons of computational complexity, this option is limited to sample numbers not larger than 100. See Subsection 9.1 of the package vignette for more details.

The podkat package offers multiple interfaces for association testing all of which require the second argument model to be a NullModel object. The simplest method is to call assocTest for an object of class GenotypeMatrix as first argument Z. If the ranges argument is not supplied, a single association test is performed using the entire genotype matrix contained in Z and an object of class AssocTestResult is returned. In this case, all variants need to reside on the same chromosome (compare with computeKernel). If the ranges argument is specified, each region in ranges is tested separately and the result is returned as an AssocTestResultRanges object.

As said, the simplest method is to store the entire genotype in a GenotypeMatrix object and to call assocTest as described above. This approach has the shortcoming that the entire genotype must be read (e.g. from a VCF file) and kept in memory as a whole. For large studies, in particular, whole-genome studies, this is not feasible. In order to be able to cope with large studies, the podkat package offers an interface that allows for reading from a VCF file piece by piece without the need to read and store the entire genotype at once. If Z is a TabixFile object or the name of a VCF file, assocTest reads from the file in batches of batchSize regions, performs the association tests for these regions, and returns the results as an AssocTestResultRanges object. This sequential batch processing can also be parallelized. The user can either set up a cluster him-/herself and pass the SOCKcluster object as cl argument. If the cl is NULL, users can leave the setup of the cluster to assocTest. In this case, the only thing necessary is to determine the number of R client processes by the nnodes argument. The variant with the VCF interface supports the same pre-processing and filter arguments as readGenotypeMatrix to control which variants are actually taken into account and how to handle variants with MAFs greater than 50%.

If the argument Z is a numeric matrix, Z is interpreted as a kernel matrix old{K}. Then a single association test is performed as described above and the result is returned as an AssocTestResult object. This allows the user to use a custom kernel not currently implemented in the podkat package. The assocTest function assumes that row and column objects in the kernel matrix are in the same order. It does not perform any check whether row and column names are the same or whether the kernel matrix is actually positive semi-definite. Users should be aware that running the function for invalid kernels matrices, i.e. for a matrix that is not positive semi-definite, produces meaningless results and may even lead to unexpected errors.

Finally, note that the samples in the null model model and in the genotype (GenotypeMatrix object or VCF file) need not be aligned to each other. If both the samples in model and in the genotype are named (i.e. row names are defined for Z if it is a GenotypeMatrix object; VCF files always contain sample names anyway), assocTest checks if all samples in model are present in the genotype. If so, it selects only those samples from the genotype that occur in the null model. If not, it quits with an error. If either the samples in the null model or the genotypes are not named, assocTest assumes that the samples are aligned to each other. This applies only if the number of samples in the null model and the number of genotypes are the same or if the number of genotypes equals the number of samples in the null model plus the number of samples that were omitted from the null model when it was trained (see NullModel and nullModel). Otherwise, the function quits with an error. An analogous procedure is applied if the kernel matrix interface is used (signature matrix,NullModel).

Value

an object of class AssocTestResult or AssocTestResultRanges (see details above)

Author(s)

Ulrich Bodenhofer bodenhofer@bioinf.jku.at

References

http://www.bioinf.jku.at/software/podkat

Wu, M. C., Lee, S., Cai, T., Li, Y., Boehnke, M., and Lin, X. (2011) Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet. 89, 82-93. DOI: 10.1016/j.ajhg.2011.05.029.

Lee, S., Emond, M. J., Bamshad, M. J., Barnes, K. C., Rieder, M. J., Nickerson, D. A., NHLBI Exome Sequencing Project - ESP Lung Project Team, Christiani, D. C., Wurfel, M. M., and Lin, X. (2012) Optimal unified approach for rare-variant association testing with application to small-sample case-control whole-exome sequencing studies. Am. J. Hum. Genet. 91, 224-237. DOI: 10.1016/j.ajhg.2012.06.007.

Davies, R. B. (1980) The distribution of a linear combination of chi^2 random variables. J. R. Stat. Soc. Ser. C-Appl. Stat. 29, 323-333.

Liu, H., Tang, Y., and Zhang, H. (2009) A new chi-square approximation to the distribution of non-negative definite quadratic forms in non-central normal variables. Comput. Stat. Data Anal. 53, 853-856.

See Also

AssocTestResult, AssocTestResultRanges, nullModel, NullModel, computeKernel, weightFuncs, readGenotypeMatrix, GenotypeMatrix, plot, qqplot, p.adjust, filterResult

Examples

## load genome description
data(hgA)

## partition genome into overlapping windows
windows <- partitionRegions(hgA)

## load genotype data from VCF file
vcfFile <- system.file("examples/example1.vcf.gz", package="podkat")
Z <- readGenotypeMatrix(vcfFile)

## read phenotype data from CSV file (continuous trait + covariates)
phenoFile <- system.file("examples/example1lin.csv", package="podkat")
pheno.c <- read.table(phenoFile, header=TRUE, sep=",")

## train null model with all covariates in data frame 'pheno'
model.c <- nullModel(y ~ ., pheno.c)

## perform association test
res <- assocTest(Z, model.c, windows)
print(res)

## perform association test using the VCF interface
res <- assocTest(vcfFile, model.c, windows, batchSize=100)
print(res)

## create Manhattan plot of adjusted p-values
plot(p.adjust(res), which="p.value.adj")

## read phenotype data from CSV file (binary trait + covariates)
phenoFile <- system.file("examples/example1log.csv", package="podkat")
pheno.b <-read.table(phenoFile, header=TRUE, sep=",")

## train null model with all covariates in data frame 'pheno'
model.b <- nullModel(y ~ ., pheno.b)

## perform association test
res <- assocTest(Z, model.b, windows)
print(res)

## create Manhattan plot of adjusted p-values
plot(p.adjust(res), which="p.value.adj")

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(podkat)
Loading required package: Rsamtools
Loading required package: GenomeInfoDb
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
    get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
    rbind, rownames, sapply, setdiff, sort, table, tapply, union,
    unique, unsplit

Loading required package: S4Vectors

Attaching package: 'S4Vectors'

The following objects are masked from 'package:base':

    colMeans, colSums, expand.grid, rowMeans, rowSums

Loading required package: IRanges
Loading required package: GenomicRanges
Loading required package: Biostrings
Loading required package: XVector
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/podkat/assocTest-methods.Rd_%03d_medium.png", width=480, height=480)
> ### Name: assocTest
> ### Title: Perform Association Test
> ### Aliases: assocTest assocTest-methods
> ###   assocTest,GenotypeMatrix,NullModel-method
> ###   assocTest,matrix,NullModel-method
> ###   assocTest,TabixFile,NullModel-method
> ###   assocTest,character,NullModel-method
> ### Keywords: methods
> 
> ### ** Examples
> 
> ## load genome description
> data(hgA)
> 
> ## partition genome into overlapping windows
> windows <- partitionRegions(hgA)
> 
> ## load genotype data from VCF file
> vcfFile <- system.file("examples/example1.vcf.gz", package="podkat")
> Z <- readGenotypeMatrix(vcfFile)
> 
> ## read phenotype data from CSV file (continuous trait + covariates)
> phenoFile <- system.file("examples/example1lin.csv", package="podkat")
> pheno.c <- read.table(phenoFile, header=TRUE, sep=",")
> 
> ## train null model with all covariates in data frame 'pheno'
> model.c <- nullModel(y ~ ., pheno.c)
> 
> ## perform association test
> res <- assocTest(Z, model.c, windows)
> print(res)
Overview of association test:
	Null model: linear 
	Number of samples: 200 
	Number of regions: 79 
	Number of regions without variants: 0 
	Average number of variants in regions: 24.1 
	Genome: hgA 
	Kernel: linear.podkat 
	p-value adjustment: none 

Overview of significance of results:
	Number of tests with p < 0.05: 8

Results for the 8 most significant regions:
  seqnames  start    end width  n         Q      p.value
1     chr1   7501  12500  5000 31 769748.34 1.294084e-07
2     chr1  10001  15000  5000 33 764828.81 4.874460e-06
3     chr1 140001 145000  5000 15  79937.68 3.599077e-03
4     chr1   5001  10000  5000 34 152555.30 9.785569e-03
5     chr1 132501 137500  5000 21  89287.55 1.349559e-02
6     chr1 142501 147500  5000 23  94629.68 3.338620e-02
7     chr1  42501  47500  5000 19  58191.23 3.341032e-02
8     chr1  25001  30000  5000 23 103713.12 3.754557e-02
> 
> ## perform association test using the VCF interface
> res <- assocTest(vcfFile, model.c, windows, batchSize=100)
> print(res)
Overview of association test:
	Null model: linear 
	Number of samples: 200 
	Number of regions: 79 
	Number of regions without variants: 0 
	Average number of variants in regions: 24.1 
	Genome: hgA 
	Kernel: linear.podkat 
	p-value adjustment: none 

Overview of significance of results:
	Number of tests with p < 0.05: 8

Results for the 8 most significant regions:
  seqnames  start    end width  n         Q      p.value
1     chr1   7501  12500  5000 31 769748.34 1.294084e-07
2     chr1  10001  15000  5000 33 764828.81 4.874460e-06
3     chr1 140001 145000  5000 15  79937.68 3.599077e-03
4     chr1   5001  10000  5000 34 152555.30 9.785569e-03
5     chr1 132501 137500  5000 21  89287.55 1.349559e-02
6     chr1 142501 147500  5000 23  94629.68 3.338620e-02
7     chr1  42501  47500  5000 19  58191.23 3.341032e-02
8     chr1  25001  30000  5000 23 103713.12 3.754557e-02
> 
> ## create Manhattan plot of adjusted p-values
> plot(p.adjust(res), which="p.value.adj")
> 
> ## read phenotype data from CSV file (binary trait + covariates)
> phenoFile <- system.file("examples/example1log.csv", package="podkat")
> pheno.b <-read.table(phenoFile, header=TRUE, sep=",")
> 
> ## train null model with all covariates in data frame 'pheno'
> model.b <- nullModel(y ~ ., pheno.b)
small sample correction applied
> 
> ## perform association test
> res <- assocTest(Z, model.b, windows)
> print(res)
Overview of association test:
	Null model: logistic 
	Number of samples: 200 
	Number of regions: 79 
	Number of regions without variants: 0 
	Average number of variants in regions: 24.1 
	Genome: hgA 
	Kernel: linear.podkat 
	p-value adjustment: none 

Overview of significance of results:
	Number of tests with p < 0.05: 22

Results for the 10 most significant regions:
   seqnames start   end width  n        Q      p.value
1      chr1  7501 12500  5000 31 38386.55 3.254320e-05
2      chr1 10001 15000  5000 33 43084.90 6.958311e-05
3      chr1 22501 27500  5000 27 25640.34 7.636106e-04
4      chr1 20001 25000  5000 23 16120.63 2.093675e-03
5      chr1 25001 30000  5000 23 10650.48 4.071057e-03
6      chr1 77501 82500  5000 17  4890.26 6.785409e-03
7      chr1 87501 92500  5000 30 12891.61 6.965278e-03
8      chr1 35001 40000  5000 23 22436.29 7.924275e-03
9      chr1 37501 42500  5000 25 22570.67 8.967666e-03
10     chr1 27501 32500  5000 27 32423.04 9.744144e-03
> 
> ## create Manhattan plot of adjusted p-values
> plot(p.adjust(res), which="p.value.adj")
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>