an object of class GenotypeMatrix, a
quadratic kernel matrix, an object of class
TabixFile, or a character string
with a file name
model
an object of class NullModel
ranges
an object with genomic regions to be tested; may be
an object of class GRanges or
GRangesList. If missing, assocTest takes the
whole genotype matrix or the genotypes in the VCF file as a whole.
kernel
determines the kernel that should be used for
association testing (see Subsection 9.2 of the package vignette for
details)
width
tolerance radius parameter for position-dependent kernels
“linear.podkat”, “quadratic.podkat”,
and “localsim.podkat”; must be single positive numeric value;
ignored for kernels “linear.SKAT”, “quadratic.SKAT”,
and “localsim.SKAT” (see Subsection 9.2 of the package
vignette for details)
weights
for the method with signature
GenotypeMatrix,NullModel, it is also possible to supply
weights directly as a numeric vector that is as long as the number of
columns of Z. In this case, the argument weightFunc
is ignored. Use NULL (default) to use automatic weighting
with the function supplied as argument weightFunc. If
weightFunc is NULL too, no weighting takes place, i.e.
an unweighted kernel is used.
weightFunc
function for computing variant weights from minor allele
frequencies (MAFs); see weightFuncs for weighting
and Subsection 9.3 of the package vignette for functions provided by
the podkat package. Use NULL
for unweighted kernels.
method
identifies the method for computing the p-values. If the
null model is of type “logistic” and small sample correction
is applied (see argument adj below), possible values are
“unbiased”, “population”, “sample”, and
“SKAT” (see details below and Subsection 9.5 of the package
vignette). If the null model is of type “linear” or if
the null model is of type “logistic” and no small sample
correction is applied, possible values are “davies”,
“liu”, and “liu.mod” (see details
below and Subsection 9.1 of the package vignette). If the null
model is of type “bernoulli”, this argument is ignored.
adj
whether or not to use small sample correction for
logistic models (binary trait with covariates). The choice
“none” turns off small sample correction. If “force”
is chosen, small sample correction is turned on unconditionally.
If “automatic” is chosen (default), small sample correction
is turned on if the number of samples does not exceed 2,000. This
argument is ignored for any type of model except “logistic”
and small sample correction is switched off. For details how to
train a null model for small sample correction, see
nullModel and Sections 4 and 9.5 of the package
vignette. An adjustment of higher moments is
performed whenever sampled null model residuals are available
in the null model model (slot res.resampled.adj, see
NullModel).
pValueLimit
if the null model is of type “bernoulli”,
assocTest performs an exact mixture of Bernoulli test. This
test uses a combinatorial algorithm to compute exact p-values and,
for the sake of computational efficiency, quits if a pre-specified
p-value threshold is exceeded. This threshold can be specified with
the pValueLimit argument. This argument is ignored for other
types of tests/null models.
cl
if cl is an object of class
SOCKcluster, association
testing is carried out in parallel on the cluster specified by cl.
If NULL (default), either no parallelization is done (if
nnodes=1) or assocTest launches a cluster with
nnodes R client processes on localhost. See Subsection
8.5.2 of the package vignette.
nnodes
if cl is NULL and nnodes is greater
than 1, makePSOCKcluster is called with nnodes
nodes on localhost, i.e. nnodes R slave processes are
launched on which association testing is carried out in parallel.
The default is 1. See Subsection 8.5.2 of
the package vignette.
batchSize
parameter which determines how many
regions of ranges
are processed at once. The larger batchSize, the larger the
the batches that are read from the VCF file Z. A larger
batchSize reduces the number of individual read operations,
which improves performance. However, a larger batchSize also
requires larger amounts of memory. A good choice of batchSize,
therefore, depends on the size and sparseness of the VCF file and as
well on the available memory. See Subsection 8.5 of
the package vignette.
noIndels
if TRUE (default), only single nucleotide
variants (SNVs) are considered
and indels in the VCF file Z are skipped.
onlyPass
if TRUE (default), only variants are considered
whose value in the FILTER column is “PASS”.
na.limit
all variants with a missing value ratio above this
threshold in the VCF file Z are not considered.
MAF.limit
all variants with a minor allele frequency (MAF) above this
threshold in the VCF file Z are not considered.
na.action
if “impute.major”, all missing values will
be imputed by major alleles before association testing. If
“omit”, all columns containing missing values in the VCF file
Z are ignored.
MAF.action
if “invert”, all columns with an MAF
exceeding 0.5 will be inverted in the sense that all minor alleles
will be replaced by major alleles and vice versa.
If “omit”, all variants in the VCF file with an MAF greater
than 0.5 are ignored. If “ignore”,
no action is taken and MAFs greater than 0.5 are kept as they are.
sex
if NULL, all samples are treated the same
without any modifications; if sex is a factor with levels
F (female) and M (male) that is as long as
length(model), this argument is
interpreted as the sex of the samples. In this case, the genotypes
corresponding to male samples are
doubled before further processing. This is designed for mixed-sex
analyses of the X chromosome outside of the pseudoautosomal
regions.
tmpdir
if computations are parallelized over multiple client
processes (see arguments nnodes and cl), the exchange
of the null model object between the master process and the client
processes is done via a temporary file. The tmpdir argument
allows to specify into which directory the temporary file should be
saved. On multi-core systems, the default should be sufficient.
If the computations are distributed over a custom cluster,
the tmpdir argument needs to be chosen such that all clients
can access it via the same path.
displayProgress
if TRUE (default) and if ranges is a
GRangesList, a progress message is printed
upon completion of each list component (typically consisting of
regions of one chromosome); this argument is ignored if
ranges is not an object of class GRangesList.
...
all other parameters are passed on to the assocTest
method with signature TabixFile,NullModel.
Details
The assocTest method is the main function of the podkat
package. For a given genotype and a null model, it performs the actual
association test(s).
For null models of types “linear” and
“logistic” (see NullModel and
nullModel), a variance component score test is used
(see Subsection 9.1 of the package vignette for details).
The test relies on the choice of a particular kernel to measure the
pairwise similarities of genotypes. The choice of the kernel can be
made with the kernel argument (see computeKernel
and Subsection 9.2 of the package vignette for more details).
For null models of type “linear”, the test statistic follows
a mixture of chi-squares distribution. For models of typ
“logistic”, the test statistic approximately follows a mixture
of chi-squares distribution. The computation of p-values for a given
mixture of chi-squares can be done according to Davies (1980) (which is
the default), according to Liu et al. (2009), or using a
modified method similar to the one suggested by
Liu et al. (2009) as implemented in the SKAT package,
too. Which method is used can be controlled using the method
argument. If method according to Davies (1980) fails,
assocTest resorts to the method by Liu et al. (2009).
See also Subsection 9.1 of the package vignette for more details.
For null models of type “logistic”, the assocTest
method also offers the small sample correction suggested by
Lee et al. (2012). Whether small sample correction is applied,
is controlled by the adj argument. The additional adjustment
of higher moments as suggested by Lee et al. (2012) is
performed whenever resampled null model residuals are available in
the null model model (slot res.resampled.adj, see
NullModel). In this case, the method
argument controls how the excess kurtosis of test statistics sampled
from the null distribution are computed. The
default setting “unbiased” computes unbiased estimates by
using the exact expected value determined from the mixture components.
The settings “population” and
“sample” use almost unbiased and biased sample statistics,
respectively. The choice “SKAT” uses the same method as
implemented in the SKAT package. See Subsection 9.5 of the
package vignette for more details.
If the null model is of type “bernoulli”, the test statistic
follows a mixture of Bernoulli distributions. In this case, an exact
p-value is determined that is computed as the probability to observe
a test statistic for random Bernoulli-distributed traits (under the null
hypothesis) that is at least as large as the observed test
statistic. For reasons of computational complexity, this option is
limited to sample numbers not larger than 100. See Subsection 9.1 of
the package vignette for more details.
The podkat package offers multiple interfaces for association
testing all of which require the second argument model to be a
NullModel object.
The simplest method is to call assocTest for an object
of class GenotypeMatrix as first argument
Z. If the ranges argument is not supplied, a single
association test is performed using the entire genotype matrix
contained in Z and an object of class
AssocTestResult is returned. In this case,
all variants need to reside on the same chromosome (compare with
computeKernel). If the ranges argument is
specified, each region in ranges is tested separately and the
result is returned as an AssocTestResultRanges
object.
As said, the simplest method is to store the entire genotype in a
GenotypeMatrix object and to call assocTest
as described above. This approach has the shortcoming that the entire
genotype must be read (e.g. from a VCF file) and kept in memory as a
whole. For large studies, in particular, whole-genome studies, this is
not feasible. In order to be able to cope with large studies, the
podkat package offers an interface that allows for reading from
a VCF file piece by piece without the need to read and store the entire
genotype at once. If Z is a TabixFile
object or the name of a VCF file, assocTest reads from the file
in batches of batchSize regions, performs the association tests
for these regions, and returns the results as an
AssocTestResultRanges object. This sequential
batch processing can also be parallelized. The user can either set up
a cluster him-/herself and pass the
SOCKcluster object as
cl argument. If the cl is NULL, users can
leave the setup of the cluster to assocTest. In this case, the
only thing necessary is to determine the number of R client processes
by the nnodes argument. The variant with the VCF interface
supports the same pre-processing and filter arguments as
readGenotypeMatrix to control which variants are actually
taken into account and how to handle variants with MAFs greater than
50%.
If the argument Z is a numeric matrix, Z is
interpreted as a kernel matrix
old{K}. Then a single association test is
performed as described above and the result is returned as an
AssocTestResult object. This allows the user to
use a custom kernel not currently implemented in the podkat
package. The assocTest function assumes that row and column
objects in the kernel matrix are in the same order. It does not
perform any check whether row and column names are the same or whether
the kernel matrix is actually positive semi-definite. Users should be
aware that running the function for invalid kernels matrices,
i.e. for a matrix that is not positive semi-definite, produces
meaningless results and may even lead to unexpected errors.
Finally, note that the samples in the null model model and
in the genotype (GenotypeMatrix object or VCF file)
need not be aligned to each other. If both the samples in model
and in the genotype are named (i.e. row names are defined
for Z if it is a GenotypeMatrix object; VCF files
always contain sample names anyway), assocTest checks if all
samples in model are present in the genotype. If so, it
selects only those samples from the genotype that occur in the null
model. If not, it quits with an error. If either the samples in the
null model or the genotypes are not named, assocTest assumes
that the samples are aligned to each other. This applies only if
the number of samples in the null model and the number of genotypes
are the same or if the number of genotypes equals the number of
samples in the null model plus the number of samples that were omitted
from the null model when it was trained (see
NullModel and nullModel).
Otherwise, the function quits with an error.
An analogous procedure is applied if the kernel matrix interface is
used (signature matrix,NullModel).
Value
an object of class AssocTestResult or
AssocTestResultRanges (see details above)
Wu, M. C., Lee, S., Cai, T., Li, Y., Boehnke, M., and Lin, X. (2011)
Rare-variant association testing for sequencing data with the sequence
kernel association test. Am. J. Hum. Genet.89, 82-93.
DOI:
10.1016/j.ajhg.2011.05.029.
Lee, S., Emond, M. J., Bamshad, M. J., Barnes, K. C., Rieder, M. J.,
Nickerson, D. A., NHLBI Exome Sequencing Project - ESP Lung Project
Team, Christiani, D. C., Wurfel, M. M., and Lin, X. (2012)
Optimal unified approach for rare-variant association testing with
application to small-sample case-control whole-exome sequencing studies.
Am. J. Hum. Genet.91, 224-237. DOI:
10.1016/j.ajhg.2012.06.007.
Davies, R. B. (1980) The distribution of a linear combination of
chi^2 random
variables. J. R. Stat. Soc. Ser. C-Appl. Stat.29,
323-333.
Liu, H., Tang, Y., and Zhang, H. (2009) A new chi-square
approximation to the distribution of non-negative definite quadratic
forms in non-central normal variables. Comput. Stat. Data Anal.53, 853-856.
## load genome description
data(hgA)
## partition genome into overlapping windows
windows <- partitionRegions(hgA)
## load genotype data from VCF file
vcfFile <- system.file("examples/example1.vcf.gz", package="podkat")
Z <- readGenotypeMatrix(vcfFile)
## read phenotype data from CSV file (continuous trait + covariates)
phenoFile <- system.file("examples/example1lin.csv", package="podkat")
pheno.c <- read.table(phenoFile, header=TRUE, sep=",")
## train null model with all covariates in data frame 'pheno'
model.c <- nullModel(y ~ ., pheno.c)
## perform association test
res <- assocTest(Z, model.c, windows)
print(res)
## perform association test using the VCF interface
res <- assocTest(vcfFile, model.c, windows, batchSize=100)
print(res)
## create Manhattan plot of adjusted p-values
plot(p.adjust(res), which="p.value.adj")
## read phenotype data from CSV file (binary trait + covariates)
phenoFile <- system.file("examples/example1log.csv", package="podkat")
pheno.b <-read.table(phenoFile, header=TRUE, sep=",")
## train null model with all covariates in data frame 'pheno'
model.b <- nullModel(y ~ ., pheno.b)
## perform association test
res <- assocTest(Z, model.b, windows)
print(res)
## create Manhattan plot of adjusted p-values
plot(p.adjust(res), which="p.value.adj")
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(podkat)
Loading required package: Rsamtools
Loading required package: GenomeInfoDb
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:parallel':
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from 'package:stats':
IQR, mad, xtabs
The following objects are masked from 'package:base':
Filter, Find, Map, Position, Reduce, anyDuplicated, append,
as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
rbind, rownames, sapply, setdiff, sort, table, tapply, union,
unique, unsplit
Loading required package: S4Vectors
Attaching package: 'S4Vectors'
The following objects are masked from 'package:base':
colMeans, colSums, expand.grid, rowMeans, rowSums
Loading required package: IRanges
Loading required package: GenomicRanges
Loading required package: Biostrings
Loading required package: XVector
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/podkat/assocTest-methods.Rd_%03d_medium.png", width=480, height=480)
> ### Name: assocTest
> ### Title: Perform Association Test
> ### Aliases: assocTest assocTest-methods
> ### assocTest,GenotypeMatrix,NullModel-method
> ### assocTest,matrix,NullModel-method
> ### assocTest,TabixFile,NullModel-method
> ### assocTest,character,NullModel-method
> ### Keywords: methods
>
> ### ** Examples
>
> ## load genome description
> data(hgA)
>
> ## partition genome into overlapping windows
> windows <- partitionRegions(hgA)
>
> ## load genotype data from VCF file
> vcfFile <- system.file("examples/example1.vcf.gz", package="podkat")
> Z <- readGenotypeMatrix(vcfFile)
>
> ## read phenotype data from CSV file (continuous trait + covariates)
> phenoFile <- system.file("examples/example1lin.csv", package="podkat")
> pheno.c <- read.table(phenoFile, header=TRUE, sep=",")
>
> ## train null model with all covariates in data frame 'pheno'
> model.c <- nullModel(y ~ ., pheno.c)
>
> ## perform association test
> res <- assocTest(Z, model.c, windows)
> print(res)
Overview of association test:
Null model: linear
Number of samples: 200
Number of regions: 79
Number of regions without variants: 0
Average number of variants in regions: 24.1
Genome: hgA
Kernel: linear.podkat
p-value adjustment: none
Overview of significance of results:
Number of tests with p < 0.05: 8
Results for the 8 most significant regions:
seqnames start end width n Q p.value
1 chr1 7501 12500 5000 31 769748.34 1.294084e-07
2 chr1 10001 15000 5000 33 764828.81 4.874460e-06
3 chr1 140001 145000 5000 15 79937.68 3.599077e-03
4 chr1 5001 10000 5000 34 152555.30 9.785569e-03
5 chr1 132501 137500 5000 21 89287.55 1.349559e-02
6 chr1 142501 147500 5000 23 94629.68 3.338620e-02
7 chr1 42501 47500 5000 19 58191.23 3.341032e-02
8 chr1 25001 30000 5000 23 103713.12 3.754557e-02
>
> ## perform association test using the VCF interface
> res <- assocTest(vcfFile, model.c, windows, batchSize=100)
> print(res)
Overview of association test:
Null model: linear
Number of samples: 200
Number of regions: 79
Number of regions without variants: 0
Average number of variants in regions: 24.1
Genome: hgA
Kernel: linear.podkat
p-value adjustment: none
Overview of significance of results:
Number of tests with p < 0.05: 8
Results for the 8 most significant regions:
seqnames start end width n Q p.value
1 chr1 7501 12500 5000 31 769748.34 1.294084e-07
2 chr1 10001 15000 5000 33 764828.81 4.874460e-06
3 chr1 140001 145000 5000 15 79937.68 3.599077e-03
4 chr1 5001 10000 5000 34 152555.30 9.785569e-03
5 chr1 132501 137500 5000 21 89287.55 1.349559e-02
6 chr1 142501 147500 5000 23 94629.68 3.338620e-02
7 chr1 42501 47500 5000 19 58191.23 3.341032e-02
8 chr1 25001 30000 5000 23 103713.12 3.754557e-02
>
> ## create Manhattan plot of adjusted p-values
> plot(p.adjust(res), which="p.value.adj")
>
> ## read phenotype data from CSV file (binary trait + covariates)
> phenoFile <- system.file("examples/example1log.csv", package="podkat")
> pheno.b <-read.table(phenoFile, header=TRUE, sep=",")
>
> ## train null model with all covariates in data frame 'pheno'
> model.b <- nullModel(y ~ ., pheno.b)
small sample correction applied
>
> ## perform association test
> res <- assocTest(Z, model.b, windows)
> print(res)
Overview of association test:
Null model: logistic
Number of samples: 200
Number of regions: 79
Number of regions without variants: 0
Average number of variants in regions: 24.1
Genome: hgA
Kernel: linear.podkat
p-value adjustment: none
Overview of significance of results:
Number of tests with p < 0.05: 22
Results for the 10 most significant regions:
seqnames start end width n Q p.value
1 chr1 7501 12500 5000 31 38386.55 3.254320e-05
2 chr1 10001 15000 5000 33 43084.90 6.958311e-05
3 chr1 22501 27500 5000 27 25640.34 7.636106e-04
4 chr1 20001 25000 5000 23 16120.63 2.093675e-03
5 chr1 25001 30000 5000 23 10650.48 4.071057e-03
6 chr1 77501 82500 5000 17 4890.26 6.785409e-03
7 chr1 87501 92500 5000 30 12891.61 6.965278e-03
8 chr1 35001 40000 5000 23 22436.29 7.924275e-03
9 chr1 37501 42500 5000 25 22570.67 8.967666e-03
10 chr1 27501 32500 5000 27 32423.04 9.744144e-03
>
> ## create Manhattan plot of adjusted p-values
> plot(p.adjust(res), which="p.value.adj")
>
>
>
>
>
> dev.off()
null device
1
>