R Graphical Manual

Browse All

Last data update: 2014.03.03

R: Combine statistics across multiple tests

combineTests

R Documentation

Combine statistics across multiple tests

Description

Combines p-values across clustered tests using Simes' method to control the cluster FDR.

Usage

combineTests(ids, tab, weight=NULL, pval.col=NULL, fc.col=NULL)

Arguments

`ids`	an integer vector containing the cluster ID for each test
`tab`	a dataframe of results with `PValue` and at least one `logFC` field for each test
`weight`	a numeric vector of weights for each window, defaults to 1 for each test
`pval.col`	an integer scalar specifying the column of `tab` containing the p-values, or a character string containing the name of that column
`fc.col`	an integer vector specifying the columns of `tab` containing the log-fold changes, or a character vector containing the names of those columns

Details

This function uses Simes' procedure to compute the combined p-value for each cluster of tests with the same value of ids. Each combined p-value represents evidence against the global null hypothesis, i.e., all individual nulls are true in each cluster. This may be more relevant than examining each test individually when multiple tests in a cluster represent parts of the same underlying event, e.g., genomic regions consisting of clusters of windows. The BH method is also applied to control the FDR across all clusters.

The importance of each test within a cluster can be adjusted by supplying different relative weight values. This may be useful for downweighting low-confidence tests, e.g., those in repeat regions. In Simes' procedure, weights are interpreted as relative frequencies of the tests in each cluster. Note that these weights have no effect between clusters and will not be used to adjust the computed FDR.

By default, the relevant fields in tab are identified by matching the column names to their expected values. Multiple fields in tab containing the logFC substring are allowed, e.g., to accommodate ANOVA-like contrasts. If the column names are different from what is expected, specification of the correct columns can be performed using pval.col and fc.col. This will overwrite any internal selection of the appropriate fields.

This function will report the number of windows with log-fold changes above 0.5 and below -0.5, to give some indication of whether binding increases or decreases in the cluster. If a cluster contains non-negligble numbers of up and down windows, this indicates that there may be a complex DB event within that cluster. Similarly, complex DB may be present if the total number of windows is larger than the number of windows in either category (i.e., change is not consistent across the cluster). Note that the threshold of 0.5 is arbitrary and has no impact on the significance calculations.

A simple clustering approach for windows is provided in mergeWindows. However, anything can be used so long as it does not compromise type I error control, e.g., promoters, gene bodies, independently called peaks. Any tests with NA values for ids will be ignored.

Value

A dataframe with one row per cluster and the numeric fields PValue, the combined p-value; and FDR, the q-value corresponding to the combined p-value. An integer field nWindows specifies the total number of windows in each cluster. There are also two integer fields *.up and *.down for each log-FC column in tab, containing the number of windows with log-FCs above 0.5 or below -0.5, respectively. The name of each row represents the ID of the corresponding cluster.

Author(s)

Aaron Lun

References

Simes RJ (1986). An improved Bonferroni procedure for multiple tests of significance. Biometrika 73, 751-754.

Benjamini Y and Hochberg Y (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Series B 57, 289-300.

Benjamini Y and Hochberg Y (1997). Multiple hypotheses testing with weights. Scand. J. Stat. 24, 407-418.

Lun ATL and Smyth GK (2014). De novo detection of differentially bound regions for ChIP-seq data using peaks and windows: controlling error rates correctly. Nucleic Acids Res. 42, e95

Examples

 
ids <- round(runif(100, 1, 10))
tab <- data.frame(logFC=rnorm(100), logCPM=rnorm(100), PValue=rbeta(100, 1, 2))
combined <- combineTests(ids, tab)
head(combined)

# With window weighting.
w <- round(runif(100, 1, 5))
combined <- combineTests(ids, tab, weight=w)
head(combined)

# With multiple log-FCs.
tab$logFC.whee <- rnorm(100, 5)
combined <- combineTests(ids, tab)
head(combined)

# Manual specification of column IDs.
combined <- combineTests(ids, tab, fc.col=c(1,4), pval.col=3)
head(combined)

combined <- combineTests(ids, tab, fc.col="logFC.whee", pval.col="PValue")
head(combined)

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(csaw)
Loading required package: GenomicRanges
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
    get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
    rbind, rownames, sapply, setdiff, sort, table, tapply, union,
    unique, unsplit

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: 'S4Vectors'

The following objects are masked from 'package:base':

    colMeans, colSums, expand.grid, rowMeans, rowSums

Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: SummarizedExperiment
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/csaw/combineTests.Rd_%03d_medium.png", width=480, height=480)
> ### Name: combineTests
> ### Title: Combine statistics across multiple tests
> ### Aliases: combineTests
> ### Keywords: testing
> 
> ### ** Examples
>  
> ids <- round(runif(100, 1, 10))
> tab <- data.frame(logFC=rnorm(100), logCPM=rnorm(100), PValue=rbeta(100, 1, 2))
> combined <- combineTests(ids, tab)
> head(combined)
  nWindows logFC.up logFC.down     PValue       FDR
1       13        4          4 0.42243709 0.4224371
2       10        2          7 0.17718477 0.3437955
3       13        7          4 0.33555787 0.4194473
4       16        3          9 0.38353868 0.4224371
5        8        2          3 0.18867245 0.3437955
6        9        4          4 0.03535743 0.3437955
> 
> # With window weighting.
> w <- round(runif(100, 1, 5))
> combined <- combineTests(ids, tab, weight=w)
> head(combined)
  nWindows logFC.up logFC.down    PValue       FDR
1       13        4          4 0.3713733 0.3713733
2       10        2          7 0.1694872 0.3505163
3       13        7          4 0.3303954 0.3671060
4       16        3          9 0.3236108 0.3671060
5        8        2          3 0.2358406 0.3505163
6        9        4          4 0.0188573 0.1885730
> 
> # With multiple log-FCs.
> tab$logFC.whee <- rnorm(100, 5)
> combined <- combineTests(ids, tab)
> head(combined)
  nWindows logFC.up logFC.down logFC.whee.up logFC.whee.down     PValue
1       13        4          4            13               0 0.42243709
2       10        2          7            10               0 0.17718477
3       13        7          4            13               0 0.33555787
4       16        3          9            16               0 0.38353868
5        8        2          3             8               0 0.18867245
6        9        4          4             9               0 0.03535743
        FDR
1 0.4224371
2 0.3437955
3 0.4194473
4 0.4224371
5 0.3437955
6 0.3437955
> 
> # Manual specification of column IDs.
> combined <- combineTests(ids, tab, fc.col=c(1,4), pval.col=3)
> head(combined)
  nWindows logFC.up logFC.down logFC.whee.up logFC.whee.down     PValue
1       13        4          4            13               0 0.42243709
2       10        2          7            10               0 0.17718477
3       13        7          4            13               0 0.33555787
4       16        3          9            16               0 0.38353868
5        8        2          3             8               0 0.18867245
6        9        4          4             9               0 0.03535743
        FDR
1 0.4224371
2 0.3437955
3 0.4194473
4 0.4224371
5 0.3437955
6 0.3437955
> 
> combined <- combineTests(ids, tab, fc.col="logFC.whee", pval.col="PValue")
> head(combined)
  nWindows logFC.whee.up logFC.whee.down     PValue       FDR
1       13            13               0 0.42243709 0.4224371
2       10            10               0 0.17718477 0.3437955
3       13            13               0 0.33555787 0.4194473
4       16            16               0 0.38353868 0.4224371
5        8             8               0 0.18867245 0.3437955
6        9             9               0 0.03535743 0.3437955
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>