R Graphical Manual

Browse All

Last data update: 2014.03.03

R: Compute the cluster-level FDR

clusterFDR

R Documentation

Compute the cluster-level FDR

Description

Compute the FDR across clusters based on the test-level FDR threshold

Usage

clusterFDR(ids, threshold, weight=NULL)
controlClusterFDR(target, adjp, FUN, ..., weight=NULL, grid.param=NULL)

Arguments

`ids`	an integer vector of cluster IDs for each significant test below `threshold`
`threshold`	a numeric scalar, specifying the FDR threshold used to define the significant tests
`target`	a numeric scalar specifying the desired cluster-level FDR threshold
`adjp`	a numeric vector of window-level adjusted p-values
`FUN`	a clustering function that takes a logical vector indicating which windows are significant, and returns an integer vector of cluster IDs
`...`	additional arguments to be passed to `FUN`
`weight`	a numeric vector of frequency weights, for internal use
`grid.param`	a named list of grid search parameters, see Details

Value

For clusterFDR, a numeric scalar is returned as the cluster-level FDR.

For controlClusterFDR, a list is returned containing two numeric scalars – threshold, the window-level FDR threshold to control the cluster-level FDR near target; and FDR, the estimate of the cluster-level FDR corresponding to threshold.

Definition of the cluster-level FDR

The clusterFDR function computes an informal estimate of the cluster-level FDR, where each cluster is formed by aggregating only significant tests. In the context of ChIP-seq, each significant test refers to a DB window that is detected at a FDR below threshold. The idea is to obtain an error rate while reporting the precise coordinates of a DB subinterval in a complex region.

The cluster-level FDR is defined as the proportion of reported clusters that have no true positives. Simply using threshold is not appropriate, as the cluster- and window-level FDRs are not equivalent. This function also differs from the standard pipeline that is based on combineTests. Specifically, region definition in the standard pipeline must be independent of DB so that precise coordinates of the DB subinterval cannot be reported. This is overcome by clustering directly on DB windows and applying post-hoc control of the cluster-level FDR.

Note that the calculation of the cluster-level FDR here is not statistically rigorous. In particular, the observed number of false positive tests is estimated based on threshold and the total number of significant tests. This is not guaranteed to be an upper bound, especially when the observed window-level FDR is variable. Thus, users should use the standard combineTests-based pipeline wherever possible. Clustering on significant windows should only be performed where the precise coordinates of the DB subinterval are important for interpretation.

Searching for the best threshold

controlClusterFDR will identify the window-level FDR threshold required to control the cluster-level FDR at target. This is necessary as the latter is not a simple function of the former. In fact, the discontinuous nature of the cluster-level FDR means that the returned value may not be below target. Rather, the closest estimate is returned, e.g., for a target of 0.05, a value of 0.051 is preferred to a value of 0.02, even though only the latter is less than the target.

A grid search is used to identify clusters of significant windows at each window-level threshold, and the corresponding cluster-level FDR is computed with clusterFDR. At each iteration, the grid point with the closest cluster-level FDR to target is chosen and the grid is recentered around that point. The size of the grid is also scaled down to provide greater resolution. Users can tune the settings of the grid search by specifying elements in grid.param as:

length:: an integer scalar specifying the length of the grid, defaults to 21
range:: a numeric scalar indicating the range of the grid in logit units, defaults to 20
iter:: an integer scalar indicating the number of iterations of the grid search
scale:: a numeric scalar specifying how the range of the grid should be downscaled at each iteration

Note about weights

In both functions, the weight argument is assumed to contain frequency weights of significant tests/windows. For example, a weight of 2 for a test would be equivalent to repeating that test (i.e., repeating the same window so it shows up twice in your analysis). These weights should be the same as those used during weighted FDR control to compute adjusted p-values. In general, users should not set this argument unless you know what you're doing.

Author(s)

Aaron Lun

Examples

# Setting up the windows and p-values.
set.seed(100)
windows <- GRanges("chrA", IRanges(1:1000, 1:1000))
test.p <- runif(1000)
test.p[c(1:10, 100:110, 220:240)] <- 0 # 3 significant subintervals.

# Defining significant windows.
threshold <- 0.05
is.sig <- p.adjust(test.p, method="BH") <= threshold

# Assuming that we only cluster significant windows.
merged <- mergeWindows(windows[is.sig], tol=0)
clusterFDR(merged$id, threshold)

# Setting up another example with more subintervals.
test.p <- runif(1000)
test.p[rep(1:2, 50) + rep(0:49, each=2) * 20] <- 0 
adj.p <- p.adjust(test.p, method="BH")
clusterFUN <- function(x) { mergeWindows(windows[x], tol=0)$id }
controlClusterFDR(0.05, adj.p, clusterFUN)

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(csaw)
Loading required package: GenomicRanges
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
    get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
    rbind, rownames, sapply, setdiff, sort, table, tapply, union,
    unique, unsplit

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: 'S4Vectors'

The following objects are masked from 'package:base':

    colMeans, colSums, expand.grid, rowMeans, rowSums

Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: SummarizedExperiment
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/csaw/clusterFDR.Rd_%03d_medium.png", width=480, height=480)
> ### Name: clusterFDR
> ### Title: Compute the cluster-level FDR
> ### Aliases: clusterFDR controlClusterFDR
> ### Keywords: testing
> 
> ### ** Examples
> 
> # Setting up the windows and p-values.
> set.seed(100)
> windows <- GRanges("chrA", IRanges(1:1000, 1:1000))
> test.p <- runif(1000)
> test.p[c(1:10, 100:110, 220:240)] <- 0 # 3 significant subintervals.
> 
> # Defining significant windows.
> threshold <- 0.05
> is.sig <- p.adjust(test.p, method="BH") <= threshold
> 
> # Assuming that we only cluster significant windows.
> merged <- mergeWindows(windows[is.sig], tol=0)
> clusterFDR(merged$id, threshold)
[1] 0.3333333
> 
> # Setting up another example with more subintervals.
> test.p <- runif(1000)
> test.p[rep(1:2, 50) + rep(0:49, each=2) * 20] <- 0 
> adj.p <- p.adjust(test.p, method="BH")
> clusterFUN <- function(x) { mergeWindows(windows[x], tol=0)$id }
> controlClusterFDR(0.05, adj.p, clusterFUN)             
$threshold
[1] 0.04853633

$FDR
[1] 0.05660377

> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>