Last data update: 2014.03.03

R: Calculate the cross correlation for a given GRanges object.
calculateCrossCorrelationR Documentation

Calculate the cross correlation for a given GRanges object.

Description

This method calculates the cross correlation, i.e. the Pearson correlation between the coverages of the positive and negative strand from a DNA sequencing experiment. The cross correlation can be used as a quality measure in ChIP-seq experiments (Kharchenko et al. 2008). Cross correlation can also be used to estimate the fragment size by determining the shift (given in base pairs) that maximizes the cross correlation.

Usage

calculateCrossCorrelation(object, shift=c(200,250,300), bin=10, mode="none", minReads=10000, chrs=NA, mc.cores=1)

Arguments

object

An GRanges object containing the aligned reads.

shift

The number of bases that the negative strand is shifted towards its three prime end. This can be a vector, if the correlation should be calculated for different shifts.

bin

If bin is larger than one, the coverage is calculated for bins of size bin and not for each single base. This speeds up calculations and might be beneficial in cases of low coverage. Note that shifting is performed after binning, so that the shift(s) should be a multiple of bin (otherwise, shift is rounded to the nearest multiple of bin).

mode

mode defines how bases (or bins) without reads are handled. both means that only bases covered on both strands are included when calculating the correlation. one means that the base has to be covered on at least one strand and none mean that all bases are included independent of their coverage.

minReads

If not at least minReads are mapped to a chromosome, the chromosome is omitted.

chrs

A character vector with the chromosomes that should be included into the calculation. NA means all chromosomes.

mc.cores

Number of cores to be used.

Details

Only 5 prime start positions of reads are used for calculating the coverage. Therefore, after removing duplicates in a single end sequencing experiment, the coverage can not be larger than one, if the bin size is set to one. (In this setting, mode both is meaningless.) If bin is larger than one, the coverage within a bin is aggregated. Then, the correlation is calculated for each shift. A shift (given in basepairs) should be multiple of the bin size (given in basepairs, too). If not, the binnend coverage is shifted by round(shift/bin) elements.

The different modes define whether regions without coverage or with only one covered strand should used. The original implementation in the package "spp" does not make use of regions without coverage. However, this seems to be a loss of information, since no coverage has also a biological meaning in a ChIP-seq experiment. If the fragment size is approximately 500bp, setting shift=seq(200, 800, 10), bin=10 and mode="none" should be a good setting.

After the cross correlation was calculated for each chromosome, the weighted mean correlation across all chromosomes is calculated. The weight for a specific chromosome equals the fraction of all reads that were aligned to that chromosome.

Value

A numeric vector with the cross correlation for each shift. The names of the vector correspond to the shifts.

Author(s)

Hans-Ulrich Klein (hklein@broadinstitute.org)

References

Kharchenko PV, Tolstorukov MY and Park PJ. Design and analysis of ChIP-seq experiments for DNA-binding proteins. Nat Biotechnol 2008, 26(12):1351-9

Landt SG et al., ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res. 2012, 22(9):1813-31

See Also

GRanges-class

Examples

triangularKernel <- function(x, pos, h)  {
  res <- 1 - (abs(x - pos) / h)
  res[res < 0] <- 0
  return(res)
}
covPos <- round(triangularKernel(1:100, 60, 50) * 100)
covNeg <- round(triangularKernel(1:100, 65, 50) * 100)

reads <- GRanges(IRanges(start=c(rep(seq_along(covPos), covPos), rep(seq_along(covNeg), covNeg) - 9),
                         width=10),
                         strand=c(rep("+", sum(covPos)), rep("-", sum(covNeg))),
                         seqnames=rep("1", sum(covPos)+sum(covNeg)))

calculateCrossCorrelation(reads, shift=c(0,10), bin=1, mode="none", minReads=1)
cor(covPos, covNeg)
cor(covPos[1:(length(covPos)-10)], covNeg[11:length(covNeg)])

calculateCrossCorrelation(reads, shift=c(0,10), bin=1, mode="one", minReads=1)
cor(covPos[covPos != 0 | covNeg != 0], covNeg[covPos != 0 | covNeg != 0])

calculateCrossCorrelation(reads, shift=c(0,10), bin=1, mode="both", minReads=1)
cor(covPos[covPos != 0 & covNeg != 0], covNeg[covPos != 0 & covNeg != 0])


covPos2 <- round(triangularKernel(1:100, 60, 50) * 50)
covNeg2 <- round(triangularKernel(1:100, 68, 50) * 50)
reads2 <- GRanges(IRanges(start=c(rep(seq_along(covPos2), covPos2), rep(seq_along(covNeg2), covNeg2) - 9),
                          width=10),
                          strand=c(rep("+", sum(covPos2)), rep("-", sum(covNeg2))),
                          seqnames=rep("2", sum(covPos2)+sum(covNeg2)))
seqlevels(reads2) <- c("1", "2")
allReads <- c(reads, reads2)

calculateCrossCorrelation(allReads, shift=5, minReads=1, bin=1, mode="none")
cor1 <- cor(covPos[1:(length(covPos)-5)], covNeg[6:length(covNeg)])
cor2 <- cor(covPos2[1:(length(covPos2)-5)], covNeg2[6:length(covNeg2)])
cor1 * (sum(c(covPos, covNeg))/length(allReads)) +
  cor2 * (sum(c(covPos2, covNeg2))/length(allReads))

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(epigenomix)
Loading required package: Biobase
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
    get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
    rbind, rownames, sapply, setdiff, sort, table, tapply, union,
    unique, unsplit

Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: 'S4Vectors'

The following objects are masked from 'package:base':

    colMeans, colSums, expand.grid, rowMeans, rowSums

Loading required package: IRanges
Loading required package: GenomicRanges
Loading required package: GenomeInfoDb
Loading required package: SummarizedExperiment
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/epigenomix/calculateCrossCorrelation.Rd_%03d_medium.png", width=480, height=480)
> ### Name: calculateCrossCorrelation
> ### Title: Calculate the cross correlation for a given GRanges object.
> ### Aliases: calculateCrossCorrelation
> ###   calculateCrossCorrelation,GRanges-method
> ### Keywords: cross correlation
> 
> ### ** Examples
> 
> triangularKernel <- function(x, pos, h)  {
+   res <- 1 - (abs(x - pos) / h)
+   res[res < 0] <- 0
+   return(res)
+ }
> covPos <- round(triangularKernel(1:100, 60, 50) * 100)
> covNeg <- round(triangularKernel(1:100, 65, 50) * 100)
> 
> reads <- GRanges(IRanges(start=c(rep(seq_along(covPos), covPos), rep(seq_along(covNeg), covNeg) - 9),
+                          width=10),
+                          strand=c(rep("+", sum(covPos)), rep("-", sum(covNeg))),
+                          seqnames=rep("1", sum(covPos)+sum(covNeg)))
> 
> calculateCrossCorrelation(reads, shift=c(0,10), bin=1, mode="none", minReads=1)
        0        10 
0.9577904 0.9555028 
> cor(covPos, covNeg)
[1] 0.9577904
> cor(covPos[1:(length(covPos)-10)], covNeg[11:length(covNeg)])
[1] 0.9555028
> 
> calculateCrossCorrelation(reads, shift=c(0,10), bin=1, mode="one", minReads=1)
        0        10 
0.9430905 0.9472663 
> cor(covPos[covPos != 0 | covNeg != 0], covNeg[covPos != 0 | covNeg != 0])
[1] 0.9430905
> 
> calculateCrossCorrelation(reads, shift=c(0,10), bin=1, mode="both", minReads=1)
        0        10 
0.9294835 0.9327396 
> cor(covPos[covPos != 0 & covNeg != 0], covNeg[covPos != 0 & covNeg != 0])
[1] 0.9294835
> 
> 
> covPos2 <- round(triangularKernel(1:100, 60, 50) * 50)
> covNeg2 <- round(triangularKernel(1:100, 68, 50) * 50)
> reads2 <- GRanges(IRanges(start=c(rep(seq_along(covPos2), covPos2), rep(seq_along(covNeg2), covNeg2) - 9),
+                           width=10),
+                           strand=c(rep("+", sum(covPos2)), rep("-", sum(covNeg2))),
+                           seqnames=rep("2", sum(covPos2)+sum(covNeg2)))
> seqlevels(reads2) <- c("1", "2")
> allReads <- c(reads, reads2)
> 
> calculateCrossCorrelation(allReads, shift=5, minReads=1, bin=1, mode="none")
        5 
0.9950325 
> cor1 <- cor(covPos[1:(length(covPos)-5)], covNeg[6:length(covNeg)])
> cor2 <- cor(covPos2[1:(length(covPos2)-5)], covNeg2[6:length(covNeg2)])
> cor1 * (sum(c(covPos, covNeg))/length(allReads)) +
+   cor2 * (sum(c(covPos2, covNeg2))/length(allReads))
[1] 0.9950325
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>