Last data update: 2014.03.03

R: Compute correlation coefficients between reads
correlateReadsR Documentation

Compute correlation coefficients between reads

Description

Computes the auto- or cross-correlation coefficients between read positions across a set of delay intervals.

Usage

correlateReads(bam.files, max.dist=1000, cross=TRUE, param=readParam())

Arguments

bam.files

a character vector containing paths to sorted and indexed BAM files

max.dist

integer scalar specifying the maximum delay distance over which correlation coefficients will be calculated

cross

a logical scalar specifying whether cross-correlations should be computed

param

a readParam object containing read extraction parameters, or a list of such objects (one for each BAM file)

Details

If cross=TRUE, reads are separated into those mapping on the forward and reverse strands. Positions on the forward strand are shifted forward by a delay interval. The chromosome-wide correlation coefficient between the shifted forward positions and the original reverse positions are computed. This is repeated for all delay intervals less than max.dist. A weighted mean for the cross-correlation is taken across all chromosomes, with weighting based on the number of reads.

Cross-correlation plots can be used to check the quality of immunoprecipitation for ChIP-Seq experiments involving transcription factors or punctate histone marks. Strong immunoprecipitation should result in a peak at a delay corresponding to the fragment length. A spike may also be observed at the delay corresponding to the read length. This is probably an artefact of the mapping process where unique mapping occurs to the same sequence on each strand.

By default, marked duplicate reads are removed from each BAM file prior to calculation of coefficients. This is strongly recommended, even if the rest of the analysis will be performed with duplicates retained. Otherwise, the read length spike will dominate the plot, such that the fragment length peak will no longer be easily visible.

If cross=FALSE, auto-correlation coefficients are computed without use of strand information. This is designed to guide estimation of the average width of enrichment for diffuse histone marks. For example, the width can be defined as the delay distance at which the autocorrelations become negligble. However, this tends to be ineffective in practice as diffuse marks tend to have very weak correlations to begin with.

If multiple BAM files are specified in bam.files, the reads from all libraries are pooled prior to calculation of the correlation coefficients. This is convenient for determining the average correlation profile across an entire dataset. Separate calculations for each file will require multiple calls to correlateReads.

Paired-end data is also supported, whereby correlations are computed using only those reads in proper pairs. This may be less meaningful as the presence of proper pairs will inevitably result in a strong peak at the fragment length. Instead, IP efficiency can be diagnosed by treating paired-end data as single-end, e.g., with pe="first" in readParam.

Value

A numeric vector of length max.dist+1 containing the correlation coefficients for each delay interval from 0 to max.dist.

Author(s)

Aaron Lun

References

Kharchenko PV, Tolstorukov MY and Park, PJ (2008). Design and analysis of ChIP-seq experiments for DNA-binding proteins. Nat. Biotechnol. 26, 1351-1359.

See Also

ccf

Examples

n <- 20
bamFile <- system.file("exdata", "rep1.bam", package="csaw")
par(mfrow=c(2,2))

x <- correlateReads(bamFile, max.dist=n)
plot(0:n, x, xlab="delay (bp)", ylab="ccf")

x <- correlateReads(bamFile, max.dist=n, param=readParam(dedup=TRUE))
plot(0:n, x, xlab="delay (bp)", ylab="ccf")

x <- correlateReads(bamFile, max.dist=n, cross=FALSE)
plot(0:n, x, xlab="delay (bp)", ylab="acf")

# Also works on paired-end data.
bamFile <- system.file("exdata", "pet.bam", package="csaw")
x <- correlateReads(bamFile, param=readParam(pe="both"))
head(x)

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(csaw)
Loading required package: GenomicRanges
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
    get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
    rbind, rownames, sapply, setdiff, sort, table, tapply, union,
    unique, unsplit

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: 'S4Vectors'

The following objects are masked from 'package:base':

    colMeans, colSums, expand.grid, rowMeans, rowSums

Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: SummarizedExperiment
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/csaw/correlateReads.Rd_%03d_medium.png", width=480, height=480)
> ### Name: correlateReads
> ### Title: Compute correlation coefficients between reads
> ### Aliases: correlateReads
> ### Keywords: diagnostics
> 
> ### ** Examples
> 
> n <- 20
> bamFile <- system.file("exdata", "rep1.bam", package="csaw")
> par(mfrow=c(2,2))
> 
> x <- correlateReads(bamFile, max.dist=n)
> plot(0:n, x, xlab="delay (bp)", ylab="ccf")
> 
> x <- correlateReads(bamFile, max.dist=n, param=readParam(dedup=TRUE))
> plot(0:n, x, xlab="delay (bp)", ylab="ccf")
> 
> x <- correlateReads(bamFile, max.dist=n, cross=FALSE)
> plot(0:n, x, xlab="delay (bp)", ylab="acf")
> 
> # Also works on paired-end data.
> bamFile <- system.file("exdata", "pet.bam", package="csaw")
> x <- correlateReads(bamFile, param=readParam(pe="both"))
> head(x)
[1] -0.02256113 -0.02274439  0.12439047 -0.02057973 -0.02074556 -0.02091439
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>