R: Calculate the cross correlation for a given GRanges object.
calculateCrossCorrelation
R Documentation
Calculate the cross correlation for a given GRanges object.
Description
This method calculates the cross correlation, i.e. the Pearson
correlation between the coverages of the positive and negative strand
from a DNA sequencing experiment. The cross correlation can be used as a
quality measure in ChIP-seq experiments (Kharchenko et al. 2008). Cross
correlation can also be used to estimate the fragment size by determining
the shift (given in base pairs) that maximizes the cross correlation.
The number of bases that the negative strand is shifted towards its
three prime end. This can be a vector, if the correlation should be
calculated for different shifts.
bin
If bin is larger than one, the coverage is calculated for bins of size
bin and not for each single base. This speeds up calculations
and might be beneficial in cases of low coverage. Note that shifting
is performed after binning, so that the shift(s) should be a multiple of
bin (otherwise, shift is rounded to the nearest multiple of bin).
mode
mode defines how bases (or bins) without reads are
handled. both means that only bases covered on both strands
are included when calculating the correlation. one means that
the base has to be covered on at least one strand and none mean
that all bases are included independent of their coverage.
minReads
If not at least minReads are mapped to a chromosome, the
chromosome is omitted.
chrs
A character vector with the chromosomes that should be included
into the calculation. NA means all chromosomes.
mc.cores
Number of cores to be used.
Details
Only 5 prime start positions of reads are used for calculating
the coverage. Therefore, after removing duplicates in a single end sequencing
experiment, the coverage can not be larger than one, if the bin size is
set to one. (In this setting, mode both is meaningless.)
If bin is larger than one, the coverage within a bin is aggregated.
Then, the correlation is calculated for each shift. A shift
(given in basepairs) should be multiple of the bin size
(given in basepairs, too). If not, the binnend coverage is shifted by
round(shift/bin) elements.
The different modes define whether regions without coverage or with
only one covered strand should used. The original implementation in
the package "spp" does not make use of regions without
coverage. However, this seems to be a loss of information, since no
coverage has also a biological meaning in a ChIP-seq experiment. If
the fragment size is approximately 500bp, setting shift=seq(200, 800, 10),
bin=10 and mode="none" should be a good setting.
After the cross correlation was calculated for each chromosome,
the weighted mean correlation across all chromosomes is
calculated. The weight for a specific chromosome equals the fraction
of all reads that were aligned to that chromosome.
Value
A numeric vector with the cross correlation for each shift. The names
of the vector correspond to the shifts.
Author(s)
Hans-Ulrich Klein (hklein@broadinstitute.org)
References
Kharchenko PV, Tolstorukov MY and Park PJ. Design and analysis of
ChIP-seq experiments for DNA-binding proteins. Nat Biotechnol 2008, 26(12):1351-9
Landt SG et al., ChIP-seq guidelines and practices of the ENCODE and
modENCODE consortia. Genome Res. 2012, 22(9):1813-31