R Graphical Manual

Browse All

Last data update: 2014.03.03

R: Identify clusters containing high-confidence substitutions...

getClusters

R Documentation

Identify clusters containing high-confidence substitutions and resolve boundaries at high resolution

Description

Identifies clusters using either the mini-rank norm (MRN) algorithm (default and recommended to achieve highest sensitivity) or via a continuous wavelet transform (CWT) based approach. The former employs thresholding of background coverage differences and finds the optimal cluster boundaries by exhaustively evaluating all putative clusters using a rank-based approach. This method has higher sensitivity and an approximately 10-fold faster running time than the CWT-based cluster identification algorithm. The latter, maintained for compatibility with wavClusteR, computes the CWT on a 1 kb window of the coverage function centered at a high-confidence substitution site, and identifies cluster boundaries by extending away from peak positions.

Usage

getClusters(highConfSub, coverage, sortedBam, method = 'mrn', cores =
1, threshold, step = 1, snr = 3)

Arguments

`highConfSub`	GRanges object containing high-confidence substitution sites as returned by the getHighConfSub function
`coverage`	An Rle object containing the coverage at each genomic position as returned by a call to coverage
`sortedBam`	a GRanges object containing all aligned reads, including read sequence (qseq) and MD tag (MD), as returned by the readSortedBam function
`method`	a character, either set to "mrn" or to "cwt" to compute clusters using the mini-rank norm or the wavelet transform-based algorithm, respectively. Default is "mrn" (recommended).
`cores`	integer, the number of cores to be used for parallel evaluation. Default is 1.
`threshold`	numeric, if `method = "mrn"`, the difference in coverage to be considered noise. If not specified, a Gaussian mixture model is used to learn a threshold from the data. Empirically, 10% of the minimum coverage required at substitutions (see argument `minCov` in the getHighConfSub function) might suffice to provide highly resolved clusters. However, if `minCov` is much lower than the median strand-specific coverage at substitutions m, which can be computed using `summary(elementMetadata(highConfSub)[, 'coverage'])['Median']`), 10% of m might represent an optimal choice.
`step`	numeric, if `method = "cwt"`, step size of window shift. If two high-confidence substitution sites are located within a distance less than `step`, the wavelet transform is computed only once. Default: 1, i.e. each high-confidence substitution site is considered independently.
`snr`	numeric, if `method = "cwt"`, signal-to-noise ratio controlling the peak calling as performed by `wavCWTPeaks` implemented in the wmtsa package. Default: 3.

Value

GRanges object containing the identified cluster boundaries.

Note

Clusters returned by this function need to be further merged by the function filterClusters, which also computes all relevant cluster statistics.

Author(s)

Federico Comoglio and Cem Sievers

References

William Constantine and Donald Percival (2011), wmtsa: Wavelet Methods for Time Series Analysis, http://CRAN.R-project.org/package=wmtsa

Sievers C, Schlumpf T, Sawarkar R, Comoglio F and Paro R. (2012) Mixture models and wavelet transforms reveal high confidence RNA-protein interaction sites in MOV10 PAR-CLIP data, Nucleic Acids Res. 40(20):e160. doi: 10.1093/nar/gks697

Comoglio F, Sievers C and Paro R (2015) Sensitive and highly resolved identification of RNA-protein interaction sites in PAR-CLIP data, BMC Bioinformatics 16, 32.

Examples


filename <- system.file( "extdata", "example.bam", package = "wavClusteR" )
example <- readSortedBam( filename = filename )
countTable <- getAllSub( example, minCov = 10, cores = 1 )
highConfSub <- getHighConfSub( countTable, supportStart = 0.2, supportEnd = 0.7, substitution = "TC" )
coverage <- coverage( example )
clusters <- getClusters( highConfSub = highConfSub, 
                         coverage = coverage, 
                         sortedBam = example, 
	                 method = 'mrn', 
	                 cores = 1, 
	                 threshold = 2 )

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(wavClusteR)
Loading required package: GenomicRanges
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
    get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
    rbind, rownames, sapply, setdiff, sort, table, tapply, union,
    unique, unsplit

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: 'S4Vectors'

The following objects are masked from 'package:base':

    colMeans, colSums, expand.grid, rowMeans, rowSums

Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: Rsamtools
Loading required package: Biostrings
Loading required package: XVector
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/wavClusteR/getClusters.Rd_%03d_medium.png", width=480, height=480)
> ### Name: getClusters
> ### Title: Identify clusters containing high-confidence substitutions and
> ###   resolve boundaries at high resolution
> ### Aliases: getClusters
> ### Keywords: core
> 
> ### ** Examples
> 
> 
> filename <- system.file( "extdata", "example.bam", package = "wavClusteR" )
> example <- readSortedBam( filename = filename )
> countTable <- getAllSub( example, minCov = 10, cores = 1 )
Loading required package: doMC
Loading required package: foreach
Loading required package: iterators
Considering substitutions, n = 497, processing in 1 chunks
   chunk #: 1
   considering the + strand
Computing local coverage at substitutions...
   considering the - strand
Computing local coverage at substitutions...
> highConfSub <- getHighConfSub( countTable, supportStart = 0.2, supportEnd = 0.7, substitution = "TC" )
> coverage <- coverage( example )
> clusters <- getClusters( highConfSub = highConfSub, 
+                          coverage = coverage, 
+                          sortedBam = example, 
+ 	                 method = 'mrn', 
+ 	                 cores = 1, 
+ 	                 threshold = 2 ) 
Computing start/end read positions
Number of chromosomes exhibiting high confidence transitions: 1
...Processing = chrX
> 
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>