Last data update: 2014.03.03

R: Select optimal bin size based on Shimazaki formula
selectBinSizeR Documentation

Select optimal bin size based on Shimazaki formula

Description

The function iteratively estimates the cost of increasing bin size within a defined range and finally selects the bin size with minimum cost.

Usage

selectBinSize(alignGR, minBinSize, maxBinSize = 1000, 
	increment = 5, getFullResults = FALSE)

Arguments

alignGR

GRanges object of alignments on a single chromosome

minBinSize

Minimum bin size to start with (Default: 5 * read length)

maxBinSize

Maximum bin size to end with (Default: 1000).

increment

Number of bases to increment the bin size in searching for the optimal bin size within the defined range (Default: 5).

getFullResults

Binary indicator. If TRUE, the optimal bin size (with the minimum cost), minimum cost, and all of the bin sizes considered and their costs are returned. If FALSE, only the optimal bin size is returned.

Details

Based on the preprocessed alignments for a chromosome, RIPSeeker divides the chromosome into bins of equal size b and compute the number of reads that b needs to be determined either empirically (e.g., based on the gel-selected length of the RNA fragment) or computationally. If the bin size is too small, the read counts fluctuates greatly, making it difficult to discern the underlying read count distribution. Additionally, input size to HMM increases as bin size decreases. A very small bin size results in a very long Markov chain of read counts to model, making the computation inefficient. On the other hand, if a bin size is too large, resolution becomes poor. Consequently, one cannot detect the local RIP region with subtle but intrinsic difference from the background, and the RIP regions tend to be too wide for designing specific primer for validation.

Intuitively, selecting an appropriate bin size for each chromosome is metaphorically equivalent to choosing an optimal intervals for building a histogram (Song, 2011). Here we implement the algorithm developed by Shimazaki and Shinomoto (2007), which is based on the goodness of the fit of the time histogram to estimate the rate of neural response of an animal to certain stimuli in a spike-in experiment. This approach has been successfully applied in a recently developed ChIP-seq program (Song and Smith, 2011). Algorithm 1 describes the pseudocode adapted from Shimazaki and Shinomoto (2007) that iteratively estimates the cost C of increasing bin size b within a defined range is outlined as follows.

For b = minBinSize to maxBinSize; do

1. Divide chromosome sequence into N bins of width b.

2. Count number of read counts x_i that enter the i'th bin.

3. Compute: ar{x} = frac{1}{N}∑_{i=1}^{N}x_{i} and v = frac{1}{N}∑_{i=1}^{N}(x_{i} - ar{x})^{2}.

4. Compute: C(b) = frac{2ar{x}-v}{b^{2}}

End For

Choose b that minimize C(b)

Value

When getFullResults is TRUE, return a list containing:

bestBinSize

the optimal bin size (with the minimum cost)

minCosts

cost of the optimal bin size

binSizes

all of the bin sizes considered

costs

all of the costs

When getFullResults is FALSE, only the optimal bin size (bestBinSize) is returned.

Author(s)

Yue Li

References

Hideaki Shimazaki and Shigeru Shinomoto. A method for selecting the bin size of a time histogram. Neural computation, 19(6):1503-1527, June 2007.

Qiang Song and Andrew D. Smith. Identifying dispersed epigenomic domains from ChIP-Seq data. Bioinformatics (Oxford, England), 27(6):870-871, March 2011.

See Also

evalBinSize, binCount

Examples

# Retrieve system files
extdata.dir <- system.file("extdata", package="RIPSeeker") 

bamFiles <- list.files(extdata.dir, ".bam$", recursive=TRUE, full.names=TRUE)

bamFiles <- grep("PRC2", bamFiles, value=TRUE)

alignGal <- getAlignGal(bamFiles[1], reverseComplement=TRUE, genomeBuild="mm9")

alignGR <- as(alignGal, "GRanges")

alignGRList <- GRangesList(as.list(split(alignGR, seqnames(alignGR))))

minBinSize <- 200

maxBinSize <- 1200

gr <- alignGRList$chrX

b <- selectBinSize(gr, minBinSize, maxBinSize, increment=100, getFullResults=TRUE)

plot(b$binSizes, b$costs)

chrname <- as.character(runValue(seqnames(gr)))

chrlen <- seqlengths(gr)[chrname]

legend("topright", box.lty=0,
		
		sprintf("%s: 1-%d;\nTotal mapped reads: %d;\nOptimal bin size = %d bp",
		
		chrname, chrlen, length(gr), b$bestBinSize))	

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(RIPSeeker)
Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
    get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
    rbind, rownames, sapply, setdiff, sort, table, tapply, union,
    unique, unsplit


Attaching package: 'S4Vectors'

The following objects are masked from 'package:base':

    colMeans, colSums, expand.grid, rowMeans, rowSums

Loading required package: IRanges
Loading required package: GenomicRanges
Loading required package: GenomeInfoDb
Loading required package: SummarizedExperiment
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: Rsamtools
Loading required package: Biostrings
Loading required package: XVector
Loading required package: GenomicAlignments
Loading required package: rtracklayer
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/RIPSeeker/selectBinSize.Rd_%03d_medium.png", width=480, height=480)
> ### Name: selectBinSize
> ### Title: Select optimal bin size based on Shimazaki formula
> ### Aliases: selectBinSize
> 
> ### ** Examples
> 
> # Retrieve system files
> extdata.dir <- system.file("extdata", package="RIPSeeker") 
> 
> bamFiles <- list.files(extdata.dir, ".bam$", recursive=TRUE, full.names=TRUE)
> 
> bamFiles <- grep("PRC2", bamFiles, value=TRUE)
> 
> alignGal <- getAlignGal(bamFiles[1], reverseComplement=TRUE, genomeBuild="mm9")
Processing /home/ddbj/local/lib64/R/library/RIPSeeker/extdata/PRC2/SRR039210_processed_tophat/accepted_hits_noDup_sel_chrX.bam ... All hits are returned with flags.
> 
> alignGR <- as(alignGal, "GRanges")
> 
> alignGRList <- GRangesList(as.list(split(alignGR, seqnames(alignGR))))
> 
> minBinSize <- 200
> 
> maxBinSize <- 1200
> 
> gr <- alignGRList$chrX
> 
> b <- selectBinSize(gr, minBinSize, maxBinSize, increment=100, getFullResults=TRUE)
> 
> plot(b$binSizes, b$costs)
> 
> chrname <- as.character(runValue(seqnames(gr)))
> 
> chrlen <- seqlengths(gr)[chrname]
> 
> legend("topright", box.lty=0,
+ 		
+ 		sprintf("%s: 1-%d;\nTotal mapped reads: %d;\nOptimal bin size = %d bp",
+ 		
+ 		chrname, chrlen, length(gr), b$bestBinSize))	
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>