Mappability threshold [0, 1] below which points are ignored during creating
the correction curve.
samplesize
The number of points sampled during LOESS fitting, decreasing reduces
runtime and memory usage, at the expense of robustness to data randomness.
verbose
Set to FALSE it messages are not desired.
Details
Input read counts are contained in the IRanges object, where number of reads
within bins (or sometimes called windows) of defined genomic size are
specified. GC content should also be computed using the exact same boundaries
for each bin.
Ensure that the GC content and mappability files have been mapped to the same genome
build (e.g. hg18) as the tumour and normal read libraries.
Here is a summary of the correction procedure.
Filter out bins with 0 reads and 0 GC content
Filter out bins with reads within the top and bottom 1% quantile (assumed to be outliers)
Filter out bins with GC content within the top and bottom 1% quantile
Filter out bins with a mappability score of greater than 0.9 ('mappability' argument).
Randomly sample up to 50000 ('samplesize' argument) of the remaining high-quality bins for the purposes keeping a good runtime.
The first loess (on the reads-by-GC curve) with a small span (smoothing window) is performed, obtaining typically a highly sensitive curve (follows low density tails of distribution, but gets jagged in high density center).
A second loess (on the first loess results) with a larger span is performed, recapitulating the curve in the low density tails and smoothing out the jagged regions in the high density center.
'cor.gc' is obtained by correcting each bin for GC content. The number of observed reads is divided by the number of reads predicted by the loess curve given an observed GC proportion.
Filter out just the top 1% quantile of the cor.gc bins, then _randomly_ sample up to 50000 ('samplesize' argument) bins.
A separate lowess curve is computed for mappability-by-GC (cor.gc).
'cor.map' is obtained by correcting each bin for mappability. The cor.gc value is divided by the cor.gc value predicted by the mappability lowess curve generated in the previous step.
'copy' is obtained by setting all cor.map values <= to NA (i.e. NaN), then apply log2
Value
The original A RangedData object
appended with 5 new columns:
valid
Valid bins, which have valid read, gc, and mappability values
ideal
Ideal bins of high mappability and no outliers
cor.gc
GC-corrected readcounts
cor.map
Mappability corrected GC-corrected readcounts
copy
cor.map values after log base 2
Author(s)
Daniel Lai
References
Yuval Benjamini and Terence P Speed. Summarizing and correcting the gc content bias
in high-throughput sequencing. Nucleic Acids Res, 40(10):e72, May 2012.
See Also
wigsToRangedData to easily generate the proper input.
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(HMMcopy)
Loading required package: IRanges
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:parallel':
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from 'package:stats':
IQR, mad, xtabs
The following objects are masked from 'package:base':
Filter, Find, Map, Position, Reduce, anyDuplicated, append,
as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
rbind, rownames, sapply, setdiff, sort, table, tapply, union,
unique, unsplit
Loading required package: S4Vectors
Loading required package: stats4
Attaching package: 'S4Vectors'
The following objects are masked from 'package:base':
colMeans, colSums, expand.grid, rowMeans, rowSums
Loading required package: geneplotter
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Loading required package: lattice
Loading required package: annotate
Loading required package: AnnotationDbi
Loading required package: XML
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/HMMcopy/correctReadcount.Rd_%03d_medium.png", width=480, height=480)
> ### Name: correctReadcount
> ### Title: Readcount correction for GC and mappability bias
> ### Aliases: correctReadcount
> ### Keywords: manip
>
> ### ** Examples
>
> data(tumour) # Load tumour_reads
> tumour_copy <- correctReadcount(tumour_reads)
Applying filter on data...
Correcting for GC bias...
Correcting for mappability bias...
>
>
>
>
>
> dev.off()
null device
1
>