R: Fit the exomeCopy or exomeCopyVar model to the observed...
exomeCopy
R Documentation
Fit the exomeCopy or exomeCopyVar model to the observed counts.
Description
Fits a hidden Markov model to observed read counts using positional
covariates. It returns an object containing the fitted parameters and
the Viterbi path, the most likely path of hidden states, which is the
predicted copy count at each window.
exomeCopy is designed to run on read counts from consecutive
genomic ranges on a single chromosome, as it tries to identify higher
or lower read depth relative to a baseline. Please see the vignette
for an example of how to prepare input data for exomeCopy, how
to loop the function over multiple chromosomes and samples, and how to
extract the resulting predicted CNVs.
exomeCopy requires as input a RangedData object
containing read counts in genomic ranges along with the covariates.
Some convenience functions are provided for preparing input for
exomeCopy:
subdivideGRanges, to subdivide a GRanges
object containing the genomic ranges of the targeted region into
genomic ranges of nearly equal width,
countBamInGRanges, to count the number of read starts
from a BAM read mapping file in a GRanges object,
getGCcontent, to get the GC-content in the ranges
given a FASTA file of the reference sequence,
generateBackground, to calculate median normalized
read depth over a set of control samples, and also any statistic over
normalized read depth.
A RangedData object with the sample counts and positional covariates
over the genomic ranges.
sample.name
The name of the value column of rdata with the sample read counts.
X.names
The names of the value columns of rdata with covariates for
estimating mu.
Y.names
(optional) the names of the value columns of rdata with covariates for
estimating phi, only required if fit.var = TRUE.
fit.var
A logical, whether the model should fit the overdispersion parameter
phi with a linear combination of covariates (exomeCopyVar) or with a
scalar (exomeCopy). Defaults to FALSE (exomeCopy).
reltol
The relative tolerance for convergence used in the optim function
for optimizing the parameter settings. From testing, the default
value was sufficient for fitting parameters, but lower relative
tolerances can be used.
S
A vector of possible copy numbers for the different states.
d
The expected copy number for the normal state. This should be set
to 2 for autosomes and 1 for haploid data.
goto.cnv
The initial setting for probability to transfer to a CNV state.
goto.normal
The initial setting for probability to transfer to a normal state.
init.phi
Either "norm" or "counts": initialize phi with the moment estimate
using residuals from a linear model of read counts on covariates or
with the raw counts.
Details
exomeCopy fits transitional and emission parameters of an HMM to best
explain the observed counts of a sample from exome or targeted
sequencing. The set of underlying copy number states, S, in the
sample must be provided before running the algorithm.
The emission probabilities are given as a negative binomial
distribution using positional covariates, such as log background read
depth, quadratic terms for GC-content, and range width, which are
stored in a matrix X. Optionally, for fitting the variance of the
distribution, the standard deviation and/or variance of the background
set can be included in a matrix Y. All covariates are normalized
within exomeCopy for improved optimization.
For the observed count at range t, O_t, the emission
probability is given by:
O_t ~ NB(mu_ti, phi)
The mean parameter mu_ti is given by:
mu_ti
= (S_i / d) e^(x_t* beta)
Here S_i is the i-th possible copy number state, d is the
expected background copy number (d = 2 for diploid sequence), and
beta is a vector of coefficients fitted by the
model. x_t* is the t-th row of the matrix X.
mu must be positive, so it is replaced with a small positive
number if the value is less than zero.
For exomeCopyVar, which also fits the variance, the emission
probability includes a location-dependent dispersion parameter.
where
log(phi_t) = y_t* gamma
or a small positive number if this is less than zero.
Two transition probabilities are fitted in the model: the
probabilities of transitioning to a normal state and to a CNV state.
exomeCopy calls negLogLike to evaluate the likelihood of
the HMM. The parameters are fit using Nelder-Mead optimization with
the optim function on the negative likelihood. The
Viterbi path is calculated by calling viterbiPath.
Value
Returns an ExomeCopy-class object. See this page for the
slot descriptions. Also see the vignette and
copyCountSegments on how to extract segments.
Author(s)
Michael Love
References
Love, Michael I.; Mysickova, Alena; Sun, Ruping; Kalscheuer, Vera;
Vingron, Martin; and Haas, Stefan A. (2011) "Modeling Read Counts for
CNV Detection in Exome Sequencing Data," Statistical Applications in
Genetics and Molecular Biology: Vol. 10 : Iss. 1, Article 52. DOI: 10.2202/1544-6115.1732
http://cmb.molgen.mpg.de/publications/Love_2011_exomeCopy.pdf.
References for HMM algorithms and use of HMM for segmentation of
genomic data by copy number:
Rabiner, L. R. (1989): "A tutorial on hidden Markov models and
selected applications in speech recognition," Proceedings of the
IEEE, 77, 257, 286, http://dx.doi.org/10.1109/5.18626.
Fridlyand, J., A. M. Snijders, D. Pinkel, D. G. Albertson, and Jain
(2004): "Hidden Markov models approach to the analysis of array CGH
data," Journal of Multivariate Analysis, 90, 132, 153,
http://dx.doi.org/10.1016/j.jmva.2004.02.008.
Marioni, J. C., N. P. Thorne, and S. Tavare (2006):"BioHMM: a
heterogeneous hidden Markov model for segmenting array CGH data."
Bioinformatics, 22, 1144, 1146,
http://view.ncbi.nlm.nih.gov/pubmed/16533818.
## The following is an example of running exomeCopy on simulated
## read counts using the model parameters defined above. For an example
## using real exome sequencing read counts (with simulated CNV) please
## see the vignette.
## create RangedData for storing genomic ranges and covariate data
## (background, background stdev, GC-content)
m <- 5000
rdata <- RangedData(IRanges(start=0:(m-1)*100+1,width=100),
space=rep("chr1",m), universe="hg19", log.bg=rnorm(m), log.bg.var=rnorm(m),
gc=runif(m,30,50))
## create read depth distributional parameters mu and phi
rdata$gc.sq <- rdata$gc^2
X <- cbind(bg=rdata$log.bg,gc=rdata$gc,gc.sq=rdata$gc.sq)
Y <- cbind(bg.sd=rdata$log.bg.var)
beta <- c(5,1,.01,-.01)
gamma <- c(-3,.1)
rdata$mu <- exp(beta[1] + scale(X) %*% beta[2:4])
rdata$phi <- exp(gamma[1] + scale(Y) %*% gamma[2])
## create observed counts with simulated heterozygous duplication
cnv.nranges <- 200
bounds <- (round(m/2)+1):(round(m/2)+cnv.nranges)
O <- rnbinom(nrow(rdata),mu=rdata$mu,size=1/rdata$phi)
O[bounds] <- O[bounds] + rbinom(cnv.nranges,prob=0.5,size=O[bounds])
rdata[["sample1"]] <- O
## run exomeCopy() and list segments
fit <- exomeCopy(rdata,"sample1",X.names=c("log.bg","gc","gc.sq"))
# an example call with variance fitting.
# see paper: this does not necessarily improve the fit
fit <- exomeCopy(rdata,"sample1",X.names=c("log.bg","gc","gc.sq"),
Y.names="log.bg",fit.var=TRUE)
## see man page for copyCountSegments() for summary of
## the predicted segments of constant copy count, and
## for plot.ExomeCopy() for plotting fitted objects
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(exomeCopy)
Loading required package: IRanges
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:parallel':
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from 'package:stats':
IQR, mad, xtabs
The following objects are masked from 'package:base':
Filter, Find, Map, Position, Reduce, anyDuplicated, append,
as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
rbind, rownames, sapply, setdiff, sort, table, tapply, union,
unique, unsplit
Loading required package: S4Vectors
Loading required package: stats4
Attaching package: 'S4Vectors'
The following objects are masked from 'package:base':
colMeans, colSums, expand.grid, rowMeans, rowSums
Loading required package: GenomicRanges
Loading required package: GenomeInfoDb
Loading required package: Rsamtools
Loading required package: Biostrings
Loading required package: XVector
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/exomeCopy/exomeCopy.Rd_%03d_medium.png", width=480, height=480)
> ### Name: exomeCopy
> ### Title: Fit the exomeCopy or exomeCopyVar model to the observed counts.
> ### Aliases: exomeCopy
>
> ### ** Examples
>
>
> ## The following is an example of running exomeCopy on simulated
> ## read counts using the model parameters defined above. For an example
> ## using real exome sequencing read counts (with simulated CNV) please
> ## see the vignette.
>
> ## create RangedData for storing genomic ranges and covariate data
> ## (background, background stdev, GC-content)
>
> m <- 5000
> rdata <- RangedData(IRanges(start=0:(m-1)*100+1,width=100),
+ space=rep("chr1",m), universe="hg19", log.bg=rnorm(m), log.bg.var=rnorm(m),
+ gc=runif(m,30,50))
>
> ## create read depth distributional parameters mu and phi
> rdata$gc.sq <- rdata$gc^2
> X <- cbind(bg=rdata$log.bg,gc=rdata$gc,gc.sq=rdata$gc.sq)
> Y <- cbind(bg.sd=rdata$log.bg.var)
> beta <- c(5,1,.01,-.01)
> gamma <- c(-3,.1)
> rdata$mu <- exp(beta[1] + scale(X) %*% beta[2:4])
> rdata$phi <- exp(gamma[1] + scale(Y) %*% gamma[2])
>
> ## create observed counts with simulated heterozygous duplication
> cnv.nranges <- 200
> bounds <- (round(m/2)+1):(round(m/2)+cnv.nranges)
> O <- rnbinom(nrow(rdata),mu=rdata$mu,size=1/rdata$phi)
> O[bounds] <- O[bounds] + rbinom(cnv.nranges,prob=0.5,size=O[bounds])
> rdata[["sample1"]] <- O
>
> ## run exomeCopy() and list segments
> fit <- exomeCopy(rdata,"sample1",X.names=c("log.bg","gc","gc.sq"))
>
> # an example call with variance fitting.
> # see paper: this does not necessarily improve the fit
> fit <- exomeCopy(rdata,"sample1",X.names=c("log.bg","gc","gc.sq"),
+ Y.names="log.bg",fit.var=TRUE)
>
> ## see man page for copyCountSegments() for summary of
> ## the predicted segments of constant copy count, and
> ## for plot.ExomeCopy() for plotting fitted objects
>
>
>
>
>
>
> dev.off()
null device
1
>