R Graphical Manual

Browse All

Last data update: 2014.03.03

R: Estimate expression of gene splicing variants de novo.

calcDenovo

R Documentation

Estimate expression of gene splicing variants de novo.

Description

calcDenovo estimates expression of gene splicing variants, considering both known variants and variants that have not been previously described.

Usage

calcDenovo(distrs, targetGenomeDB, knownGenomeDB=targetGenomeDB, pc,
readLength, islandid, priorq=3, mprior, minpp=0.001, selectBest=FALSE,
searchMethod="submodels", niter, exactMarginal=TRUE,
integrateMethod="plugin", verbose=TRUE, mc.cores=1)

Arguments

`distrs`	List of fragment distributions as generated by the `getDistrs` function
`targetGenomeDB`	`annotatedGenome` object with isoforms we wish to quantify. By default these are the same as in `knownGenomeDB`, but more typically `targetGenomeDB` is imported from a .gtf file produced by some isoform prediction software.
`knownGenomeDB`	`annotatedGenome` object with known isoforms, e.g. from UCSC or GENCODE annotations. Used to set the prior probability that any given isoform is expressed. `knownGenome` should be the same genome annotations used to create argument `mprior` (when provided)
`pc`	Named vector of exon path counts as returned by `pathCounts`
`readLength`	Read length in bp, e.g. in a paired-end experiment where 75bp are sequenced on each end one would set `readLength=75`.
`islandid`	Name of the gene island to be analyzed. If not specified, all gene islands are analyzed.
`priorq`	Parameter of the prior distribution on the proportion of reads coming from each variant. The prior is Dirichlet with prior sample size for each variant equal to priorq. We recommend `priorq=3` as this defines a non-local prior that penalizes falsely predicted isoforms that show low expression.
`mprior`	Prior on the model space returned by `modelPrior`, used to favor isoforms consistent with `knownGenomeDB`. If left missing it is estimated from `knownGenomeDB`. See details.
`minpp`	Models (i.e. splicing configurations) with posterior probability less than `minpp` are not reported. This argument can help reduce substantially the amount of required memory to store the results.
`selectBest`	If set to `TRUE` only the model with highest posterior probability is reported. While this can save memory, we do not recommend this option as it may ignore a substantial amount of uncertainty.
`searchMethod`	Method used to perform the model search. `"allmodels"` enumerates all possible models (warning: this is not feasible for genes with >5 exons). `"rwmcmc"` uses a random-walk MCMC scheme to focus on models with high posterior probability. `"submodels"` considers that some isoforms in `targetGenomeDB` may not be expressed, but does not search for new variants. `"auto"` uses `"allmodels"` for genes with up to 5 exons and `"rwmcmc"` for longer genes. See details.
`niter`	Number of MCMC iterations.
`exactMarginal`	Set to `FALSE` to estimate posterior model probabilities as the proportion of MCMC visits. Set to `TRUE` to use the integrated likelihoods (default). See details.
`integrateMethod`	Method to compute integrated likelihoods. The default (`'plugin'`) evaluates likelihood*prior at the posterior mode and is the faster option. Set `'Laplace'` for Laplace approximations and `'IS'` for Importance Sampling. The latter increases computation cost very substantially.
`verbose`	Set to `TRUE` to display progress information.
`mc.cores`	Number of processors to be used for parallel computation. Can only be used if the package `multicore` is available for your system. Warning: using multiple processors substantially increases the memory requirements, so set this value carefully.

Details

calcDenovo explores which subset of the isoforms indicated in targetGenomeDB are truly expressed. It also adds new isoforms when some reads follow an exon path that is not possible under any of the isoforms in targetGenomeDB. calcDenovo the posterior probability of each model (i.e. configuration of expressed variants) via Bayes theorem.

P(model|y) "proportional to" m(y|model) P(model)

where m(y|model) is the integrated likelihood and P(model) is the prior probability of the model. For example, a gene with 20 predicted isoforms in targetGenome gives rise 2^20 - 1 possible models (configurations of expressed isoforms).

Importantly, P(model) can be set by analyzing available genome annotations in knownGenomeDB. For instance, genes with 20 exons have isoforms that tend to use most of the 20 exons. They also tend to express more isoforms than genes with 5 exons. The function modelPrior analyzes knownGenomeDB to set reasonable values for P(model).

An exhaustive enumeration of all possible models is not feasible unless the gene is very short (e.g. around 5 exons). For longer genes we use computational strategies to search a subset of "interesting" models. This is controlled by the argument searchMethod (see above).

In order to compute P(model|y) one can either use the computed m(y|model) P(model) (option exactMarginal==TRUE) or the proportion of MCMC visits (option exactMarginal==FALSE). Unless niter is large the former option typically provides more precise estimates.

Value

A denovoGenomeExpr object.

Author(s)

Camille Stephan-Otto Attolini, Manuel Kroiss, David Rossell

References

Rossell D, Stephan-Otto Attolini C, Kroiss M, Stocker A. Quantifying Alternative Splicing from Paired-End RNA-sequencing data. Annals of Applied Statistics, 8(1):309-330.

Examples

## See help(denovoExpr)