R: Estimate and fit dispersions for a CountDataSet.
estimateDispersions
R Documentation
Estimate and fit dispersions for a CountDataSet.
Description
This function obtains dispersion estimates for a count data set. For each condition
(or collectively for all conditions, see 'method' argument below) it first computes
for each gene an empirical dispersion value (a.k.a. a raw SCV value), then fits by regression
a dispersion-mean relationship and finally chooses for each gene a
dispersion parameter that will be used in subsequent tests from the
empirical and the fitted value according to the 'sharingMode'
argument.
There are three ways how the empirical dispersion can be computed:
pooled - Use the samples from all conditions with
replicates to estimate a single pooled empirical dispersion value,
called "pooled", and assign it to all samples.
pooled-CR - Fit models according to modelFormula and
estimate the dispersion by maximizing a Cox-Reid adjusted profile
likelihood (CR-APL). This method
is much slower than method=="pooled" but works also with crossed
factors (as may occur, e.g., in designs with paired samples).
Usually, you will need to specify the model formula, which should be the
same as the one used later in the call to nbinomFitGLMs for fitting the full model.
Note: The method of using CR-APL maximization for this application has been
developed by McCarthy, Chen and Smyth [Nucl. Acid Res., 2012 and been
first implemented in edgeR (in 2010). DESeq optimizes the expression
for the CR-APL given in McCarthy et al.'s paper, but does not use the
weigthed maximum likelihood scheme proposed there.
per-condition - For each condition with replicates, compute
a gene's empirical dispersion value by considering the data from samples for this
condition. For samples of unreplicated conditions, the maximum
of empirical dispersion values from the other conditions is used. If object has a
multivariate design (i.e., if a data frame was passed instead of a
factor for the condition argument in
newCountDataSet), this method is not available. (Note: This method was called “normal” in
previous versions.)
blind - Ignore the sample labels and compute a
gene's empirical dispersion value as if all samples were replicates of a
single condition. This can be done even if there are no biological
replicates. This method can lead to loss of power; see
the vignette for details. The single estimated dispersion condition
is called "blind" and used for all samples.
sharingMode
After the empirical dispersion values have been computed for each
gene, a dispersion-mean relationship is fitted for sharing
information across genes in order to reduce variability of the
dispersion estimates. After that, for each gene, we have two values: the
empirical value (derived only from this gene's data), and the
fitted value (i.e., the dispersion value typical for genes with an
average expression similar to those of this gene). The
sharingMode argument specifies which of these two values
will be written to the featureData's disp_ columns
and hence will be used by the functions nbinomTest
and fitNbinomGLMs.
fit-only - use only the fitted value, i.e., the
empirical value is used only as input to the fitting, and then
ignored. Use this only with very few replicates, and when
you are not too concerned about false positives from dispersion outliers, i.e. genes
with an unusually high variability.
maximum - take the maximum of the two values. This is
the conservative or prudent choice, recommended once you have at
least three or four replicates and maybe even with only two replicates.
gene-est-only - No fitting or sharing, use only the
empirical value. This method is preferable when the number of
replicates is large and the empirical dispersion values are
sufficiently reliable. If the number of replicates is small, this
option may lead to many cases where the dispersion
of a gene is accidentally underestimated and a false positive arises in
the subsequent testing.
fitType
parametric - Fit a dispersion-mean relation of the
form dispersion = asymptDisp + extraPois / mean via a robust
gamma-family GLM. The coefficients asymptDisp and extraPois
are given in the attribute coefficients of the dispFunc
in the fitInfo (see below).
local - Use the locfit package to fit a dispersion-mean
relation, as described in the DESeq paper.
locfit_extra_args, lp_extra_args
(only for fitType=local)
Options to be passed to the locfit and to the lp
function of the locfit package. Use this to adjust the local
fitting. For example, you may pass a value for nn different
from the default (0.7) if the fit seems too smooth or too rough by
setting lp_extra_agrs=list(nn=0.9). As another example, you
can set locfit_extra_args=list(maxk=200) if you get the
error that locfit ran out of nodes. See the documentation of the
locfit package for details. In most cases, you will not
need to provide these parameters, as the defaults seem to work
quite well.
modelFrame
By default, the information in conditions(object) or pData(object) is used to
determine which samples are replicates (see newCountDataSet).
For method="pooled", a data frame can be passed
here, and all rows that are identical in this data frame are considered
to indicate replicate samples in object. For method="pooled-CR",
the data frame is used in the fits. For the other methods, this argument is
ignored.
modelFormula
For method="pooled-CR", this is the formual used for the dispersion fits.
For all other methods, this argument is ignored.
...
extra arguments are ignored
Details
Behaviour for method="per-condition": For each replicated condition, a list, named
with the condition's name, is placed in the environment object@fitInfo. This list
has five named elements: The vector perGeneDispEsts contains the
empirical dispersions. The function dispFunc
is the fitted function, i.e., it takes as its argument a normalized
mean expression value and returns the corresponding
fitted dispersion. The values fitted according to this function are
in the third element fittedDispEst, a vector of the same
length as perGeneDispEsts. The fourt element, df,
is an integer, indicating the number of degrees of freedom of
the per-gene estimation. The fifth element, sharingMode,
stores the value of the sharingMode argument to
esimateDispersions.
Behaviour for method="blind" and method="pooled": Only one list is produced,
named "blind" or "pooled" and placed in object@fitInfo.
For each list in the fitInfo environment, the dispersion
values that are intended to be used in subsequent testing are computed according to
the value of sharingMode and are placed in the
featureData data frame, in a column
named with the same name, prefixed with "disp_".
Then, the dispTable (see there) is filled to assign to each
condition the appropriate dispersion column in the phenoData frame.
Note: Up to DESeq version 1.4.x (Bioconductor release 2.8), this function was
called estimateVarianceFunctions, stored its result differently and
did not have the arguments sharingMode and fitType.
estimatevarianceFunction's behaviour corresponded
to the settings sharingMode="fit-only" and fitType="local". Note that
these are not the default, because the new defaults sharingMode="maximum"
and fitType="parametric" are more robust and tend to give better results.
Value
The CountDataSet cds, with the slots fitInfo and
featureData updated as described in Details.
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(DESeq)
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:parallel':
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from 'package:stats':
IQR, mad, xtabs
The following objects are masked from 'package:base':
Filter, Find, Map, Position, Reduce, anyDuplicated, append,
as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
rbind, rownames, sapply, setdiff, sort, table, tapply, union,
unique, unsplit
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Loading required package: locfit
locfit 1.5-9.1 2013-03-22
Loading required package: lattice
Welcome to 'DESeq'. For improved performance, usability and
functionality, please consider migrating to 'DESeq2'.
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/DESeq/estimateDispersions.Rd_%03d_medium.png", width=480, height=480)
> ### Name: estimateDispersions
> ### Title: Estimate and fit dispersions for a CountDataSet.
> ### Aliases: estimateDispersions,CountDataSet-method estimateDispersions
>
> ### ** Examples
>
> cds <- makeExampleCountDataSet()
> cds <- estimateSizeFactors( cds )
> cds <- estimateDispersions( cds )
> str( fitInfo( cds ) )
List of 5
$ perGeneDispEsts: num [1:10000] 0.0554 1.1392 0.4875 0.2173 0.053 ...
$ dispFunc :function (q)
..- attr(*, "coefficients")= Named num [1:2] 0.217 0.828
.. ..- attr(*, "names")= chr [1:2] "asymptDisp" "extraPois"
..- attr(*, "fitType")= chr "parametric"
$ fittedDispEsts : num [1:10000] 0.218 0.261 0.218 0.24 0.218 ...
$ df : int 3
$ sharingMode : chr "maximum"
> head( fData( cds ) )
disp_pooled
gene_1_F 0.2175652
gene_2_F 1.1392220
gene_3_F 0.4874638
gene_4_F 0.2402484
gene_5_T 0.2180398
gene_6_F 0.5299781
>
>
>
>
>
> dev.off()
null device
1
>