R Graphical Manual

Browse All

Last data update: 2014.03.03

R: Estimate and fit dispersions for a CountDataSet.

estimateDispersions

R Documentation

Estimate and fit dispersions for a CountDataSet.

Description

This function obtains dispersion estimates for a count data set. For each condition (or collectively for all conditions, see 'method' argument below) it first computes for each gene an empirical dispersion value (a.k.a. a raw SCV value), then fits by regression a dispersion-mean relationship and finally chooses for each gene a dispersion parameter that will be used in subsequent tests from the empirical and the fitted value according to the 'sharingMode' argument.

Usage

## S4 method for signature 'CountDataSet'
estimateDispersions( object,
   method = c( "pooled", "pooled-CR", "per-condition", "blind" ),
   sharingMode = c( "maximum", "fit-only", "gene-est-only" ),
   fitType = c("parametric", "local"),
   locfit_extra_args=list(), lp_extra_args=list(),
   modelFrame = NULL, modelFormula = count ~ condition, ... )

Arguments

`object`	a `CountDataSet` with size factors.
`method`	There are three ways how the empirical dispersion can be computed: `pooled` - Use the samples from all conditions with replicates to estimate a single pooled empirical dispersion value, called "pooled", and assign it to all samples. `pooled-CR` - Fit models according to `modelFormula` and estimate the dispersion by maximizing a Cox-Reid adjusted profile likelihood (CR-APL). This method is much slower than `method=="pooled"` but works also with crossed factors (as may occur, e.g., in designs with paired samples). Usually, you will need to specify the model formula, which should be the same as the one used later in the call to `nbinomFitGLMs` for fitting the full model. Note: The method of using CR-APL maximization for this application has been developed by McCarthy, Chen and Smyth [Nucl. Acid Res., 2012 and been first implemented in edgeR (in 2010). DESeq optimizes the expression for the CR-APL given in McCarthy et al.'s paper, but does not use the weigthed maximum likelihood scheme proposed there. `per-condition` - For each condition with replicates, compute a gene's empirical dispersion value by considering the data from samples for this condition. For samples of unreplicated conditions, the maximum of empirical dispersion values from the other conditions is used. If `object` has a multivariate design (i.e., if a data frame was passed instead of a factor for the `condition` argument in `newCountDataSet`), this method is not available. (Note: This method was called “normal” in previous versions.) `blind` - Ignore the sample labels and compute a gene's empirical dispersion value as if all samples were replicates of a single condition. This can be done even if there are no biological replicates. This method can lead to loss of power; see the vignette for details. The single estimated dispersion condition is called "blind" and used for all samples.
`sharingMode`	After the empirical dispersion values have been computed for each gene, a dispersion-mean relationship is fitted for sharing information across genes in order to reduce variability of the dispersion estimates. After that, for each gene, we have two values: the empirical value (derived only from this gene's data), and the fitted value (i.e., the dispersion value typical for genes with an average expression similar to those of this gene). The `sharingMode` argument specifies which of these two values will be written to the `featureData`'s `disp_` columns and hence will be used by the functions `nbinomTest` and `fitNbinomGLMs`. `fit-only` - use only the fitted value, i.e., the empirical value is used only as input to the fitting, and then ignored. Use this only with very few replicates, and when you are not too concerned about false positives from dispersion outliers, i.e. genes with an unusually high variability. `maximum` - take the maximum of the two values. This is the conservative or prudent choice, recommended once you have at least three or four replicates and maybe even with only two replicates. `gene-est-only` - No fitting or sharing, use only the empirical value. This method is preferable when the number of replicates is large and the empirical dispersion values are sufficiently reliable. If the number of replicates is small, this option may lead to many cases where the dispersion of a gene is accidentally underestimated and a false positive arises in the subsequent testing.
`fitType`	`parametric` - Fit a dispersion-mean relation of the form `dispersion = asymptDisp + extraPois / mean` via a robust gamma-family GLM. The coefficients `asymptDisp` and `extraPois` are given in the attribute `coefficients` of the `dispFunc` in the `fitInfo` (see below). `local` - Use the locfit package to fit a dispersion-mean relation, as described in the DESeq paper.
`locfit_extra_args, lp_extra_args`	(only for `fitType=local`) Options to be passed to the `locfit` and to the `lp` function of the locfit package. Use this to adjust the local fitting. For example, you may pass a value for `nn` different from the default (0.7) if the fit seems too smooth or too rough by setting `lp_extra_agrs=list(nn=0.9)`. As another example, you can set `locfit_extra_args=list(maxk=200)` if you get the error that locfit ran out of nodes. See the documentation of the `locfit` package for details. In most cases, you will not need to provide these parameters, as the defaults seem to work quite well.
`modelFrame`	By default, the information in `conditions(object)` or `pData(object)` is used to determine which samples are replicates (see `newCountDataSet`). For `method="pooled"`, a data frame can be passed here, and all rows that are identical in this data frame are considered to indicate replicate samples in `object`. For `method="pooled-CR"`, the data frame is used in the fits. For the other methods, this argument is ignored.
`modelFormula`	For `method="pooled-CR"`, this is the formual used for the dispersion fits. For all other methods, this argument is ignored.
`...`	extra arguments are ignored

Details

Behaviour for method="per-condition": For each replicated condition, a list, named with the condition's name, is placed in the environment object@fitInfo. This list has five named elements: The vector perGeneDispEsts contains the empirical dispersions. The function dispFunc is the fitted function, i.e., it takes as its argument a normalized mean expression value and returns the corresponding fitted dispersion. The values fitted according to this function are in the third element fittedDispEst, a vector of the same length as perGeneDispEsts. The fourt element, df, is an integer, indicating the number of degrees of freedom of the per-gene estimation. The fifth element, sharingMode, stores the value of the sharingMode argument to esimateDispersions.

Behaviour for method="blind" and method="pooled": Only one list is produced, named "blind" or "pooled" and placed in object@fitInfo.

For each list in the fitInfo environment, the dispersion values that are intended to be used in subsequent testing are computed according to the value of sharingMode and are placed in the featureData data frame, in a column named with the same name, prefixed with "disp_".

Then, the dispTable (see there) is filled to assign to each condition the appropriate dispersion column in the phenoData frame.

Note: Up to DESeq version 1.4.x (Bioconductor release 2.8), this function was called estimateVarianceFunctions, stored its result differently and did not have the arguments sharingMode and fitType. estimatevarianceFunction's behaviour corresponded to the settings sharingMode="fit-only" and fitType="local". Note that these are not the default, because the new defaults sharingMode="maximum" and fitType="parametric" are more robust and tend to give better results.

Value

The CountDataSet cds, with the slots fitInfo and featureData updated as described in Details.

Author(s)

Simon Anders, sanders@fs.tum.de

Examples

cds <- makeExampleCountDataSet()
cds <- estimateSizeFactors( cds )
cds <- estimateDispersions( cds )
str( fitInfo( cds ) )
head( fData( cds ) )

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(DESeq)
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
    get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
    rbind, rownames, sapply, setdiff, sort, table, tapply, union,
    unique, unsplit

Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: locfit
locfit 1.5-9.1 	 2013-03-22
Loading required package: lattice
    Welcome to 'DESeq'. For improved performance, usability and
    functionality, please consider migrating to 'DESeq2'.
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/DESeq/estimateDispersions.Rd_%03d_medium.png", width=480, height=480)
> ### Name: estimateDispersions
> ### Title: Estimate and fit dispersions for a CountDataSet.
> ### Aliases: estimateDispersions,CountDataSet-method estimateDispersions
> 
> ### ** Examples
> 
> cds <- makeExampleCountDataSet()
> cds <- estimateSizeFactors( cds )
> cds <- estimateDispersions( cds )
> str( fitInfo( cds ) )
List of 5
 $ perGeneDispEsts: num [1:10000] 0.0554 1.1392 0.4875 0.2173 0.053 ...
 $ dispFunc       :function (q)  
  ..- attr(*, "coefficients")= Named num [1:2] 0.217 0.828
  .. ..- attr(*, "names")= chr [1:2] "asymptDisp" "extraPois"
  ..- attr(*, "fitType")= chr "parametric"
 $ fittedDispEsts : num [1:10000] 0.218 0.261 0.218 0.24 0.218 ...
 $ df             : int 3
 $ sharingMode    : chr "maximum"
> head( fData( cds ) )
         disp_pooled
gene_1_F   0.2175652
gene_2_F   1.1392220
gene_3_F   0.4874638
gene_4_F   0.2402484
gene_5_T   0.2180398
gene_6_F   0.5299781
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>