R: Differential expression analysis based on the Negative...
DESeq
R Documentation
Differential expression analysis based on the Negative Binomial (a.k.a. Gamma-Poisson) distribution
Description
This function performs a default analysis through the steps:
estimation of size factors: estimateSizeFactors
estimation of dispersion: estimateDispersions
Negative Binomial GLM fitting and Wald statistics: nbinomWaldTest
For complete details on each step, see the manual pages of the respective
functions. After the DESeq function returns a DESeqDataSet object,
results tables (log2 fold changes and p-values) can be generated
using the results function. See the manual page
for results for information on independent filtering and
p-value adjustment for multiple test correction.
a DESeqDataSet object, see the constructor functions
DESeqDataSet,
DESeqDataSetFromMatrix,
DESeqDataSetFromHTSeqCount.
test
either "Wald" or "LRT", which will then use either
Wald significance tests (defined by nbinomWaldTest),
or the likelihood ratio test on the difference in deviance between a
full and reduced model formula (defined by nbinomLRT)
fitType
either "parametric", "local", or "mean"
for the type of fitting of dispersions to the mean intensity.
See estimateDispersions for description.
betaPrior
whether or not to put a zero-mean normal prior on
the non-intercept coefficients
See nbinomWaldTest for description of the calculation
of the beta prior. By default, the beta prior is used only for the
Wald test, but can also be specified for the likelihood ratio test.
full
for test="LRT", the full model formula,
which is restricted to the formula in design(object).
alternatively, it can be a model matrix constructed by the user.
advanced use: specifying a model matrix for full and test="Wald"
is possible if betaPrior=FALSE
reduced
for test="LRT", a reduced formula to compare against,
i.e., the full formula with the term(s) of interest removed.
alternatively, it can be a model matrix constructed by the user
quiet
whether to print messages at each step
minReplicatesForReplace
the minimum number of replicates required
in order to use replaceOutliers on a
sample. If there are samples with so many replicates, the model will
be refit after these replacing outliers, flagged by Cook's distance.
Set to Inf in order to never replace outliers.
modelMatrixType
either "standard" or "expanded", which describe
how the model matrix, X of the GLM formula is formed.
"standard" is as created by model.matrix using the
design formula. "expanded" includes an indicator variable for each
level of factors in addition to an intercept. for more information
see the Description of nbinomWaldTest.
betaPrior must be set to TRUE in order for expanded model matrices
to be fit.
parallel
if FALSE, no parallelization. if TRUE, parallel
execution using BiocParallel, see next argument BPPARAM.
A note on running in parallel using BiocParallel: it may be
advantageous to remove large, unneeded objects from your current
R environment before calling DESeq,
as it is possible that R's internal garbage collection
will copy these files while running on worker nodes.
BPPARAM
an optional parameter object passed internally
to bplapply when parallel=TRUE.
If not specified, the parameters last registered with
register will be used.
Details
The differential expression analysis uses a generalized linear model of the form:
K_ij ~ NB(mu_ij, alpha_i)
mu_ij = s_j q_ij
log2(q_ij) = x_j. beta_i
where counts K_ij for gene i, sample j are modeled using
a Negative Binomial distribution with fitted mean mu_ij
and a gene-specific dispersion parameter alpha_i.
The fitted mean is composed of a sample-specific size factor
s_j and a parameter q_ij proportional to the
expected true concentration of fragments for sample j.
The coefficients beta_i give the log2 fold changes for gene i for each
column of the model matrix X.
The sample-specific size factors can be replaced by
gene-specific normalization factors for each sample using
normalizationFactors.
For details on the fitting of the log2 fold changes and calculation of p-values,
see nbinomWaldTest if using test="Wald",
or nbinomLRT if using test="LRT".
Experiments without replicates do not allow for estimation of the dispersion
of counts around the expected value for each group, which is critical for
differential expression analysis. If an experimental design is
supplied which does not contain the necessary degrees of freedom for differential
analysis, DESeq will provide a message to the user and follow
the strategy outlined in Anders and Huber (2010)
under the section 'Working without replicates', wherein all the samples
are considered as replicates of a single group for the estimation of dispersion.
As noted in the reference above: "Some overestimation of the variance
may be expected, which will make that approach conservative."
Furthermore, "while one may not want to draw strong conclusions from such an analysis,
it may still be useful for exploration and hypothesis generation."
The argument minReplicatesForReplace is used to decide which samples
are eligible for automatic replacement in the case of extreme Cook's distance.
By default, DESeq will replace outliers if the Cook's distance is
large for a sample which has 7 or more replicates (including itself).
This replacement is performed by the replaceOutliers
function. This default behavior helps to prevent filtering genes
based on Cook's distance when there are many degrees of freedom.
See results for more information about filtering using
Cook's distance, and the 'Dealing with outliers' section of the vignette.
Unlike the behavior of replaceOutliers, here original counts are
kept in the matrix returned by counts, original Cook's
distances are kept in assays(dds)[["cooks"]], and the replacement
counts used for fitting are kept in assays(dds)[["replaceCounts"]].
Note that if a log2 fold change prior is used (betaPrior=TRUE)
then expanded model matrices will be used in fitting. These are
described in nbinomWaldTest and in the vignette. The
contrast argument of results should be used for
generating results tables.
Value
a DESeqDataSet object with results stored as
metadata columns. These results should accessed by calling the results
function. By default this will return the log2 fold changes and p-values for the last
variable in the design formula. See results for how to access results
for other variables.
Author(s)
Michael Love
References
Michael I Love, Wolfgang Huber, Simon Anders: Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology 2014, 15:550. http://dx.doi.org/10.1186/s13059-014-0550-8
See Also
nbinomWaldTest, nbinomLRT
Examples
# see vignette for suggestions on generating
# count tables from RNA-Seq data
cnts <- matrix(rnbinom(n=1000, mu=100, size=1/0.5), ncol=10)
cond <- factor(rep(1:2, each=5))
# object construction
dds <- DESeqDataSetFromMatrix(cnts, DataFrame(cond), ~ cond)
# standard analysis
dds <- DESeq(dds)
res <- results(dds)
# an alternate analysis: likelihood ratio test
ddsLRT <- DESeq(dds, test="LRT", reduced= ~ 1)
resLRT <- results(ddsLRT)
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(DESeq2)
Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:parallel':
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from 'package:stats':
IQR, mad, xtabs
The following objects are masked from 'package:base':
Filter, Find, Map, Position, Reduce, anyDuplicated, append,
as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
rbind, rownames, sapply, setdiff, sort, table, tapply, union,
unique, unsplit
Attaching package: 'S4Vectors'
The following objects are masked from 'package:base':
colMeans, colSums, expand.grid, rowMeans, rowSums
Loading required package: IRanges
Loading required package: GenomicRanges
Loading required package: GenomeInfoDb
Loading required package: SummarizedExperiment
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/DESeq2/DESeq.Rd_%03d_medium.png", width=480, height=480)
> ### Name: DESeq
> ### Title: Differential expression analysis based on the Negative Binomial
> ### (a.k.a. Gamma-Poisson) distribution
> ### Aliases: DESeq
>
> ### ** Examples
>
>
> # see vignette for suggestions on generating
> # count tables from RNA-Seq data
> cnts <- matrix(rnbinom(n=1000, mu=100, size=1/0.5), ncol=10)
> cond <- factor(rep(1:2, each=5))
>
> # object construction
> dds <- DESeqDataSetFromMatrix(cnts, DataFrame(cond), ~ cond)
converting counts to integer mode
>
> # standard analysis
> dds <- DESeq(dds)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
-- note: fitType='parametric', but the dispersion trend was not well captured by the
function: y = a/x + b, and a local regression fit was automatically substituted.
specify fitType='local' or 'mean' to avoid this message next time.
final dispersion estimates
fitting model and testing
> res <- results(dds)
>
> # an alternate analysis: likelihood ratio test
> ddsLRT <- DESeq(dds, test="LRT", reduced= ~ 1)
using pre-existing size factors
estimating dispersions
found already estimated dispersions, replacing these
gene-wise dispersion estimates
mean-dispersion relationship
-- note: fitType='parametric', but the dispersion trend was not well captured by the
function: y = a/x + b, and a local regression fit was automatically substituted.
specify fitType='local' or 'mean' to avoid this message next time.
final dispersion estimates
fitting model and testing
> resLRT <- results(ddsLRT)
>
>
>
>
>
>
> dev.off()
null device
1
>