R: Apply a variance stabilizing transformation (VST) to the...
varianceStabilizingTransformation
R Documentation
Apply a variance stabilizing transformation (VST) to the count data
Description
This function calculates a variance stabilizing transformation (VST) from the
fitted dispersion-mean relation(s) and then transforms the count data (normalized
by division by the size factors or normalization factors), yielding a matrix
of values which are now approximately homoskedastic (having constant variance along the range
of mean values). The transformation also normalizes with respect to library size.
The rlog is less sensitive
to size factors, which can be an issue when size factors vary widely.
These transformations are useful when checking for outliers or as input for
machine learning techniques such as clustering or linear discriminant analysis.
logical, whether to blind the transformation to the experimental
design. blind=TRUE should be used for comparing samples in an manner unbiased by
prior information on samples, for example to perform sample QA (quality assurance).
blind=FALSE should be used for transforming data for downstream analysis,
where the full use of the design information should be made.
blind=FALSE will skip re-estimation of the dispersion trend, if this has already been calculated.
If many of genes have large differences in counts due to
the experimental design, it is important to set blind=FALSE for downstream
analysis.
fitType
in case dispersions have not yet been estimated for object,
this parameter is passed on to estimateDispersions (options described there).
Details
For each sample (i.e., column of counts(dds)), the full variance function
is calculated from the raw variance (by scaling according to the size factor and adding
the shot noise). We recommend a blind estimation of the variance function, i.e.,
one ignoring conditions. This is performed by default, and can be modified using the
'blind' argument.
Note that neither rlog transformation nor the VST are used by the
differential expression estimation in DESeq, which always
occurs on the raw count data, through generalized linear modeling which
incorporates knowledge of the variance-mean dependence. The rlog transformation
and VST are offered as separate functionality which can be used for visualization,
clustering or other machine learning tasks. See the transformation section of the
vignette for more details.
The transformation does not require that one has already estimated size factors
and dispersions.
A typical workflow is shown in Section Variance stabilizing transformation
in the package vignette.
If estimateDispersions was called with:
fitType="parametric",
a closed-form expression for the variance stabilizing
transformation is used on the normalized
count data. The expression can be found in the file ‘vst.pdf’
which is distributed with the vignette.
fitType="local",
the reciprocal of the square root of the variance of the normalized counts, as derived
from the dispersion fit, is then numerically
integrated, and the integral (approximated by a spline function) is evaluated for each
count value in the column, yielding a transformed value.
fitType="mean", a VST is applied for Negative Binomial distributed counts, 'k',
with a fixed dispersion, 'a': ( 2 asinh(sqrt(a k)) - log(a) - log(4) )/log(2).
In all cases, the transformation is scaled such that for large
counts, it becomes asymptotically (for large values) equal to the
logarithm to base 2 of normalized counts.
The variance stabilizing transformation from a previous dataset
can be frozen and reapplied to new samples. See the 'Data quality assessment'
section of the vignette for strategies to see if new samples are
sufficiently similar to previous datasets.
The frozen VST is accomplished by saving the dispersion function
accessible with dispersionFunction, assigning this
to the DESeqDataSet with the new samples, and running
varianceStabilizingTransformation with 'blind' set to FALSE
(see example below).
Then the dispersion function from the previous dataset will be used
to transform the new sample(s).
Limitations: In order to preserve normalization, the same
transformation has to be used for all samples. This results in the
variance stabilizition to be only approximate. The more the size
factors differ, the more residual dependence of the variance on the
mean will be found in the transformed data. rlog is a
transformation which can perform better in these cases.
As shown in the vignette, the function meanSdPlot
from the package vsn can be used to see whether this is a problem.
Value
varianceStabilizingTransformation returns a
DESeqTransform if a DESeqDataSet was provided,
or returns a a matrix if a count matrix was provided.
Note that for DESeqTransform output, the matrix of
transformed values is stored in assay(vsd).
getVarianceStabilizedData also returns a matrix.
Author(s)
Simon Anders
References
Reference for the variance stabilizing transformation for counts with a dispersion trend:
dds <- makeExampleDESeqDataSet(m=6)
vsd <- varianceStabilizingTransformation(dds)
dists <- dist(t(assay(vsd)))
plot(hclust(dists))
# learn the dispersion function of a dataset
design(dds) <- ~ 1
dds <- estimateSizeFactors(dds)
dds <- estimateDispersions(dds)
# use the previous dispersion function for a new sample
ddsNew <- makeExampleDESeqDataSet(m=1)
ddsNew <- estimateSizeFactors(ddsNew)
dispersionFunction(ddsNew) <- dispersionFunction(dds)
vsdNew <- varianceStabilizingTransformation(ddsNew, blind=FALSE)
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(DESeq2)
Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:parallel':
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from 'package:stats':
IQR, mad, xtabs
The following objects are masked from 'package:base':
Filter, Find, Map, Position, Reduce, anyDuplicated, append,
as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
rbind, rownames, sapply, setdiff, sort, table, tapply, union,
unique, unsplit
Attaching package: 'S4Vectors'
The following objects are masked from 'package:base':
colMeans, colSums, expand.grid, rowMeans, rowSums
Loading required package: IRanges
Loading required package: GenomicRanges
Loading required package: GenomeInfoDb
Loading required package: SummarizedExperiment
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/DESeq2/varianceStabilizingTransformation.Rd_%03d_medium.png", width=480, height=480)
> ### Name: varianceStabilizingTransformation
> ### Title: Apply a variance stabilizing transformation (VST) to the count
> ### data
> ### Aliases: getVarianceStabilizedData varianceStabilizingTransformation
>
> ### ** Examples
>
>
> dds <- makeExampleDESeqDataSet(m=6)
> vsd <- varianceStabilizingTransformation(dds)
> dists <- dist(t(assay(vsd)))
> plot(hclust(dists))
>
> # learn the dispersion function of a dataset
> design(dds) <- ~ 1
> dds <- estimateSizeFactors(dds)
> dds <- estimateDispersions(dds)
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
>
> # use the previous dispersion function for a new sample
> ddsNew <- makeExampleDESeqDataSet(m=1)
Warning message:
In DESeqDataSet(se, design = design, ignoreRank) :
all genes have equal values for all samples. will not be able to perform differential analysis
> ddsNew <- estimateSizeFactors(ddsNew)
> dispersionFunction(ddsNew) <- dispersionFunction(dds)
variance of dispersion residuals not estimated (necessary only for differential expression calling)
> vsdNew <- varianceStabilizingTransformation(ddsNew, blind=FALSE)
>
>
>
>
>
>
> dev.off()
null device
1
>