Last data update: 2014.03.03

R: Decompose the gene-level variance
decomposeVarR Documentation

Decompose the gene-level variance

Description

Decompose the gene-specific variance into biological and technical components for single-cell RNA-seq data.

Usage

## S4 method for signature 'matrix,list'
decomposeVar(x, fit, design=NA)
## S4 method for signature 'SCESet,list'
decomposeVar(x, fit, ..., assay="exprs", get.spikes=FALSE)

Arguments

x

A numeric matrix of normalized log-expression values, where each column corresponds to a cell and each row corresponds to an endogenous gene. Alternatively, a SCESet object containing such a matrix.

fit

A list containing the output of trendVar, run on log-expression values for spike-in genes.

design

A numeric matrix describing the systematic factors contributing to expression in each cell.

...

Additional arguments to pass to decomposeVar,matrix,list-method.

assay

A string specifying which assay values to use, e.g., counts or exprs.

get.spikes

A logical scalar specifying whether decomposition should be performed for spike-ins.

Details

This function computes the variance of the log-CPMs for each endogenous gene. The technical component of the variance for each gene is determined by interpolating the fitted trend in fit at the mean log-CPM for that gene. This represents variance due to sequencing noise, variability in capture efficiency, etc. The biological component is determined by subtracting the technical component from the total variance.

Highly variable genes (HVGs) can be identified as those with large biological components. Unlike other methods for decomposition, this approach estimates the variance of the log-CPMs rather than of the counts themselves. The log-transformation blunts the impact of large positive outliers and ensures that the HVG list is not dominated by outliers. Interpretation is not compromised – HVGs will still be so, regardless of whether counts or log-CPMs are considered.

The design matrix can be set if there are factors that should be blocked, e.g., batch effects, known (and uninteresting) clusters. If NULL, it will be set to an all-ones matrix, i.e., all cells are replicates. If NA, it will be extracted from fit$design, assuming that the same cells were used to fit the trend.

Value

A data frame is returned, containing:

mean:

A numeric vector of mean log-CPMs for all cellular genes.

total:

A numeric vector of the variances of log-CPMs for all cellular genes.

bio:

A numeric vector containing the biological component of the variance for all genes.

tech:

A numeric vector containing the technical component of the variance for all genes.

Rows corresponding to spike-in transcripts are set to NA unless get.spikes=TRUE.

Author(s)

Aaron Lun

See Also

trendVar

Examples

set.seed(100)

nspikes <- ncells <- 100
spike.means <- 2^runif(nspikes, 3, 8)
spike.disp <- 100/spike.means + 0.5
spike.data <- matrix(rnbinom(nspikes*ncells, mu=spike.means, size=1/spike.disp), ncol=ncells)

ngenes <- 10000
cell.means <- 2^runif(ngenes, 2, 10)
cell.disp <- 100/cell.means + 0.5
cell.data <- matrix(rnbinom(ngenes*ncells, mu=cell.means, size=1/cell.disp), ncol=ncells)

combined <- rbind(cell.data, spike.data)
colnames(combined) <- seq_len(ncells)
rownames(combined) <- seq_len(nrow(combined))
y <- newSCESet(countData=combined)
isSpike(y) <- rep(c(FALSE, TRUE), c(ngenes, nspikes))

# Normalizing.
y <- computeSpikeFactors(y) # or computeSumFactors
y <- normalize(y)

# Decomposing technical and biological noise.
fit <- trendVar(y)
results <- decomposeVar(y, fit)

plot(results$mean, results$total)
o <- order(results$mean)
lines(results$mean[o], results$tech[o], col="red", lwd=2)

plot(results$mean, results$bio)

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(scran)
Loading required package: BiocParallel
Loading required package: scater
Loading required package: Biobase
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
    get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
    rbind, rownames, sapply, setdiff, sort, table, tapply, union,
    unique, unsplit

Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: ggplot2

Attaching package: 'scater'

The following object is masked from 'package:stats':

    filter

> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/scran/decomposeVar.Rd_%03d_medium.png", width=480, height=480)
> ### Name: decomposeVar
> ### Title: Decompose the gene-level variance
> ### Aliases: decomposeVar decomposeVar,matrix,list-method
> ###   decomposeVar,SCESet,list-method
> ### Keywords: variance
> 
> ### ** Examples
> 
> set.seed(100)
> 
> nspikes <- ncells <- 100
> spike.means <- 2^runif(nspikes, 3, 8)
> spike.disp <- 100/spike.means + 0.5
> spike.data <- matrix(rnbinom(nspikes*ncells, mu=spike.means, size=1/spike.disp), ncol=ncells)
> 
> ngenes <- 10000
> cell.means <- 2^runif(ngenes, 2, 10)
> cell.disp <- 100/cell.means + 0.5
> cell.data <- matrix(rnbinom(ngenes*ncells, mu=cell.means, size=1/cell.disp), ncol=ncells)
> 
> combined <- rbind(cell.data, spike.data)
> colnames(combined) <- seq_len(ncells)
> rownames(combined) <- seq_len(nrow(combined))
> y <- newSCESet(countData=combined)
> isSpike(y) <- rep(c(FALSE, TRUE), c(ngenes, nspikes))
> 
> # Normalizing.
> y <- computeSpikeFactors(y) # or computeSumFactors
> y <- normalize(y)
> 
> # Decomposing technical and biological noise.
> fit <- trendVar(y)
> results <- decomposeVar(y, fit)
> 
> plot(results$mean, results$total)
> o <- order(results$mean)
> lines(results$mean[o], results$tech[o], col="red", lwd=2)
> 
> plot(results$mean, results$bio)
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>