Last data update: 2014.03.03

R: Get the biological variability
trendVarR Documentation

Get the biological variability

Description

Compute the biological and technical components of the gene-specific variance in single-cell RNA-seq data.

Usage

## S4 method for signature 'matrix'
trendVar(x, trend=c("poly", "loess"), df=5, span=0.3, design=NULL)
## S4 method for signature 'SCESet'
trendVar(x, ..., assay="exprs", use.spikes=TRUE)

Arguments

x

A numeric matrix of normalized expression values, where each column corresponds to a cell and each row corresponds to a spike-in transcript. Alternatively, a SCESet object that contains such values.

trend

A string indicating whether the trend should be polynomial or loess-based.

df

An integer scalar specifying the degrees of freedom for polynomial fitting.

span

An numeric scalar specifying the span for loess fitting.

design

A numeric matrix describing the systematic factors contributing to expression in each cell.

...

Additional arguments to pass to trendVar,matrix,list-method.

assay

A string specifying which assay values to use, e.g., counts or exprs.

use.spikes

A logical scalar specifying whether the trend should be fitted to variances for spike-in or endogenous genes.

Details

The strategy is to fit an abundance-dependent trend to the variance of the log-normalized expression for the spike-in genes, using trendVar. For SCESet objects, these expression values can be computed by normalize after setting the size factors, e.g., with computeSpikeFactors. Log-transformed values are used as these tend to be more robust to genes with strong expression in only one or two outlier cells.

The mean and variance of the log-CPMs is calculated for each spike-in gene. A polynomial or loess trend is then fitted to the variance against the mean for all genes. This represents technical variability due to sequencing, drop-outs during capture, etc. Variance decomposition to biological and technical components for endogenous genes can then be performed

The design matrix can be set if there are factors that should be blocked, e.g., batch effects, known (and uninteresting) clusters. Otherwise, it will default to an all-ones matrix, effectively treating all cells as part of the same group.

Value

A named list is returned, containing:

mean:

A numeric vector of mean log-CPMs for all spike-in genes.

var:

A numeric vector of the variances of log-CPMs for all spike-in genes.

trend:

A function that returns the fitted value of the trend at any mean log-CPM.

design:

A numeric matrix, containing the design matrix that was used.

Additional notes

By default, a polynomial trend with df degrees of freedom is fitted to the spike-in variances as it is more precise than the loess curve. Note that this method is rather dependent on the quality of the spike-ins – the fit will obviously be poor if the coverage of all spike-ins is low. In some data sets, a loess curve may yield better results, though this may require some fiddling with the span.

When spike-ins are not available, trendVar can also be applied directly to the counts for endogenous genes by setting use.spikes=FALSE (or by manually supplying a matrix of normalized expression for endogenous genes, for trendVar,matrix-method). This assumes that most genes exhibit technical variation and little biological variation, e.g., in a homogeneous population. A loess curve is recommended here as it is more robust to a subset of genes with very large or very small variances. If use.spikes=NA, every row will be used for trend fitting, regardless of whether it corresponds to a spike-in transcript or endogenous gene.

Author(s)

Aaron Lun

See Also

poly, loess, decomposeVar, computeSpikeFactors, computeSumFactors, normalize

Examples

set.seed(100)

nspikes <- ncells <- 100
spike.means <- 2^runif(nspikes, 3, 8)
spike.disp <- 100/spike.means + 0.5
spike.data <- matrix(rnbinom(nspikes*ncells, mu=spike.means, size=1/spike.disp), ncol=ncells)

ngenes <- 10000
cell.means <- 2^runif(ngenes, 2, 10)
cell.disp <- 100/cell.means + 0.5
cell.data <- matrix(rnbinom(ngenes*ncells, mu=cell.means, size=1/cell.disp), ncol=ncells)

combined <- rbind(cell.data, spike.data)
colnames(combined) <- seq_len(ncells)
rownames(combined) <- seq_len(nrow(combined))
y <- newSCESet(countData=combined)
isSpike(y) <- rep(c(FALSE, TRUE), c(ngenes, nspikes))

# Normalizing.
y <- computeSpikeFactors(y) # or computeSumFactors.
y <- normalize(y)

# Fitting a trend to the spike-ins.
fit <- trendVar(y)
plot(fit$mean, fit$var)
x <- sort(fit$mean)
lines(x, fit$trend(x), col="red", lwd=2)

# Fitting a trend to the endogenous genes. 
fit <- trendVar(y, use.spikes=FALSE)
plot(fit$mean, fit$var)
x <- sort(fit$mean)
lines(x, fit$trend(x), col="red", lwd=2)

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(scran)
Loading required package: BiocParallel
Loading required package: scater
Loading required package: Biobase
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
    get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
    rbind, rownames, sapply, setdiff, sort, table, tapply, union,
    unique, unsplit

Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: ggplot2

Attaching package: 'scater'

The following object is masked from 'package:stats':

    filter

> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/scran/trendVar.Rd_%03d_medium.png", width=480, height=480)
> ### Name: trendVar
> ### Title: Get the biological variability
> ### Aliases: trendVar trendVar,matrix-method trendVar,SCESet-method
> ### Keywords: variance
> 
> ### ** Examples
> 
> set.seed(100)
> 
> nspikes <- ncells <- 100
> spike.means <- 2^runif(nspikes, 3, 8)
> spike.disp <- 100/spike.means + 0.5
> spike.data <- matrix(rnbinom(nspikes*ncells, mu=spike.means, size=1/spike.disp), ncol=ncells)
> 
> ngenes <- 10000
> cell.means <- 2^runif(ngenes, 2, 10)
> cell.disp <- 100/cell.means + 0.5
> cell.data <- matrix(rnbinom(ngenes*ncells, mu=cell.means, size=1/cell.disp), ncol=ncells)
> 
> combined <- rbind(cell.data, spike.data)
> colnames(combined) <- seq_len(ncells)
> rownames(combined) <- seq_len(nrow(combined))
> y <- newSCESet(countData=combined)
> isSpike(y) <- rep(c(FALSE, TRUE), c(ngenes, nspikes))
> 
> # Normalizing.
> y <- computeSpikeFactors(y) # or computeSumFactors.
> y <- normalize(y)
> 
> # Fitting a trend to the spike-ins.
> fit <- trendVar(y)
> plot(fit$mean, fit$var)
> x <- sort(fit$mean)
> lines(x, fit$trend(x), col="red", lwd=2)
> 
> # Fitting a trend to the endogenous genes. 
> fit <- trendVar(y, use.spikes=FALSE)
> plot(fit$mean, fit$var)
> x <- sort(fit$mean)
> lines(x, fit$trend(x), col="red", lwd=2)
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>