R Graphical Manual

Browse All

Last data update: 2014.03.03

R: LDA model inference

compute.lda

R Documentation

LDA model inference

Description

This function fits a Latent Dirichlet Allocation (LDA) to single-cell RNA-seq data.

Usage

compute.lda(data, method = "maptpx", k.topics = if (method == "maptpx") 2:15
  else 4, log.scale = TRUE, sd.filter = 0.5, tot.iter = if (method ==
  "Gibbs") 200 else 1e+06, tol = if (method == "maptpx") 0.05 else 10^-5)

Arguments

`data`	A matrix of (non-negative) RNA-seq expression levels where each row is a gene and each column is the cell sequenced.
`method`	LDA inference method to use. Can be any unique prefix of ‘maptpx’, ‘Gibbs’ or ‘VEM’ (defaults to ‘maptpx’)
`k.topics`	Integer (optional). Number of topics to fit in the model. If `method` is ‘maptpx’, `k.topics` can be a vector of possible topic numbers and the the best model (evaluated on Bayes factor vs a null single topic model) will be returned.
`log.scale`	Boolean (optional). Whether the data should be log-scaled.
`sd.filter`	Numeric or `FALSE` (optional). Standard-deviation threshold below which genes should be removed from the data (no filtering if set to `FALSE`).
`tot.iter, tol`	Numeric parameters (optional) forwarded to the chosen LDA inference method's contol class.

Details

Latent Dirichlet allocation (LDA) is a generative model that allows sets of observations to be explained by unobserved groups (topics) that explain why some parts of the data are similar [Blei, 2003]. Each topic is modelled as a (Dirichlet) distribution over observations and each set of observations is also modelled as a (Dirichlet) distribution over topics. In lieu of the traditional NLP context of word occurence counts in documents, our model uses RNA-seq observation counts in single cells. Three separate LDA inference methods can be used at the moment:

Gibbs uses Collapsed Gibbs Sampling method (implemented by Xuan-Hieu Phan and co-authors in the topicmodels package [Phan, 2008]) to infer the parameters of the Dirichlet distributions for a given number of topics. It gives high accuracy but is very time-consuming to run on large number of cells and genes.
VEM uses Variational Expectation-Maximisation (as described in [Hoffman, 2010]). This method tends to converge faster than Gibbs collapsed sampling, albeit with lower accuracy.
maptpx uses the method described in [Taddy, 2011] and implemented in package maptpx to estimate the parameters of the topic model for increasing number of topics (using previous estimates as a starting point for larger topic numbers). The best model (/number of topics) is selected based on Bayes factor over the Null model. Although potentially less accurate, this method provides the fastest way to train and select from a large number of models, when the number of topics is not well known.

When in doubt, the function can be ran with its default parameter values and should produce a usable LDA model in reasonable time (using the ‘maptpx’ inference method). The model can be further refined for a specific number of topics with slower methods. While larger models (using large number of topics) might fit the data well, there is a high risk of overfitting and it is recommended to use the smallest possible number of topics that still explains the observations well. Anecdotally, a typical number of topics for cell differentiation data (from pluripotent to fully specialised) would seem to be around 4 or 5.

Value

A LDA model fitted for data, of class LDA-class (for methods 'Gibbs' or 'VEM') or topics (for 'maptpx')

References

Blei, Ng, and Jordan. “Latent dirichlet allocation.” the Journal of machine Learning research 3 (2003): 993-1022.
Hoffman, Blei and Bach (2010). “Online Learning for Latent Dirichlet Allocation.” In J Lafferty, CKI Williams, J Shawe-Taylor, R Zemel, A Culotta (eds.), Advances in Neural Information Processing Systems 23, pp. 856-864. MIT Press, Cambridge, MA.
Hornik and Gr<c3><83><c2><bc>n. “topicmodels: An R package for fitting topic models.” Journal of Statistical Software 40.13 (2011): 1-30.
Phan, Nguyen and Horiguchi. “Learning to classify short and sparse text & web with hidden topics from large-scale data collections.” Proceedings of the 17th international conference on World Wide Web. ACM, 2008.
Taddy. “On estimation and selection for topic models.” arXiv preprint arXiv:1109.4518 (2011).

Examples

# Load skeletal myoblast RNA-Seq data from HSMMSingleCell package:
library(HSMMSingleCell)
data(HSMM_expr_matrix)

# Run LDA inference using 'maptpx' method for k = 4:
 lda.results = compute.lda(HSMM_expr_matrix, k.topics=4, method="maptpx")


# Run LDA inference using 'maptpx' method for number of topics k = 3 to 6:
 lda.results = compute.lda(HSMM_expr_matrix, k.topics=3:6, method="maptpx")

# Run LDA inference using 'Gibbs' [collapsed sampling] method for number of k = 4 topics:
 lda.results = compute.lda(HSMM_expr_matrix, k.topics=4, method="Gibbs")

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(cellTree)
Loading required package: topGO
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
    get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
    rbind, rownames, sapply, setdiff, sort, table, tapply, union,
    unique, unsplit

Loading required package: graph
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: GO.db
Loading required package: AnnotationDbi
Loading required package: stats4
Loading required package: IRanges
Loading required package: S4Vectors

Attaching package: 'S4Vectors'

The following objects are masked from 'package:base':

    colMeans, colSums, expand.grid, rowMeans, rowSums


Loading required package: SparseM

Attaching package: 'SparseM'

The following object is masked from 'package:base':

    backsolve


groupGOTerms: 	GOBPTerm, GOMFTerm, GOCCTerm environments built.

Attaching package: 'topGO'

The following object is masked from 'package:IRanges':

    members

> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/cellTree/compute.lda.Rd_%03d_medium.png", width=480, height=480)
> ### Name: compute.lda
> ### Title: LDA model inference
> ### Aliases: compute.lda
> 
> ### ** Examples
> 
> # Load skeletal myoblast RNA-Seq data from HSMMSingleCell package:
> library(HSMMSingleCell)
> data(HSMM_expr_matrix)
> 
> # Run LDA inference using 'maptpx' method for k = 4:
>  lda.results = compute.lda(HSMM_expr_matrix, k.topics=4, method="maptpx")
Computing LDA model using: maptpx
Converting to log values...
Filtering out rows with standard deviation < 0.5 (47192 -> 13515)...
Loading required namespace: maptpx

Estimating on a 271 document collection.
Fitting the 4 topic model.
log posterior increase: 1799.62, 3613.11, 2764.04, 1187.68, 704.87, done.
Selected k = 4 topics
> 
> ## No test: 
> # Run LDA inference using 'maptpx' method for number of topics k = 3 to 6:
>  lda.results = compute.lda(HSMM_expr_matrix, k.topics=3:6, method="maptpx")
Computing LDA model using: maptpx
Converting to log values...
Filtering out rows with standard deviation < 0.5 (47192 -> 13515)...

Estimating on a 271 document collection.
Fit and Bayes Factor Estimation for K = 3 ... 6
log posterior increase: 275.05, 875.3, 3576.09, 2351.75, 1110.8, 691.45, done.
log BF( 3 ) = 93906.43
log posterior increase: 38214.43, 1787.53, 740.49, done.
log BF( 4 ) = 141779.42
log posterior increase: 19928.15, 905.14, 507.36, 376.38, done.
log BF( 5 ) = 147144.41
log posterior increase: 11509.18, 954.61, 489.19, done.
log BF( 6 ) = 149036.9
Selected k = 6 topics
> 
> # Run LDA inference using 'Gibbs' [collapsed sampling] method for number of k = 4 topics:
>  lda.results = compute.lda(HSMM_expr_matrix, k.topics=4, method="Gibbs")
Computing LDA model using: Gibbs
Converting to log values...
Filtering out rows with standard deviation < 0.5 (47192 -> 13515)...
Loading required namespace: topicmodels
K = 4; V = 13515; M = 271
Sampling 230 iterations!
Iteration 10 ...
Iteration 20 ...