Last data update: 2014.03.03

R: LDA model inference
compute.ldaR Documentation

LDA model inference

Description

This function fits a Latent Dirichlet Allocation (LDA) to single-cell RNA-seq data.

Usage

compute.lda(data, method = "maptpx", k.topics = if (method == "maptpx") 2:15
  else 4, log.scale = TRUE, sd.filter = 0.5, tot.iter = if (method ==
  "Gibbs") 200 else 1e+06, tol = if (method == "maptpx") 0.05 else 10^-5)

Arguments

data

A matrix of (non-negative) RNA-seq expression levels where each row is a gene and each column is the cell sequenced.

method

LDA inference method to use. Can be any unique prefix of ‘maptpx’, ‘Gibbs’ or ‘VEM’ (defaults to ‘maptpx’)

k.topics

Integer (optional). Number of topics to fit in the model. If method is ‘maptpx’, k.topics can be a vector of possible topic numbers and the the best model (evaluated on Bayes factor vs a null single topic model) will be returned.

log.scale

Boolean (optional). Whether the data should be log-scaled.

sd.filter

Numeric or FALSE (optional). Standard-deviation threshold below which genes should be removed from the data (no filtering if set to FALSE).

tot.iter, tol

Numeric parameters (optional) forwarded to the chosen LDA inference method's contol class.

Details

Latent Dirichlet allocation (LDA) is a generative model that allows sets of observations to be explained by unobserved groups (topics) that explain why some parts of the data are similar [Blei, 2003]. Each topic is modelled as a (Dirichlet) distribution over observations and each set of observations is also modelled as a (Dirichlet) distribution over topics. In lieu of the traditional NLP context of word occurence counts in documents, our model uses RNA-seq observation counts in single cells. Three separate LDA inference methods can be used at the moment:

  • Gibbs uses Collapsed Gibbs Sampling method (implemented by Xuan-Hieu Phan and co-authors in the topicmodels package [Phan, 2008]) to infer the parameters of the Dirichlet distributions for a given number of topics. It gives high accuracy but is very time-consuming to run on large number of cells and genes.

  • VEM uses Variational Expectation-Maximisation (as described in [Hoffman, 2010]). This method tends to converge faster than Gibbs collapsed sampling, albeit with lower accuracy.

  • maptpx uses the method described in [Taddy, 2011] and implemented in package maptpx to estimate the parameters of the topic model for increasing number of topics (using previous estimates as a starting point for larger topic numbers). The best model (/number of topics) is selected based on Bayes factor over the Null model. Although potentially less accurate, this method provides the fastest way to train and select from a large number of models, when the number of topics is not well known.

When in doubt, the function can be ran with its default parameter values and should produce a usable LDA model in reasonable time (using the ‘maptpx’ inference method). The model can be further refined for a specific number of topics with slower methods. While larger models (using large number of topics) might fit the data well, there is a high risk of overfitting and it is recommended to use the smallest possible number of topics that still explains the observations well. Anecdotally, a typical number of topics for cell differentiation data (from pluripotent to fully specialised) would seem to be around 4 or 5.

Value

A LDA model fitted for data, of class LDA-class (for methods 'Gibbs' or 'VEM') or topics (for 'maptpx')

References

  • Blei, Ng, and Jordan. “Latent dirichlet allocation.” the Journal of machine Learning research 3 (2003): 993-1022.

  • Hoffman, Blei and Bach (2010). “Online Learning for Latent Dirichlet Allocation.” In J Lafferty, CKI Williams, J Shawe-Taylor, R Zemel, A Culotta (eds.), Advances in Neural Information Processing Systems 23, pp. 856-864. MIT Press, Cambridge, MA.

  • Hornik and Gr<c3><83><c2><bc>n. “topicmodels: An R package for fitting topic models.” Journal of Statistical Software 40.13 (2011): 1-30.

  • Phan, Nguyen and Horiguchi. “Learning to classify short and sparse text & web with hidden topics from large-scale data collections.” Proceedings of the 17th international conference on World Wide Web. ACM, 2008.

  • Taddy. “On estimation and selection for topic models.” arXiv preprint arXiv:1109.4518 (2011).

See Also

LDA, topics, LDA_Gibbscontrol-class, CTM_VEMcontrol-class

Examples

# Load skeletal myoblast RNA-Seq data from HSMMSingleCell package:
library(HSMMSingleCell)
data(HSMM_expr_matrix)

# Run LDA inference using 'maptpx' method for k = 4:
 lda.results = compute.lda(HSMM_expr_matrix, k.topics=4, method="maptpx")


# Run LDA inference using 'maptpx' method for number of topics k = 3 to 6:
 lda.results = compute.lda(HSMM_expr_matrix, k.topics=3:6, method="maptpx")

# Run LDA inference using 'Gibbs' [collapsed sampling] method for number of k = 4 topics:
 lda.results = compute.lda(HSMM_expr_matrix, k.topics=4, method="Gibbs")

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(cellTree)
Loading required package: topGO
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
    get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
    rbind, rownames, sapply, setdiff, sort, table, tapply, union,
    unique, unsplit

Loading required package: graph
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: GO.db
Loading required package: AnnotationDbi
Loading required package: stats4
Loading required package: IRanges
Loading required package: S4Vectors

Attaching package: 'S4Vectors'

The following objects are masked from 'package:base':

    colMeans, colSums, expand.grid, rowMeans, rowSums


Loading required package: SparseM

Attaching package: 'SparseM'

The following object is masked from 'package:base':

    backsolve


groupGOTerms: 	GOBPTerm, GOMFTerm, GOCCTerm environments built.

Attaching package: 'topGO'

The following object is masked from 'package:IRanges':

    members

> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/cellTree/compute.lda.Rd_%03d_medium.png", width=480, height=480)
> ### Name: compute.lda
> ### Title: LDA model inference
> ### Aliases: compute.lda
> 
> ### ** Examples
> 
> # Load skeletal myoblast RNA-Seq data from HSMMSingleCell package:
> library(HSMMSingleCell)
> data(HSMM_expr_matrix)
> 
> # Run LDA inference using 'maptpx' method for k = 4:
>  lda.results = compute.lda(HSMM_expr_matrix, k.topics=4, method="maptpx")
Computing LDA model using: maptpx
Converting to log values...
Filtering out rows with standard deviation < 0.5 (47192 -> 13515)...
Loading required namespace: maptpx

Estimating on a 271 document collection.
Fitting the 4 topic model.
log posterior increase: 1799.62, 3613.11, 2764.04, 1187.68, 704.87, done.
Selected k = 4 topics
> 
> ## No test: 
> # Run LDA inference using 'maptpx' method for number of topics k = 3 to 6:
>  lda.results = compute.lda(HSMM_expr_matrix, k.topics=3:6, method="maptpx")
Computing LDA model using: maptpx
Converting to log values...
Filtering out rows with standard deviation < 0.5 (47192 -> 13515)...

Estimating on a 271 document collection.
Fit and Bayes Factor Estimation for K = 3 ... 6
log posterior increase: 275.05, 875.3, 3576.09, 2351.75, 1110.8, 691.45, done.
log BF( 3 ) = 93906.43
log posterior increase: 38214.43, 1787.53, 740.49, done.
log BF( 4 ) = 141779.42
log posterior increase: 19928.15, 905.14, 507.36, 376.38, done.
log BF( 5 ) = 147144.41
log posterior increase: 11509.18, 954.61, 489.19, done.
log BF( 6 ) = 149036.9
Selected k = 6 topics
> 
> # Run LDA inference using 'Gibbs' [collapsed sampling] method for number of k = 4 topics:
>  lda.results = compute.lda(HSMM_expr_matrix, k.topics=4, method="Gibbs")
Computing LDA model using: Gibbs
Converting to log values...
Filtering out rows with standard deviation < 0.5 (47192 -> 13515)...
Loading required namespace: topicmodels
K = 4; V = 13515; M = 271
Sampling 230 iterations!
Iteration 10 ...
Iteration 20 ...