R Graphical Manual

Browse All

Last data update: 2014.03.03

R: Functions to Fit LDA-type models

lda.collapsed.gibbs.sampler

R Documentation

Functions to Fit LDA-type models

Description

These functions use a collapsed Gibbs sampler to fit three different models: latent Dirichlet allocation (LDA), the mixed-membership stochastic blockmodel (MMSB), and supervised LDA (sLDA). These functions take sparsely represented input documents, perform inference, and return point estimates of the latent parameters using the state at the last iteration of Gibbs sampling. Multinomial logit for sLDA is supported using the multinom function from nnet package .

Usage

lda.collapsed.gibbs.sampler(documents, K, vocab, num.iterations, alpha,
eta, initial = NULL, burnin = NULL, compute.log.likelihood = FALSE,
  trace = 0L, freeze.topics = FALSE)

slda.em(documents, K, vocab, num.e.iterations, num.m.iterations, alpha,
eta, annotations, params, variance, logistic = FALSE, lambda = 10,
regularise = FALSE, method = "sLDA", trace = 0L, MaxNWts=3000)

mmsb.collapsed.gibbs.sampler(network, K, num.iterations, alpha,
beta.prior, initial = NULL, burnin = NULL, trace = 0L)

lda.cvb0(documents, K, vocab, num.iterations, alpha, eta, trace = 0L)

Arguments

`documents`	A list whose length is equal to the number of documents, D. Each element of `documents` is an integer matrix with two rows. Each column of `documents[[i]]` (i.e., document i) represents a word occurring in the document. `documents[[i]][1, j]` is a 0-indexed word identifier for the jth word in document i. That is, this should be an index - 1 into `vocab`. `documents[[i]][2, j]` is an integer specifying the number of times that word appears in the document.
`network`	For `mmsb.collapsed.gibbs.sampler`, a D \times D matrix (coercible as logical) representing the adjacency matrix for the network. Note that elements on the diagonal are ignored.
`K`	An integer representing the number of topics in the model.
`vocab`	A character vector specifying the vocabulary words associated with the word indices used in `documents`.
`num.iterations`	The number of sweeps of Gibbs sampling over the entire corpus to make.
`num.e.iterations`	For `slda.em`, the number of Gibbs sampling sweeps to make over the entire corpus for each iteration of EM.
`num.m.iterations`	For `slda.em`, the number of EM iterations to make.
`alpha`	The scalar value of the Dirichlet hyperparameter for topic proportions.
`beta.prior`	For `mmsb.collapsed.gibbs.sampler`, the the beta hyperparameter for each entry of the block relations matrix. This parameter should be a length-2 list whose entries are K \times K matrices. The elements of the two matrices comprise the two parameters for each beta variable.
`eta`	The scalar value of the Dirichlet hyperparamater for topic multinomials.
`initial`	A list of initial topic assignments for words. It should be in the same format as the `assignments` field of the return value. If this field is NULL, then the sampler will be initialized with random assignments.
`burnin`	A scalar integer indicating the number of Gibbs sweeps to consider as burn-in (i.e., throw away) for `lda.collapsed.gibbs.sampler` and `mmsb.collapsed.gibbs.sampler`. If this parameter is non-NULL, it will also have the side-effect of enabling the `document_expects` field of the return value (see below for details). Note that burnin iterations do NOT count towards `num.iterations`.
`compute.log.likelihood`	A scalar logical which when `TRUE` will cause the sampler to compute the log likelihood of the words (to within a constant factor) after each sweep over the variables. The log likelihood for each iteration is stored in the `log.likelihood` field of the result. This is useful for assessing convergence, but slows things down a tiny bit.
`annotations`	A length D numeric vector of covariates associated with each document. Only used by `slda.em` which models documents along with numeric annotations associated with each document. When using the logistic option, annotations must be consecutive integers starting from 0.
`params`	For `slda.em`, a length Kx(number of classes-1) numeric vector of regression coefficients at which the EM algorithm should be initialized.
`variance`	For `slda.em`, the variance associated with the Gaussian response modeling the annotations in `annotations`.
`logistic`	For `slda.em`, a scalar logical which, when `TRUE`, causes the annotations to be modeled using a logistic response instead of a Gaussian (the covariates must be consecutive integers starting from zero when used with sLDA).
`lambda`	When `regularise` is `TRUE`. This is a scalar that is the standard deviation of the Gaussian prior on the regression coefficients.
`regularise`	When `TRUE`, a Gaussian prior is used for the regression coefficients. This requires the `penalized` package.
`method`	For `slda.em`, a character indicating how to model the annotations. Only `"sLDA"`, the stock model given in the references, is officially supported at the moment.
`trace`	When `trace` is greater than zero, diagnostic messages will be output. Larger values of `trace` imply more messages.
`MaxNWts`	Input to the nnet's multinom function with a default value of 3000 maximum weights. Increasing this value may be necessary when using logistic sLDA with a large number of topics at the necessary expense of longer run times.
`freeze.topics`	When `TRUE`, topic assignments will occur but the counts of words associated with topics will not change. `initial` should be set when this option is used. This is best use for sampling test documents.

Value

A fitted model as a list with the following components:

`assignments`	A list of length D. Each element of the list, say `assignments[[i]]` is an integer vector of the same length as the number of columns in `documents[[i]]` indicating the topic assignment for each word.
`topics`	A K \times V matrix where each entry indicates the number of times a word (column) was assigned to a topic (row). The column names should correspond to the vocabulary words given in `vocab`.
`topic_sums`	A length K vector where each entry indicates the total number of times words were assigned to each topic.
`document_sums`	A K \times D matrix where each entry is an integer indicating the number of times words in each document (column) were assigned to each topic (column).
`log.likelihoods`	Only for `lda.collapsed.gibbs.sampler`. A matrix with 2 rows and `num.iterations` columns of log likelihoods when the flag `compute.log.likelihood` is set to `TRUE`. The first row contains the full log likelihood (including the prior), whereas the second row contains the log likelihood of the observations conditioned on the assignments.
`document_expects`	This field only exists if `burnin` is non-NULL. This field is like document_sums but instead of only aggregating counts for the last iteration, this field aggegates counts over all iterations after burnin.
`net.assignments.left`	Only for `mmsb.collapsed.gibbs.sampler`. A D \times D integer matrix of topic assignments for the source document corresponding to the link between one document (row) and another (column).
`net.assignments.right`	Only for `mmsb.collapsed.gibbs.sampler`. A D \times D integer matrix of topic assignments for the destination document corresponding to the link between one document (row) and another (column).
`blocks.neg`	Only for `mmsb.collapsed.gibbs.sampler`. A K \times K integer matrix indicating the number of times the source of a non-link was assigned to a topic (row) and the destination was assigned to another (column).
`blocks.pos`	Only for `mmsb.collapsed.gibbs.sampler`. A K \times K integer matrix indicating the number of times the source of a link was assigned to a topic (row) and the destination was assigned to another (column).
`model`	For `slda.em`, a model of type `lm`, the regression model fitted to the annotations.
`coefs`	For `slda.em`, a length Kx(number of classes-1) numeric vector of coefficients for the regression model.

Note

WARNING: This function does not compute precisely the correct thing when the count associated with a word in a document is not 1 (this is for speed reasons currently). A workaround when a word appears multiple times is to replicate the word across several columns of a document. This will likely be fixed in a future version.

Author(s)

Jonathan Chang (slycoder@gmail.com)

References

Blei, David M. and Ng, Andrew and Jordan, Michael. Latent Dirichlet allocation. Journal of Machine Learning Research, 2003.

Airoldi , Edoardo M. and Blei, David M. and Fienberg, Stephen E. and Xing, Eric P. Mixed Membership Stochastic Blockmodels. Journal of Machine Learning Research, 2008.

Blei, David M. and McAuliffe, John. Supervised topic models. Advances in Neural Information Processing Systems, 2008.

Griffiths, Thomas L. and Steyvers, Mark. Finding scientific topics. Proceedings of the National Academy of Sciences, 2004.

Asuncion, A., Welling, M., Smyth, P., and Teh, Y. W. On smoothing and inference for topic models. Uncertainty in Artificial Intelligence, 2009.

Examples

## See demos for the three functions:

## Not run: demo(lda)

## Not run: demo(slda)

## Not run: demo(mmsb)