These functions use a collapsed Gibbs sampler to fit three different
models: latent Dirichlet allocation (LDA), the mixed-membership stochastic
blockmodel (MMSB), and supervised LDA (sLDA). These functions take
sparsely represented input documents, perform inference, and return
point estimates of the latent parameters using the state at the last
iteration of Gibbs sampling. Multinomial logit for sLDA is supported
using the multinom function from nnet package .
A list whose length is equal to the number of documents, D. Each
element of documents is an integer matrix with two rows. Each
column of documents[[i]] (i.e., document i) represents a
word occurring in the document.
documents[[i]][1, j] is a
0-indexed word identifier for the jth word in document i. That is,
this should be an index - 1 into vocab. documents[[i]][2,
j] is an integer specifying the number of times that word appears in
the document.
network
For mmsb.collapsed.gibbs.sampler, a D \times D
matrix (coercible as logical) representing the adjacency matrix for
the network. Note that elements on the diagonal are ignored.
K
An integer representing the number of topics in the model.
vocab
A character vector specifying the vocabulary words associated with
the word indices used in documents.
num.iterations
The number of sweeps of Gibbs sampling over the entire corpus to make.
num.e.iterations
For slda.em, the number of Gibbs sampling sweeps to make over
the entire corpus for each iteration of EM.
num.m.iterations
For slda.em, the number of EM iterations to make.
alpha
The scalar value of the Dirichlet hyperparameter for
topic proportions.
beta.prior
For mmsb.collapsed.gibbs.sampler, the the beta hyperparameter
for each entry of the block relations matrix. This parameter should
be a length-2 list whose entries are K \times K matrices. The
elements of the two matrices comprise the two parameters for each beta variable.
eta
The scalar value of the Dirichlet hyperparamater for topic
multinomials.
initial
A list of initial topic assignments for words. It should be
in the same format as the assignments field of the return
value. If this field is NULL, then the sampler will be initialized
with random assignments.
burnin
A scalar integer indicating the number of Gibbs sweeps to consider
as burn-in (i.e., throw away) for lda.collapsed.gibbs.sampler
and mmsb.collapsed.gibbs.sampler. If this parameter is non-NULL, it
will also have the side-effect of enabling the
document_expects field of the return value (see below for
details). Note that burnin iterations do NOT count towards num.iterations.
compute.log.likelihood
A scalar logical which when TRUE will cause the sampler to
compute the log likelihood of the words (to within a constant
factor) after each sweep over the variables. The log likelihood for each
iteration is stored in the log.likelihood field of the result.
This is useful for assessing convergence, but slows things down a tiny
bit.
annotations
A length D numeric vector of covariates associated with each
document. Only used by slda.em which models documents along
with numeric annotations associated with each document. When using the
logistic option, annotations must be consecutive integers starting from 0.
params
For slda.em, a length Kx(number of classes-1) numeric vector of
regression coefficients at which the EM algorithm should be initialized.
variance
For slda.em, the variance associated with the Gaussian
response modeling the annotations in annotations.
logistic
For slda.em, a scalar logical which, when TRUE, causes
the annotations to be modeled using a logistic response instead of a
Gaussian (the covariates must be consecutive integers starting from
zero when used with sLDA).
lambda
When regularise is TRUE. This is a scalar that is the
standard deviation of the Gaussian prior on the regression coefficients.
regularise
When TRUE, a Gaussian prior is used for the regression
coefficients. This requires the penalized package.
method
For slda.em, a character indicating how to model the
annotations. Only "sLDA", the stock model given in the
references, is officially supported at the moment.
trace
When trace is greater than zero, diagnostic messages will be
output. Larger values of trace imply more messages.
MaxNWts
Input to the nnet's multinom function with a default value of 3000 maximum
weights. Increasing this value may be necessary when using logistic sLDA with
a large number of topics at the necessary expense of longer run times.
freeze.topics
When TRUE, topic assignments will occur but the counts of
words associated with topics will not change. initial should be
set when this option is used. This is best use for sampling test
documents.
Value
A fitted model as a list with the following components:
assignments
A list of length D. Each element of the list, say
assignments[[i]] is an integer vector of the same length as the
number of columns in documents[[i]] indicating the topic
assignment for each word.
topics
A K \times V matrix where each entry indicates the
number of times a word (column) was assigned to a topic (row). The column
names should correspond to the vocabulary words given in vocab.
topic_sums
A length K vector where each entry indicates the
total number of times words were assigned to each topic.
document_sums
A K \times D matrix where each entry is an
integer indicating the number of times words in each document
(column) were assigned to each topic (column).
log.likelihoods
Only for lda.collapsed.gibbs.sampler. A
matrix with 2 rows and num.iterations columns of log likelihoods when the flag
compute.log.likelihood is set to TRUE. The first row
contains the full log likelihood (including the prior), whereas the
second row contains the log likelihood of the observations
conditioned on the assignments.
document_expects
This field only exists if burnin is
non-NULL. This field is like document_sums but instead of only
aggregating counts for the last iteration, this field aggegates
counts over all iterations after burnin.
net.assignments.left
Only for
mmsb.collapsed.gibbs.sampler. A D \times D integer matrix of
topic assignments for the source document corresponding to the link
between one document (row) and another (column).
net.assignments.right
Only for
mmsb.collapsed.gibbs.sampler. A D \times D integer matrix of
topic assignments for the destination document corresponding to the link
between one document (row) and another (column).
blocks.neg
Only for
mmsb.collapsed.gibbs.sampler. A K \times K integer
matrix indicating the number of times the source of a non-link was
assigned to a topic (row) and the destination was assigned to
another (column).
blocks.pos
Only for
mmsb.collapsed.gibbs.sampler. A K \times K integer
matrix indicating the number of times the source of a link was
assigned to a topic (row) and the destination was assigned to
another (column).
model
For slda.em, a model of type lm,
the regression
model fitted to the annotations.
coefs
For slda.em, a length Kx(number of classes-1)
numeric vector of coefficients for the regression model.
Note
WARNING: This function does not compute precisely the correct thing
when the count associated with a word in a document is not 1 (this
is for speed reasons currently). A workaround when a word appears
multiple times is to replicate the word across several columns of a
document. This will likely be fixed in a future version.
Blei, David M. and Ng, Andrew and Jordan, Michael. Latent
Dirichlet allocation. Journal of Machine Learning Research, 2003.
Airoldi , Edoardo M. and Blei, David M. and Fienberg, Stephen
E. and Xing, Eric P. Mixed Membership Stochastic
Blockmodels. Journal of Machine Learning Research, 2008.
Blei, David M. and McAuliffe, John. Supervised topic models.
Advances in Neural Information Processing Systems, 2008.
Griffiths, Thomas L. and Steyvers, Mark. Finding scientific
topics. Proceedings of the National Academy of Sciences, 2004.
Asuncion, A., Welling, M., Smyth, P., and Teh, Y. W. On
smoothing and inference for topic models. Uncertainty in Artificial Intelligence,
2009.
See Also
read.documents and lexicalize can be used
to generate the input data to these models.
top.topic.words,
predictive.distribution, and slda.predict for operations on the fitted models.
Examples
## See demos for the three functions:
## Not run: demo(lda)
## Not run: demo(slda)
## Not run: demo(mmsb)