R: Compute predictive distributions for fitted LDA-type models.
predictive.distribution
R Documentation
Compute predictive distributions for fitted LDA-type models.
Description
This function takes a fitted LDA-type model and computes a predictive
distribution for new words in a document. This is useful for making
predictions about held-out words.
A K \times D matrix where each entry is a numeric proportional
to the probability of seeing a topic (row) conditioned on document
(column) (this entry is sometimes denoted θ_{d,k} in the
literature, see details). Either the document_sums field or
the document_expects field from the output of
lda.collapsed.gibbs.sampler can be used.
topics
A K \times V matrix where each entry is a numeric proportional
to the probability of seeing the word (column) conditioned on topic
(row) (this entry is sometimes denoted β_{w,k} in the
literature, see details). The column names should correspond to the
words in the vocabulary. The topics field from the output of
lda.collapsed.gibbs.sampler can be used.
alpha
The scalar value of the Dirichlet hyperparameter for
topic proportions. See references for details.
eta
The scalar value of the Dirichlet hyperparamater for topic
multinomials. See references for details.
Details
The formula used to compute predictive probability is p_d(w) =
∑_k (θ_{d, k} + α) (β_{w, k} + η).
Value
A V \times D matrix of the probability of seeing a word (row) in
a document (column). The row names of the matrix are set to the
column names of topics.
Blei, David M. and Ng, Andrew and Jordan, Michael. Latent Dirichlet allocation. Journal of Machine Learning Research, 2003.
See Also
lda.collapsed.gibbs.sampler for the format of
topics and document_sums and details of the model.
top.topic.words demonstrates another use for a fitted
topic matrix.
Examples
## Fit a model (from demo(lda)).
data(cora.documents)
data(cora.vocab)
K <- 10 ## Num clusters
result <- lda.collapsed.gibbs.sampler(cora.documents,
K, ## Num clusters
cora.vocab,
25, ## Num iterations
0.1,
0.1)
## Predict new words for the first two documents
predictions <- predictive.distribution(result$document_sums[,1:2],
result$topics,
0.1, 0.1)
## Use top.topic.words to show the top 5 predictions in each document.
top.topic.words(t(predictions), 5)
## [,1] [,2]
## [1,] "learning" "learning"
## [2,] "algorithm" "paper"
## [3,] "model" "problem"
## [4,] "paper" "results"
## [5,] "algorithms" "system"