Last data update: 2014.03.03

R: Compute predictive distributions for fitted LDA-type models.
predictive.distributionR Documentation

Compute predictive distributions for fitted LDA-type models.

Description

This function takes a fitted LDA-type model and computes a predictive distribution for new words in a document. This is useful for making predictions about held-out words.

Usage

predictive.distribution(document_sums, topics, alpha, eta)

Arguments

document_sums

A K \times D matrix where each entry is a numeric proportional to the probability of seeing a topic (row) conditioned on document (column) (this entry is sometimes denoted θ_{d,k} in the literature, see details). Either the document_sums field or the document_expects field from the output of lda.collapsed.gibbs.sampler can be used.

topics

A K \times V matrix where each entry is a numeric proportional to the probability of seeing the word (column) conditioned on topic (row) (this entry is sometimes denoted β_{w,k} in the literature, see details). The column names should correspond to the words in the vocabulary. The topics field from the output of lda.collapsed.gibbs.sampler can be used.

alpha

The scalar value of the Dirichlet hyperparameter for topic proportions. See references for details.

eta

The scalar value of the Dirichlet hyperparamater for topic multinomials. See references for details.

Details

The formula used to compute predictive probability is p_d(w) = ∑_k (θ_{d, k} + α) (β_{w, k} + η).

Value

A V \times D matrix of the probability of seeing a word (row) in a document (column). The row names of the matrix are set to the column names of topics.

Author(s)

Jonathan Chang (slycoder@gmail.com)

References

Blei, David M. and Ng, Andrew and Jordan, Michael. Latent Dirichlet allocation. Journal of Machine Learning Research, 2003.

See Also

lda.collapsed.gibbs.sampler for the format of topics and document_sums and details of the model.

top.topic.words demonstrates another use for a fitted topic matrix.

Examples

## Fit a model (from demo(lda)).
data(cora.documents)
data(cora.vocab)

K <- 10 ## Num clusters
result <- lda.collapsed.gibbs.sampler(cora.documents,
                                      K,  ## Num clusters
                                      cora.vocab,
                                      25,  ## Num iterations
                                      0.1,
                                      0.1) 

## Predict new words for the first two documents
predictions <-  predictive.distribution(result$document_sums[,1:2],
                                        result$topics,
                                        0.1, 0.1)

## Use top.topic.words to show the top 5 predictions in each document.
top.topic.words(t(predictions), 5)

##      [,1]         [,2]      
## [1,] "learning"   "learning"
## [2,] "algorithm"  "paper"   
## [3,] "model"      "problem" 
## [4,] "paper"      "results" 
## [5,] "algorithms" "system"  

Results