R Graphical Manual

Browse All

Last data update: 2014.03.03

R: Compute predictive distributions for fitted LDA-type models.

predictive.distribution

R Documentation

Compute predictive distributions for fitted LDA-type models.

Description

This function takes a fitted LDA-type model and computes a predictive distribution for new words in a document. This is useful for making predictions about held-out words.

Usage

predictive.distribution(document_sums, topics, alpha, eta)

Arguments

`document_sums`	A K \times D matrix where each entry is a numeric proportional to the probability of seeing a topic (row) conditioned on document (column) (this entry is sometimes denoted θ_{d,k} in the literature, see details). Either the `document_sums` field or the `document_expects` field from the output of `lda.collapsed.gibbs.sampler` can be used.
`topics`	A K \times V matrix where each entry is a numeric proportional to the probability of seeing the word (column) conditioned on topic (row) (this entry is sometimes denoted β_{w,k} in the literature, see details). The column names should correspond to the words in the vocabulary. The `topics` field from the output of `lda.collapsed.gibbs.sampler` can be used.
`alpha`	The scalar value of the Dirichlet hyperparameter for topic proportions. See references for details.
`eta`	The scalar value of the Dirichlet hyperparamater for topic multinomials. See references for details.

Details

The formula used to compute predictive probability is p_d(w) = ∑_k (θ_{d, k} + α) (β_{w, k} + η).

Value

A V \times D matrix of the probability of seeing a word (row) in a document (column). The row names of the matrix are set to the column names of topics.

Author(s)

Jonathan Chang (slycoder@gmail.com)

References

Blei, David M. and Ng, Andrew and Jordan, Michael. Latent Dirichlet allocation. Journal of Machine Learning Research, 2003.

Examples

## Fit a model (from demo(lda)).
data(cora.documents)
data(cora.vocab)

K <- 10 ## Num clusters
result <- lda.collapsed.gibbs.sampler(cora.documents,
                                      K,  ## Num clusters
                                      cora.vocab,
                                      25,  ## Num iterations
                                      0.1,
                                      0.1) 

## Predict new words for the first two documents
predictions <-  predictive.distribution(result$document_sums[,1:2],
                                        result$topics,
                                        0.1, 0.1)

## Use top.topic.words to show the top 5 predictions in each document.
top.topic.words(t(predictions), 5)

##      [,1]         [,2]      
## [1,] "learning"   "learning"
## [2,] "algorithm"  "paper"   
## [3,] "model"      "problem" 
## [4,] "paper"      "results" 
## [5,] "algorithms" "system"