R: Read LDA-formatted Document and Vocabulary Files
read.documents
R Documentation
Read LDA-formatted Document and Vocabulary Files
Description
These functions read in the document and vocabulary files associated
with a corpus. The format of the files is the same as that used by
LDA-C (see below for details). The return value of these functions
can be used by the inference procedures defined in the lda package.
A length-1 character vector specifying the path to the
document/vocabulary file. These are set to ‘mult.dat’ and
‘vocab.dat’ by default.
Details
The details of the format are also described in the readme for LDA-C.
The format of the documents file is appropriate for typical text data
as it sparsely encodes observed features. A single file encodes a
corpus (a collection of documents). Each line of the file
encodes a single document (a feature vector).
The line encoding a document begins with an integer followed by a
number of feature-count pairs, all separated by spaces. A
feature-count pair consists of two integers separated by a colon. The
first integer indicates the feature (note that this is zero-indexed!)
and the second integer indicates the count (i.e., value) of that
feature. The initial integer of a line indicates how many
feature-count pairs are to be expected on that line.
Note that we permit a feature to appear more than once on a line, in
which case the value for that feature will be the sum of all instances
(the behavior for such files is undefined for LDA-C). For example, a
line reading 4 7:1 0:2 7:3 1:1
will yield a document with feature 0 occurring twice, feature 1
occurring once, and feature 7 occurring four times, with all other
features occurring zero times.
The format of the vocabulary is a set of newline separated strings
corresponding to features. That is, the first line of the vocabulary
file will correspond to the label for feature 0, the second for
feature 1, etc.
Value
read.documents returns a list of matrices suitable as input for
the inference routines in lda. See
lda.collapsed.gibbs.sampler for details.
read.vocab returns a character vector of strings corresponding to
features.