Last data update: 2014.03.03

R: Import text documents into Mallet format
mallet.importR Documentation

Import text documents into Mallet format

Description

This function takes an array of document IDs and text files (as character strings) and converts them into a Mallet instance list.

Usage

mallet.import(id.array, text.array, stoplist.file, preserve.case, token.regexp)

Arguments

id.array

An array of document IDs.

text.array

An array of text strings to use as documents. The type of the array must be character.

stoplist.file

The name of a file containing stopwords (words to ignore), one per line. If the file is not in the current working directory, you may need to include a full path.

preserve.case

By default, the input text is converted to all lowercase.

token.regexp

A quoted string representing a regular expression that defines a token. The default is one or more unicode letter: "[\p{L}]+". Note that special characters must have double backslashes.

See Also

mallet.word.freqs returns term and document frequencies, which may be useful in selecting stopwords.

Examples

## Not run: 
mallet.instances <- mallet.import(documents$id, documents$text, "en.txt",
		    		token.regexp = "\p{L}[\p{L}\p{P}]+\p{L}")

## End(Not run)

Results