R Graphical Manual

Browse All

Last data update: 2014.03.03

R: A shortened collection of newsgroup messages with the first 3...

newsgroups

R Documentation

A shortened collection of newsgroup messages with the first 3 classes.

Description

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. We use in this package only its first 3 classes for demonstration purposes.

Usage

data(newsgroup.train.documents)
data(newsgroup.test.documents)
data(newsgroup.train.labels)
data(newsgroup.test.labels)
data(newsgroup.vocab)

Format

newsgroup.train.documents and newsgroup.test.documents comprise a corpus of 2731 newsgroup documents partitioned into 1633 training and 1098 test cases evenly distributed across 3 classes.

newsgroup.train.labels is a numeric vector of length 1633 which gives a class label from 1 to 3 for each training document in the corpus.

newsgroup.test.labels is a numeric vector of length 1098 which gives a class label from 1 to 3 for each test document in the corpus.

newsgroup.vocab is the vocabulary of the corpus.

stopwords English stopwords extracted from the tm package.

Source

http://qwone.com/~jason/20Newsgroups/

Examples