tokenize is a simple regular expression based parser that
splits the components of a vector of character into tokens while
protecting infix punctuation. If lines = TRUE assume x
was imported with readLines and end-of-line markers need to be
added back to the components.
remove_stopwords removes the tokens given in words from
x. If lines = FALSE assumes the components of both
vectors contain tokens which can be compared using match.
Otherwise, assumes the tokens in x are delimited by word
boundaries (including infix punctuation) and uses regular expression
matching.
Value
The same type of object as x.
Author(s)
Christian Buchta
Examples
txt <- ""It's almost noon," it@dot.net said."
## split
x <- tokenize(txt)
x
## reconstruct
t <- paste(x, collapse = "")
t
if (require("tm", quietly = TRUE)) {
words <- readLines(system.file("stopwords", "english.dat",
package = "tm"))
remove_stopwords(x, words)
remove_stopwords(t, words, lines = TRUE)
} else
remove_stopwords(t, words = c("it", "it's"), lines = TRUE)