This function computes the stems of each of the
given words in the vector.
This reduces a word to its base component,
making it easier to compare words
like win, winning, winner.
See http://snowball.tartarus.org/ for
more information about the concept and algorithms
for stemming.
Usage
wordStem(words, language = character(), warnTested = FALSE)
Arguments
words
a character vector of words whose stems are to be
computed.
language
the name of a recognized language for the package.
This should either be a single string which is an element in the
vector returned by getStemLanguages, or
alternatively a character vector of length 3
giving the names of the routines for
creating and closing a Snowball SN_env environment
and performing the stem (in that order).
See the example below.
warnTested
an option to control whether a warning is issued
about languages which have not been explicitly tested as part of the
unit testing of the code. For the most part, one can ignore these
warnings and so they are turned off. In the future, we might
consider controlling this with a global option, but for now
we suppress the warnings by default.
Details
This uses Dr. Martin Porter's stemming algorithm
and the interface generated by Snowball
http://snowball.tartarus.org/.
Value
A character vector with as many elements
as there are in the input vector
with the corresponding elements being the
stem of the word.
# Simple example
# "win" "win" "winner"
wordStem(c("win", "winning", 'winner'))
# test the supplied vocabulary.
testWords = readLines(system.file("words", "english", "voc.txt", package = "RTextTools"))
validate = readLines(system.file("words", "english", "output.txt", package = "RTextTools"))
## Not run:
# Read the test words directly from the snowball site over the Web
testWords = readLines(url("http://snowball.tartarus.org/english/voc.txt"))
## End(Not run)
testOut = wordStem(testWords)
all(validate == testOut)
# Specify the language from one of the built-in languages.
testOut = wordStem(testWords, "english")
all(validate == testOut)
# To illustrate using the dynamic lookup of symbols that allows one
# to easily add new languages or create and close environment
# routines (for example, to manage pools if this were an efficiency
# issue!)
testOut = wordStem(testWords, c("testDynCreate", "testDynClose", "testDynStem"))
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(RTextTools)
Loading required package: SparseM
Attaching package: 'SparseM'
The following object is masked from 'package:base':
backsolve
> png(filename="/home/ddbj/snapshot/RGM3/R_CC/result/RTextTools/wordStem.Rd_%03d_medium.png", width=480, height=480)
> ### Name: wordStem
> ### Title: Get the common root/stem of words
> ### Aliases: wordStem
> ### Keywords: IO utilities
>
> ### ** Examples
>
>
> # Simple example
> # "win" "win" "winner"
> wordStem(c("win", "winning", 'winner'))
[1] "win" "win" "winner"
>
>
> # test the supplied vocabulary.
> testWords = readLines(system.file("words", "english", "voc.txt", package = "RTextTools"))
Warning message:
In file(con, "r") :
file("") only supports open = "w+" and open = "w+b": using the former
> validate = readLines(system.file("words", "english", "output.txt", package = "RTextTools"))
Warning message:
In file(con, "r") :
file("") only supports open = "w+" and open = "w+b": using the former
>
> ## Not run:
> ##D # Read the test words directly from the snowball site over the Web
> ##D testWords = readLines(url("http://snowball.tartarus.org/english/voc.txt"))
> ## End(Not run)
>
>
> testOut = wordStem(testWords)
> all(validate == testOut)
[1] TRUE
>
> # Specify the language from one of the built-in languages.
> testOut = wordStem(testWords, "english")
> all(validate == testOut)
[1] TRUE
>
> # To illustrate using the dynamic lookup of symbols that allows one
> # to easily add new languages or create and close environment
> # routines (for example, to manage pools if this were an efficiency
> # issue!)
> testOut = wordStem(testWords, c("testDynCreate", "testDynClose", "testDynStem"))
>
>
>
>
>
> dev.off()
null device
1
>