R: Type-Token Statistics for Samples and Empirical Data (zipfR)
vec2xxx
R Documentation
Type-Token Statistics for Samples and Empirical Data (zipfR)
Description
Compute type-frequency list, frequency spectrum and vocabulary growth
curve from a token vector representing a random sample or an observed
sequence of tokens.
a vector of length N_0, representing a random sample or
other observed data set of N_0 tokens. For each token, the
corresponding element of x specifies the type that the
token belongs to. Usually, x is a character vector, but it
might also specify integer IDs in some cases.
steps
number of steps for which vocabulary growth data
V(N) is calculated. The values of N will be evenly
spaced (up to rounding differences) from N=1 to N=N_0.
stepsize
alternative way of specifying the steps of the
vocabulary growth curve. In this case, vocabulary growth data will
be calculated every stepsize tokens. The first step is
chosen such that the last step corresponds to the full sample
(N=N_0). Only one of the parameters steps and
stepsize may be specified.
m.max
an integer in the range $1 ... 9$, specifying how many
spectrum elements V_m(N) to include in the vocabulary growth
curve. By default only vocabulary size V(N) is calculated,
i.e. m.max=0.
Details
There are two main applications for the vec2xxx functions:
a)
They can be used to calculate type-token statistics and
vocabulary growth curves for random samples generated from a LNRE
model (with the rlnre function).
b)
They provide an easy way to process a user's own data
without having to rely on external scripts to compute frequency
spectra and vocabulary growth curves. All that is needed is a
text file in one-token-per-line formt (i.e. where each token is
given on a separate line). See "Examples" below for further
hints.
Both applications work well for samples of up to approx. 1 million
tokens. For considerably larger data sets, specialized external
software should be used, such as the Perl scripts provided on the
zipfR homepage.
Value
An object of class tfl, spc or vgc, representing
the type frequency list, frequency spectrum or vocabulary growth curve
of the token vector x, respectively.
See Also
tfl, spc and vgc for more
information about type frequency lists, frequency spectra and
vocabulary growth curves
rlnre for generating random samples (in the form of the
required token vectors) from a LNRE model
readLines and scan for loading token
vectors from disk files
Examples
## type-token statistics for random samples from a LNRE distribution
model <- lnre("fzm", alpha=.5, A=1e-6, B=.05)
x <- rlnre(model, 100000)
vec2tfl(x)
vec2spc(x) # same as tfl2spc(vec2tfl(x))
vec2vgc(x)
sample.spc <- vec2spc(x)
exp.spc <- lnre.spc(model, 100000)
## Not run: plot(exp.spc, sample.spc)
sample.vgc <- vec2vgc(x, m.max=1, steps=500)
exp.vgc <- lnre.vgc(model, N=N(sample.vgc), m.max=1)
## Not run: plot(exp.vgc, sample.vgc, add.m=1)
## load token vector from a file in one-token-per-line format
## Not run: x <- readLines(filename)
## Not run: x <- readLines(file.choose()) # with file selection dialog
## you can also perform whitespace tokenization and filter the data
## Not run: brown <- scan("brown.pos", what=character(0), quote="")
## Not run: nouns <- grep("/NNS?$", brown, value=TRUE)
## Not run: plot(vec2spc(nouns))
## Not run: plot(vec2vgc(nouns, m.max=1), add.m=1)