'ZIClass' objects are key items in ZooImage. They contain all what is required
for automatically classify plancton from .zid files. They can be used as
blackboxes by all users (but require users trained in machine learning
techniques to build them). Hence, ZooImage is made very simple for biologists
that just want to use classifiers but do not want to worry about all the
complexities of what is done inside the engine!
Usage
ZIClass(formula, data, method = getOption("ZI.mlearning", "mlRforest"),
calc.vars = getOption("ZI.calcVars", calcVars), drop.vars = NULL,
drop.vars.def = dropVars(), cv.k = 10, cv.strat = TRUE,
..., subset, na.action = na.omit)
## S3 method for class 'ZIClass'
print(x, ...)
## S3 method for class 'ZIClass'
summary(object, sort.by = "Fscore", decreasing = TRUE,
na.rm = FALSE, ...)
## S3 method for class 'ZIClass'
predict(object, newdata, calc = TRUE, class.only = TRUE,
type = "class", ...)
## S3 method for class 'ZIClass'
confusion(x, y = response(x), labels = c("Actual", "Predicted"),
useNA = "ifany", prior, use.cv = TRUE, ...)
Arguments
formula
a formula with left member being the class variable and the
right member being a list of predicting variables separated by a '+' sign.
Since data is supposed to be previously filtered using
calc.vars and the class variable in 'ZITrain' object is always
named Class, the formula almost always reduces to Class ~ .
data
a data frame (a 'ZITrain' object usually), containing both
measurement and manual classification (a factor variables usually named
'Class').
method
the machine learning method to use. It should produce
results compatible with mlearning objects as returned by the various
mlXXX() functions in the mlearning package. By default, the
random forest algorithm is used (it is among the ones that give best result
with plankton).
calc.vars
a function to use to calculate variables from the original
data frame.
drop.vars
a character vector with names of variables to drop for the
classification, or NULL (by default) to keep them all.
drop.vars.def
a second list of variables to drop contained in a
character vector. That list is supposed to match the name of variables that
are obviously non informative and are dropped by default. It can be gathered
automatically using dropVars(). See ?calcVars for more details.
cv.k
the k times for cross-validation.
cv.strat
do we use a stratified sampling for cross-validation?
(recommended).
...
further arguments to pass to the classification algorithm (see
help of that particular function).
subset
an expression for subsetting to original data frame.
na.action
the function to filter the initial data frame for missing
values. Althoung the default in R is na.fail, leading to failure if
at least one NA is found in the data frame, the default here is
na.omit which leads to elimination of all lines containing at least
one NA. Take care about how many items remain, if you encounter
many NAs in your dataset!
x
a 'ZIClass' object.
object
a 'ZIClass' object.
newdata
a 'ZIDat' object, or a 'data.frame' to use for prediction.
sort.by
the statistics to use to sort the table (by default, F-score).
decreasing
do we sort in increasing or decreasing order?
na.rm
do we eliminate entries with missing data first (using
na.omit())?
calc
a boolean indicating if variables have to be recalculated
before running the prediction.
class.only
if TRUE, return just a vector with classification,
otherwise, return the 'ZIDat' object with 'Predicted' column appended to it.
type
the type of result to return, "class" by default. No other
value is permitted if class.only is FALSE.
y
a factor with reference classes.
labels
labels to use for, respectively, the reference class and the
predicted class.
useNA
do we keep NAs as a separate category? The default "ifany"
creates this category only if there are missing values. Other possibilities
are "no", or "always". The default is suitable for test sets
because unclassified items (those in the "_" directory or one of its
subdirectories) get NA for Class.
prior
class frequencies to use for first classifier that
is tabulated in the rows of the confusion matrix. This is either a single
positive numeric to set all class frequencies to this value (use 1 for
relative frequencies and 100 for relative freqs in percent), or a vector of
positive numbers of the same length as the levels in the object. If the
vector is named, names must match levels. Alternatively, providing
NULL or an object of null length resets row class prefencies into
their initial values.
use.cv
the predicted values extracted from the 'ZIClass' object can
either be the predicted values from the training set, or the cross-validated
predictions (by default). Most of the time, you want the cross-validated
predictions, which allows for not (or less) biased evaluation of the
classifier prediction... So, if you don't know, you are probably better
leaving the default value.
Value
ZIClass() is the constructor that build the 'ZIClass' object.
print(), summary() and predict()) are the methods to
print the object, to calculate statistics on this classifier based on the
confusion matrix and to predict groups for ZooImage samples, using one
'ZIClass' object.
Note
Always analyze carefully the properties, performances and limitations of a
'ZIClass' object before using it to classify objects of one series. For
instance, you can use confusion() to compare two classifiers, or an
automatic classifier with a manual classification done by a taxonomists.
Always respect the limitations in the use of a 'ZIClass' object (for
instance, a classifier specific of one given series should not be used to
classify items in a different series)! It is a good practice to make a
report, documenting a 'ZIClass' object, together with the comments of
taxonomists that made the reference training set, and with details on the
analysis of the performances of the classifier.