R: Cross-validation for tuning parameter selection
cvTuning
R Documentation
Cross-validation for tuning parameter selection
Description
Select tuning parameters of a model by estimating the
respective prediction errors via (repeated) K-fold
cross-validation. It is thereby possible to supply a
model fitting function or an unevaluated function call to
a model fitting function.
Usage
cvTuning(object, ...)
## S3 method for class 'function'
cvTuning(object, formula,
data = NULL, x = NULL, y, tuning = list(),
args = list(), cost = rmspe, K = 5, R = 1,
foldType = c("random", "consecutive", "interleaved"),
folds = NULL, names = NULL, predictArgs = list(),
costArgs = list(), selectBest = c("min", "hastie"),
seFactor = 1, envir = parent.frame(), seed = NULL, ...)
## S3 method for class 'call'
cvTuning(object, data = NULL, x = NULL,
y, tuning = list(), cost = rmspe, K = 5, R = 1,
foldType = c("random", "consecutive", "interleaved"),
folds = NULL, names = NULL, predictArgs = list(),
costArgs = list(), selectBest = c("min", "hastie"),
seFactor = 1, envir = parent.frame(), seed = NULL, ...)
Arguments
object
a function or an unevaluated function call
for fitting a model (see call for the
latter).
formula
a formula describing
the model.
data
a data frame containing the variables
required for fitting the models. This is typically used
if the model in the function call is described by a
formula.
x
a numeric matrix containing the predictor
variables. This is typically used if the function call
for fitting the models requires the predictor matrix and
the response to be supplied as separate arguments.
y
a numeric vector or matrix containing the
response.
tuning
a list of arguments giving the tuning
parameter values to be evaluated. The names of the list
components should thereby correspond to the argument
names of the tuning parameters. For each tuning
parameter, a vector of values can be supplied.
Cross-validation is then applied over the grid of all
possible combinations of tuning parameter values.
args
a list of additional arguments to be passed
to the model fitting function.
cost
a cost function measuring prediction loss.
It should expect the observed values of the response to
be passed as the first argument and the predicted values
as the second argument, and must return either a
non-negative scalar value, or a list with the first
component containing the prediction error and the second
component containing the standard error. The default is
to use the root mean squared prediction error (see
cost).
K
an integer giving the number of groups into
which the data should be split (the default is five).
Keep in mind that this should be chosen such that all
groups are of approximately equal size. Setting K
equal to n yields leave-one-out cross-validation.
R
an integer giving the number of replications for
repeated K-fold cross-validation. This is ignored
for for leave-one-out cross-validation and other
non-random splits of the data.
foldType
a character string specifying the type of
folds to be generated. Possible values are
"random" (the default), "consecutive" or
"interleaved".
folds
an object of class "cvFolds" giving
the folds of the data for cross-validation (as returned
by cvFolds). If supplied, this is
preferred over K and R.
names
an optional character vector giving names
for the arguments containing the data to be used in the
function call (see “Details”).
predictArgs
a list of additional arguments to be
passed to the predict method of the
fitted models.
costArgs
a list of additional arguments to be
passed to the prediction loss function cost.
selectBest
a character string specifying a
criterion for selecting the best model. Possible values
are "min" (the default) or "hastie". The
former selects the model with the smallest prediction
error. The latter is useful for models with a tuning
parameter controlling the complexity of the model (e.g.,
penalized regression). It selects the most parsimonious
model whose prediction error is no larger than
seFactor standard errors above the prediction
error of the best overall model. Note that the models
are thereby assumed to be ordered from the most
parsimonious one to the most complex one. In particular
a one-standard-error rule is frequently applied.
seFactor
a numeric value giving a multiplication
factor of the standard error for the selection of the
best model. This is ignored if selectBest is
"min".
envir
the environment in which to
evaluate the function call for fitting the models (see
eval).
seed
optional initial seed for the random number
generator (see .Random.seed).
...
additional arguments to be passed down.
Details
(Repeated) K-fold cross-validation is performed in
the following way. The data are first split into K
previously obtained blocks of approximately equal size.
Each of the K data blocks is left out once to fit
the model, and predictions are computed for the
observations in the left-out block with the
predict method of the fitted model.
Thus a prediction is obtained for each observation.
The response variable and the obtained predictions for
all observations are then passed to the prediction loss
function cost to estimate the prediction error.
For repeated cross-validation, this process is replicated
and the estimated prediction errors from all replications
as well as their average are included in the returned
object.
Furthermore, if the response is a vector but the
predict method of the fitted models
returns a matrix, the prediction error is computed for
each column. A typical use case for this behavior would
be if the predict method returns
predictions from an initial model fit and stepwise
improvements thereof.
If formula or data are supplied, all
variables required for fitting the models are added as
one argument to the function call, which is the typical
behavior of model fitting functions with a
formula interface. In this case,
the accepted values for names depend on the
method. For the function method, a character
vector of length two should supplied, with the first
element specifying the argument name for the formula and
the second element specifying the argument name for the
data (the default is to use c("formula", "data")).
Note that names for both arguments should be supplied
even if only one is actually used. For the call
method, which does not have a formula argument, a
character string specifying the argument name for the
data should be supplied (the default is to use
"data").
If x is supplied, on the other hand, the predictor
matrix and the response are added as separate arguments
to the function call. In this case, names should
be a character vector of length two, with the first
element specifying the argument name for the predictor
matrix and the second element specifying the argument
name for the response (the default is to use c("x",
"y")). It should be noted that the formula or
data arguments take precedence over x.
Value
If tuning is an empty list, cvFit is
called to return an object of class "cv".
Otherwise an object of class "cvTuning" (which
inherits from class "cvSelect") with the following
components is returned:
n
an integer giving the number of observations.
K
an integer giving the number of folds.
R
an integer giving the number of replications.
tuning
a data frame containing the grid of tuning
parameter values for which the prediction error was
estimated.
best
an integer vector giving the indices of the
optimal combinations of tuning parameters.
cv
a data frame containing the estimated
prediction errors for all combinations of tuning
parameter values. For repeated cross-validation, those
are average values over all replications.
se
a data frame containing the estimated standard
errors of the prediction loss for all combinations of
tuning parameter values.
selectBest
a character string specifying the
criterion used for selecting the best model.
seFactor
a numeric value giving the multiplication
factor of the standard error used for the selection of
the best model.
reps
a data frame containing the estimated
prediction errors from all replications for all
combinations of tuning parameter values. This is only
returned for repeated cross-validation.
seed
the seed of the random number generator
before cross-validation was performed.
call
the matched function call.
Note
The same cross-validation folds are used for all
combinations of tuning parameter values for maximum
comparability.
Author(s)
Andreas Alfons
References
Hastie, T., Tibshirani, R. and Friedman, J. (2009)
The Elements of Statistical Learning: Data Mining,
Inference, and Prediction. Springer, 2nd edition.
See Also
cvTool, cvFit,
cvSelect, cvFolds,
cost
Examples
library("robustbase")
data("coleman")
## evaluate MM regression models tuned for 85% and 95% efficiency
tuning <- list(tuning.psi = c(3.443689, 4.685061))
## via model fitting function
# perform cross-validation
# note that the response is extracted from 'data' in
# this example and does not have to be supplied
cvTuning(lmrob, formula = Y ~ ., data = coleman, tuning = tuning,
cost = rtmspe, K = 5, R = 10, costArgs = list(trim = 0.1),
seed = 1234)
## via function call
# set up function call
call <- call("lmrob", formula = Y ~ .)
# perform cross-validation
cvTuning(call, data = coleman, y = coleman$Y, tuning = tuning,
cost = rtmspe, K = 5, R = 10, costArgs = list(trim = 0.1),
seed = 1234)