distance metric to be used: 'euclid'
(Euclidean distance) or 'mahal' (Mahalanobis distance,
default).
pc
optional. If not specified, distance are
computed in the Euclidean space. Alternatively, distance
are computed in the principal component score space and
pc is the number of principal components retained.
If pc < 1, the number of principal components kept
corresponds to the number of components explaining at
least (pc * 100) percent of the total variance.
group
An optional factor (or vector that
can be coerced to a factor by as.factor) of
length equal to nrow(X), giving the identifier of related
observations (e.g. samples of the same batch of
measurements, , of the same origin, or of the same soil
profile). When one observation is selected by the
procedure all observations of the same group are removed
together and assigned to the calibration/validation sets.
This allows to select calibration and validation samples
that are independent from each other.
.center
logical value indicating whether the input
matrix should be centered before Principal Component
Analysis. Default set to TRUE.
.scale
logical value indicating whether the input
matrix should be scaled before Principal Component
Analysis. Default set to FALSE.
Details
The DUPLEX algorithm is similar to the Kennard-Stone
algorithm (see kenStone) but allows to select
both calibration and validation points that are
independent. Similarly to the Kennard-Stone algorithm, it
starts by selecting the pair of points that are the
farthest apart. They are assigned to the calibration sets
and removed from the list of points. Then, the next pair of
points which are farthest apart are assigned to the
validation sets and removed from the list. In a third step,
the procedure assigns each remaining point alternatively to
the calibration and validation sets based on the distance
to the points already selected. Similarly to the
Kennard-Stone algorithm, the default distance metric used
by the procedure is the Euclidean distance, but the
Mahalanobis distance can be used as well using the
pc argument (see kenStone).
Value
a list with components:
'model' numeric vector giving the row
indices of the input data selected for calibration
'test' numeric vector giving the row
indices of the input data selected for validation
'pc' if the pc argument is specified,
a numeric matrix of the scaled pc scores
Author(s)
Antoine Stevens & Leonardo Ramirez–Lopez
References
Kennard, R.W., and Stone, L.A., 1969. Computer aided design
of experiments. Technometrics 11, 137-148.
Snee, R.D., 1977. Validation of regression models: methods
and examples. Technometrics 19, 415-428.
See Also
kenStone, honigs,
shenkWest, naes
Examples
data(NIRsoil)
sel <- duplex(NIRsoil$spc,k=30,metric='mahal',pc=.99)
plot(sel$pc[,1:2],xlab='PC1',ylab='PC2')
points(sel$pc[sel$model,1:2],pch=19,col=2) # points selected for calibration
points(sel$pc[sel$test,1:2],pch=18,col=3) # points selected for validation
# Test on artificial data
X <- expand.grid(1:20,1:20) + rnorm(1e5,0,.1)
plot(X[,1],X[,2],xlab='VAR1',ylab='VAR2')
sel <- duplex(X,k=25,metric='mahal')
points(X[sel$model,],pch=19,col=2) # points selected for calibration
points(X[sel$test,],pch=15,col=3) # points selected for validation