Last data update: 2014.03.03

R: k-means sampling
naesR Documentation

k-means sampling

Description

Perform a k-means sampling on a matrix or data.frame for multivariate calibration

Usage

naes(X,k,pc,iter.max = 10, method = 0,.center = TRUE,.scale = FALSE)

Arguments

X

numeric matrix or data.frame

k

either the number of calibration samples to select or a set of cluster centres to initiate the k-means clustering.

pc

optional. If not specified, k-means is run directly on the variable (Euclidean) space. Alternatively, a PCA is performed before k-means and pc is the number of principal components kept. If pc < 1, the number of principal components kept corresponds to the number of components explaining at least (pc * 100) percent of the total variance.

iter.max

maximum number of iterations allowed for the k-means clustering. Default is iter.max = 10 (see ?kmeans)

method

the method used for selecting calibration samples within each cluster: either samples closest to the cluster centers (method = 0, default), samples farthest away from the centre of the data (method = 1) or random selection (method = 2)

.center

logical value indicating whether the input matrix should be centered before Principal Component Analysis. Default set to TRUE.

.scale

logical value indicating whether the input matrix should be scaled before Principal Component Analysis. Default set to FALSE.

Details

K-means sampling is a simple procedure based on cluster analysis to select calibration samples from large multivariate datasets. The method can be described in three points (Naes et al.,2001):

  1. Perform a PCA and decide how many principal component to keep,

  2. Carry out a k-means clustering on the principal component scores and choose the number of resulting clusters to be equal to the number of desired calibration samples,

  3. Select one sample from each cluster.

Value

a list with components:

  • 'model' numeric vector giving the row indices of the input data selected for calibration

  • 'test' numeric vector giving the row indices of the remaining observations

  • 'pc' if the pc argument is specified, a numeric matrix of the scaled pc scores

  • 'cluster' integer vector indicating the cluster to which each point was assigned

  • 'centers' a matrix of cluster centres

Author(s)

Antoine Stevens and Leonardo Ramirez-Lopez

References

Naes, T., 1987. The design of calibration in near infra-red reflectance analysis by clustering. Journal of Chemometrics 1, 121-134.

Naes, T., Isaksson, T., Fearn, T., and Davies, T., 2002. A user friendly guide to multivariate calibration and classification. NIR Publications, Chichester, United Kingdom.

See Also

kenStone, honigs, duplex, shenkWest

Examples

data(NIRsoil)
sel <- naes(NIRsoil$spc,k=5,p=.99,method=0)
plot(sel$pc[,1:2],col=sel$cluster+2) # clusters
# points selected for calibration with method = 0
points(sel$pc[sel$model,1:2],col=2,pch=19,cex=1)
sel2 <- naes(NIRsoil$spc,k=sel$centers,p=.99,method=1) # pre-defined centers can also be provided
# points selected for calibration with method = 1
points(sel$pc[sel2$model,1:2],col=1,pch=15,cex=1)

Results