an integer giving the maximum number of clusters to
consider.
B
an integer giving the number of bootstraps.
criteria
a character string indicating which criteria to
evaluate the gap data. One of ‘“tibshirani”’
(default),‘“DandF”’ or ‘“none”’. Can be
abbreviated.
nnode
an integer of many CPUS to use for parallel
processing. Defaults to NULL i.e. no parallel processing.
scale
logical. Should the data be scaled?
...
For any other arguments passed from the generic function.
Details
This code performs the gap analysis using lga. The gap statistic is
defined as the difference between the log of the Residual Orthogonal
Sum of Squared Distances (denoted log(W_k)) and its expected
value derived using bootstrapping under the null hypothesis that there
is only one cluster. In this implementation, the reference
distribution used for the bootstrapping is a random uniform hypercube,
transformed by the principal components of the underlying data set.
For further details see Tibshirani et al (2001).
For different criteria, different rules apply. With
‘“tibshirani”’ (ibid) we calculate the gap
statistic for
k = 1, …, K, stopping when
gap(k) >= gap(k+1)
- s_(k+1)
where s_(k+1) is a function of standard deviation of
the bootstrapped estimates.
With the ‘“DandF”’ criteria from Dudoit et al
(2002), we calculate the gap statistic for
all values of k = 1, …, K, selecting the number of clusters
as
khat = smallest k >= 1 such that gap(k) >=
gap(kstar) - s_(kstar)
where kstar = argmax_(k
>= 1) gap(k).
Finally, for the criteria “none”, no rules are applied, and
just the gap data is returned.
As lga is ostensibly unsupervised in this case, the parameter niter
is set to 20 to ensure convergence.
This function is parallel computing aware via the nnode
argument, and works with the package snow. In order to
use parallel computing, one of MPI (e.g. lamboot) or PVM is necessary.
For further details, see the documentation for snow.
Value
An object of class ‘“gap”’ with components
finished
a logical. For the “tibshirani”, was there a
solution found?
nclust
a integer for the number of clusters estimated. Returns
NA if nothing conclusive is found.
data
the original data set, scaled if specified in the
arguments.
Tibshirani, R. and Walther, G. and Hastie, T. (2001)
‘Estimating the number of clusters in a data set via the gap
statistic’, J. R. Statist. Soc. B63, 411–423.
Dudoit, S. and Fridlyand, J. (2002) ‘A prediction-based
resampling method for estimating the number of clusters in a
dataset’, Genome Biology3.
Van Aelst, S. and Wang, X. and Zamar, R. and Zhu, R. (2006)
‘Linear Grouping Using Orthogonal Regression’,
Computational Statistics & Data Analysis50,
1287–1312.
See Also
lga
Examples
## Synthetic example
## Make a dataset with 2 clusters in 2 dimensions
library(MASS)
set.seed(1234)
X <- rbind(mvrnorm(n=100, mu=c(1, -2), Sigma=diag(0.1, 2) + 0.9),
mvrnorm(n=100, mu=c(1, 1), Sigma=diag(0.1, 2) + 0.9))
gap(X, K=4, B=20)
## to run this using parallel processing with 4 nodes, the equivalent
## code would be
## Not run: gap(X, K=4, B=20, nnode=4)
## Quakes data (from package:datasets)
## Including the first two dimensions versus three dimensions
## yields different results
set.seed(1234)
## Not run:
gap(quakes[,1:2], K=4, B=20)
gap(quakes[,1:3], K=4, B=20)
## End(Not run)
library(maps)
lgaout1 <- lga(quakes[,1:2], k=3)
plot(lgaout1)
lgaout2 <- lga(quakes[,1:3], k=2)
plot(lgaout2)
## Let's put this in context
par(mfrow=c(1,2))
map("world", xlim=range(quakes[,2]), ylim=range(quakes[,1])); box()
points(quakes[,2], quakes[,1], pch=lgaout1$cluster, col=lgaout1$cluster)
map("world", xlim=range(quakes[,2]), ylim=range(quakes[,1])); box()
points(quakes[,2], quakes[,1], pch=lgaout2$cluster, col=lgaout2$cluster)