INCAnumclu helps to estimate the number of clusters in a
dataset. The INCA index associated to different partitions with
different number of clusters is calculated.
a distance matrix or a dist object with distance information between units.
K
the maximum number of cluster to be considered. For each k value ( k=2,..,K) a partition with k clusters is calculated.
method
character string defining the clustering method in
order to obtain the partitions. The clustering method is performed
via the functions pam, agnes,
diana and fanny in
package cluster. The available clustering methods are pam (default method), average (UPGMA), single (single linkage),
complete (complete linkage), ward (Ward's method),
weighted (weighted average linkage), diana (hierarchical
divisive) and fanny (fuzzy clustering). Nevertheless, the user can introduce
particular or custom partitions indicating method=partition and specifying
the partitions in argument pert.
pert
only useful when parameter method="partition"; it
is a matrix and each column contains a partition of the units. That means that each column is an n-vector that indicates which group each unit belongs to. Note that the expected values of each column of pert are numbers greater than or equal to 1 (for instance 1,2,3,4..., k).
L
default value NULL, but when some units are considered by
the user as noise units, L must be specified as follows: (a)
L is greater than or equal to 1 and all units in clusters with a
cardinal <= L are considered noise units; (b) L="custom" when the user wants to specify which units are considered noise units. These units must be specified in argument noise.
noise
when L="custom", it is a logical vector indicating
the units considered by the user as noise units.
Value
Returns an object of class incanc which is a numeric vector containing the INCA index associated to each of the k (k=2,...,K) partitions. When noise is no null, the function returns a list with the INCA index for each partition, which is calculated without noise units as well as with noise units. The associated plot returns INCA index plot, both, with and without noise.
Author(s)
Itziar Irigoien itziar.irigoien@ehu.es; Konputazio Zientziak eta Adimen Artifiziala, Euskal Herriko Unibertsitatea (UPV-EHU), Donostia, Spain.
Conchita Arenas carenas@ub.edu; Departament d'Estadistica, Universitat de Barcelona, Barcelona, Spain.
References
Irigoien, I. and Arenas, C. (2008). INCA: New statistic for estimating the number of clusters and identifying atypical units.
Statistics in Medicine, 27(15), 2948–2973.
Arenas, C. and Cuadras, C.M. (2002). Some recent statistical methods based on distances. Contributions to Science, 2, 183–191.
See Also
INCAindex, estW
Examples
#------- Example 1 --------------------------------------
#generate 3 clusters, each of them with 20 objects in dimension 5.
mu1 <- sample(1:10, 5, replace=TRUE)
x1 <- matrix(rnorm(20*5, mean = mu1, sd = 1),ncol=5, byrow=TRUE)
mu2 <- sample(1:10, 5, replace=TRUE)
x2 <- matrix(rnorm(20*5, mean = mu2, sd = 1),ncol=5, byrow=TRUE)
mu3 <- sample(1:10, 5, replace=TRUE)
x3 <- matrix(rnorm(20*5, mean = mu3, sd = 1),ncol=5, byrow=TRUE)
x <- rbind(x1,x2,x3)
# calculte euclidean distance between them
d <- dist(x)
# calculate the INCA index associated to partitions with k=2, ..., k=5 clusters.
INCAnumclu(d, K=5)
out <- INCAnumclu(d, K=5)
plot(out)
#------- Example 1 cont. --------------------------------
# With hypothetical noise elements
noiseunits <- rep(FALSE, 60)
noiseunits[sample(1:60, 20)] <- TRUE
out <- INCAnumclu(d, K=5, L="custom", noise=noiseunits)
plot(out)
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(ICGE)
Loading required package: MASS
Loading required package: cluster
> png(filename="/home/ddbj/snapshot/RGM3/R_CC/result/ICGE/INCAnumclu.Rd_%03d_medium.png", width=480, height=480)
> ### Name: INCAnumclu
> ### Title: Estimation of Number of Clusters in Data
> ### Aliases: INCAnumclu plot.incanc print.incanc
> ### Keywords: multivariate cluster
>
> ### ** Examples
>
> #------- Example 1 --------------------------------------
> #generate 3 clusters, each of them with 20 objects in dimension 5.
> mu1 <- sample(1:10, 5, replace=TRUE)
> x1 <- matrix(rnorm(20*5, mean = mu1, sd = 1),ncol=5, byrow=TRUE)
> mu2 <- sample(1:10, 5, replace=TRUE)
> x2 <- matrix(rnorm(20*5, mean = mu2, sd = 1),ncol=5, byrow=TRUE)
> mu3 <- sample(1:10, 5, replace=TRUE)
> x3 <- matrix(rnorm(20*5, mean = mu3, sd = 1),ncol=5, byrow=TRUE)
> x <- rbind(x1,x2,x3)
>
> # calculte euclidean distance between them
> d <- dist(x)
>
> # calculate the INCA index associated to partitions with k=2, ..., k=5 clusters.
> INCAnumclu(d, K=5)
---INCA index to estimate the number of clusters---
Clustering method: pam
k= 2 1
k= 3 1
k= 4 0.49
k= 5 0.1
> out <- INCAnumclu(d, K=5)
> plot(out)
>
> #------- Example 1 cont. --------------------------------
> # With hypothetical noise elements
> noiseunits <- rep(FALSE, 60)
> noiseunits[sample(1:60, 20)] <- TRUE
> out <- INCAnumclu(d, K=5, L="custom", noise=noiseunits)
> plot(out)
>
>
>
>
>
> dev.off()
null device
1
>