Last data update: 2014.03.03

R: Optimal and Fast Univariate k-Means Clustering
Ckmeans.1d.dpR Documentation

Optimal and Fast Univariate k-Means Clustering

Description

Perform optimal and fast univariate k-means clustering by dynamic programming and divide-and-conquer.

Usage

Ckmeans.1d.dp(x, k=c(1,9), y=1)

Arguments

x

a numeric vector of data to be clustered. All NA elements must be removed from x before calling this function.

k

either an exact integer number of clusters, or a vector of length two specifying the minimum and maximum numbers of clusters to be examined. The default is c(1,9). When k is a range, the actual number of clusters is determined by Bayesian information criterion.

y

a value of 1 to specify equal weights or a numeric vector of unequal weights for each element. The default weight is one. It is highly recommended to use positive (instead of zero) weights to account for the influence of every element. The weights have a strong impact on the clustering result.

Details

In contrast to the heuristic k-means algorithms implemented in function kmeans, this function optimally assigns elements in numeric vector x into k clusters by dynamic programming (Wang and Song, 2011). It minimizes the total of within-cluster sums of squared distances (withinss) between each element and its corresponding cluster mean. When a range is provided for k, the exact number of clusters is determined by Bayesian information criterion. Different from the heuristic k-means algorithms whose results may be non-optimal or change from run to run, the result of Ckmeans.1d.dp is guaranteed to be optimal and reproducible, and its advantage in efficiency and accuracy over heuristic k-means methods become increasingly evident as k increases.

Value

An object of class "Ckmeans.1d.dp" with a print method. It is a list containing the following components:

cluster

a vector of clusters assigned to each element in x. Each cluster is indexed by an integer from 1 to k.

centers

a numeric vector of the (weighted) means for each cluster.

withinss

a numeric vector of the (weighted) within-cluster sum of squares for each cluster.

size

a vector of the (weighted) number of elements in each cluster.

totss

total sum of squared distances between each element and the sample mean. This statistic is not dependent on the clustering result.

tot.withinss

total sum of within-cluster squared distances between each element and its cluster mean. This statistic is minimized given the number of clusters.

betweenss

sum of squared distances between each cluster mean and sample mean weighed by cluster size. This statistic is maximized given the number of clusters.

Author(s)

Joe Song and Haizhou Wang

References

Wang, H. and Song, M. (2011) Ckmeans.1d.dp: optimal k-means clustering in one dimension by dynamic programming. The R Journal 3(2), 29–33. Retrieved from http://journal.r-project.org/archive/2011-2/RJournal_2011-2_Wang+Song.pdf

Examples

# Ex. 1 The number of clusters is provided.
# Generate data from a Gaussian mixture model of two components
x <- c(rnorm(50, sd=0.3), rnorm(50, mean=1, sd=0.3))
# Divide x into 2 clusters
k <- 2
result <- Ckmeans.1d.dp(x, k)
plot(x, col=result$cluster, pch=result$cluster, cex=1.5,
     main="Optimal k-means clustering given k",
     sub=paste("Number of clusters given:", k))
abline(h=result$centers, col=1:k, lty="dashed", lwd=2)
legend("bottom", paste("Cluster", 1:k), col=1:k, pch=1:k, cex=1.5, bty="n")

# Ex. 2 The number of clusters is determined by Bayesian information criterion
# Generate data from a Gaussian mixture model of two components
x <- c(rnorm(50, mean=-1, sd=0.3), rnorm(50, mean=1, sd=1))
# Divide x into k clusters, k automatically selected (default: 1~9)
result <- Ckmeans.1d.dp(x)
k <- max(result$cluster)
plot(x, col=result$cluster, pch=result$cluster, cex=1.5,
     main="Optimal k-means clustering with k estimated",
     sub=paste("Number of clusters is estimated to be", k))
abline(h=result$centers, col=1:k, lty="dashed", lwd=2)
legend("topleft", paste("Cluster", 1:k), col=1:k, pch=1:k, cex=1.5, bty="n")

# Ex. 3 Segmenting a time course using weighted k-means
t <- seq(0, 2*pi*2, length=250)
n1 <- 1:125
n2 <- 126:250
y1 <- abs(sin(2*t[n1]) + 0.1*rnorm(length(n1)))
y2 <- abs(sin(0.5*t[n2]) + 0.1*rnorm(length(n2)))
y <- c(y1, y2)

w <- y * 2 # exp((y - min(y)))

res <- Ckmeans.1d.dp(t, k=c(5), w)
plot(t, y, main = "Time course segmentation", col=res$cluster,
     pch=res$cluster, xlab="time", ylab="intensity", type="h")
abline(v=res$centers, col="chocolate", lty="dashed")
text(res$centers, max(y) * .95, cex=1.5, font=2,
     paste(round(res$size / sum(res$size) * 100, 2), "/100"))

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(Ckmeans.1d.dp)
> png(filename="/home/ddbj/snapshot/RGM3/R_CC/result/Ckmeans.1d.dp/Ckmeans.1d.dp.Rd_%03d_medium.png", width=480, height=480)
> ### Name: Ckmeans.1d.dp
> ### Title: Optimal and Fast Univariate k-Means Clustering
> ### Aliases: Ckmeans.1d.dp
> 
> ### ** Examples
> 
> # Ex. 1 The number of clusters is provided.
> # Generate data from a Gaussian mixture model of two components
> x <- c(rnorm(50, sd=0.3), rnorm(50, mean=1, sd=0.3))
> # Divide x into 2 clusters
> k <- 2
> result <- Ckmeans.1d.dp(x, k)
> plot(x, col=result$cluster, pch=result$cluster, cex=1.5,
+      main="Optimal k-means clustering given k",
+      sub=paste("Number of clusters given:", k))
> abline(h=result$centers, col=1:k, lty="dashed", lwd=2)
> legend("bottom", paste("Cluster", 1:k), col=1:k, pch=1:k, cex=1.5, bty="n")
> 
> # Ex. 2 The number of clusters is determined by Bayesian information criterion
> # Generate data from a Gaussian mixture model of two components
> x <- c(rnorm(50, mean=-1, sd=0.3), rnorm(50, mean=1, sd=1))
> # Divide x into k clusters, k automatically selected (default: 1~9)
> result <- Ckmeans.1d.dp(x)
> k <- max(result$cluster)
> plot(x, col=result$cluster, pch=result$cluster, cex=1.5,
+      main="Optimal k-means clustering with k estimated",
+      sub=paste("Number of clusters is estimated to be", k))
> abline(h=result$centers, col=1:k, lty="dashed", lwd=2)
> legend("topleft", paste("Cluster", 1:k), col=1:k, pch=1:k, cex=1.5, bty="n")
> 
> # Ex. 3 Segmenting a time course using weighted k-means
> t <- seq(0, 2*pi*2, length=250)
> n1 <- 1:125
> n2 <- 126:250
> y1 <- abs(sin(2*t[n1]) + 0.1*rnorm(length(n1)))
> y2 <- abs(sin(0.5*t[n2]) + 0.1*rnorm(length(n2)))
> y <- c(y1, y2)
> 
> w <- y * 2 # exp((y - min(y)))
> 
> res <- Ckmeans.1d.dp(t, k=c(5), w)
> plot(t, y, main = "Time course segmentation", col=res$cluster,
+      pch=res$cluster, xlab="time", ylab="intensity", type="h")
> abline(v=res$centers, col="chocolate", lty="dashed")
> text(res$centers, max(y) * .95, cex=1.5, font=2,
+      paste(round(res$size / sum(res$size) * 100, 2), "/100"))
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>