R Graphical Manual

Browse All

Last data update: 2014.03.03

R: Copula-Based Clustering Algorithm

CoClust

R Documentation

Copula-Based Clustering Algorithm

Description

Cluster analysis based on copula functions

Usage

CoClust(m, dimset = 2:5, noc = 4, copula = "frank", fun = median,
  method.ma = c("empirical", "pseudo"), method.c = c("ml", "mpl", "irho", "itau"),
   dfree = NULL, writeout = 5, penalty = c("BICk", "AICk", "LL"), ...)

Arguments

`m`	a data matrix.
`dimset`	the set of dimensions for which the function tries the clustering.
`noc`	sample size of the set for selecting the number of clusters.
`copula`	a copula model. This should be one of "normal", "t", "frank", "clayton" and "gumbel". See the Details section.
`fun`	combination function of the pairwise Spearman's rho used to select the k-plets. The default is `median`
`method.ma`	estimation method for margins. See the Details section.
`method.c`	estimation method for copula. See `fitCopula`.
`dfree`	degrees of freedom for the t copula.
`writeout`	writes a message on the number of allocated observations every writeout observations.
`penalty`	Specifies the likelihood criterion used for selecting the number of clusters.
`...`	further parameters for `fitCopula`.

Details

Usage for Frank copula:

CoClust(m, nmaxmarg = 2:5, noc = 4, copula = "frank",
 fun = median, method.ma=c("gaussian","empirical"), method.c = "mpl",
 penalty ="BICk", ...)

CoClust is a clustering algorithm that, being based on copula functions, allows to group observations according to the multivariate dependence structure of the generating process without any assumptions on the margins.

For each k in dimset the algorithm builds a sample of noc observations (rows of the data matrix m) by using the matrix of Spearman's rho correlation coefficients which are combined by means of the function fun (median by default). The number of clusters K is selected by means of a criterion based on the likelihood of the copula fit. The switch penalty allows to select 3 different criteria; The choice LL corresponds to using the likelihood without penalty terms. Then, the remaining observations are allocated to the clusters as follows: 1. selects a K-plet of observations on the basis of fun applied to the pairwise Spearman's rho; 2. allocates or discards the K-plet on the basis of the likelihood of the copula fit.

The estimation approach for the copula fit is semiparametric: a range of nonparametric margins and parametric copula models can be selected by the user. The CoClust algorithm does not require to set a priori the number of clusters nor it needs a starting classification.

Notice that the dependence structure for the Gaussian and the t copula is set to exchangeable. Non structured dependence structures will be allowed in a future version.

Value

An object of S4 class "CoClust", which is a list with the following elements:

Number.of.Clusters

the number K of identified clusters.

Index.Matrix

a n.obs by (K+1) matrix where n.obs is the number of observations put in each cluster. The matrix contains the row indexes of the observations of the data matrix m. The last column contains the log-likelihood of the copula fit.

Data.Clusters

the matrix of the final clustering.

Dependence

a list containing:

`Model`	the copula model used for the clustering.
`Param`	the estimated dependence parameter between clusters.
`Std.Err`	the standard error of Param.
`P.val`	the p-value associated to the null hypothesis `H_0: theta=0`.

LogLik

the maximized log-likelihood copula fit.

Est.Method

the estimation method used for the copula fit.

Opt.Method

the optimization method used for the copula fit.

LLC

the value of the LogLikelihood Criterion for each k in dimset.

Index.dimset

a list that, for each k in dimset, contains the index matrix of the initial set of nk observations used for selecting the number of clusters, together with the associated loglikelihood.

Note

The final clustering is composed of K groups in which observations of the same group are independent whereas the observations that belong to different groups and that form a K-plet are dependent.

Author(s)

Francesca Marta Lilja Di Lascio <marta.dilascio@unibz.it>,

Simone Giannerini <simone.giannerini@unibo.it>

References

Di Lascio, F.M.L. and Giannerini, S. (2014) "An Improved Copula-Based Clustering Algorithm", Working Paper.

Di Lascio, F.M.L. and Giannerini, S. (2012) "A Copula-Based Algorithm for Discovering Patterns of Dependent Observations", Journal of Classification, Volume 29, Number 1, 50-75.

Di Lascio, F.M.L. (2008). "Analyzing the dependence structure of microarray data: a copula-based approach". PhD thesis, Dipartimento di Scienze Statistiche, Universita' di Bologna, Italy.

Examples

## ******************************************************************
## 1. builds a 3-variate copula with different margins
##    (Gaussian, Gamma, Beta)
##
## 2. generates a data matrix xm with 15 rows and 21 columns and
##    builds the matrix of the true cluster indexes
##
## 3. applies the CoClust to the rows of xm and recovers the
##    multivariate dependence structure of the data
## ******************************************************************

## Step 1. **********************************************************
n      <- 105             # total number of observations
n.col  <- 21              # number of columns of the data matrix m
n.marg <- 3               # dimension of the copula
n.row  <- n*n.marg/n.col  # number of rows of the data matrix m

theta  <- 10
copula <- frankCopula(theta, dim = n.marg)
mymvdc <- mvdc(copula, c("norm", "gamma", "beta"),list(list(mean=7, sd=2),
                list(shape=3, rate=4), list(shape1=2, shape2=1)))

## Step 2. **********************************************************
set.seed(11)
x.samp <- rMvdc(n, mymvdc)
xm     <- matrix(x.samp, nrow = n.row, ncol = n.col, byrow=TRUE)

index.true <-  matrix(1:15,5,3)
colnames(index.true) <- c("Cluster 1","Cluster 2", "Cluster 3")

## Step 3. **********************************************************

clust <- CoClust(xm, dimset = 2:4, noc=2, copula="frank",
                 method.ma="empirical", method.c="ml",writeout=1)
clust
clust@"Number.of.Clusters"
clust@"Dependence"$Param
clust@"Data.Clusters"
index.clust <- clust@"Index.Matrix"

## compare with index.true
index.clust
index.true
##

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(CoClust)
Loading required package: copula
> png(filename="/home/ddbj/snapshot/RGM3/R_CC/result/CoClust/CoClust.Rd_%03d_medium.png", width=480, height=480)
> ### Name: CoClust
> ### Title: Copula-Based Clustering Algorithm
> ### Aliases: CoClust
> ### Keywords: cluster multivariate
> 
> ### ** Examples
> 
> ## ******************************************************************
> ## 1. builds a 3-variate copula with different margins
> ##    (Gaussian, Gamma, Beta)
> ##
> ## 2. generates a data matrix xm with 15 rows and 21 columns and
> ##    builds the matrix of the true cluster indexes
> ##
> ## 3. applies the CoClust to the rows of xm and recovers the
> ##    multivariate dependence structure of the data
> ## ******************************************************************
> 
> ## Step 1. **********************************************************
> n      <- 105             # total number of observations
> n.col  <- 21              # number of columns of the data matrix m
> n.marg <- 3               # dimension of the copula
> n.row  <- n*n.marg/n.col  # number of rows of the data matrix m
> 
> theta  <- 10
> copula <- frankCopula(theta, dim = n.marg)
> mymvdc <- mvdc(copula, c("norm", "gamma", "beta"),list(list(mean=7, sd=2),
+                 list(shape=3, rate=4), list(shape1=2, shape2=1)))
> 
> ## Step 2. **********************************************************
> set.seed(11)
> x.samp <- rMvdc(n, mymvdc)
> xm     <- matrix(x.samp, nrow = n.row, ncol = n.col, byrow=TRUE)
> 
> index.true <-  matrix(1:15,5,3)
> colnames(index.true) <- c("Cluster 1","Cluster 2", "Cluster 3")
> 
> ## Step 3. **********************************************************
> 
> clust <- CoClust(xm, dimset = 2:4, noc=2, copula="frank",
+                  method.ma="empirical", method.c="ml",writeout=1)
  Number of clusters selected:  3 
  Allocated observations:  3 
  Allocated observations:  4 
  Allocated observations:  5 
> clust
An object of class "CoClust"
Slot "Number.of.Clusters":
[1] 3

Slot "Index.Matrix":
     Cluster 1 Cluster 2 Cluster 3    LogLik
[1,]        11         1         6  29.59298
[2,]        13         3         8  59.11185
[3,]        12         2         7  87.55890
[4,]        14         4         9 118.68674
[5,]        15         5        10 148.43749

Slot "Data.Clusters":
       Cluster 1 Cluster 2 Cluster 3
  [1,] 0.3821719  4.724571 0.1722283
  [2,] 0.3911910  5.738707 0.1548273
  [3,] 0.9923630 11.130341 1.3684076
  [4,] 0.6133405  6.711067 0.5572253
  [5,] 0.1489788  2.763910 0.3727072
  [6,] 0.8220294  9.946755 0.7025062
  [7,] 0.7748810  8.849824 0.7472348
  [8,] 0.7060588  6.977426 0.6037493
  [9,] 0.7420107  6.979874 1.2564989
 [10,] 0.8868274  8.121758 0.8366584
 [11,] 0.8408346  9.083846 0.8449545
 [12,] 0.9265845  7.497380 0.5735631
 [13,] 0.7366919  9.083588 0.8049378
 [14,] 0.1854497  4.737831 0.1091059
 [15,] 0.7962249  7.314255 0.5804155
 [16,] 0.5226436  6.609135 0.4938606
 [17,] 0.3812399  4.108089 0.3293736
 [18,] 0.6755020  7.028995 0.5640552
 [19,] 0.3773079  4.926126 0.4780587
 [20,] 0.2413054  4.050170 0.2756642
 [21,] 0.5509937  5.554924 0.5825189
 [22,] 0.5547837  6.950117 0.5624433
 [23,] 0.8341096  7.723815 1.2358954
 [24,] 0.1245806  5.237230 0.2698407
 [25,] 0.8439165  8.633602 1.0305195
 [26,] 0.8290546  8.875230 0.7430665
 [27,] 0.2653793  3.818821 0.3310576
 [28,] 0.1377339  3.847222 0.1881491
 [29,] 0.4238974  5.987543 0.4669754
 [30,] 0.7442955  7.278040 0.6419422
 [31,] 0.7690821  6.851152 0.5348509
 [32,] 0.5766055  5.569177 0.5740944
 [33,] 0.9077906  7.246912 1.6961565
 [34,] 0.5948870  5.276076 0.3273010
 [35,] 0.3188433  2.466374 0.2048978
 [36,] 0.8563399  8.215923 1.0216487
 [37,] 0.4859458  5.125675 0.4037320
 [38,] 0.5840352  5.207848 0.4522179
 [39,] 0.4421058  5.275799 0.4154543
 [40,] 0.7105057  6.313148 0.5755700
 [41,] 0.0623775  3.980188 0.1791108
 [42,] 0.4427724  4.347047 0.1617050
 [43,] 0.8102775  7.328941 0.8441203
 [44,] 0.9520337  9.014100 1.0392218
 [45,] 0.2989031  4.227565 0.2305394
 [46,] 0.9201125  8.991290 1.3522166
 [47,] 0.7760083  7.578709 0.8481472
 [48,] 0.5374009  6.116144 0.5733275
 [49,] 0.7421856  7.733539 0.7321689
 [50,] 0.8619819  8.923534 1.2041696
 [51,] 0.7675674  7.438398 1.0839021
 [52,] 0.9134812 12.299546 1.2930689
 [53,] 0.7558059  7.901530 0.9108837
 [54,] 0.3538221  4.070995 0.3053536
 [55,] 0.7324060  9.486247 0.7039008
 [56,] 0.8761552  9.117614 0.7884164
 [57,] 0.7343466  6.498987 0.6840582
 [58,] 0.3253134  6.100498 0.5853185
 [59,] 0.5710626  6.738952 0.3936210
 [60,] 0.5760979  6.354835 0.6985677
 [61,] 0.6924759  4.877017 0.5914819
 [62,] 0.8750170  8.938650 0.6987183
 [63,] 0.4963767  5.154385 0.5267843
 [64,] 0.9845955 10.830024 1.6332640
 [65,] 0.9130914  6.378303 0.7150011
 [66,] 0.6379774  7.991038 0.5805825
 [67,] 0.8735105  7.361490 0.7832649
 [68,] 0.6784702  7.413119 0.7021784
 [69,] 0.9622892  8.632043 1.1318317
 [70,] 0.9638345  9.265671 1.4903069
 [71,] 0.6172756  6.299267 0.8306458
 [72,] 0.8349335  8.792616 0.9115916
 [73,] 0.7755230  7.888000 0.7354790
 [74,] 0.6527278  6.332678 0.5341079
 [75,] 0.4630175  3.717388 0.2649272
 [76,] 0.8449912  8.392278 2.3192944
 [77,] 0.8263662  8.620581 1.3452385
 [78,] 0.8194348  7.847350 1.1544766
 [79,] 0.4565947  5.809143 0.5332319
 [80,] 0.8729451  7.225545 0.5724901
 [81,] 0.5329342  6.176113 0.4141617
 [82,] 0.4477027  5.114751 0.4234471
 [83,] 0.9191833  9.377256 1.5187152
 [84,] 0.2842830  3.502181 0.4284995
 [85,] 0.4114047  6.398566 0.1623977
 [86,] 0.9633245  9.471324 1.3984225
 [87,] 0.9856805  8.968840 1.3766028
 [88,] 0.6509489  7.088675 1.1073758
 [89,] 0.9299693  9.482104 1.1620751
 [90,] 0.6456944  6.448404 0.5497347
 [91,] 0.1856461  1.937598 0.2026946
 [92,] 0.9422877  6.825404 0.7958724
 [93,] 0.5364393  5.751602 0.2944656
 [94,] 0.6440514  6.226674 0.6087897
 [95,] 0.8683983  8.995176 0.9055552
 [96,] 0.6618030  5.422601 0.4316463
 [97,] 0.8657631  9.153244 0.8223764
 [98,] 0.8635867  8.491585 1.0403472
 [99,] 0.5914044  5.790188 0.4643897
[100,] 0.9022617  8.710650 0.9941284
[101,] 0.6248820  6.794992 0.4528421
[102,] 0.3686887  3.487050 0.2717204
[103,] 0.8063894 10.786327 1.3713073
[104,] 0.4984841  5.251239 0.2757156
[105,] 0.9103522 10.797712 1.0076279

Slot "Dependence":
$Copula
[1] "frank"

$Param
[1] 10.30767

$Std.Err
[1] 0.7387905

$P.value
[1] 0


Slot "LogLik":
[1] 148.4375

Slot "Est.Method":
[1] "maximum likelihood"

Slot "Opt.Method":
[1] "ml"

Slot "LLC":
         2          3          4 
 -63.58120 -114.48603  -40.71821 

Slot "Index.dimset":
$`2`
      1 2   LogLik
[1,] 11 1 18.59320
[2,]  8 3 33.65943

$`3`
      1 2 3   LogLik
[1,] 11 1 6 29.59298
[2,] 13 3 8 59.11185

$`4`
      1 2  3  4    LogLik
[1,] 11 1  6 12  3.370454
[2,]  7 3 13  8 22.227938


> clust@"Number.of.Clusters"
[1] 3
> clust@"Dependence"$Param
[1] 10.30767
> clust@"Data.Clusters"
       Cluster 1 Cluster 2 Cluster 3
  [1,] 0.3821719  4.724571 0.1722283
  [2,] 0.3911910  5.738707 0.1548273
  [3,] 0.9923630 11.130341 1.3684076
  [4,] 0.6133405  6.711067 0.5572253
  [5,] 0.1489788  2.763910 0.3727072
  [6,] 0.8220294  9.946755 0.7025062
  [7,] 0.7748810  8.849824 0.7472348
  [8,] 0.7060588  6.977426 0.6037493
  [9,] 0.7420107  6.979874 1.2564989
 [10,] 0.8868274  8.121758 0.8366584
 [11,] 0.8408346  9.083846 0.8449545
 [12,] 0.9265845  7.497380 0.5735631
 [13,] 0.7366919  9.083588 0.8049378
 [14,] 0.1854497  4.737831 0.1091059
 [15,] 0.7962249  7.314255 0.5804155
 [16,] 0.5226436  6.609135 0.4938606
 [17,] 0.3812399  4.108089 0.3293736
 [18,] 0.6755020  7.028995 0.5640552
 [19,] 0.3773079  4.926126 0.4780587
 [20,] 0.2413054  4.050170 0.2756642
 [21,] 0.5509937  5.554924 0.5825189
 [22,] 0.5547837  6.950117 0.5624433
 [23,] 0.8341096  7.723815 1.2358954
 [24,] 0.1245806  5.237230 0.2698407
 [25,] 0.8439165  8.633602 1.0305195
 [26,] 0.8290546  8.875230 0.7430665
 [27,] 0.2653793  3.818821 0.3310576
 [28,] 0.1377339  3.847222 0.1881491
 [29,] 0.4238974  5.987543 0.4669754
 [30,] 0.7442955  7.278040 0.6419422
 [31,] 0.7690821  6.851152 0.5348509
 [32,] 0.5766055  5.569177 0.5740944
 [33,] 0.9077906  7.246912 1.6961565
 [34,] 0.5948870  5.276076 0.3273010
 [35,] 0.3188433  2.466374 0.2048978
 [36,] 0.8563399  8.215923 1.0216487
 [37,] 0.4859458  5.125675 0.4037320
 [38,] 0.5840352  5.207848 0.4522179
 [39,] 0.4421058  5.275799 0.4154543
 [40,] 0.7105057  6.313148 0.5755700
 [41,] 0.0623775  3.980188 0.1791108
 [42,] 0.4427724  4.347047 0.1617050
 [43,] 0.8102775  7.328941 0.8441203
 [44,] 0.9520337  9.014100 1.0392218
 [45,] 0.2989031  4.227565 0.2305394
 [46,] 0.9201125  8.991290 1.3522166
 [47,] 0.7760083  7.578709 0.8481472
 [48,] 0.5374009  6.116144 0.5733275
 [49,] 0.7421856  7.733539 0.7321689
 [50,] 0.8619819  8.923534 1.2041696
 [51,] 0.7675674  7.438398 1.0839021
 [52,] 0.9134812 12.299546 1.2930689
 [53,] 0.7558059  7.901530 0.9108837
 [54,] 0.3538221  4.070995 0.3053536
 [55,] 0.7324060  9.486247 0.7039008
 [56,] 0.8761552  9.117614 0.7884164
 [57,] 0.7343466  6.498987 0.6840582
 [58,] 0.3253134  6.100498 0.5853185
 [59,] 0.5710626  6.738952 0.3936210
 [60,] 0.5760979  6.354835 0.6985677
 [61,] 0.6924759  4.877017 0.5914819
 [62,] 0.8750170  8.938650 0.6987183
 [63,] 0.4963767  5.154385 0.5267843
 [64,] 0.9845955 10.830024 1.6332640
 [65,] 0.9130914  6.378303 0.7150011
 [66,] 0.6379774  7.991038 0.5805825
 [67,] 0.8735105  7.361490 0.7832649
 [68,] 0.6784702  7.413119 0.7021784
 [69,] 0.9622892  8.632043 1.1318317
 [70,] 0.9638345  9.265671 1.4903069
 [71,] 0.6172756  6.299267 0.8306458
 [72,] 0.8349335  8.792616 0.9115916
 [73,] 0.7755230  7.888000 0.7354790
 [74,] 0.6527278  6.332678 0.5341079
 [75,] 0.4630175  3.717388 0.2649272
 [76,] 0.8449912  8.392278 2.3192944
 [77,] 0.8263662  8.620581 1.3452385
 [78,] 0.8194348  7.847350 1.1544766
 [79,] 0.4565947  5.809143 0.5332319
 [80,] 0.8729451  7.225545 0.5724901
 [81,] 0.5329342  6.176113 0.4141617
 [82,] 0.4477027  5.114751 0.4234471
 [83,] 0.9191833  9.377256 1.5187152
 [84,] 0.2842830  3.502181 0.4284995
 [85,] 0.4114047  6.398566 0.1623977
 [86,] 0.9633245  9.471324 1.3984225
 [87,] 0.9856805  8.968840 1.3766028
 [88,] 0.6509489  7.088675 1.1073758
 [89,] 0.9299693  9.482104 1.1620751
 [90,] 0.6456944  6.448404 0.5497347
 [91,] 0.1856461  1.937598 0.2026946
 [92,] 0.9422877  6.825404 0.7958724
 [93,] 0.5364393  5.751602 0.2944656
 [94,] 0.6440514  6.226674 0.6087897
 [95,] 0.8683983  8.995176 0.9055552
 [96,] 0.6618030  5.422601 0.4316463
 [97,] 0.8657631  9.153244 0.8223764
 [98,] 0.8635867  8.491585 1.0403472
 [99,] 0.5914044  5.790188 0.4643897
[100,] 0.9022617  8.710650 0.9941284
[101,] 0.6248820  6.794992 0.4528421
[102,] 0.3686887  3.487050 0.2717204
[103,] 0.8063894 10.786327 1.3713073
[104,] 0.4984841  5.251239 0.2757156
[105,] 0.9103522 10.797712 1.0076279
> index.clust <- clust@"Index.Matrix"
> 
> ## compare with index.true
> index.clust
     Cluster 1 Cluster 2 Cluster 3    LogLik
[1,]        11         1         6  29.59298
[2,]        13         3         8  59.11185
[3,]        12         2         7  87.55890
[4,]        14         4         9 118.68674
[5,]        15         5        10 148.43749
> index.true
     Cluster 1 Cluster 2 Cluster 3
[1,]         1         6        11
[2,]         2         7        12
[3,]         3         8        13
[4,]         4         9        14
[5,]         5        10        15
> ##
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>