Perform unsupervised clustering using various optimization procedures to find
the maximum likelihood estimate of the general Gaussian mixture copula
model by Tewari et al. (2011).
Usage
fit.full.GMCM(u, m, theta = choose.theta(u, m), method = c("NM", "SANN",
"L-BFGS", "L-BFGS-B", "PEM"), max.ite = 1000, verbose = TRUE, ...)
Arguments
u
An n by d matrix of ranked and scaled test statistics.
Rows
correspond to observations and columns to the dimensions of the variables.
m
The number of components to be fitted.
theta
A list of parameters as defined in rtheta. If
theta is not provided, then heuristic starting values are chosen
using the k-means algorithm.
method
A character vector of length 1. The optimization
method used. Should be either "NM", "SANN", "L-BFGS",
"L-BFGS-B", or "PEM" which are the Nelder-Mead, Simulated
Annealing, limited-memory quasi-Newton method, limited-memory quasi-Newton
method with box constraints, and the pseudo EM algorithm, respectively.
Default is "NM". See optim for further details.
max.ite
The maximum number of iterations. If the method is
"SANN" this is the number of iterations as there is no other
stopping criterion. (See optim)
verbose
Logical. If TRUE, a trace of the parameter estimates
is made.
...
Arguments passed to the control-list in
optim or PseudoEMAlgorithm if the method
is "PEM".
Details
The "L-BFGS-B" method does not perform a transformation of the
parameters and uses box constraints as implemented in optim.
Note that the many parameter configurations are poorly estimable or directly
unidentifiable.
Value
A list of parameters formatted as described in rtheta.
Note
All the optimization procedures are strongly dependent on the initial
values and the cooling scheme. Therefore it is advisable to apply multiple
different initial parameters and select the best fit.
The choose.theta itself chooses random a initialization.
Hence, the output when theta is not directly supplied can vary.
See optim for further details.
Author(s)
Anders Ellern Bilgrau <anders.ellern.bilgrau@gmail.com>
References
Li, Q., Brown, J. B. J. B., Huang, H., & Bickel, P. J. (2011).
Measuring reproducibility of high-throughput experiments. The Annals of
Applied Statistics, 5(3), 1752-1779. doi:10.1214/11-AOAS466
Tewari, A., Giering, M. J., & Raghunathan, A. (2011). Parametric
Characterization of Multimodal Distributions with Non-gaussian Modes. 2011
IEEE 11th International Conference on Data Mining Workshops, 286-292.
doi:10.1109/ICDMW.2011.135
See Also
optim, get.prob
Examples
set.seed(17)
sim <- SimulateGMCMData(n = 1000, m = 3, d = 2)
# Plotting simulated data
par(mfrow = c(1,2))
plot(sim$z, col = rainbow(3)[sim$K], main = "Latent process")
plot(sim$u, col = rainbow(3)[sim$K], main = "GMCM process")
# Observed data
uhat <- Uhat(sim$u)
# The model should be fitted multiple times using different starting estimates
start.theta <- choose.theta(uhat, m = 3) # Random starting estimate
res <- fit.full.GMCM(u = uhat, theta = start.theta,
method = "NM", max.ite = 3000,
reltol = 1e-2, trace = TRUE) # Note, 1e-2 is too big
# Confusion matrix
Khat <- apply(get.prob(uhat, theta = res), 1, which.max)
table("Khat" = Khat, "K" = sim$K) # Note, some components have been swapped
# Simulation from GMCM with the fitted parameters
simfit <- SimulateGMCMData(n = 1000, theta = res)
# As seen, the underlying latent process is hard to estimate.
# The clustering, however, is very good.
par(mfrow = c(2,2))
plot(simfit$z, col = simfit$K, main = "Model check 1\nSimulated GMM")
plot(simfit$u, col = simfit$K, main = "Model check 2\nSimulated GMCM")
plot(sim$u, col = Khat, main = "MAP clustering")
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(GMCM)
> png(filename="/home/ddbj/snapshot/RGM3/R_CC/result/GMCM/fit.full.GMCM.Rd_%03d_medium.png", width=480, height=480)
> ### Name: fit.full.GMCM
> ### Title: Unsupervised clustering using a general GMCM
> ### Aliases: fit.full.GMCM fit.full.gmcm
>
> ### ** Examples
>
> set.seed(17)
> sim <- SimulateGMCMData(n = 1000, m = 3, d = 2)
>
> # Plotting simulated data
> par(mfrow = c(1,2))
> plot(sim$z, col = rainbow(3)[sim$K], main = "Latent process")
> plot(sim$u, col = rainbow(3)[sim$K], main = "GMCM process")
>
> # Observed data
> uhat <- Uhat(sim$u)
>
> # The model should be fitted multiple times using different starting estimates
> start.theta <- choose.theta(uhat, m = 3) # Random starting estimate
> res <- fit.full.GMCM(u = uhat, theta = start.theta,
+ method = "NM", max.ite = 3000,
+ reltol = 1e-2, trace = TRUE) # Note, 1e-2 is too big
Nelder-Mead direct search function minimizer
function value for initial parameters = -532.119142
Scaled convergence tolerance is 5.32129
Stepsize computed as 0.908776
BUILD 15 592.493941 -579.264880
LO-REDUCTION 17 -191.688195 -579.264880
HI-REDUCTION 19 -199.665759 -579.264880
LO-REDUCTION 21 -248.049142 -579.264880
HI-REDUCTION 23 -366.700910 -579.264880
LO-REDUCTION 25 -409.223490 -599.883027
HI-REDUCTION 27 -444.829521 -599.883027
LO-REDUCTION 29 -468.981142 -599.883027
HI-REDUCTION 31 -478.942797 -599.883027
HI-REDUCTION 33 -490.506246 -599.883027
HI-REDUCTION 35 -491.956415 -599.883027
LO-REDUCTION 37 -492.921066 -599.883027
LO-REDUCTION 39 -498.905480 -599.883027
LO-REDUCTION 41 -501.658194 -599.883027
LO-REDUCTION 43 -532.119142 -599.883027
REFLECTION 45 -532.136454 -608.618017
REFLECTION 47 -534.236088 -615.759559
REFLECTION 49 -536.789234 -619.299238
REFLECTION 51 -546.483086 -621.685781
LO-REDUCTION 53 -553.068460 -621.685781
LO-REDUCTION 55 -562.123324 -621.920152
REFLECTION 57 -567.269898 -621.942192
REFLECTION 59 -569.794759 -646.618172
LO-REDUCTION 61 -572.583502 -646.618172
REFLECTION 63 -579.264880 -650.581081
HI-REDUCTION 65 -579.525220 -650.581081
LO-REDUCTION 67 -594.625222 -650.581081
LO-REDUCTION 69 -597.572392 -650.581081
REFLECTION 71 -599.883027 -659.641852
LO-REDUCTION 73 -608.618017 -659.641852
EXTENSION 75 -612.150562 -676.993643
LO-REDUCTION 77 -615.759559 -676.993643
LO-REDUCTION 79 -619.299238 -676.993643
LO-REDUCTION 81 -619.394835 -676.993643
LO-REDUCTION 83 -621.685781 -676.993643
LO-REDUCTION 85 -621.920152 -676.993643
LO-REDUCTION 87 -621.942192 -676.993643
LO-REDUCTION 89 -631.953485 -676.993643
LO-REDUCTION 91 -637.615680 -676.993643
REFLECTION 93 -646.230967 -680.065226
LO-REDUCTION 95 -646.618172 -680.065226
HI-REDUCTION 97 -647.294905 -680.065226
LO-REDUCTION 99 -647.588391 -680.065226
LO-REDUCTION 101 -650.581081 -680.065226
LO-REDUCTION 103 -656.503424 -680.065226
REFLECTION 105 -656.781239 -680.659019
EXTENSION 107 -659.171445 -692.865079
LO-REDUCTION 109 -659.641852 -692.865079
LO-REDUCTION 111 -660.059119 -692.865079
LO-REDUCTION 113 -662.495231 -692.865079
REFLECTION 115 -671.164238 -693.389412
LO-REDUCTION 117 -671.586443 -693.389412
LO-REDUCTION 119 -672.763716 -693.389412
REFLECTION 121 -673.959931 -694.622827
LO-REDUCTION 123 -674.219436 -694.622827
REFLECTION 125 -675.482792 -700.502234
LO-REDUCTION 127 -675.999163 -700.502234
LO-REDUCTION 129 -676.993643 -700.502234
LO-REDUCTION 131 -680.065226 -700.502234
EXTENSION 133 -680.659019 -705.882619
LO-REDUCTION 135 -684.971455 -705.882619
EXTENSION 137 -685.973558 -707.846676
LO-REDUCTION 139 -688.335078 -707.846676
EXTENSION 141 -689.992120 -711.883730
LO-REDUCTION 143 -690.725193 -711.883730
LO-REDUCTION 145 -692.814691 -711.883730
LO-REDUCTION 147 -692.865079 -711.883730
HI-REDUCTION 149 -693.015372 -711.883730
LO-REDUCTION 151 -693.389412 -711.883730
LO-REDUCTION 153 -693.953910 -711.883730
REFLECTION 155 -694.622827 -715.529357
LO-REDUCTION 157 -697.680840 -715.529357
HI-REDUCTION 159 -700.502234 -715.529357
LO-REDUCTION 161 -701.370441 -715.529357
LO-REDUCTION 163 -701.380084 -715.529357
HI-REDUCTION 165 -702.801125 -715.529357
LO-REDUCTION 167 -705.695891 -715.529357
LO-REDUCTION 169 -705.882619 -716.427646
LO-REDUCTION 171 -707.297957 -716.427646
REFLECTION 173 -707.846676 -718.113527
HI-REDUCTION 175 -708.346704 -718.113527
LO-REDUCTION 177 -708.849320 -718.113527
HI-REDUCTION 179 -711.639995 -718.113527
LO-REDUCTION 181 -711.787184 -718.113527
LO-REDUCTION 183 -711.859122 -718.113527
REFLECTION 185 -711.883730 -718.215761
LO-REDUCTION 187 -712.044354 -718.215761
LO-REDUCTION 189 -712.446052 -718.215761
LO-REDUCTION 191 -712.973631 -718.463450
Exiting from Nelder Mead minimizer
193 function evaluations used
>
> # Confusion matrix
> Khat <- apply(get.prob(uhat, theta = res), 1, which.max)
> table("Khat" = Khat, "K" = sim$K) # Note, some components have been swapped
K
Khat 1 2 3
1 0 403 0
2 0 0 293
3 97 207 0
>
> # Simulation from GMCM with the fitted parameters
> simfit <- SimulateGMCMData(n = 1000, theta = res)
>
> # As seen, the underlying latent process is hard to estimate.
> # The clustering, however, is very good.
> par(mfrow = c(2,2))
> plot(simfit$z, col = simfit$K, main = "Model check 1\nSimulated GMM")
> plot(simfit$u, col = simfit$K, main = "Model check 2\nSimulated GMCM")
> plot(sim$u, col = Khat, main = "MAP clustering")
>
>
>
>
>
> dev.off()
null device
1
>