R Graphical Manual

Browse All

Last data update: 2014.03.03

R: Knowledge Discovery by Accuracy Maximization

KODAMA

R Documentation

Knowledge Discovery by Accuracy Maximization

Description

KODAMA (KnOwledge Discovery by Accuracy MAximization) is an unsupervised and semisupervised learning algorithm that performs feature extraction from noisy and high-dimensional data. Unlike other data mining methods, the peculiarity of KODAMA is that it is driven by an integrated procedure of cross validation of the results.

Usage

KODAMA(data,M=100,Tcycle=20,
       FUN_VAR=function(x){ceiling(ncol(x))},
       FUN_SAM=function(x){ceiling(nrow(x)*0.75)},
       bagging=FALSE,
       FUN=KNN.CV,
       f.par=list(kn=10),
       W=NULL,
       constrain=NULL,
       fix=rep(FALSE,nrow(data)),
       epsilon=0.05,
       shake=FALSE)

Arguments

`data`	a matrix.
`M`	number of iterative processes (step I-III).
`Tcycle`	number of interative cycle that leads to the maximization of cross-validated accuracy.
`FUN_VAR`	function to select the number of variable to select randomly. By default all variable are taken.
`FUN_SAM`	function to select the number of sample to select randomly. By default the 75
`bagging`	If it Should sampling be with replacement, `bagging = TRUE`. By default `bagging = FALSE`
`FUN`	classifier to be consider. Choices are "`KNN.CV`", "`PLS.SVM.CV`" , and "`PCA.CA.KNN.CV`".
`f.par`	parameters of the classifier.
`W`	a vector of `nrow(data)` elements. The KODAMA procedure can be started by different initializations of the vector `W`. Without any a priori information the vector `W` can be initializated with each element being different from the others (i.e., each sample categorized in a one-element class). Alternatively, the vector `W` can be initialized by a clustering procedure, such as `kmeans`.
`constrain`	a vector of `nrow(data)` elements. Supervised constraints can be imposed by linking some samples in such a way that if one of them is changed the linked ones must change in the same way (i.e., they are forced to belong to the same class) during the maximization of the cross-validation accuracy procedure. Sample with the same identificative constrain will be forced to stay together.
`fix`	a vector of `nrow(data)` elements. The values of this vector must to be `TRUE` or `FALSE`. By default all elements are `FALSE`. Samples with the `TRUE` fix value will not change the class label defined in `W` during the maximization of the cross-validation accuracy procedure.
`epsilon`	cut-off value for low proximity. High proximity are typical of intracluster relationships, whereas low proximities are expected for interluster relationships. Very low proximities between samples are ignored by (default) setting `epsilon = 0.05`.
`shake`	if `shake = FALSE` the cross-validated accuracy is computed with the class defined in `W` else the it is not, before the maximization of the cross-validation accuracy procedure.

Details

KODAMA consists of five steps. For a simple description of the method, we can divide KODAMA into two parts: (i) the maximization of cross-validated accuracy by an iterative process (step I and II), resulting in the construction of a proximity matrix (step III), and (ii) the definition of a dissimilarity matrix (step IV and V). The first part entails the core idea of KODAMA, that is, the partitioning of data guided by the maximization of the cross-validated accuracy. At the begininng of this part, a fraction of the total samples (defined by FUN_SAM) are randomly selected from the original data. The whole iterative process (step I-III) is repeated M times to average the effects owing to the randomness of the iterative procedure. Each time that this part is repeated, a different fraction of sample is selected. The second part aims at collectioning and processing these results by costructing a dissimilarity matrix to provide a holistic view of the data while maintaining their intrinsic structure (steps IV and V).

Value

The function returns a list with 4 items:

`dissimilarity`	a dissimilarity matrix.
`acc`	a vector with the `M` cross-validated accuracies.
`proximity`	a proximity matrix.
`v`	a matrix containing the all classification obtained maximizing the cross-validation accuracy.

Author(s)

Stefano Cacciatore and Leonardo Tenori

References

Cacciatore S, Luchinat C, Tenori L.
Knowledge discovery by accuracy maximization.
Proc Natl Acad Sci U S A 2014;111(14):5117-22.

Examples


# data(iris)
# kk=KODAMA(iris[,-5])
# pp = cmdscale(kk$dissimilarity)
# plot(pp,col=rep(2:4,each=50))
#
#
#
# WARNING: The next example is high computational extensive
#
# data(MetRef);
# u=MetRef$data;
# u=u[,-which(colSums(u)==0)]
# u=scaling(u)$newXtrain
# class=as.factor(unlist(MetRef$donor))
# kk=KODAMA(u,FUN=PCA.CA.KNN.CV, W=function(x) as.numeric(kmeans(x,50)$cluster))
# pp = cmdscale(kk$dissimilarity)
# plot(pp,col=class)
# pp = cmdscale(kk$dissimilarity)
# plot(pp,col=class)

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(KODAMA)
Loading required package: e1071
Loading required package: plsgenomics
Loading required package: MASS
Loading required package: boot
Loading required package: parallel
Loading required package: class

Attaching package: 'KODAMA'

The following object is masked from 'package:plsgenomics':

    transformy

> png(filename="/home/ddbj/snapshot/RGM3/R_CC/result/KODAMA/KODAMA.Rd_%03d_medium.png", width=480, height=480)
> ### Name: KODAMA
> ### Title: Knowledge Discovery by Accuracy Maximization
> ### Aliases: KODAMA
> 
> ### ** Examples
> 
> 
> # data(iris)
> # kk=KODAMA(iris[,-5])
> # pp = cmdscale(kk$dissimilarity)
> # plot(pp,col=rep(2:4,each=50))
> #
> #
> #
> # WARNING: The next example is high computational extensive
> #
> # data(MetRef);
> # u=MetRef$data;
> # u=u[,-which(colSums(u)==0)]
> # u=scaling(u)$newXtrain
> # class=as.factor(unlist(MetRef$donor))
> # kk=KODAMA(u,FUN=PCA.CA.KNN.CV, W=function(x) as.numeric(kmeans(x,50)$cluster))
> # pp = cmdscale(kk$dissimilarity)
> # plot(pp,col=class)
> # pp = cmdscale(kk$dissimilarity)
> # plot(pp,col=class)
> 
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>