Last data update: 2014.03.03

R: k-NN algorithm using the arc cosinus distance. Tuning the k...
dirknn.tuneR Documentation

k-NN algorithm using the arc cosinus distance. Tuning the k neigbours

Description

It estimates the percentage of correct classification via an m-fold cross valdiation. The bias is estimated as well using the algorithm suggested by Tibshirani and Tibshirani (2009) and is subtracted.

Usage

dirknn.tune(z, M = 10, A = 5, ina, type = "S", mesos = TRUE, mat = NULL)

Arguments

z

The data, a numeric matrix with unit vectors.

M

The number of folds for the m-fold cross validation, set to 10 by default.

A

The maximum number of nearest neighbours, set to 5 by default. The 1 nearest neighbour is not used.

ina

A variable indicating the groups of the data x.

type

If type is "S", the standard k-NN algorithm is to be used, else "NS" for the non standard one. See below (details) for more information.

mesos

A boolean variable used only in the case of the non standard algorithm (type="NS"). Should the average of the distances be calculated (TRUE) or not (FALSE)? If it is FALSE, the harmonic mean is calculated.

mat

You can specify your own folds by giving a mat, where each column is a fold. Each column contains indices of the observations. You can also leave it NULL and it will create folds.

Details

The standard algorithm is to keep the k nearest observations and see the groups of these observations. The new observation is allocated to the most frequent seen group. The non standard algorithm is to calculate the classical mean or the harmonic mean of the k nearest observations for each group. The new observation is allocated to the group with the smallest mean distance. The estimated bias is calculated as Tibshirani and Tibshirani (2009) suggested.

We have made an eficient (not very much efficient though) memory allocation. Even if you have hundreds of thousands of observations, the computer will not clush, it will only take longer. Instead of calcualte the distance matrix once in the beginning we calcualte the distances of the out-of-sample observations from the rest. If we calculated the distance matrix in the beginning, once, the resulting matrix could have dimensions thousands by thousands. This would not fit into the memory. If you have a few hundres of observations, the runtime is about the same (maybe less, maybe more) as calculating the distance matrix in the first place.

Value

A list including:

per

The average percent of correct classification across the neighbours.

percent

The bias corrected percent of correct classification.

runtime

The run time of the algorithm. A numeric vector. The first element is the user time, the second element is the system time and the third element is the elapsed time.

Author(s)

Michail Tsagris R implementation and documentation: Michail Tsagris <mtsagris@yahoo.gr> and Giorgos Athineou <athineou@csd.uoc.gr>

References

Tibshirani, Ryan J., and Robert Tibshirani. A bias correction for the minimum error rate in cross-validation. The Annals of Applied Statistics (2009), 3(2): 822-829.

See Also

dirknn, vmf.da, mix.vmf

Examples

k <- runif(4, 4, 20)
prob <- c(0.2, 0.4, 0.3, 0.1)
mu <- matrix(rnorm(16), ncol = 4)
mu <- mu / sqrt( rowSums(mu^2) )
da <- rmixvmf(200, prob, mu, k)
x <- da$x
ina <- da$id
dirknn.tune(x, M = 5, A = 10, ina, type = "S", mesos = TRUE)
dirknn.tune(x, M = 10, A = 5, ina, type = "S", mesos = TRUE)

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(Directional)
> png(filename="/home/ddbj/snapshot/RGM3/R_CC/result/Directional/dirknn.tune.Rd_%03d_medium.png", width=480, height=480)
> ### Name: dirknn.tune
> ### Title: k-NN algorithm using the arc cosinus distance. Tuning the k
> ###   neigbours
> ### Aliases: dirknn.tune
> ### Keywords: Directional k-NN algorithm Supervised classification
> 
> ### ** Examples
> 
> k <- runif(4, 4, 20)
> prob <- c(0.2, 0.4, 0.3, 0.1)
> mu <- matrix(rnorm(16), ncol = 4)
> mu <- mu / sqrt( rowSums(mu^2) )
> da <- rmixvmf(200, prob, mu, k)
> x <- da$x
> ina <- da$id
> dirknn.tune(x, M = 5, A = 10, ina, type = "S", mesos = TRUE)
$per
  k=2   k=3   k=4   k=5   k=6   k=7   k=8   k=9  k=10 
0.855 0.890 0.890 0.890 0.895 0.915 0.910 0.915 0.905 

$percent
Bias corrected estimated percentage 
                              0.905 

$runtime
   user  system elapsed 
  0.212   0.000   0.211 

> dirknn.tune(x, M = 10, A = 5, ina, type = "S", mesos = TRUE)
$per
  k=2   k=3   k=4   k=5 
0.850 0.875 0.900 0.900 

$percent
Bias corrected estimated percentage 
                               0.89 

$runtime
   user  system elapsed 
  0.108   0.000   0.107 

> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>