R Graphical Manual

Browse All

Last data update: 2014.03.03

R: Find K nearest neighbors

yai	R Documentation

Find K nearest neighbors

Description

Given a set of observations, yai

separates the observations into reference and target observations,
applies the specified method to project X-variables into a Euclidean space (not always, see argument method), and
finds the k-nearest neighbors within the referenece observations and between the reference and target observations.

An alternative method using randomForest classification and regression trees is provided for steps 2 and 3. Target observations are those with values for X-variables and not for Y-variables, while reference observations are those with no missing values for X-and Y-variables (see Details for the exception).

Usage

yai(x=NULL,y=NULL,data=NULL,k=1,noTrgs=FALSE,noRefs=FALSE,
    nVec=NULL,pVal=.05,method="msn",ann=TRUE,mtry=NULL,ntree=500,
    rfMode="buildClasses",bootstrap=FALSE,ppControl=NULL,sampleVars=NULL,
    rfXsubsets=NULL)

Arguments

`x`	1) a matrix or data frame containing the X-variables for all observations with row names are the identification for the observations, or 2) a one-sided formula defining the X-variables as a linear formula. If a formula is coded for `x`, one must be used for `y` as well, if needed.
`y`	1) a matrix or data frame containing the Y-variables for the reference observations, or 2) a one-sided formula defining the Y-variables as a linear formula.
`data`	when `x` and `y` are formulas, then data is a data frame or matrix that contains all the variables with row names are the identification for the observations. The observations are split by `yai` into two sets.
`k`	the number of nearest neighbors; default is 1.
`noTrgs`	when TRUE, skip finding neighbors for target observations.
`noRefs`	when TRUE, skip finding neighbors for reference observations.
`nVec`	number of canonical vectors to use (methods `msn` and `msn2`), or number of independent of X-variables reference data when method `mahalanobis`. When NULL, the number is set by the function.
`pVal`	significant level for canonical vectors, used when `method` is `msn` or `msn2`.
`method`	is the strategy used for computing distance and therefore for finding neighbors; the options are quoted key words (see details): `euclidean` distance is computed in a normalized X space. `raw` like euclidean, except no normalization is done. `mahalanobis` distance is computed in its namesakes space. `ica` like mahalanobis, but based on Independent Component Analysis using package `fastICA`. `msn` distance is computed in a projected canonical space. `msn2` like msn, but with variance weighting (canonical regression rather than correlation). `msnPP` like msn, except that the canonical correlation is computed using projection pursuit from ccaPP (see argument `ppControl`). `gnn` distance is computed using a projected ordination of Xs found using canonical correspondence analysis (`cca` from package vegan). If `cca` fails, `rda` is used and a warning is issued. `randomForest`distance is one minus the proportion of randomForest trees where a target observation is in the same terminal node as a reference observation (see `randomForest`). `random` like raw except that the X space is a single vector of uniform random [0,1] numbers generated using `runif`, results in random assignment of neighbors, and forces `ann` to be FALSE.
`ann`	TRUE if `ann` is used to find neighbors, FALSE if a slow search is used.
`mtry`	the number of X-variables picked at random when method is `randomForest`, see `randomForest`, default is sqrt(number of X-variables).
`ntree`	the number of classification and regression trees when method is `randomForest`. When more than one Y-variable is used, the trees are divided among the variables. Alternatively, ntree can be a vector of values corresponding to each Y-variable.
`rfMode`	when `buildClasses` and method is `randomForest`, continuous variables are internally converted to classes forcing randomForest to build classification trees for the variable. Otherwise, regression trees are built if your version of randomForest is newer than `4.5-18`.
`bootstrap`	if `TRUE`, the reference observations are sampled with replacement.
`ppControl`	used to control how canoncial correlation analysis via projection pursuit is done, see Details.
`sampleVars`	the X- and/or Y-variables will be sampled (without replacement) if this is not NULL and greater than zero. If specified as a single unnamed value, that value is used to control the sample size of both X and Y variables. If two unnamed values, then the first is taken for X-variables and the second for Y-variables. If zero, no sampling is done. Otherwise, values are less than 1.0 they are taken as the proportion of the number of variables. Values greater or equal to 1 are number of variables to be included in the sample. Specification of a large number will cause the sequence of variables to be randomized.
`rfXsubsets`	a named list of character vectors where there is one vector for each Y-variable, see details, only applies when `method="randomForest"`

Details

See the paper at http://www.jstatsoft.org/v23/i10 (it includes examples).

The following information is in addition to the content in the papers.

You need not have any Y-variables to run yai for the following methods: euclidean, raw, mahalanobis, ica, random, and randomForest (in which case unsupervised classification is performed). However, normally yai classifies reference observations as those with no missing values for X- and Y- variables and target observations are those with values for X- variables and missing data for Y-variables. When Y is NULL (there are no Y-variables), all the observations are considered references. See newtargets for an example of how to use yai in this situation.

When bootstrap=TRUE the reference observations are sampled with replacement. The sample size is set to the number of reference observations. Normally, about a third of the reference observations are left out of the sample; they are often called out-of-bag samples. The out-of-bag observations are then treated as targets.

When method="msnPP" projection pursuit from ccaPP is used. The method is further controlled using argument ppControl to specify a character vector that has has two named components.

method One of the following "spearman", "kendall", "quadrant", "M", "pearson", default is "spearman"
search If "data" or "proj", then ccaProj is used, otherwise the default ccaGrid is used.

Here are some details on argument rfXsubsets. When method="randomForest" one call to randomForest is generated for for each Y-variable. When argument rfXsubsets is left NULL, all the X-variables are used for each of the Y-variables. However, sometimes better results can be achieved by using specific subsets of X-variables for each Y-variable. This is done by setting rfXsubsets equal to a named list of character vectors. The names correspond to the Y-variable names and the character vectors hold the list of X-variables for the corresponding Y-variable.

Value

An object of class yai, which is a list with the following tags:

`call`	the call.
`yRefs, xRefs`	matrices of the X- and Y-variables for just the reference observations (unscaled). The scale factors are attached as attributes.
`obsDropped`	a list of the row names for observations dropped for various reasons (missing data).
`trgRows`	a list of the row names for target observations as a subset of all observations.
`xall`	the X-variables for all observations.
`cancor`	returned from cancor function when method `msn` or `msn2` is used (NULL otherwise).
`ccaVegan`	an object of class cca (from package vegan) when method gnn is used.
`ftest`	a list containing partial F statistics and a vector of Pr>F (pgf) corresponding to the canonical correlation coefficients when method msn or msn2 is used (NULL otherwise).
`yScale, xScale`	scale data used on yRefs and xRefs as needed.
`k`	the value of k.
`pVal`	as input; only used when method `msn`, `msn2` or `msnPP` is used.
`projector`	NULL when not used. For methods `msn`, `msn2`, `msnPP`, `gnn` and `mahalanobis`, this is a matrix that projects normalized X-variables into a space suitable for doing Eculidian distances.
`nVec`	number of canonical vectors used (methods `msn` and `msn2`), or number of independent X-variables in the reference data when method `mahalanobis` is used.
`method`	as input, the method used.
`ranForest`	a list of the forests if method `randomForest` is used. There is one forest for each Y-variable, or just one forest when there are no Y-variables.
`ICA`	a list of information from `fastICA` when method `ica` is used.
`ann`	the value of ann, TRUE when `ann` is used, FALSE otherwise.
`xlevels`	NULL if no factors are used as predictors; otherwise a list of predictors that have factors and their levels (see `lm`).
`neiDstTrgs`	a matrix of distances between a target (identified by its row name) and the k references. There are k columns.
`neiIdsTrgs`	a matrix of reference identifications that correspond to neiDstTrgs.
`neiDstRefs, neiIdsRefs`	counterparts for references.
`bootstrap`	a vector of reference rownames that constitute the bootstrap sample; or the value `FALSE` when bootstrap is not used.

Author(s)

Nicholas L. Crookston ncrookston.fs@gmail.com
John Coulston jcoulston@fs.fed.us
Andrew O. Finley finleya@msu.edu

Examples


require (yaImpute)

data(iris)

# set the random number seed so that example results are consistent
# normally, leave out this command
set.seed(12345)

# form some test data, y's are defined only for reference
# observations.
refs=sample(rownames(iris),50)
x <- iris[,1:2]      # Sepal.Length Sepal.Width
y <- iris[refs,3:4]  # Petal.Length Petal.Width

# build yai objects using 2 methods
msn <- yai(x=x,y=y)
mal <- yai(x=x,y=y,method="mahalanobis")
# compare these results using the generalized mean distances. mal wins!
grmsd(mal,msn)

# use projection pursuit and specify ppControl (loads package ccaPP)
if (require(ccaPP)) 
{
  msnPP <- yai(x=x,y=y,method="msnPP",ppControl=c(method="kendall",search="proj"))
  grmsd(mal,msnPP,msn)
}

#############

data(MoscowMtStJoe)

# convert polar slope and aspect measurements to cartesian
# (which is the same as Stage's (1976) transformation).

polar <- MoscowMtStJoe[,40:41]
polar[,1] <- polar[,1]*.01      # slope proportion
polar[,2] <- polar[,2]*(pi/180) # aspect radians
cartesian <- t(apply(polar,1,function (x)
               {return (c(x[1]*cos(x[2]),x[1]*sin(x[2]))) }))
colnames(cartesian) <- c("xSlAsp","ySlAsp")
x <- cbind(MoscowMtStJoe[,37:39],cartesian,MoscowMtStJoe[,42:64])
y <- MoscowMtStJoe[,1:35]

msn <- yai(x=x, y=y, method="msn", k=1)
mal <- yai(x=x, y=y, method="mahalanobis", k=1)
# the results can be plotted.
plot(mal,vars=yvars(mal)[1:16])

# compare these results using the generalized mean distances..
grmsd(mal,msn)

# try method="randomForest"
if (require(randomForest))
{
  # reduce the plant community data for randomForest.
  yba  <- MoscowMtStJoe[,1:17]
  ybaB <- whatsMax(yba,nbig=7)  # see help on whatsMax
  
  rf <- yai(x=x, y=ybaB, method="randomForest", k=1)
  
  # build the imputations for the original y's
  rforig <- impute(rf,ancillaryData=y)
  
  # compare the results using individual rmsd's
  compare.yai(mal,msn,rforig)
  plot(compare.yai(mal,msn,rforig))
  
  # build another randomForest case forcing regression
  # to be used for continuous variables. The answers differ
  # but one is not clearly better than the other.
  
  rf2 <- yai(x=x, y=ybaB, method="randomForest", rfMode="regression")
  rforig2 <- impute(rf2,ancillaryData=y)
  compare.yai(rforig2,rforig)
}