R: Calculate BCMI between a set of continuous variables
cmi
R Documentation
Calculate BCMI between a set of continuous variables
Description
This function calculates MI and BCMI between a set of continuous variables
held as columns in a matrix. It also performs jackknife bias correction and
provides a z-score for the hypothesis of no association. Also included are
the *.pw functions that calculate MI between two vectors only. The *njk
functions do not perform the jackknife and are therefore faster.
The data matrix. Each row is an observation and each column is a
variable of interest. Should be numerical data.
level
The number of levels used for plug-in bandwidth estimation (see
the documentation for the KernSmooth package.)
na.rm
Remove missing values if TRUE. This is required for the
bandwidth calculation.
h
A (double) vector of smoothing bandwidths, one for each variable. If
missing this will be calculated using the dpik() function from the
KernSmooth package.
...
Additional options passed to dpik() if necessary.
v1
A vector for the pairwise version
v2
A vector for the pairwise version
Details
The results of cmi() are in many ways similar to a correlation matrix,
with each row and column index corresponding to a given variable.
cminjk() and cminjk.pw() just returns the MI values without performing the
jackknife. cmi.pw() and cminjk.pw() each only require two bandwidths, one
for each variable. The number of processor cores used can be changed by
setting the environment variable "OMP_NUM_THREADS" before starting R.
Value
Returns a list of 3 matrices each of size ncol(cts) by
ncol(cts)
mi
The raw MI estimates.
bcmi
Jackknife bias corrected MI estimates (BCMI). These are each MI value
minus the corresponding jackknife estimate of bias.
zvalues
Z-scores for each hypothesis that the corresponding
BCMI value is zero. These have poor statistical properties but can be useful
as a rough measure of the strength of association.
Examples
##################################################
# The USArrests dataset
# Matrix version
c1 <- cmi(USArrests)
lapply(c1, round, 2)
# Pairwise version
cmi.pw(USArrests[,1], USArrests[,2])
# Without jackknife
c2 <- cminjk(USArrests)
round(c2, 2)
cminjk.pw(USArrests[,1], USArrests[,2])
##################################################
# A look at Anscombe's famous dataset.
par(mfrow = c(2,2))
plot(anscombe$x1, anscombe$y1)
plot(anscombe$x2, anscombe$y2)
plot(anscombe$x3, anscombe$y3)
plot(anscombe$x4, anscombe$y4)
cor(anscombe$x1, anscombe$y1)
cor(anscombe$x2, anscombe$y2)
cor(anscombe$x3, anscombe$y3)
cor(anscombe$x4, anscombe$y4)
cmi.pw(anscombe$x1, anscombe$y1)
cmi.pw(anscombe$x2, anscombe$y2)
cmi.pw(anscombe$x3, anscombe$y3)
# dpik() has some trouble with zero scale estimates on this one:
cmi.pw(anscombe$x4, anscombe$y4, scalest = "stdev")
##################################################
##################################################
# The highly collinear Longley dataset
pairs(longley, main = "longley data")
l1 <- cmi(longley)
lapply(l1, round, 2)
# Here we demonstrate the scale-invariance of MI.
# Note: Scaling can help stabilise estimates when there are
# difficulties with the bandwidth estimation, but is unnecessary
# here.
long2 <- scale(longley)
l2 <- cmi(long2)
lapply(l2, round, 2)
##################################################
# See the vignette for large-scale examples.