R: Calculate mixed-pair BCMI between a set of continuous...
mmi
R Documentation
Calculate mixed-pair BCMI between a set of continuous variables and a set
of discrete variables.
Description
This function calculates MI and BCMI between a set of continuous variables
and a set of discrete variables (variables in columns). It also performs
jackknife bias correction and provides a z-score for the hypothesis of no
association. Also included are the *.pw functions that calculate MI between
two vectors only. The *njk functions do not perform the jackknife and are
therefore faster.
The data matrix. Each row is an observation and each column is a
variable of interest. Should be numerical data. (For the pairwise functions this
should be a vector.)
disc
Matrix of discrete data, each row is an observation and each
column is a variable. Will be coerced to integers. (For the pairwise functions this
should be a vector.)
level
The number of levels used for plug-in bandwidth estimation (see
the documentation for the KernSmooth package.)
na.rm
Remove missing values if TRUE. This is required for the
bandwidth calculation.
h
A (double) vector of smoothing bandwidths, one for each variable. If
missing this will be calculated using the dpik() function from the
KernSmooth package.
...
Additional options passed to dpik() if necessary.
Details
mminjk() and mminjk.pw() return just the MI values without performing the
jackknife. mmi.pw() and mminjk.pw() only require one bandwidth for the
continuous variable. The number of processor cores used can be changed by
setting the environment variable "OMP_NUM_THREADS" before starting R.
Value
Returns a list of 3 matrices each of size ncol(cts) by
ncol(disc). Each row index represents a continuous variable and each column
index a discrete variable.
mi
The raw MI estimates.
bcmi
Jackknife bias corrected MI estimates (BCMI). These are each MI value
minus the corresponding jackknife estimate of bias.
zvalues
z-scores for each hypothesis that the corresponding
bcmi value is zero. These have poor statistical properties but can be useful
as a rough measure of the strength of association.
Examples
##################################################
# A dataset with discrete and continuous variables
cts <- state.x77
disc <- data.frame(state.division,state.region)
summary(cts)
table(disc)
m1 <- mmi(cts, disc)
lapply(m1, round, 2)
# Division gives more information about the continuous variables than region.
# Here is one where both division and region show a strong association:
boxplot(cts[,6] ~ disc[,1])
boxplot(cts[,6] ~ disc[,2])
# In this case the states need to be divided into regions before a clear
# association can be seen:
boxplot(cts[,1] ~ disc[,1])
boxplot(cts[,1] ~ disc[,2])
# Look at associations within the continuous variables:
pairs(cts, col = state.region)
c1 <- cmi(cts)
lapply(c1, round, 2)
##################################################
# A pairwise comparison
# Note that the ANOVA homoskedasticity assumption is not satisfied here.
boxplot(InsectSprays[,1] ~ InsectSprays[,2])
mmi.pw(InsectSprays[,1], InsectSprays[,2])
##################################################
# Another pairwise comparison
boxplot(morley[,3] ~ morley[,1])
m2 <- mmi.pw(morley[,3], morley[,1])
m2
##################################################
# See the vignette for large-scale examples.