Last data update: 2014.03.03

R: Matrix of Hoeffding's D Statistics
hoeffdR Documentation

Matrix of Hoeffding's D Statistics

Description

Computes a matrix of Hoeffding's (1948) D statistics for all possible pairs of columns of a matrix. D is a measure of the distance between F(x,y) and G(x)H(y), where F(x,y) is the joint CDF of X and Y, and G and H are marginal CDFs. Missing values are deleted in pairs rather than deleting all rows of x having any missing variables. The D statistic is robust against a wide variety of alternatives to independence, such as non-monotonic relationships. The larger the value of D, the more dependent are X and Y (for many types of dependencies). D used here is 30 times Hoeffding's original D, and ranges from -0.5 to 1.0 if there are no ties in the data. print.hoeffd prints the information derived by hoeffd. The higher the value of D, the more dependent are x and y. hoeffd also computes the mean and maximum absolute values of the difference between the joint empirical CDF and the product of the marginal empirical CDFs.

Usage

hoeffd(x, y)
## S3 method for class 'hoeffd'
print(x, ...)

Arguments

x

a numeric matrix with at least 5 rows and at least 2 columns (if y is absent), or an object created by hoeffd

y

a numeric vector or matrix which will be concatenated to x

...

ignored

Details

Uses midranks in case of ties, as described by Hollander and Wolfe. P-values are approximated by linear interpolation on the table in Hollander and Wolfe, which uses the asymptotically equivalent Blum-Kiefer-Rosenblatt statistic. For P<.0001 or >0.5, P values are computed using a well-fitting linear regression function in log P vs. the test statistic. Ranks (but not bivariate ranks) are computed using efficient algorithms (see reference 3).

Value

a list with elements D, the matrix of D statistics, n the matrix of number of observations used in analyzing each pair of variables, and P, the asymptotic P-values. Pairs with fewer than 5 non-missing values have the D statistic set to NA. The diagonals of n are the number of non-NAs for the single variable corresponding to that row and column.

Author(s)

Frank Harrell
Department of Biostatistics
Vanderbilt University
f.harrell@vanderbilt.edu

References

Hoeffding W. (1948): A non-parametric test of independence. Ann Math Stat 19:546–57.

Hollander M. and Wolfe D.A. (1973). Nonparametric Statistical Methods, pp. 228–235, 423. New York: Wiley.

Press WH, Flannery BP, Teukolsky SA, Vetterling, WT (1988): Numerical Recipes in C. Cambridge: Cambridge University Press.

See Also

rcorr, varclus

Examples

x <- c(-2, -1, 0, 1, 2)
y <- c(4,   1, 0, 1, 4)
z <- c(1,   2, 3, 4, NA)
q <- c(1,   2, 3, 4, 5)
hoeffd(cbind(x,y,z,q))


# Hoeffding's test can detect even one-to-many dependency
set.seed(1)
x <- seq(-10,10,length=200)
y <- x*sign(runif(200,-1,1))
plot(x,y)
hoeffd(x,y)

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(Hmisc)
Loading required package: lattice
Loading required package: survival
Loading required package: Formula
Loading required package: ggplot2

Attaching package: 'Hmisc'

The following objects are masked from 'package:base':

    format.pval, round.POSIXt, trunc.POSIXt, units

> png(filename="/home/ddbj/snapshot/RGM3/R_CC/result/Hmisc/hoeffd.Rd_%03d_medium.png", width=480, height=480)
> ### Name: hoeffd
> ### Title: Matrix of Hoeffding's D Statistics
> ### Aliases: hoeffd print.hoeffd
> ### Keywords: nonparametric htest
> 
> ### ** Examples
> 
> x <- c(-2, -1, 0, 1, 2)
> y <- c(4,   1, 0, 1, 4)
> z <- c(1,   2, 3, 4, NA)
> q <- c(1,   2, 3, 4, 5)
> hoeffd(cbind(x,y,z,q))
D
   x  y  z  q
x  1  0 NA  1
y  0  1 NA  0
z NA NA  1 NA
q  1  0 NA  1

avg|F(x,y)-G(x)H(y)|
     x    y z    q
x 0.00 0.04 0 0.16
y 0.04 0.00 0 0.04
z 0.00 0.00 0 0.00
q 0.16 0.04 0 0.00

max|F(x,y)-G(x)H(y)|
     x   y z    q
x 0.00 0.1 0 0.24
y 0.10 0.0 0 0.10
z 0.00 0.0 0 0.00
q 0.24 0.1 0 0.00

n
  x y z q
x 5 5 4 5
y 5 5 4 5
z 4 4 4 4
q 5 5 4 5

P
  x      y      z q     
x        0.3633   0.0000
y 0.3633          0.3633
z                       
q 0.0000 0.3633         
> 
> 
> # Hoeffding's test can detect even one-to-many dependency
> set.seed(1)
> x <- seq(-10,10,length=200)
> y <- x*sign(runif(200,-1,1))
> plot(x,y)
> hoeffd(x,y)
D
     x    y
x 1.00 0.06
y 0.06 1.00

avg|F(x,y)-G(x)H(y)|
       x      y
x 0.0000 0.0407
y 0.0407 0.0000

max|F(x,y)-G(x)H(y)|
       x      y
x 0.0000 0.0763
y 0.0763 0.0000

n= 200 

P
  x  y 
x     0
y  0   
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>