R Graphical Manual

Browse All

Last data update: 2014.03.03

R: Computes various rank correlation coefficients

comprank

R Documentation

Computes various rank correlation coefficients

Description

Computes various rank order correlations with and withou ties present.

Usage

comprank(p, q = NULL, indr, tiex = "woodbury", sizer = 100000,
	 repgin = 1000, print = TRUE)

Arguments

`p`	a numeric vector, matrix or data frame.
`q`	NULL (default) or a vector with compatible dimensions to p.
`indr`	a character string that specifies the rank correlation that is used in the test statistic. Acceptable values are: "spearman", "kendall", "gini","r4", "fy" (Fisher-Yates correlation), "filliben" (Filliben rank correlation). Only enough of the string to be unique is required.
`tiex`	a character string that specifies the method for breaking ties. Possible options are: "woodbury", "gh","wgh", "midrank","dubois". Only enough of the string to be unique is required. This option is ignored when ties are absent. The options: `"midrank"` and `"dubois"` are also ignored in the calculation of r_4, Fisher-Yates and Filliben.
`sizer`	number of replications for resolving ties by randomization. Default value: 10^5.
`repgin`	number of sampled pairs of permutations needed to apply the weighted max-min method (wgh). Default value: 10^3.
`print`	if `FALSE` suppresses some of the output.

Details

Consider a fixed set of n distinct items ordered according to the different degree in which they possess two common attributes represented by p and q. Let us suppose that each attribute consist of a host of intangibles that can be ranked but not measured and that the evaluations are expressed in terms of an ordinal scale of n ranks: (p_i, q_i), i=1,2, ..., n where p is re-arranged in the natural order.

Gideon & Hollister (1987) pointed out that reasonable coefficients of rank correlation need to possess certain properties and gave a list of postulates for nonparametric measures of dependence based on works of Racute{e}nyi (1959) and Schweizer & Wolff (1981). On this specific issue, see also Scarisini, (1984), King & Chinchilli (2001), and Genest & Plante (2003). In short, the properties that any index of rank correlation should satisfy are the following

r(p,q) is defined for any pair of permutations p,q.
Comparability. -1 ≤ r(p,q) ≤ 1 with r(p,p) = r(q,q) = 1 and r(p,p^*) = r(q,q^*) = -1 where p^* = n+1- p and q^* = n+1- q. It is also required that increasing levels (decreasing levels) of agreement are reflected by increasing positive (negative) values.
Symmetry. r(p,q) = r(q,p).
Right-invariance. r(p[θ],q[θ]) = r(p,q) for all θ, p, q in S_n, where S_n is the set of all n! permutations of integers 1, 2,cdots, n.
Antisymmetry under reversal. r(p,q^*) = r(p^*,q) = -r(p,q^*).
Zero expected value under independence. E_{p,q in S_n}[r(q,p)]=0 when all rankings are equally probable with probability 1/n!

pvrank includes six admissible rank correlations (in the sense that they have the desirable properties described above).

Spearman: r_1=1-(6/(n^3-n)S_1;quad S_1=∑_{i=1}^n (p_i-q_i)^2

Kendall: r_2=(2/(n^2-n))S_2;quad S_2=∑_{i=1}^{n-1}∑_{j=i+1}^n sign(p_i-q_j)sign(p_i-q_j)

Gini: r_3=(2/(n^2-k_n))S_3;quad S_3=∑_{i=1}^n (|n+1-p_i-q_i|-|p_i-q_i|)

r_4=Big[∑_{i=1}^ng_i(p,q^*)Big]Big[∑_{i=1}^ng_j(p^*,q)Big]-Big[∑_{i=1}^ng_i(p^*,q^*)Big]Big[∑_{i=1}^ng_j(p,q)Big]/M_n

where k_n is zero if n is even and one if n is odd. The function g≤ft(. ight) is defined as g_h(p_i,q_j)=max(p_i/q_j,q_j/p_i). The quantity M_n is given by

M_n=Big[k_n+2∑_{i=1}^{n/2}(n+1-i)/iBig]^2 -n^2

Two other admissible coefficients are:

Fisher-Yates: r_5=frac{∑_{i=1}^n ξ(p_i|n)ξ(q_i|n)}{∑_{i=1}^n ξ(p_i|n)^2}

Filliben: r_6=frac{∑_{i=1}^n ζ(p_i|n)ζ(q_i|n)}{∑_{i=1}^n ζ(p_i|n)^2}

The expression ξ(p_i|n) denotes the expected values of the i-th largest standardized deviate in a sample of size n from a standad Gaussian population. See, for example, function evNormOrdStats in package EnvStats, Millard (2013). An implementation of the Gaussian rank correlation coefficient is given in the rococo package. See Bodenhofer et al. (2013). Finally, ζ(p_i|n) is the i-th order statistic median from a standard Gaussian distribution. These medians ζ(p_i|n) are exactly related to the order statistic medians m_i from a uniform distribution on [0,1], that is, ζ(p_i|n)=Φ^{-1}(m_i) where Φ^{-1}(x) is the quantile function of the standard Gaussian cumulative distribution function. The medians of the order statistics from a unit uniform distribution are given by m_i=B^{-1}(0.5,i,n+1-i) where B^{-1}(x,a,b) denotes the inverse Beta distribution with parametes a,b evaluated at the percentage x. Fortunately, B^{-1}(,) is a standard function: qbeta in package stats. See Tarsitano & Amerise (2016) and Amerise & Tarsitano (2016).

In the absence of ties r_1, r_2, r_3 can assume (n^3-n)/6+1, (n^2-n)/2+1, (n^2-k_n)/2+1 distinct values, respectively. The coefficient r_4 can assume a number of different values of the order 0.25n! more or less uniformly spaced from each other. When n>3, r_1 can be zero if, and only if, n is not of the form n=4m+2 where m is a positive integer. The quantity S_2 is even if, and only if, n=4m or n=4m+1; S_2 only takes on an odd value if n is not in that form. For n>3, zero is always a value of r_3. If n≥5, coefficient r_4 fails to be zero for n=7. The number of distinct values of r_5 and r_6 is only limited by the number of bits used in the representation of float-point values in a computer.

Equal values are common when rank methods are applied to rounded data or data consisting solely of small integers. A popular technique for resolving ties in rank correlation is the mid-rank method: the mean of the rankings remain unaltered, but the variance is reduced and changes according to the number and location of ties.

Woodbury (1940) notes that the quadratic mean has been proposed to preserve the sum of squares of the ranks: n(n+1)(2n+1)/6. In this case the condition on the total variance is satisfied, but not that on the total mean, which turns out to be greater than (n+1)/2. The Woodbury method yields a corrected values of the rank correlation which could be regarded as the average of the values of the conventional coefficient obtained by all possible rankings of tied observations. In essence, the methods "gh" (see Gideon and Hollister, 1987) and "wgh" (see Gini, 1939) consider two special orderings with the objective to determine the pair of permutations that maximizes the direct association or positive correlation and the pair of permutations that maximizes the inverse association or, in absolute terms, the negative correlation. The method "gh" yields the simple average of the bounds; the method "wgh" yields a weighted average of the bounds. See Amerise and Tarsitano (2014).

Value

a list with the following components :

`r`	The value of the required rank correlation.
`ities`	Method for dealing with ties.

Author(s)

Ilaria Lucrezia Amerise

References

Amerise, I. L. and Tarsitano, A. (2014). Correction methods for ties in rank correlations. Journal of Applied statistics, 42, 1–13.

Amerise, I. L. and Tarsitano, A. (2016). A re-edition of the probability plot correlation coefficient leading to a new rank correlation. Submitted.

Bodenhofer, U. et al. (2013). "Testing noisy numerical data for monotonic association". information Sciences, 245, 21–37.

Genest, C. and Plante, J.-F. (2003). "On Blest's measure of rank correlation". The Canadian Journal of Statistics, 31, 35–52.

Gideon, R. and Hollister, A. (1987). "A rank correlation coefficient resistant to outliers". Journal of the American Statistical Association, 82, 656–666.

Gini, C. (1939). "Sulla determinazione dell'indice di cograduazione". Metron, 13, 41–48.

King, T.S. and Chinchilli, V.M. (2001). "Robust estimators of the concordance correlation coefficient". Journal of Biopharmaceutical Statistics, 11, 83–105.

Millard, S.P. (2013). EnvStats: An R Package for Environmental Statistics. Springer, New York.

Racute{e}nyi, A. (1959). "On measures of dependence". Acta Mathematica Hungarica, 10, 441–451.

Scarsini, M. (1984). "On measures of dependence". Stochastica, 8, 201–218.

Schweizer, B. and Wolff, E.F. (1981). "On nonparametric measures of dependence for random variables". Annals of Statistics, 9, 879–885.

Tarsitano, A. and Lombardo, R. (2013). "A coefficient of correlation based on ratios of ranks and anti-ranks". Jahrbucher fur Nationalokonomie und Statistik, 233, 206–224.

Tarsitano, A. and Amerise, I. L. (2016). "Effectiveness of rank correlation statistics in non-linear relationships". Submitted.

Woodbury, M. A. (1940). "Rank correlation when there are equal variates". The Annals of Mathematical Statistics 11, 358–362.

Examples


data(Zoutus);attach(Zoutus);print(Zoutus)
a<-comprank(LogDose,LogTime,"spearman","woodbury")
cat(a$r,a$ities,"\n")
a<-comprank(LogDose,LogTime,"kendall","woodbury")
cat(a$r,a$ities,"\n")
detach(Zoutus)
#####
#
# Yotopoulos, P. A.  Nugent, J. B. (1973). A balanced-growth version of the
# linkage hypothesis: a test. The Quarterly Journal of Economics, 87, 157-171.
x<-1:18
y<-c(7,1,4,8,3,9,2,5,10,6,17,13,14,12,11,16,15,18)
a<-comprank(x,y,"gini");cat(a$r,a$ities,"\n")
#####
#
data(Franzen);attach(Franzen)
op<-par(mfrow=c(1,1))
plot(MECISSP,PPP, main="Environmental Attitudes in International Comparison",
xlab="Mean Environmental concern ISSP", ylab="Purchasing power parity",
pch=19, cex=0.8,col="salmon3")
abline(h=mean(PPP),col="darkred",lty=2,lwd=1)
abline(v=mean(MECISSP),col="darkred",lty=2,lwd=1)
par(op)
a<-comprank(MECISSP,PPP,"kendall","gh");cat(a$r,a$ities,"\n")
a<-comprank(MECISSP,PPP,"gini","wgh");cat(a$r,a$ities,"\n")
detach(Franzen)
#####
#
data(Viscoh);attach(Viscoh)
Viscoh<-as.matrix(Viscoh)
a<-comprank(Viscoh,"spearman","gh",print=FALSE)
print(a$r);cat(" method:", a$ities,"\n")
b<-comprank(Viscoh,"r4","gh",print=FALSE)
print(b$r);cat(" method:", b$ities,"\n")
c<-comprank(Viscoh,"fy","wgh",print=FALSE)
print(c$r);cat("method:", c$ities,"\n")
d<-comprank(Viscoh,"fil","wgh",print=FALSE)
print(d$r);cat(" method:", d$ities,"\n")
detach(Viscoh)
#####
#
data(Laudaher);attach(Laudaher)
a1<-comprank(Duration,Infiltration,"gini","midrank")
a2<-comprank(Duration,Infiltration,"gini","dubois")
a3<-comprank(Duration,Infiltration,"spearman","midrank")
a4<-comprank(Duration,Infiltration,"spearman","dubois")
cat("Coefficient","method","\n",a1$r,a1$ities,"\n",a2$r,a2$ities,"\n",a3$r,a3$ities,
"\n",a4$r,a4$ities,"\n")
detach(Laudaher)
#####
# Asymptotic confidence intervals.
r.cofint <- function(r, n, index, Level=.95) {
	asd<-rep(0,4)
	asd[1]<-1/sqrt(n-1) # Spearman
	asd[2]<-sqrt((4*n+10)/(9*n*(n-1))) # Kendall
	asd[3]<-1/sqrt(1.5*n) # Gini
	asd[4]<-1/sqrt(1.00762*(n-1)) # r4
	# Fisher-Yates and Filliben
	if(index<=4) {zse<-r*asd[index]} else {zse<-atan(r*(1-0.6/(n+8)))/sqrt(n-3)} 
	rlow <- r - zse * qnorm((1-Level)/2,lower.tail=FALSE);rlow<-max(-1,rlow)
 	rupp <- r + zse * qnorm((1-Level)/2,lower.tail=FALSE);rupp<-min(1,rupp)
 	out<-list(Lower.r=rlow, Upper.r=rupp)
 	return(out)
}
#
# Rajalingam S. and Zeya O. "Summative assessments by Spearman`s correlation
# coefficient: a case study. in enhancing learning: teaching & learning international 
# conference. Nov 24-26 2011. Miri, Sarawak: Curtin University
FA=c(39.09,39.77,35.62,34.69,34.42,35.57,35.94,38.68,38.41,36.4,37.95,37.03,38.03,
35.5,38.55,30.68,26.3,36.28)
SA=c(27.5,34,24,24,17,17.5,16.5,26,25.5,28.5,26.5,13,12.5,9.5,28.5,23,24.5,22)
n<-length(FA)
op<-par(mfrow=c(1,1))
plot(FA,SA, main="Academic progresses of students",
xlab="Formative assessment", ylab="summative assessment",
pch=19,col="steelblue4")
text(FA,SA,label=1:n,cex=0.8,pos=2)
abline(h=mean(SA),col="darkblue",lty=2,lwd=1)
abline(v=mean(FA),col="darkblue",lty=2,lwd=1)
par(op)
rct<-c("spearman","kendall","gini","r4", "fy","filliben")
for (index in 1:6){
	r<-comprank(FA,SA,rct[index])$r
	cir<-r.cofint(r,n,index,Level=0.99)
	cat(rct[index],"Low:",cir$Lower,"Value:",r," Upp: ",cir$Upper.r,"\n")
}
#####
#
# Daniel, C.  Wood, F. S. Fitting Equations to Data. New York: John Wiley,
# 1971, p. 45. Pilot-plant data
# The response variable (y) corresponds to the acid content determined by
# titration and the explanatory variable (x) is the organic acid content determined 
# by extraction and weighting
y<-c(76, 70, 55, 71, 55, 48, 50, 66, 41, 43, 82, 68, 88, 58, 64, 88, 89, 88,
84, 88)
x<-c(123, 109, 62, 104, 57, 37, 44, 100, 16, 28, 138, 105, 159, 75, 88, 164,
169, 167, 149, 167)
out1<-comprank(x,y,"s","woodbury")
out2<-comprank(x,y,"k","woodbury")
out3<-comprank(x,y,"g","woodbury")
out4<-comprank(x,y,"r4","woodbury")
out5<-comprank(x,y,"fy","woodbury")
out6<-comprank(x,y,"fil","woodbury")
ind<-c(out1$r,out2$r,out3$r,out4$r,out5$r,out6$r)
cat(out1$ities,"\n")
print(round(ind,5))
#####
#
data(DietFish);attach(DietFish)
op<-par(mfrow=c(1,1))
plot(occ_D,occ_H, main="Comparison of the diets of banded killfish",xlab="occ F_diaphanus",
ylab="occ F_heteroclitus", pch=19, cex=0.8,col="darkcyan")
text(occ_D,occ_H, labels = rownames(DietFish), cex=0.5, pos=3)
abline(h=mean(occ_D),col="darkred",lty=2,lwd=1)
abline(v=mean(occ_H),col="darkred",lty=2,lwd=1)
par(op)
a<-comprank(occ_D,occ_H,"sp","woodbury");cat(a$r,a$ities,"\n")
a<-comprank(occ_D,occ_H,"sp","gh");cat(a$r,a$ities,"\n")
a<-comprank(occ_D,occ_H,"sp","wgh");cat(a$r,a$ities,"\n")
a<-comprank(occ_D,occ_H,"sp","midrank");cat(a$r,a$ities,"\n")
a<-comprank(occ_D,occ_H,"sp","dubois");cat(a$r,a$ities,"\n")
detach(DietFish)
#####
#
data(Radiation);attach(Radiation);Radiation<-as.matrix(Radiation)
r1<-comprank(Radiation,"spearman","midrank",print=FALSE);eigen(r1$r)
r2<-comprank(Radiation,"kendall","midrank",print=FALSE);eigen(r2$r)
r3<-comprank(Radiation,"gini","midrank",print=FALSE);eigen(r3$r)
r4<-comprank(Radiation,"r4","gh",print=FALSE);eigen(r4$r)
r5<-comprank(Radiation,"fy","gh",print=FALSE);eigen(r5$r)
r6<-comprank(Radiation,"fil","gh",print=FALSE);eigen(r6$r)
detach(Radiation)
#####
#
# Correlation matrix
data(Marozzi);attach(Marozzi)
Marozzi<-as.matrix(Marozzi)
cor1<-comprank(Marozzi,"spearman","midrank",print=FALSE)
rownames(cor1$r)<-colnames(Marozzi);rownames(cor1$r)<-colnames(Marozzi)
print(round(cor1$r,3))
cor1 <-comprank(Marozzi,"kendall","midrank",print=FALSE)
print(round(cor1$r,3))
cor1<-comprank(Marozzi,"gini","midrank",print=FALSE)
print(round(cor1$r,3))
cor1<-comprank(Marozzi,"r4","wgh",print=FALSE)
print(round(cor1$r,3))
cor1<-comprank(Marozzi,"fy","wgh",print=FALSE)
print(round(cor1$r,3))
cor1<-comprank(Marozzi,"fil","wgh",print=FALSE)
print(round(cor1$r,3))
detach(Marozzi)