Last data update: 2014.03.03

R: Pairwise association measure between categorical variables
pw.assocR Documentation

Pairwise association measure between categorical variables

Description

This function computes some association measures between a categorical nominal variable and each of the other available predictors (also categorical variables).

Usage

pw.assoc(formula, data, weights=NULL, freq0c=NULL)

Arguments

formula

A formula of the type y~x1+x2 where y denotes the name of the categorical variable (a factor in R) which plays the role of the dependent variable while x1 and x2 are the name of the predictors (both categorical variables). Numeric variables are not allowed; eventual numerical variables should be categorized (see function cut) before being passed to pw.assoc.

data

The data frame which contains the variables called by formula.

weights

The name of the eventual variable in data which provides the units' weights. Weights are used to estimate frequencies (a cell frequency is estimated by summing the weights of the units which present the given characteristics). Default is NULL (no weights available, each unit counts 1).

freq0c

A small number which is substituted to eventual cells with zero frequencies in order to avoid computation failures. When NULL (default) a cell with zero frequency is substitutes with 1/N^2, being N the sample size.

Details

This function computes some association measures among the response variable and each of the predictors specified in the formula. The following association measure are considered:

Cramer's V:

(Chi^2/(N*min(I-1,J-1)))^0.5

N is the sample size, I is the number of rows and J is the number of columns. Cramer's V ranges from 0 to 1.

Goodman–Kruskal lambda(R|C):

lambda(R|C) = (sum_j max_i(p_ij) - max_i(p_i+))/(1 - max_i(p_i+))

It ranges from 0 to 1, and denotes how much the knowledge of the column variable (predictor) helps in reducing the prediction error of the values of the row variable.

Goodman–Kruskal tau(R|C):

tau(R|C) = (sum_ij p^2_ij / p_+j - sum_i p_i+)/(1 - sum_i p_i+)

It takes values in the interval [0,1] and has the same PRE meaning of the lambda.

Theil's Uncertainty coefficient:

U(R|C) = (sum_ij p_ij log (p_ij/p+j) - sum_i p_i+ log p_i+) / (-sum_i p_i+ log p_i+)

It takes values in the interval [0,1] and measure the reduction of uncertainty in the row variable due to knowing the column variable.

It is worth noting that lambda, tau and U are asymmetric measures of the proportional reduction of the variance of the row column when passing from its marginal distribution to its conditional distribution given the column variable obtained starting from the general expression (cf. Agresti, 2002, p. 56):

(V(R) - E[V(R|C)])/V(R)

They differ in the way of measuring variance, in fact it does not exist a general accepted definition of the variance of a categorical variable.

Value

A list object with four components.

V

A vector with the estimated Cramer's V for each couple response-predictor.

labda

A vector with the values of Goodman-Kruscal lambda(R|C) for each couple response-predictor.

tau

A vector with the values of Goodman-Kruscal tau(R|C) for each couple response-predictor.

U

A vector whit the values of Theil's uncertainty coefficient U(R|C) for each couple response-predictor.

Author(s)

Marcello D'Orazio madorazi@istat.it

References

Agresti A (2002) Categorical Data Analysis. Second Edition. Wiley, new York.

Examples

data(quine, package="MASS") #loads quine from MASS
str(quine)

# how Lrn is response variable
pw.assoc(Lrn~Age+Sex+Eth, data=quine)

# usage of units' weights
quine$ww <- runif(nrow(quine), 1,4) #random gen  1<=weights<=4
pw.assoc(Lrn~Age+Sex+Eth, data=quine, weights="ww")

Results