Last data update: 2014.03.03

R: Frechet bounds of cells in a contingency table
Frechet.bounds.catR Documentation

Frechet bounds of cells in a contingency table

Description

This function permits to derive the bounds for cell probabilities of the table Y vs. Z starting from the marginal tables (X vs. Y), (X vs. Z) and the joint distribution of the X variables.

Usage

Frechet.bounds.cat(tab.x, tab.xy, tab.xz, print.f="tables", tol= 0.001) 

Arguments

tab.x

A R table crossing the X variables. This table must be obtained by using the function xtabs or table, e.g.
tab.x <- xtabs(~x1+x2+x3, data=data.all).
When tab.x = NULL then only tab.xy and tab.xz must be supplied.

tab.xy

A R table of X vs. Y variable. This table must be obtained by using the function xtabs or table, e.g.
table.xy <- xtabs(~x1+x2+x3+y, data=data.A).

A single categorical Y variable is allowed. One or more categorical variables can be considered as X variables (common variables). Obviously, the same X variables in tab.x must be available in tab.xy. Moreover, it is assumed that the joint distribution of the X variables computed from tab.xy is equal to tab.x; a warning appears if this is not true (see argument tol).

When tab.x = NULL then tab.xy should be a one–dimensional table providing the marginal distribution of the Y variable.

tab.xz

A R table of X vs. Z variable. This table must be obtained by using the function xtabs or table, e.g.
tab.xz <- xtabs(~x1+x2+x3+z, data=data.B).

A single categorical Z variable is allowed. One or more categorical variables can be considered as X variables (common variables). The same X variables in tab.x must be available in tab.xz. Moreover, it is assumed that the joint distribution of the X variables computed from tab.xz is equal to tab.x; a warning appears if this is not true (see argument tol).

When tab.x = NULL then tab.xz should be a one–dimensional table providing the marginal distribution of the Z variable.

print.f

A string: when print.f="tables" (default) all the cells' estimates will be saved as tables in a list. On the contrary, if print.f="data.frame", they will be saved as columns of a data.frame.

tol

Tolerance used in comparing joint distributions as far as X variables are considered (default tol= 0.001); the joint distribution of the X variables computed from tab.xy and tab.xz should be equal to that in tab.x.

Details

This function permits to compute the Frechet bounds for the relative frequencies in the contingency table of Y vs. Z, starting from the distributions P(Y|X), P(Z|X) and P(X). The bounds for the relative frequencies p(y=j,z=k) in the table Y vs. Z are:

p(Y=j,Z=k) >= sum_i(p(X=i) * max(0; p(Y=j|X=i) + p(Z=k|X=i) - 1) )

p(Y=j,Z=k) <= sum_i(p(X=i) * min(p(Y=j|X=i),p(Z=k|X=i)))

The relative frequencies p(X=i)=n_i/n are computed from the frequencies in tab.x;
the relative frequencies p(Y=j|X=i)=n_ij/n_i. are computed from the tab.xy,
finally, p(Z=k|X=i)=n_ik/n_i. are derived from tab.xy.

It is assumed that the marginal distribution of the X variables is the same in all the input tables: tab.x, tab.xy and tab.xz. If this is not true a warning message will appear.

Note that the cells bounds for the relative frequencies in the contingency table of Y vs. Z are computed also without considering the X variables:

max(0;p(Y=j)+p(Z=k)-1) <= p(Y=j,Z=k) <= min(p(Y=j);p(Z=k))

These bounds represent the unique output when tab.x = NULL.

Finally, the contingency table of Y vs. Z estimated under the Conditional Independence Assumption (CIA) is obtained by considering:

p(Y=i,Z=k) = p(Y=j|X=i)*p(Z=k|X=i)*p(X=i)

When tab.x = NULL then it is also provided the expected table under the assumption of independence between Y and Z:

p(Y=i,Z=k) = p(Y=j)*p(Z=k)*

Note that in the presence of too many cells with 0s in the input contingency tables is an indication of sparseness; this is an unappealing situation when estimating the cells' relative frequencies needed to derive the bounds; in such cases the corresponding results may be unreliable. A possible alternative way of working consists in estimating the required parameters by considering a pseudo-Bayes estimator (see pBayes); in practice the input tab.x, tab.xy and tab.xz should be the ones provided by the pBayes function.

Value

When print.f="tables" (default) a list with the following components:

low.u

The estimated lower bounds for the relative frequencies in the table Y vs. Z without conditioning on the X variables.

up.u

The estimated upper bounds for the relative frequencies in the table Y vs. Z without conditioning on the X variables.

CIA

The estimated relative frequencies in the table Y vs. Z under the Conditional Independence Assumption (CIA).

low.cx

The estimated lower bounds for the relative frequencies in the table Y vs. Z when conditioning on the X variables.

up.cx

The estimated upper bounds for the relative frequencies in the table Y vs. Z when conditioning on the X variables.

uncertainty

The uncertainty associated to input data, summarized in terms of average width of uncertainty bounds with and without conditioning on the X variables

When print.f="data.frame" the output list contains just two components:

bounds

A data.frame whose columns reports the estimated uncertainty bounds.

uncertainty

The uncertainty associated to input data, summarized in terms of average width of uncertainty bounds with and without conditioning on the X variables

Author(s)

Marcello D'Orazio madorazi@istat.it

References

Ballin, M., D'Orazio, M., Di Zio, M., Scanu, M. and Torelli, N. (2009) “Statistical Matching of Two Surveys with a Common Subset”. Working Paper, 124. Dip. Scienze Economiche e Statistiche, Univ. di Trieste, Trieste.

D'Orazio, M., Di Zio, M. and Scanu, M. (2006). Statistical Matching: Theory and Practice. Wiley, Chichester.

See Also

Fbwidths.by.x, harmonize.x

Examples


data(quine, package="MASS") #loads quine from MASS
str(quine)

# split quine in two subsets
set.seed(7654)
lab.A <- sample(nrow(quine), 70, replace=TRUE)
quine.A <- quine[lab.A, 1:3]
quine.B <- quine[-lab.A, 2:4]

# compute the tables required by Frechet.bounds.cat()
freq.x <- xtabs(~Sex+Age, data=quine.A)
freq.xy <- xtabs(~Sex+Age+Eth, data=quine.A)
freq.xz <- xtabs(~Sex+Age+Lrn, data=quine.B)

# apply Frechet.bounds.cat()
bounds.yz <- Frechet.bounds.cat(tab.x=freq.x, tab.xy=freq.xy,
        tab.xz=freq.xz, print.f="data.frame")
bounds.yz

#compare marg. distribution of Xs in A and B
comp.prop(p1=margin.table(freq.xy,c(1,2)), p2=margin.table(freq.xz,c(1,2)), 
          n1=nrow(quine.A), n2=nrow(quine.B))

# harmonize distr. of Sex vs. Age before applying
# Frechet.bounds.cat()

N <- nrow(quine)
quine.A$pop <- N
quine.A$f <- N/70 # reciprocal sampling fraction
quine.B$pop <- N
quine.B$f <- N/(N-70)

# derive the table of Sex vs. Age related to the whole data set
tot.sex.age <- colSums(model.matrix(~Sex*Age-1, data=quine))
tot.sex.age

# use hamonize.x() to harmonize the Sex vs. Age between
# quine.A and quine.B

# create svydesign objects
require(survey)
svy.qA <- svydesign(~1, weights=~f, fpc=~pop, data=quine.A)
svy.qB <- svydesign(~1, weights=~f, fpc=~pop, data=quine.B)

# apply harmonize.x 
out.hz <- harmonize.x(svy.A=svy.qA, svy.B=svy.qB, form.x=~Sex*Age-1, x.tot=tot.sex.age)

# compute the new tables required by Frechet.bounds.cat()
freq.x <- xtabs(out.hz$weights.A~Sex+Age, data=quine.A)
freq.xy <- xtabs(out.hz$weights.A~Sex+Age+Eth, data=quine.A)
freq.xz <- xtabs(out.hz$weights.B~Sex+Age+Lrn, data=quine.B)

#compare marg. distribution of Xs in A and B
comp.prop(p1=margin.table(freq.xy,c(1,2)), p2=margin.table(freq.xz,c(1,2)), 
          n1=nrow(quine.A), n2=nrow(quine.B))

# apply Frechet.bounds.cat()
bounds.yz <- Frechet.bounds.cat(tab.x=freq.x, tab.xy=freq.xy,
        tab.xz=freq.xz, print.f="data.frame")
bounds.yz

Results