R: Computing overlap proportions among ordered vectors
Computing overlap proportions among ordered vectors


computeCat computes the overlap proportions between pairs of ordered vectors of identifiers. The input to this function is a data.frame containing non-redundant identifiers and a number of ranking statistics organized by columns. This function enables comparing all possible pair combinations, or selecting one column as the reference ranking for the remaining. The output of this function can be used as the input to plotCat, which creates correspondence at the top curves, as used in Irizarry et al, Nat Methods (2005), for comparing differential gene expression across platforms and labs.


computeCat(data, size=nrow(data), idCol=1, ref,
			method = c("equalRank", "equalStat"),
			decreasing = TRUE)



A data.frame produced by mergeData, containing a column of unique identifiers and at least two columns of ranking statistics (e.g. t-statistics, fold-change, Cox coefficients)


numeric. The number of top ranking statistics to be considered in the computation of the overlap proportions. If omitted all rows in data will be considered. If size is large computation time may be long.


numeric or character. The index (by default equal to one), or the name of the column containing the common identifiers (e.g. ENTREZID, SYMBOLS, ...).


character. The column name corresponding to the ranking statistics to be used as the reference in all pairs of comparisons.


character. The method used to compute the overlap proportion between two ordered vectors of identifiers: either "equalRank" or "equalStat". The first method computed the overlap based on equal ranks, whereas the latter uses equal statistics.


logical. This argument defines whether decreasing or increasing ordering should be used


computeCat computes overlapping proportions between pairs of ordered vectors of identifiers. This function first finds all possible pairs of vector combinations, then it computes the corresponding overlapping proportions. If a column is selected as the reference, using the argument ref, only the combinations involving this column will be returned.

Briefly, for each CAT curve two vectors of identifiers are first ordered by the ranking statistics of choice, then the overlap between the two vectors is computed by considering more and more identifiers (vector size).

This function enables to compute overlapping proportions using two distinct methods: "equalRank" or "equalStat". With "equalRank" the overlap is obtained between vectors of the same size using equal ranks, which in turn can potentially correspond to ranking statistics of different magnitude (e.g. the vectors are of the same size, but might have different ranking statistics). With "equalStat" the overlap is obtained between vectors defined by using equal ranking statistics, which can potentially correspond to different rank, and hence to vectors of different size (e.g. the vectors are of different size, but have similar ranking statistics).


A list of lists in which each element correspond to a CAT curve. If a specific reference column is provided through the ref argument, the number of list elements is equal to the number of combinations involving the reference group, otherwise all possible combinations are returned. When the "equalRank" method is used each list element contains only the overlapping proportion, while when the "equalStat" method is used the number of genes with equal statistics is stored along with the overlapping proportion. This output is used to produce CAT curves, using the plotCat function, as described in Irizarry et al, Nat Methods (2005).


Given the combinatorial nature of the computation, a long computational time can be necessary if the input data contains many columns and many rows (number of features). In such a case consider limiting the number of rows used using the size argument.


Luigi Marchionni


See Also

See mergeData and plotCat.


###load data

###the column name for the identifiers
idCol <- "SYMBOL"

###the column name for the ranking statistics
byCol <- "t"

###use lapply to remove redundancy from all data.frames
###default method is "maxORmin"
newMatchBoxExpression <- lapply(matchBoxExpression, filterRedundant, idCol=idCol, byCol=byCol)

###select t-statistics and merge into a new data.frame using SYMBOL
mat <- mergeData(newMatchBoxExpression, idCol=idCol, byCol=byCol)

###Compute CAT for decreasing t-statistics: all genes
cpH2L <- computeCat(mat, idCol=1,decreasing=TRUE, method="equalRank")

###Compute CAT for increasing t-statistics:only the first 300 genes
cpL2H <- computeCat(mat, idCol=1, size=300, decreasing=FALSE, method="equalRank")

###Compute CAT for increasing t-statistics:only the first 300 genes
###use the second column as the reference
cpL2H.ref <- computeCat(mat, idCol=1, size=300, ref="dataSetA.t",
  decreasing=FALSE, method="equalRank")


