R: Computing overlap proportions among ordered vectors
computeCat
R Documentation
Computing overlap proportions among ordered vectors
Description
computeCat computes the overlap proportions between
pairs of ordered vectors of identifiers.
The input to this function is a data.frame containing non-redundant
identifiers and a number of ranking statistics organized by columns.
This function enables comparing all possible pair combinations,
or selecting one column as the reference ranking for the remaining.
The output of this function can be used as the input to
plotCat, which creates correspondence at the top
curves, as used in Irizarry et al, Nat Methods (2005), for
comparing differential gene expression across platforms and labs.
A data.frame produced by mergeData,
containing a column of unique identifiers and at least two columns
of ranking statistics (e.g. t-statistics, fold-change, Cox coefficients)
size
numeric. The number of top ranking statistics
to be considered in the computation of the overlap proportions.
If omitted all rows in data will be considered.
If size is large computation time may be long.
idCol
numeric or character. The index (by default equal to one),
or the name of the column containing the common identifiers
(e.g. ENTREZID, SYMBOLS, ...).
ref
character. The column name corresponding to
the ranking statistics to be used as the reference in all pairs of
comparisons.
method
character. The method used to compute the overlap
proportion between two ordered vectors of identifiers: either "equalRank"
or "equalStat". The first method computed the overlap based
on equal ranks, whereas the latter uses equal statistics.
decreasing
logical. This argument defines whether
decreasing or increasing ordering should be used
Details
computeCat computes overlapping proportions
between pairs of ordered vectors of identifiers.
This function first finds all possible pairs of vector combinations,
then it computes the corresponding overlapping
proportions. If a column is selected as the reference,
using the argument ref, only the combinations
involving this column will be returned.
Briefly, for each CAT curve two vectors of identifiers
are first ordered by the ranking statistics of choice,
then the overlap between the two vectors is computed
by considering more and more identifiers (vector size).
This function enables to compute overlapping proportions
using two distinct methods: "equalRank" or "equalStat".
With "equalRank" the overlap is obtained between vectors
of the same size using equal ranks, which in turn can
potentially correspond to ranking statistics of different
magnitude (e.g. the vectors are of the same
size, but might have different ranking statistics).
With "equalStat" the overlap is obtained between vectors
defined by using equal ranking statistics, which can
potentially correspond to different rank, and hence to
vectors of different size (e.g. the vectors are of different
size, but have similar ranking statistics).
Value
A list of lists in which each element correspond to a
CAT curve. If a specific reference column is provided
through the ref argument, the number of
list elements is equal to the number of combinations
involving the reference group, otherwise all possible
combinations are returned.
When the "equalRank" method is used each list element
contains only the overlapping proportion, while when
the "equalStat" method is used the number of genes
with equal statistics is stored along with the overlapping
proportion.
This output is used to produce CAT curves,
using the plotCat function, as described
in Irizarry et al, Nat Methods (2005).
Note
Given the combinatorial nature of the computation,
a long computational time can be necessary if the input
data contains many columns and many rows
(number of features).
In such a case consider limiting the number of rows
used using the size argument.
Irizarry, R. A.; Warren, D.; Spencer, F.; Kim, I. F.; Biswal, S.;
Frank, B. C.; Gabrielson, E.; Garcia, J. G. N.; Geoghegan, J.;
Germino, G.; Griffin, C.; Hilmer, S. C.; Hoffman, E.;
Jedlicka, A. E.; Kawasaki, E.; Martinez-Murillo, F.;
Morsberger, L.; Lee, H.; Petersen, D.; Quackenbush, J.;
Scott, A.; Wilson, M.; Yang, Y.; Ye, S. Q.
and Yu, W. Multiple-laboratory comparison of microarray platforms.
Nat Methods, 2005, 2, 345-350
Ross, A. E.; Marchionni, L.; Vuica-Ross, M.; Cheadle, C.;
Fan, J.; Berman, D. M.; and Schaeffer E. M.
Gene Expression Pathways of High Grade Localized Prostate Cancer.
Prostate 2011, 71, 1568-1578
Benassi, B.; Flavin, R.; Marchionni, L.; Zanata, S.; Pan, Y.;
Chowdhury, D.; Marani, M.; Strano, S.; Muti, P.; and Blandino, G.
c-Myc is activated via USP2a-mediated modulation of microRNAs
in prostate cancer. Cancer Discovery, 2012, March, 2, 236-247
See Also
See mergeData and plotCat.
Examples
###load data
data(matchBoxExpression)
###the column name for the identifiers
idCol <- "SYMBOL"
###the column name for the ranking statistics
byCol <- "t"
###use lapply to remove redundancy from all data.frames
###default method is "maxORmin"
newMatchBoxExpression <- lapply(matchBoxExpression, filterRedundant, idCol=idCol, byCol=byCol)
###select t-statistics and merge into a new data.frame using SYMBOL
mat <- mergeData(newMatchBoxExpression, idCol=idCol, byCol=byCol)
###Compute CAT for decreasing t-statistics: all genes
cpH2L <- computeCat(mat, idCol=1,decreasing=TRUE, method="equalRank")
###Compute CAT for increasing t-statistics:only the first 300 genes
cpL2H <- computeCat(mat, idCol=1, size=300, decreasing=FALSE, method="equalRank")
###Compute CAT for increasing t-statistics:only the first 300 genes
###use the second column as the reference
cpL2H.ref <- computeCat(mat, idCol=1, size=300, ref="dataSetA.t",
decreasing=FALSE, method="equalRank")
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(matchBox)
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/matchBox/computeCat.Rd_%03d_medium.png", width=480, height=480)
> ### Name: computeCat
> ### Title: Computing overlap proportions among ordered vectors
> ### Aliases: computeCat
> ### Keywords: manip
>
> ### ** Examples
>
> ###load data
> data(matchBoxExpression)
>
> ###the column name for the identifiers
> idCol <- "SYMBOL"
>
> ###the column name for the ranking statistics
> byCol <- "t"
>
> ###use lapply to remove redundancy from all data.frames
> ###default method is "maxORmin"
> newMatchBoxExpression <- lapply(matchBoxExpression, filterRedundant, idCol=idCol, byCol=byCol)
>
> ###select t-statistics and merge into a new data.frame using SYMBOL
> mat <- mergeData(newMatchBoxExpression, idCol=idCol, byCol=byCol)
>
> ###Compute CAT for decreasing t-statistics: all genes
> cpH2L <- computeCat(mat, idCol=1,decreasing=TRUE, method="equalRank")
>
> ###Compute CAT for increasing t-statistics:only the first 300 genes
> cpL2H <- computeCat(mat, idCol=1, size=300, decreasing=FALSE, method="equalRank")
>
> ###Compute CAT for increasing t-statistics:only the first 300 genes
> ###use the second column as the reference
> cpL2H.ref <- computeCat(mat, idCol=1, size=300, ref="dataSetA.t",
+ decreasing=FALSE, method="equalRank")
>
>
>
>
>
>
> dev.off()
null device
1
>