R: This functions removes redundant features from a data.frame
filterRedundant
R Documentation
This functions removes redundant features from a data.frame
Description
Prior computing proportion of overlap between ranked vector of features
it is necessary to remove the redundant features.
This can be accomplished using a number of methods implemeted
in the filterRedundant function, as explained below.
a data.frame from which redundant features
(rows) must be removed.
method
character. The method used for removing redundancy.
Currently available methods are: maxORmin,
geoMean, random, mean, median,
(see Details below).
idCol
character or numeric. Name or index of the column
containing redundant identifiers (e.g. ENTREZID, SYMBOLS, ...).
byCol
character or numeric. Name or index of the column
containing the ranking statistics (used only with maxORmin
method).
absolute
logical. Indicates whether the absolute statistics,
as defined by byCol, should be used when reordering
(used only with maxORmin method).
decreasing
logical. Indicates whether reodering should be
decreasing or not (used only with maxORmin method).
trim
numeric. Indicates whether a trimmed mean should
be computed (used only with mean method).
...
further arguments to be passed (not currently implemented).
Details
The maxORmin method removes
redundant features by selecting the rows
that correspond to the maximum or minimum
value of a selected statistics.
With this approach
redundant features are first
ranked in increasing or decreasing order,
as defined by the decreasing argument,
using the ranking statistics defined by byCol,
either in their original or absolute scale,
as defined by absolute argument.
Subsequently data.frame rows corresponding to redundant
identifiers are removed, after these have been identified in
the column defined by the idCol,
using the duplicated function.
The mean, median, geoMean,
and random methods provide alternative ways
for summarizing numerical values corresponding to
redundant features, as defined by the idCol
argument:
mean takes the average,
median the median,
geoMean the geometric mean,
random select a random value.
Value
A data.frame with fewer rows with respect to the input one,
unique by the identifier specified by the idCol argument.
Note
filterRedundant is a utility function providing various
methods to remove redundant rows from a data.frame.
The choice of the method depends on the nature of the values,
and the final goal.
Therefore caution should be used when taking the mean
or the median across few values, or passing the arguments
with the minORmax method (for instance it would
make no sense at all to use a decreasing ordering if the ranking
statistics is a p-value).
Author(s)
Luig Marchionni <marchion@jhu.edu>
See Also
See duplicated.
Examples
###load data
data(matchBoxExpression)
###check whether there are redundant identifiers
sapply(matchBoxExpression,nrow)
###the column name for the identifiers
idCol <- "SYMBOL"
###the column name for the ranking statistics
byCol <- "t"
###use lapply to remove redundancy from all data.frames
###default method is "maxORmin"
newMatchBoxExpression <- lapply(matchBoxExpression, filterRedundant, idCol=idCol, byCol=byCol)
###recheck number of rows
sapply(newMatchBoxExpression, nrow)
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(matchBox)
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/matchBox/filterRedundant.Rd_%03d_medium.png", width=480, height=480)
> ### Name: filterRedundant
> ### Title: This functions removes redundant features from a data.frame
> ### Aliases: filterRedundant
> ### Keywords: manip
>
> ### ** Examples
>
> ###load data
> data(matchBoxExpression)
>
> ###check whether there are redundant identifiers
> sapply(matchBoxExpression,nrow)
dataSetA dataSetB dataSetC
1375 2358 1800
>
> ###the column name for the identifiers
> idCol <- "SYMBOL"
>
> ###the column name for the ranking statistics
> byCol <- "t"
>
> ###use lapply to remove redundancy from all data.frames
> ###default method is "maxORmin"
> newMatchBoxExpression <- lapply(matchBoxExpression, filterRedundant, idCol=idCol, byCol=byCol)
>
> ###recheck number of rows
> sapply(newMatchBoxExpression, nrow)
dataSetA dataSetB dataSetC
1075 2058 1500
>
>
>
>
>
>
> dev.off()
null device
1
>