Last data update: 2014.03.03

R: This functions removes redundant features from a data.frame
filterRedundantR Documentation

This functions removes redundant features from a data.frame

Description

Prior computing proportion of overlap between ranked vector of features it is necessary to remove the redundant features. This can be accomplished using a number of methods implemeted in the filterRedundant function, as explained below.

Usage

filterRedundant(object,
    method=c("maxORmin", "geoMean", "mean", "median","random"),
    idCol=1, byCol=2, absolute=TRUE, decreasing=TRUE, trim=0, ...)

Arguments

object

a data.frame from which redundant features (rows) must be removed.

method

character. The method used for removing redundancy. Currently available methods are: maxORmin, geoMean, random, mean, median, (see Details below).

idCol

character or numeric. Name or index of the column containing redundant identifiers (e.g. ENTREZID, SYMBOLS, ...).

byCol

character or numeric. Name or index of the column containing the ranking statistics (used only with maxORmin method).

absolute

logical. Indicates whether the absolute statistics, as defined by byCol, should be used when reordering (used only with maxORmin method).

decreasing

logical. Indicates whether reodering should be decreasing or not (used only with maxORmin method).

trim

numeric. Indicates whether a trimmed mean should be computed (used only with mean method).

...

further arguments to be passed (not currently implemented).

Details

The maxORmin method removes redundant features by selecting the rows that correspond to the maximum or minimum value of a selected statistics. With this approach redundant features are first ranked in increasing or decreasing order, as defined by the decreasing argument, using the ranking statistics defined by byCol, either in their original or absolute scale, as defined by absolute argument. Subsequently data.frame rows corresponding to redundant identifiers are removed, after these have been identified in the column defined by the idCol, using the duplicated function.

The mean, median, geoMean, and random methods provide alternative ways for summarizing numerical values corresponding to redundant features, as defined by the idCol argument: mean takes the average, median the median, geoMean the geometric mean, random select a random value.

Value

A data.frame with fewer rows with respect to the input one, unique by the identifier specified by the idCol argument.

Note

filterRedundant is a utility function providing various methods to remove redundant rows from a data.frame. The choice of the method depends on the nature of the values, and the final goal. Therefore caution should be used when taking the mean or the median across few values, or passing the arguments with the minORmax method (for instance it would make no sense at all to use a decreasing ordering if the ranking statistics is a p-value).

Author(s)

Luig Marchionni <marchion@jhu.edu>

See Also

See duplicated.

Examples

###load data
data(matchBoxExpression)

###check whether there are redundant identifiers
sapply(matchBoxExpression,nrow)

###the column name for the identifiers
idCol <- "SYMBOL"

###the column name for the ranking statistics
byCol <- "t"

###use lapply to remove redundancy from all data.frames
###default method is "maxORmin"
newMatchBoxExpression <- lapply(matchBoxExpression, filterRedundant, idCol=idCol, byCol=byCol)

###recheck number of rows
sapply(newMatchBoxExpression, nrow)

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(matchBox)
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/matchBox/filterRedundant.Rd_%03d_medium.png", width=480, height=480)
> ### Name: filterRedundant
> ### Title: This functions removes redundant features from a data.frame
> ### Aliases: filterRedundant
> ### Keywords: manip
> 
> ### ** Examples
> 
> ###load data
> data(matchBoxExpression)
> 
> ###check whether there are redundant identifiers
> sapply(matchBoxExpression,nrow)
dataSetA dataSetB dataSetC 
    1375     2358     1800 
> 
> ###the column name for the identifiers
> idCol <- "SYMBOL"
> 
> ###the column name for the ranking statistics
> byCol <- "t"
> 
> ###use lapply to remove redundancy from all data.frames
> ###default method is "maxORmin"
> newMatchBoxExpression <- lapply(matchBoxExpression, filterRedundant, idCol=idCol, byCol=byCol)
> 
> ###recheck number of rows
> sapply(newMatchBoxExpression, nrow)
dataSetA dataSetB dataSetC 
    1075     2058     1500 
> 
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>