Last data update: 2014.03.03

R: Filter genes by variance
filterGenesByVarianceR Documentation

Filter genes by variance

Description

A function that filters genes by variance; it can simply threshold out genes that are above or below a certain magnitude of variance, filter out genes that fall outside of a minimum and maximum percentile, or simply select the top N varying genes.

Usage

filterGenesByVariance(study, plotSaveDir = "~/", minVarPercentile, 
maxVarPercentile=1, maxVar, minVar, exprIndex = "expr", 
keysIndex = "keys", outputFile = "varCal.txt", plotVarianceHist = FALSE,
varMetric = c("everything", "all.obs", "complete.obs", 
"na.or.complete", "pairwise.complete.obs"), 
sampleCol = TRUE, numTopVarGenes)

Arguments

study

A list, of minimally the gene expression or some molecular data matrix with keys (molecular features, such as genes) in the rows and patient samples in the columns and a keys list. In line with using this function for any type of molecular data, the exprIndex name, and also the keysIndex name, in the list can be altered.

plotSaveDir

If plotVarianceHist is TRUE, then the plotSaveDir is a character string specifying where this histogram plot should be saved.

minVarPercentile

Minimum variance percentile. Must be provided in conjunction with maxVarPercentile to use percentiles to threshold genes.

maxVarPercentile

Maximum variance percentile. Defaul is 1, i.e. 1%. Must be provided in conjunction with minVarPercentile to use percentiles to threshold genes.

maxVar

If maxVar is provided, as opposed to minVarPercentile and maxVarPercentile, genes are removed that are above a certain variance magnitude. This may be useful if a user suspects very highly varying genes are actually technical noise/outliers. May be used in conjunction with minVar or in isolation.

minVar

If maxVar is provided, as opposed to minVarPercentile and maxVarPercentile, genes are removed that are below a certain variance magnitude. This is helpful before running certain algorithms, such as the popular Combat batch normalization technique, that can throw errors if genes with extremely low variances are in the data matrix. May be used in conjunction with maxVar or in isolation.

exprIndex

Character string. List slot name for the data matrix, presumably an expression matrix.

keysIndex

Character string. List slot name for the feature names, presumably probes or gene names.

outputFile

Output file for messages that print status of the filtering. Include full directory if file should not be printed to current working directory.

plotVarianceHist

Plot the histogram of variances overall? Good for exploratory analyses to understand the distribution of variance across all data points. Default is FALSE to avoid saving a ggplot image for every function run.

varMetric

Standard options taken from the base var() function. May be important if you have NA values in your data matrix; otherwise, "everything" is usually fine.

sampleCol

Are samples in the columns of the expression matrix? If not, this function will first transpose the matrix, as the function assumes samples are in the columns features are in the rows.

numTopVarGenes

A numeric value indicating the number of genes (features) to select; the function will only take this number of genes that have the highest variance across all genes.

Value

A list: output <- list(study=study,filteredStudy=filteredStudy,p=p);

study

Original study list object

filteredStudy

filteredStudy object, i.e. the gene expression and keys only for the desired filtered keys/features.

Note

Filtering by variance is equivalent to filtering on the coefficient of variation if data is logged. Further work includes automatically allowing the user to use the coefficient of variation as opposed to baseline variation for a threshold.

It is highly suggested you use filterAndImputeSamples() beforehand to remove any NA values, to avoid -Inf or NA variance calculations.

Author(s)

Katie Planey <katie.planey@gmail.com>

Examples

#load up our datasets
data(curatedBreastDataExprSetList);

#just perform on one dataset as an example, GSE1379. 
#This dataset does not have NA values, which makes for a
#good example without extra pre-processing.
#highestVariance calculation make take a minute to run.
#create study list object. 
study <- list(expr=exprs(curatedBreastDataExprSetList[[1]]),
keys=curatedBreastDataExprSetList[[1]]@featureData$gene_symbol)
#take top 100 varying genes

filterGeneStudy <- filterGenesByVariance(study, exprIndex = "expr", 
keysIndex = "keys", outputFile = "./varCal.txt", 
plotVarianceHist = FALSE,
varMetric = c("everything"), sampleCol = TRUE, numTopVarGenes=100)

#names of output
names(filterGeneStudy)

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(curatedBreastData)
Loading required package: ggplot2
Loading required package: impute
Loading required package: XML
Loading required package: Biobase
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
    get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
    rbind, rownames, sapply, setdiff, sort, table, tapply, union,
    unique, unsplit

Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: BiocStyle
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/curatedBreastData/filterGenesByVariance.Rd_%03d_medium.png", width=480, height=480)
> ### Name: filterGenesByVariance
> ### Title: Filter genes by variance
> ### Aliases: filterGenesByVariance
> 
> ### ** Examples
> 
> #load up our datasets
> data(curatedBreastDataExprSetList);
> 
> #just perform on one dataset as an example, GSE1379. 
> #This dataset does not have NA values, which makes for a
> #good example without extra pre-processing.
> #highestVariance calculation make take a minute to run.
> #create study list object. 
> study <- list(expr=exprs(curatedBreastDataExprSetList[[1]]),
+ keys=curatedBreastDataExprSetList[[1]]@featureData$gene_symbol)
> #take top 100 varying genes
> 
> filterGeneStudy <- filterGenesByVariance(study, exprIndex = "expr", 
+ keysIndex = "keys", outputFile = "./varCal.txt", 
+ plotVarianceHist = FALSE,
+ varMetric = c("everything"), sampleCol = TRUE, numTopVarGenes=100)
> 
> #names of output
> names(filterGeneStudy)
[1] "study"         "filteredStudy" "p"            
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>