A function that filters genes by variance; it can simply threshold out genes that are above or below a certain magnitude of variance, filter out genes that fall outside of a minimum and maximum percentile, or simply select the top N varying genes.
A list, of minimally the gene expression or some molecular data matrix with keys (molecular features, such as genes) in the rows and patient samples in the columns and a keys list. In line with using this function for any type of molecular data, the exprIndex name, and also the keysIndex name, in the list can be altered.
plotSaveDir
If plotVarianceHist is TRUE, then the plotSaveDir is a character string specifying where this histogram plot should be saved.
minVarPercentile
Minimum variance percentile. Must be provided in conjunction with maxVarPercentile to use percentiles to threshold genes.
maxVarPercentile
Maximum variance percentile. Defaul is 1, i.e. 1%. Must be provided in conjunction with minVarPercentile to use percentiles to threshold genes.
maxVar
If maxVar is provided, as opposed to minVarPercentile and maxVarPercentile, genes are removed that are above a certain variance magnitude. This may be useful if a user suspects very highly varying genes are actually technical noise/outliers. May be used in conjunction with minVar or in isolation.
minVar
If maxVar is provided, as opposed to minVarPercentile and maxVarPercentile, genes are removed that are below a certain variance magnitude. This is helpful before running certain algorithms, such as the popular Combat batch normalization technique, that can throw errors if genes with extremely low variances are in the data matrix. May be used in conjunction with maxVar or in isolation.
exprIndex
Character string. List slot name for the data matrix, presumably an expression matrix.
keysIndex
Character string. List slot name for the feature names, presumably probes or gene names.
outputFile
Output file for messages that print status of the filtering. Include full directory if file should not be printed to current working directory.
plotVarianceHist
Plot the histogram of variances overall? Good for exploratory analyses to understand the distribution of variance across all data points. Default is FALSE to avoid saving a ggplot image for every function run.
varMetric
Standard options taken from the base var() function. May be important if you have NA values in your data matrix; otherwise, "everything" is usually fine.
sampleCol
Are samples in the columns of the expression matrix? If not, this function will first transpose the matrix, as the function assumes samples are in the columns features are in the rows.
numTopVarGenes
A numeric value indicating the number of genes (features) to select; the function will only take this number of genes that have the highest variance across all genes.
Value
A list: output <- list(study=study,filteredStudy=filteredStudy,p=p);
study
Original study list object
filteredStudy
filteredStudy object, i.e. the gene expression and keys only for the desired filtered keys/features.
Note
Filtering by variance is equivalent to filtering on the coefficient of variation if data is logged. Further work includes automatically allowing the user to use the coefficient of variation as opposed to baseline variation for a threshold.
It is highly suggested you use filterAndImputeSamples() beforehand to remove any NA values, to avoid -Inf or NA variance calculations.
Author(s)
Katie Planey <katie.planey@gmail.com>
Examples
#load up our datasets
data(curatedBreastDataExprSetList);
#just perform on one dataset as an example, GSE1379.
#This dataset does not have NA values, which makes for a
#good example without extra pre-processing.
#highestVariance calculation make take a minute to run.
#create study list object.
study <- list(expr=exprs(curatedBreastDataExprSetList[[1]]),
keys=curatedBreastDataExprSetList[[1]]@featureData$gene_symbol)
#take top 100 varying genes
filterGeneStudy <- filterGenesByVariance(study, exprIndex = "expr",
keysIndex = "keys", outputFile = "./varCal.txt",
plotVarianceHist = FALSE,
varMetric = c("everything"), sampleCol = TRUE, numTopVarGenes=100)
#names of output
names(filterGeneStudy)
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(curatedBreastData)
Loading required package: ggplot2
Loading required package: impute
Loading required package: XML
Loading required package: Biobase
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:parallel':
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from 'package:stats':
IQR, mad, xtabs
The following objects are masked from 'package:base':
Filter, Find, Map, Position, Reduce, anyDuplicated, append,
as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
rbind, rownames, sapply, setdiff, sort, table, tapply, union,
unique, unsplit
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Loading required package: BiocStyle
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/curatedBreastData/filterGenesByVariance.Rd_%03d_medium.png", width=480, height=480)
> ### Name: filterGenesByVariance
> ### Title: Filter genes by variance
> ### Aliases: filterGenesByVariance
>
> ### ** Examples
>
> #load up our datasets
> data(curatedBreastDataExprSetList);
>
> #just perform on one dataset as an example, GSE1379.
> #This dataset does not have NA values, which makes for a
> #good example without extra pre-processing.
> #highestVariance calculation make take a minute to run.
> #create study list object.
> study <- list(expr=exprs(curatedBreastDataExprSetList[[1]]),
+ keys=curatedBreastDataExprSetList[[1]]@featureData$gene_symbol)
> #take top 100 varying genes
>
> filterGeneStudy <- filterGenesByVariance(study, exprIndex = "expr",
+ keysIndex = "keys", outputFile = "./varCal.txt",
+ plotVarianceHist = FALSE,
+ varMetric = c("everything"), sampleCol = TRUE, numTopVarGenes=100)
>
> #names of output
> names(filterGeneStudy)
[1] "study" "filteredStudy" "p"
>
>
>
>
>
> dev.off()
null device
1
>