R: Collapse/handle duplicated probes (genes) in a dataset
collapseDupProbes
R Documentation
Collapse/handle duplicated probes (genes) in a dataset
Description
Used internally by processExpressionSet. Code to either take the average across a set of duplicated "keys" (can be probes or genes, which correspond to the rows in the expression matrix "expr"), or take the keys that has the highest variance across the set of duplicated keys.
An expression matrix with genes in the rows and samples in the columns.
sampleColNames
Sample column names. Needed for internal debugging; usually the default colnames(expr) is appropriate.
keys
Generally the list of gene symbols, or some molecular key, that needs to be "collapsed" because it contains duplicated names.
method
Method used to collapse probes: take the mean across all duplicated keys, or just pick the key with the highest variance?
debug
Use internal unit tests that will stop the code if it detects a bug?
removeNA_keys
Remove any NA keys?
varMetric
Standard options taken from the base var() function. May be important if you have NA values in your data matrix; otherwise, "everything" is usually fine.
Value
Returns a processed list with the items "expr" and "keys", the expression matrix and final keys list.
Author(s)
Katie Planey <katie.planey@gmail.com>
Examples
#load up our datasets
data(curatedBreastDataExprSetList);
#just perform on second dataset, GSE2034, as an example.
#This dataset has no NAs already but does have duplicated genes
#highestVariance calculation make take a minute to run.
collapsedData <- collapseDupProbes(expr=exprs(curatedBreastDataExprSetList[[2]]),
keys=curatedBreastDataExprSetList[[2]]@featureData$gene_symbol,
method = c("highestVariance"), debug = TRUE, removeNA_keys = TRUE,
varMetric = c("everything"))
#look at names of outputs
names(collapsedData)
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(curatedBreastData)
Loading required package: ggplot2
Loading required package: impute
Loading required package: XML
Loading required package: Biobase
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:parallel':
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from 'package:stats':
IQR, mad, xtabs
The following objects are masked from 'package:base':
Filter, Find, Map, Position, Reduce, anyDuplicated, append,
as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
rbind, rownames, sapply, setdiff, sort, table, tapply, union,
unique, unsplit
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Loading required package: BiocStyle
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/curatedBreastData/collapseDupProbes.Rd_%03d_medium.png", width=480, height=480)
> ### Name: collapseDupProbes
> ### Title: Collapse/handle duplicated probes (genes) in a dataset
> ### Aliases: collapseDupProbes
>
> ### ** Examples
>
> #load up our datasets
> data(curatedBreastDataExprSetList);
>
> #just perform on second dataset, GSE2034, as an example.
> #This dataset has no NAs already but does have duplicated genes
> #highestVariance calculation make take a minute to run.
> collapsedData <- collapseDupProbes(expr=exprs(curatedBreastDataExprSetList[[2]]),
+ keys=curatedBreastDataExprSetList[[2]]@featureData$gene_symbol,
+ method = c("highestVariance"), debug = TRUE, removeNA_keys = TRUE,
+ varMetric = c("everything"))
Warning messages:
1: In collapseDupProbes(expr = exprs(curatedBreastDataExprSetList[[2]]), :
It's best to impute NA values before running this function
otherwise it may set averages to NA if there is 1 NA present.
This function just removes any genes whose key is NA.
2: In collapseDupProbes(expr = exprs(curatedBreastDataExprSetList[[2]]), :
You may get a warning here because key names are duplicated
so it can't use them as row names. That's OK.
> #look at names of outputs
> names(collapsedData)
[1] "expr" "keys"
>
>
>
>
>
>
> dev.off()
null device
1
>