Last data update: 2014.03.03

R: Collapse/handle duplicated probes (genes) in a dataset
collapseDupProbesR Documentation

Collapse/handle duplicated probes (genes) in a dataset

Description

Used internally by processExpressionSet. Code to either take the average across a set of duplicated "keys" (can be probes or genes, which correspond to the rows in the expression matrix "expr"), or take the keys that has the highest variance across the set of duplicated keys.

Usage

collapseDupProbes(expr, sampleColNames=colnames(expr), 
keys, method = c("average", "highestVariance"), debug = TRUE, removeNA_keys = TRUE, 
varMetric = c("everything", "all.obs", "complete.obs", "na.or.complete",
"pairwise.complete.obs"))

Arguments

expr

An expression matrix with genes in the rows and samples in the columns.

sampleColNames

Sample column names. Needed for internal debugging; usually the default colnames(expr) is appropriate.

keys

Generally the list of gene symbols, or some molecular key, that needs to be "collapsed" because it contains duplicated names.

method

Method used to collapse probes: take the mean across all duplicated keys, or just pick the key with the highest variance?

debug

Use internal unit tests that will stop the code if it detects a bug?

removeNA_keys

Remove any NA keys?

varMetric

Standard options taken from the base var() function. May be important if you have NA values in your data matrix; otherwise, "everything" is usually fine.

Value

Returns a processed list with the items "expr" and "keys", the expression matrix and final keys list.

Author(s)

Katie Planey <katie.planey@gmail.com>

Examples

#load up our datasets
data(curatedBreastDataExprSetList);

#just perform on second dataset, GSE2034, as an example.
#This dataset has no NAs already but does have duplicated genes
#highestVariance calculation make take a minute to run.
collapsedData <- collapseDupProbes(expr=exprs(curatedBreastDataExprSetList[[2]]),  
keys=curatedBreastDataExprSetList[[2]]@featureData$gene_symbol, 
method = c("highestVariance"), debug = TRUE, removeNA_keys = TRUE, 
varMetric = c("everything"))
#look at names of outputs
names(collapsedData)

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(curatedBreastData)
Loading required package: ggplot2
Loading required package: impute
Loading required package: XML
Loading required package: Biobase
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
    get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
    rbind, rownames, sapply, setdiff, sort, table, tapply, union,
    unique, unsplit

Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: BiocStyle
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/curatedBreastData/collapseDupProbes.Rd_%03d_medium.png", width=480, height=480)
> ### Name: collapseDupProbes
> ### Title: Collapse/handle duplicated probes (genes) in a dataset
> ### Aliases: collapseDupProbes
> 
> ### ** Examples
> 
> #load up our datasets
> data(curatedBreastDataExprSetList);
> 
> #just perform on second dataset, GSE2034, as an example.
> #This dataset has no NAs already but does have duplicated genes
> #highestVariance calculation make take a minute to run.
> collapsedData <- collapseDupProbes(expr=exprs(curatedBreastDataExprSetList[[2]]),  
+ keys=curatedBreastDataExprSetList[[2]]@featureData$gene_symbol, 
+ method = c("highestVariance"), debug = TRUE, removeNA_keys = TRUE, 
+ varMetric = c("everything"))
Warning messages:
1: In collapseDupProbes(expr = exprs(curatedBreastDataExprSetList[[2]]),  :
  It's best to impute NA values before running this function
otherwise it may set averages to NA if there is 1 NA present.
This function just removes any genes whose key is NA.
2: In collapseDupProbes(expr = exprs(curatedBreastDataExprSetList[[2]]),  :
  
You may get a warning here because key names are duplicated 
  so it can't use them as row names. That's OK.

> #look at names of outputs
> names(collapsedData)
[1] "expr" "keys"
> 
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>