Last data update: 2014.03.03

R: Post-process a normalized assayData in an ExpressionSet...
processExpressionSetR Documentation

Post-process a normalized assayData in an ExpressionSet object

Description

This function Post-processes a normalized assayData in an ExpressionSet object. Is is assumed the assay data is already baseline normalized (for example, for microarray data, this could mean quantile normalized and then logged.)

Usage

processExpressionSet(exprSet, outputFileDirectory = "./",numTopVarGenes,
minVarPercentile, maxVarPercentile = 1, minVar)

Arguments

exprSet

expressionSet S4 object with expression (assay) data, featureData and phenoData.

outputFileDirectory

Output file directory for messages that print status of post-processing the ExpressionSet.

minVarPercentile

Minimum variance percentile. Must be provided in conjunction with maxVarPercentile to use percentiles to threshold genes.

maxVarPercentile

Maximum variance percentile. Defaul is 1, i.e. 1%. Must be provided in conjunction with minVarPercentile to use percentiles to threshold genes.

minVar

If maxVar is provided, as opposed to minVarPercentile and maxVarPercentile, genes are removed that are below a certain variance magnitude. This is helpful before running certain algorithms, such as the popular Combat batch normalization technique, that can throw errors if genes with extremely low variances are in the data matrix. May be used in conjunction with maxVar or in isolation.

numTopVarGenes

A numeric value indicating the number of genes (features) to select; the function will only take this number of genes that have the highest variance across all genes.

Details

This function performs several post-processing tasks: filtering out genes and samples with high NA rates, imputing missing values, collapsing duplicated features/genes to make a unique feature list, removing any samples for which there is already a sample with the sample patient ID (duplicated samples), and filtering genes by variance. This function is a wrapper for the functions: filterAndImputeSamples(), collapseDupProbes(), removeDuplicatedPatients(), and filterGenesByVariance(). It is is run after initial dataset normalization, such as quantile normalization on microarray datasets.

Value

A post-processed S4 expressionSet. Tests are run to confirm the final S4 object is a valid ExpressionObject before it is returned.

Author(s)

Katie Planey <katie.planey@gmail.com>

Examples

#load up our datasets
data(curatedBreastDataExprSetList);

#just perform on one dataset as an example, GSE9893. 
#This dataset does have NA values, so
#you'll see the impute.knn progress printed to the screen.
#also take only genes that fall in 
#the variance percentiles between .75 and 1 
#(i.e. top 75th percentile genes by variance.)

post_procExprSet <- processExpressionSet(exprSet=
curatedBreastDataExprSetList[[5]], 
outputFileDirectory = "./",
minVarPercentile=.75, maxVarPercentile = 1)

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(curatedBreastData)
Loading required package: ggplot2
Loading required package: impute
Loading required package: XML
Loading required package: Biobase
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from 'package:stats':

    IQR, mad, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, append,
    as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
    get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
    rbind, rownames, sapply, setdiff, sort, table, tapply, union,
    unique, unsplit

Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: BiocStyle
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/curatedBreastData/processExpressionSet.Rd_%03d_medium.png", width=480, height=480)
> ### Name: processExpressionSet
> ### Title: Post-process a normalized assayData in an ExpressionSet object
> ### Aliases: processExpressionSet
> 
> ### ** Examples
> 
> #load up our datasets
> data(curatedBreastDataExprSetList);
> 
> #just perform on one dataset as an example, GSE9893. 
> #This dataset does have NA values, so
> #you'll see the impute.knn progress printed to the screen.
> #also take only genes that fall in 
> #the variance percentiles between .75 and 1 
> #(i.e. top 75th percentile genes by variance.)
> 
> post_procExprSet <- processExpressionSet(exprSet=
+ curatedBreastDataExprSetList[[5]], 
+ outputFileDirectory = "./",
+ minVarPercentile=.75, maxVarPercentile = 1)
Cluster size 22898 broken into 7597 15301 
Cluster size 7597 broken into 1303 6294 
Done cluster 1303 
Cluster size 6294 broken into 2904 3390 
Cluster size 2904 broken into 1773 1131 
Cluster size 1773 broken into 1372 401 
Done cluster 1372 
Done cluster 401 
Done cluster 1773 
Done cluster 1131 
Done cluster 2904 
Cluster size 3390 broken into 1993 1397 
Cluster size 1993 broken into 977 1016 
Done cluster 977 
Done cluster 1016 
Done cluster 1993 
Done cluster 1397 
Done cluster 3390 
Done cluster 6294 
Done cluster 7597 
Cluster size 15301 broken into 6374 8927 
Cluster size 6374 broken into 4092 2282 
Cluster size 4092 broken into 1948 2144 
Cluster size 1948 broken into 951 997 
Done cluster 951 
Done cluster 997 
Done cluster 1948 
Cluster size 2144 broken into 979 1165 
Done cluster 979 
Done cluster 1165 
Done cluster 2144 
Done cluster 4092 
Cluster size 2282 broken into 613 1669 
Done cluster 613 
Cluster size 1669 broken into 780 889 
Done cluster 780 
Done cluster 889 
Done cluster 1669 
Done cluster 2282 
Done cluster 6374 
Cluster size 8927 broken into 6446 2481 
Cluster size 6446 broken into 3049 3397 
Cluster size 3049 broken into 1504 1545 
Cluster size 1504 broken into 1080 424 
Done cluster 1080 
Done cluster 424 
Done cluster 1504 
Cluster size 1545 broken into 899 646 
Done cluster 899 
Done cluster 646 
Done cluster 1545 
Done cluster 3049 
Cluster size 3397 broken into 1179 2218 
Done cluster 1179 
Cluster size 2218 broken into 1764 454 
Cluster size 1764 broken into 1009 755 
Done cluster 1009 
Done cluster 755 
Done cluster 1764 
Done cluster 454 
Done cluster 2218 
Done cluster 3397 
Done cluster 6446 
Cluster size 2481 broken into 1695 786 
Cluster size 1695 broken into 960 735 
Done cluster 960 
Done cluster 735 
Done cluster 1695 
Done cluster 786 
Done cluster 2481 
Done cluster 8927 
Done cluster 15301 

Starting with  155patients.
found no multiple samples from the same patient(s)
Warning messages:
1: In filterAndImputeSamples(study, studyName = "study", outputFile = paste0(outputFileDirectory,  :
  
Just a warning: this function assumes your missing values
  are proper NAs, not "null",etc.

2: In collapseDupProbes(expr = exprSet@assayData$exprs, sampleColNames = colnames(exprSet@assayData$exprs),  :
  It's best to impute NA values before running this function
otherwise it may set averages to NA if there is 1 NA present.
This function just removes any genes whose key is NA.
3: In collapseDupProbes(expr = exprSet@assayData$exprs, sampleColNames = colnames(exprSet@assayData$exprs),  :
  
You may get a warning here because key names are duplicated 
  so it can't use them as row names. That's OK.

> 
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>