R: Post-process a normalized assayData in an ExpressionSet...
processExpressionSet
R Documentation
Post-process a normalized assayData in an ExpressionSet object
Description
This function Post-processes a normalized assayData in an ExpressionSet object. Is is assumed the assay data is already baseline normalized (for example, for microarray data, this could mean quantile normalized and then logged.)
expressionSet S4 object with expression (assay) data, featureData and phenoData.
outputFileDirectory
Output file directory for messages that print status of post-processing the ExpressionSet.
minVarPercentile
Minimum variance percentile. Must be provided in conjunction with maxVarPercentile to use percentiles to threshold genes.
maxVarPercentile
Maximum variance percentile. Defaul is 1, i.e. 1%. Must be provided in conjunction with minVarPercentile to use percentiles to threshold genes.
minVar
If maxVar is provided, as opposed to minVarPercentile and maxVarPercentile, genes are removed that are below a certain variance magnitude. This is helpful before running certain algorithms, such as the popular Combat batch normalization technique, that can throw errors if genes with extremely low variances are in the data matrix. May be used in conjunction with maxVar or in isolation.
numTopVarGenes
A numeric value indicating the number of genes (features) to select; the function will only take this number of genes that have the highest variance across all genes.
Details
This function performs several post-processing tasks: filtering out genes and samples with high NA rates, imputing missing values, collapsing duplicated features/genes to make a unique feature list, removing any samples for which there is already a sample with the sample patient ID (duplicated samples), and filtering genes by variance. This function is a wrapper for the functions: filterAndImputeSamples(), collapseDupProbes(), removeDuplicatedPatients(), and filterGenesByVariance(). It is is run after initial dataset normalization, such as quantile normalization on microarray datasets.
Value
A post-processed S4 expressionSet. Tests are run to confirm the final S4 object is a valid ExpressionObject before it is returned.
Author(s)
Katie Planey <katie.planey@gmail.com>
Examples
#load up our datasets
data(curatedBreastDataExprSetList);
#just perform on one dataset as an example, GSE9893.
#This dataset does have NA values, so
#you'll see the impute.knn progress printed to the screen.
#also take only genes that fall in
#the variance percentiles between .75 and 1
#(i.e. top 75th percentile genes by variance.)
post_procExprSet <- processExpressionSet(exprSet=
curatedBreastDataExprSetList[[5]],
outputFileDirectory = "./",
minVarPercentile=.75, maxVarPercentile = 1)
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(curatedBreastData)
Loading required package: ggplot2
Loading required package: impute
Loading required package: XML
Loading required package: Biobase
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:parallel':
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from 'package:stats':
IQR, mad, xtabs
The following objects are masked from 'package:base':
Filter, Find, Map, Position, Reduce, anyDuplicated, append,
as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
rbind, rownames, sapply, setdiff, sort, table, tapply, union,
unique, unsplit
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Loading required package: BiocStyle
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/curatedBreastData/processExpressionSet.Rd_%03d_medium.png", width=480, height=480)
> ### Name: processExpressionSet
> ### Title: Post-process a normalized assayData in an ExpressionSet object
> ### Aliases: processExpressionSet
>
> ### ** Examples
>
> #load up our datasets
> data(curatedBreastDataExprSetList);
>
> #just perform on one dataset as an example, GSE9893.
> #This dataset does have NA values, so
> #you'll see the impute.knn progress printed to the screen.
> #also take only genes that fall in
> #the variance percentiles between .75 and 1
> #(i.e. top 75th percentile genes by variance.)
>
> post_procExprSet <- processExpressionSet(exprSet=
+ curatedBreastDataExprSetList[[5]],
+ outputFileDirectory = "./",
+ minVarPercentile=.75, maxVarPercentile = 1)
Cluster size 22898 broken into 7597 15301
Cluster size 7597 broken into 1303 6294
Done cluster 1303
Cluster size 6294 broken into 2904 3390
Cluster size 2904 broken into 1773 1131
Cluster size 1773 broken into 1372 401
Done cluster 1372
Done cluster 401
Done cluster 1773
Done cluster 1131
Done cluster 2904
Cluster size 3390 broken into 1993 1397
Cluster size 1993 broken into 977 1016
Done cluster 977
Done cluster 1016
Done cluster 1993
Done cluster 1397
Done cluster 3390
Done cluster 6294
Done cluster 7597
Cluster size 15301 broken into 6374 8927
Cluster size 6374 broken into 4092 2282
Cluster size 4092 broken into 1948 2144
Cluster size 1948 broken into 951 997
Done cluster 951
Done cluster 997
Done cluster 1948
Cluster size 2144 broken into 979 1165
Done cluster 979
Done cluster 1165
Done cluster 2144
Done cluster 4092
Cluster size 2282 broken into 613 1669
Done cluster 613
Cluster size 1669 broken into 780 889
Done cluster 780
Done cluster 889
Done cluster 1669
Done cluster 2282
Done cluster 6374
Cluster size 8927 broken into 6446 2481
Cluster size 6446 broken into 3049 3397
Cluster size 3049 broken into 1504 1545
Cluster size 1504 broken into 1080 424
Done cluster 1080
Done cluster 424
Done cluster 1504
Cluster size 1545 broken into 899 646
Done cluster 899
Done cluster 646
Done cluster 1545
Done cluster 3049
Cluster size 3397 broken into 1179 2218
Done cluster 1179
Cluster size 2218 broken into 1764 454
Cluster size 1764 broken into 1009 755
Done cluster 1009
Done cluster 755
Done cluster 1764
Done cluster 454
Done cluster 2218
Done cluster 3397
Done cluster 6446
Cluster size 2481 broken into 1695 786
Cluster size 1695 broken into 960 735
Done cluster 960
Done cluster 735
Done cluster 1695
Done cluster 786
Done cluster 2481
Done cluster 8927
Done cluster 15301
Starting with 155patients.
found no multiple samples from the same patient(s)
Warning messages:
1: In filterAndImputeSamples(study, studyName = "study", outputFile = paste0(outputFileDirectory, :
Just a warning: this function assumes your missing values
are proper NAs, not "null",etc.
2: In collapseDupProbes(expr = exprSet@assayData$exprs, sampleColNames = colnames(exprSet@assayData$exprs), :
It's best to impute NA values before running this function
otherwise it may set averages to NA if there is 1 NA present.
This function just removes any genes whose key is NA.
3: In collapseDupProbes(expr = exprSet@assayData$exprs, sampleColNames = colnames(exprSet@assayData$exprs), :
You may get a warning here because key names are duplicated
so it can't use them as row names. That's OK.
>
>
>
>
>
>
> dev.off()
null device
1
>