R: Phenotype prediction using microarray data: approach of the...
predictDSC
R Documentation
Phenotype prediction using microarray data: approach of the best overall team
in the IMPROVER Diagnostic Signature Challenge
Description
This function implements the classification pipeline of the best overall team (Team221)
in the IMPROVER Diagnostic Signature Challenge. The function ofers also eploring other
combinations of data preprocessing, feature selection and classifier types.
A data frame with two columns: files and group giving the names of the Affymetrix
.cel files (no full path) and their corresponding groups. Only two groups are allowed as well
as a third group called "Test". The samples corresponding to these will not be used in
training but will be used to normalize the training data with.
celfile.path
The location of the directory where the .cel files are located.
annotation
The names of a package that can be used to map the probesets to the ENTREZ
gene IDS in order to deal with duplicate probesets pre gene. E.g.hgu133plues2.db
preprocs
A character vector giving the names of the normalization methods to try.
Supported options are "rma","gcrma","mas5"
filters
A character vector giving the names of the methods to use to rank features.
Supported options are "mttest" for moderated t-test using limma package,"ttest" for regular t-test,
and "wilcox" for wilcoxon test.
classifiers
A character vector giving the names of the classifier types to use for
learning the relation between expression levels and phenotype. Supported options are
"LDA","kNN","svm".
FCT
A numeric value giving the fold change threshokd to be used to filter out non-relevant
features. Note, setting it to a too large value can produce an error as there need to be at least
NF probestes with a fold change larger than FCT in each fold of the cross-validation.
CVP
The number of cross-validation partitions to create (minimum is 2). Do use a CVP
value which ensures that at least two samples from the smalest group are kept for testing
at each fold. E.g. If you have 10 samples in the smalest of the 2 groups a CVP of 4 would be maximum.
NF
The maximum number of features that would make sense to consider using as predcitors
in the models. NF should be less than the number of training samples.
by
The size of the step when searching for the number of features to include. By default th esearch starts with the top 2 features, and
a number of "by" features are added up to NF.
NR
An integer number between 1 and Inf giving the number of times the cross-validation
should be repeated to ensure a robust solution to the question: how many features to use as predictors
in the model?.
Details
See cited documents for more details.
Value
A list object containing one item for each possible combination between the elements of preprocs,
filters, and classifiers. Each item of the list contains the following information:
predictions - a data frame with the predicted class membership belief value
(posterior probability) for each sample (row) and each class (column).
features - Names of the Affy probesets used as predictors by the model. A letter "F" is
added as suffix to the probeset names.
model - A fitted model object as produced by the lda, svm and kNN functions.
performanceTr - A matrix giving the number of features tested (NN) mean AUC over
all folds and repetitions (meanAUC), and the standard deviation of AUC values accross folds and
repeats of the cross-validation.
bestAUC - The value of mean AUC corresponding to the optimal number of features chosen.
Author(s)
Adi Laurentiu Tarca <atarca@med.wayne.edu>
References
Adi L. Tarca, Mario Lauria, Michael Unger, Erhan Bilal, Stephanie Boue, Kushal Kumar Dey,
Julia Hoeng, Heinz Koeppl, Florian Martin, Pablo Meyer, Preetam Nandy, Raquel Norel,
Manuel Peitsch, Jeremy J Rice, Roberto Romero, Gustavo Stolovitzky, Marja Talikka,
Yang Xiang, Christoph Zechner, and IMPROVER DSC Collaborators,
Strengths and limitations of microarray-based phenotype prediction: Lessons learned
from the IMPROVER Diagnostic Signature Challenge. Bioinformatics, submitted 2013.
Tarca AL, Than NG, Romero R, Methodological Approach from the Best Overall
Team in the IMPROVER Diagnostic Signature Challenge, Systems Biomedicine, submitted, 2013.
See Also
aggregateDSC
Examples
library(maPredictDSC)
library(LungCancerACvsSCCGEO)
data(LungCancerACvsSCCGEO)
anoLC
gsLC
table(anoLC$group)
#run a series of methods combinations
modlist=predictDSC(ano=anoLC,celfile.path=system.file("extdata/lungcancer",package="LungCancerACvsSCCGEO"),
annotation="hgu133plus2.db",
preprocs=c("rma"),filters=c("mttest","wilcox"),FCT=1.0,classifiers=c("LDA","kNN"),
CVP=2,NF=4, NR=1)
#rank combinations by the performance on training data (AUC)
trainingAUC=sort(unlist(lapply(modlist,"[[","best_AUC")),decreasing=TRUE)
trainingAUC
#optional step; since we know the class of the test samples, let's see how the
#methods combinations perform on the test data
perfTest=function(out){
perfDSC(pred=out$predictions,gs=gsLC)
}
testPerf=t(data.frame(lapply(modlist,perfTest)))
testPerf=testPerf[order(testPerf[,"AUC"],decreasing=TRUE),]
testPerf
#aggregate predictions from top 3 combinations of methods
best3=names(trainingAUC)[1:3]
aggpred=aggregateDSC(modlist[best3])
#test the aggregated model on the test data
perfDSC(aggpred,gsLC)
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(maPredictDSC)
Loading required package: MASS
Loading required package: affy
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:parallel':
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from 'package:stats':
IQR, mad, xtabs
The following objects are masked from 'package:base':
Filter, Find, Map, Position, Reduce, anyDuplicated, append,
as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
rbind, rownames, sapply, setdiff, sort, table, tapply, union,
unique, unsplit
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Loading required package: limma
Attaching package: 'limma'
The following object is masked from 'package:BiocGenerics':
plotMA
Loading required package: gcrma
Loading required package: ROC
Loading required package: class
Loading required package: e1071
Loading required package: caret
Loading required package: lattice
Loading required package: ggplot2
Loading required package: hgu133plus2.db
Loading required package: AnnotationDbi
Loading required package: stats4
Loading required package: IRanges
Loading required package: S4Vectors
Attaching package: 'S4Vectors'
The following objects are masked from 'package:base':
colMeans, colSums, expand.grid, rowMeans, rowSums
Attaching package: 'AnnotationDbi'
The following object is masked from 'package:MASS':
select
Loading required package: org.Hs.eg.db
Loading required package: ROCR
Loading required package: gplots
Attaching package: 'gplots'
The following object is masked from 'package:IRanges':
space
The following object is masked from 'package:S4Vectors':
space
The following object is masked from 'package:stats':
lowess
Loading required package: LungCancerACvsSCCGEO
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/maPredictDSC/predictDSC.Rd_%03d_medium.png", width=480, height=480)
> ### Name: predictDSC
> ### Title: Phenotype prediction using microarray data: approach of the best
> ### overall team in the IMPROVER Diagnostic Signature Challenge
> ### Aliases: predictDSC
> ### Keywords: parametric methods
>
> ### ** Examples
>
>
> library(maPredictDSC)
> library(LungCancerACvsSCCGEO)
> data(LungCancerACvsSCCGEO)
> anoLC
files group
1 GSM137916.CEL.gz AC
2 GSM258560.CEL.gz AC
3 GSM258579.CEL.gz AC
4 GSM258589.CEL.gz AC
5 GSM258598.CEL.gz AC
6 GSM353885.CEL.gz AC
7 GSM467021.CEL.gz AC
8 GSM152624.CEL.gz SCC
9 GSM258580.CEL.gz SCC
10 GSM277678.CEL.gz SCC
11 GSM466956.CEL.gz SCC
12 GSM466980.CEL.gz SCC
13 GSM466989.CEL.gz SCC
14 GSM467023.CEL.gz SCC
15 GSM46976.CEL.gz SCC
16 lung_100.CEL Test
17 lung_107.CEL Test
18 lung_111.CEL Test
19 lung_15.CEL Test
20 lung_150.CEL Test
21 lung_29.CEL Test
22 lung_30.CEL Test
23 lung_35.CEL Test
24 lung_40.CEL Test
25 lung_41.CEL Test
26 lung_50.CEL Test
27 lung_51.CEL Test
28 lung_59.CEL Test
29 lung_62.CEL Test
30 lung_8.CEL Test
> gsLC
SCC AC
lung_15.CEL 1 0
lung_29.CEL 0 1
lung_30.CEL 0 1
lung_59.CEL 0 1
lung_51.CEL 1 0
lung_100.CEL 1 0
lung_62.CEL 1 0
lung_8.CEL 1 0
lung_41.CEL 0 1
lung_40.CEL 1 0
lung_150.CEL 1 0
lung_111.CEL 1 0
lung_35.CEL 0 1
lung_107.CEL 1 0
lung_50.CEL 0 1
> table(anoLC$group)
AC SCC Test
7 8 15
>
> #run a series of methods combinations
> modlist=predictDSC(ano=anoLC,celfile.path=system.file("extdata/lungcancer",package="LungCancerACvsSCCGEO"),
+ annotation="hgu133plus2.db",
+ preprocs=c("rma"),filters=c("mttest","wilcox"),FCT=1.0,classifiers=c("LDA","kNN"),
+ CVP=2,NF=4, NR=1)
Background correcting
Normalizing
Calculating Expression
Getting probe level data...
Computing p-values
Making P/M/A Calls
NA in cutpts forces recomputation using smallest gap
NA in cutpts forces recomputation using smallest gap
NA in cutpts forces recomputation using smallest gap
NA in cutpts forces recomputation using smallest gap
NA in cutpts forces recomputation using smallest gap
NA in cutpts forces recomputation using smallest gap
rma_mttest_LDA
NA in cutpts forces recomputation using smallest gap
NA in cutpts forces recomputation using smallest gap
NA in cutpts forces recomputation using smallest gap
NA in cutpts forces recomputation using smallest gap
NA in cutpts forces recomputation using smallest gap
NA in cutpts forces recomputation using smallest gap
rma_wilcox_LDA
NA in cutpts forces recomputation using smallest gap
NA in cutpts forces recomputation using smallest gap
NA in cutpts forces recomputation using smallest gap
NA in cutpts forces recomputation using smallest gap
NA in cutpts forces recomputation using smallest gap
rma_mttest_kNN
NA in cutpts forces recomputation using smallest gap
NA in cutpts forces recomputation using smallest gap
NA in cutpts forces recomputation using smallest gap
NA in cutpts forces recomputation using smallest gap
NA in cutpts forces recomputation using smallest gap
NA in cutpts forces recomputation using smallest gap
rma_wilcox_kNN
There were 50 or more warnings (use warnings() to see the first 50)
>
>
> #rank combinations by the performance on training data (AUC)
> trainingAUC=sort(unlist(lapply(modlist,"[[","best_AUC")),decreasing=TRUE)
> trainingAUC
rma_mttest_LDA rma_mttest_kNN rma_wilcox_LDA rma_wilcox_kNN
0.9843750 0.9375000 0.5833333 0.5208333
>
>
> #optional step; since we know the class of the test samples, let's see how the
> #methods combinations perform on the test data
>
> perfTest=function(out){
+ perfDSC(pred=out$predictions,gs=gsLC)
+ }
> testPerf=t(data.frame(lapply(modlist,perfTest)))
NA in cutpts forces recomputation using smallest gap
NA in cutpts forces recomputation using smallest gap
NA in cutpts forces recomputation using smallest gap
NA in cutpts forces recomputation using smallest gap
NA in cutpts forces recomputation using smallest gap
NA in cutpts forces recomputation using smallest gap
NA in cutpts forces recomputation using smallest gap
NA in cutpts forces recomputation using smallest gap
> testPerf=testPerf[order(testPerf[,"AUC"],decreasing=TRUE),]
> testPerf
BCM CCEM AUPR AUC
rma_wilcox_kNN 0.9444444 0.9333333 0.9777778 0.9444444
rma_mttest_LDA 0.8888778 0.8666600 0.9555556 0.8888889
rma_wilcox_LDA 0.7719000 0.7305033 0.8803474 0.8888889
rma_mttest_kNN 0.8888889 0.8666667 0.9555556 0.8888889
>
> #aggregate predictions from top 3 combinations of methods
> best3=names(trainingAUC)[1:3]
> aggpred=aggregateDSC(modlist[best3])
> #test the aggregated model on the test data
> perfDSC(aggpred,gsLC)
NA in cutpts forces recomputation using smallest gap
NA in cutpts forces recomputation using smallest gap
BCM CCEM AUPR AUC
0.8498889 0.8434989 1.0000000 1.0000000
>
>
>
>
>
>
>
>
> dev.off()
null device
1
>