R: Cross Validation for Iterative Bayesian Model Averaging
crossVal
R Documentation
Cross Validation for Iterative Bayesian Model Averaging
Description
This function performs k runs of n-fold cross validation on a
training dataset for survival analysis on microarray data, where
k and n are specified by the user.
Data matrix for the training set where columns are variables
and rows are observations. In the case of gene expression data,
the columns (variables) represent genes, while the rows
(observations) represent samples. The data is not assumed
to be pre-sorted by rank.
survTime
Vector of survival times for the patient samples in the
training set. Survival times are assumed to be presented
in uniform format (e.g., months or days), and the length
of this vector should be equal to the number of rows in
exset.
censor
Vector of censor data for the patient samples in the
training set. In general, 0 = censored and 1 = uncensored.
The length of this vector should equal the number of rows
in exset and the number of elements in survTime.
diseaseType
String denoting the type of disease in the training dataset
(used for writing to file). Default is 'cancer'.
nbest
A number specifying the number of models of each size
returned to bic.surv in the BMA package.
The default is 10.
maxNvar
A number indicating the maximum number of variables used in
each iteration of bic.surv from the BMA package.
The default is 25.
p
A number indicating the maximum number of top univariate genes
used in the iterative bic.surv algorithm. This number is
assumed to be less than the total number of genes in the training data.
A larger p usually requires longer computational time as more iterations
of the bic.surv algorithm are potentially applied. The default is
100.
cutPoint
Threshold percent for separating high- from low-risk groups.
The default is 50.
verbose
A boolean variable indicating whether or not to print interim
information to the console. The default is FALSE.
noFolds
A number specifying the desired number of folds in each cross
validation run. The default is 10.
noRuns
A number specifying the desired number of cross validation runs.
The default is 10.
Details
This function performs k runs of n-fold cross validation, where k and n
are specified by the user through the noRuns and noFolds
arguments respectively. For each run of cross validation, the training
set, survival times, and censor data are re-ordered according to a
random permutation. For each fold of cross validation, 1/nth of the data
is set aside to act as the validation set. In each fold, the
iterateBMAsurv.train.predict.assess function is called in order
to carry out a complete run of survival analysis. This means the
univariate ranking measure for this cross validation function is Cox
Proportional Hazards Regression; see iterateBMAsurv.train.wrapper
to experiment with alternate univariate ranking methods. With each run
of cross validation, the survival analysis statistics are saved and
written to file.
Value
The output of this function is a series of files giving information on
cross validation results. The file beginning with 'foldresults' contains
information for every fold in the form of a 2 x 2 table indicating the
number of test samples in each category (high-risk or censored,
high-risk or uncensored, low-risk or censored, low-risk or uncensored). This file
also gives the accumulated percentage of uncensored statistic from each run. The file
beginning with 'runresults' gives the total number of test samples assigned
to each category along with percentage uncensored across the entire run. The end of
this file contains this same information, averaged across all runs. The
file beginning with 'stats' gives the statistics from each fold, including
the p-value, chi-square statistic, and variance matrix. Finally, the file
beginning with 'avg_p_value_chi_square' gives the overall means and standard
deviations of the p-values and chi-square statistics across all runs and all
folds.
Note
The BMA package is required. Also, smaller training sets may lead to
cross validation folds where all test samples are assigned to one risk group
or all samples are in the same censor category (all samples are either censored
or uncensored). In this case, the fold is skipped, and cross validation proceeds
from the next fold. This particular error will be evidenced by a missing fold
result in the output files. All averages will be calculated as if this fold had
never occurred.
References
Annest, A., Yeung, K.Y., Bumgarner, R.E., and Raftery, A.E. (2008).
Iterative Bayesian Model Averaging for Survival Analysis.
Manuscript in Progress.
Raftery, A.E. (1995).
Bayesian model selection in social research (with Discussion). Sociological Methodology 1995 (Peter V. Marsden, ed.), pp. 111-196, Cambridge, Mass.: Blackwells.
Volinsky, C., Madigan, D., Raftery, A., and Kronmal, R. (1997)
Bayesian Model Averaging in Proprtional Hazard Models: Assessing the Risk of a Stroke.
Applied Statistics 46: 433-448.
Yeung, K.Y., Bumgarner, R.E. and Raftery, A.E. (2005)
Bayesian Model Averaging: Development of an improved multi-class, gene selection and classification tool for microarray data.
Bioinformatics 21: 2394-2402.
library (BMA)
library(iterativeBMAsurv)
data(trainData)
data(trainSurv)
data(trainCens)
## Perform 1 run of 2-fold cross validation on the training set, using p=10 genes and nbest=5 for fast computation
cv <- crossVal (exset=trainData, survTime=trainSurv, censor=trainCens, diseaseType="DLBCL", noRuns=1, noFolds=2, p=10, nbest=5)
## Upon completion of this function, all relevant output files will be in the working R directory.
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(iterativeBMAsurv)
Loading required package: BMA
Loading required package: survival
Loading required package: leaps
Loading required package: robustbase
Attaching package: 'robustbase'
The following object is masked from 'package:survival':
heart
Loading required package: inline
Loading required package: rrcov
Scalable Robust Estimators with High Breakdown Point (version 1.3-11)
Loading required package: splines
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/iterativeBMAsurv/crossVal.Rd_%03d_medium.png", width=480, height=480)
> ### Name: crossVal
> ### Title: Cross Validation for Iterative Bayesian Model Averaging
> ### Aliases: crossVal
> ### Keywords: survival
>
> ### ** Examples
>
> library (BMA)
> library(iterativeBMAsurv)
> data(trainData)
> data(trainSurv)
> data(trainCens)
>
> ## Perform 1 run of 2-fold cross validation on the training set, using p=10 genes and nbest=5 for fast computation
> cv <- crossVal (exset=trainData, survTime=trainSurv, censor=trainCens, diseaseType="DLBCL", noRuns=1, noFolds=2, p=10, nbest=5)
******************** BEGINNING CV RUN 1 ********************
---------- BEGINNING FOLD 1 ----------
1: Explored up to variable # 10
Iterate bic.surv is done!
Selected genes:
[1] "X19373" "X29258" "X27585" "X16817" "X17280" "X17273" "X27766" "X28003"
[9] "X30209" "X27587"
Posterior probabilities of selected genes:
[1] 68.4 22.7 24.8 21.8 12.9 0.0 36.0 39.6 42.7 15.2
# selected genes = 9
# selected models = 30
Risk Table:
cens.vec.test
0 1
High Risk 7 11
Low Risk 8 6
Call:
survdiff(formula = mySurv.obj ~ unlist(risk.groups))
N Observed Expected (O-E)^2/E (O-E)^2/V
unlist(risk.groups)=High Risk 18 11 9.09 0.403 0.883
unlist(risk.groups)=Low Risk 14 6 7.91 0.462 0.883
Chisq= 0.9 on 1 degrees of freedom, p= 0.347
[,1] [,2]
[1,] 4.143528 -4.143528
[2,] -4.143528 4.143528
[1] 0.8831155
---------- BEGINNING FOLD 2 ----------
1: Explored up to variable # 10
Iterate bic.surv is done!
Selected genes:
[1] "X26512" "X26922" "X33310" "X24298" "X28197" "X16527" "X29650" "X16164"
[9] "X34729" "X24394"
Posterior probabilities of selected genes:
[1] 25.5 100.0 6.6 10.2 77.3 22.7 67.6 7.4 5.9 11.6
# selected genes = 10
# selected models = 17
Risk Table:
cens.vec.test
0 1
High Risk 5 11
Low Risk 10 7
Call:
survdiff(formula = mySurv.obj ~ unlist(risk.groups))
N Observed Expected (O-E)^2/E (O-E)^2/V
unlist(risk.groups)=High Risk 16 11 6.68 2.80 4.68
unlist(risk.groups)=Low Risk 17 7 11.32 1.65 4.68
Chisq= 4.7 on 1 degrees of freedom, p= 0.0305
[,1] [,2]
[1,] 3.993091 -3.993091
[2,] -3.993091 3.993091
[1] 4.678284
******************** END CV RUN 1 ********************
Overall results from this run:
Censored Uncensored Percent Uncensored
Low Risk 18 13 41.93548
High Risk 12 22 64.70588
Overall average result matrix over all runs:
Censored Uncensored Percent Uncensored
Low Risk 18 13 41.93548
High Risk 12 22 64.70588
Average p-value across all folds and all runs:
[1] 0.188948
Standard deviation of p-values across all folds and all runs:
[1] 0.2240139
Average chi-square value across all folds and all runs:
[1] 2.7807
Standard deviation of chi-square across all folds and all runs:
[1] 2.68359
There were 50 or more warnings (use warnings() to see the first 50)
>
> ## Upon completion of this function, all relevant output files will be in the working R directory.
>
>
>
>
>
>
> dev.off()
null device
1
>