R: Iterative Bayesian Model Averaging: training, prediction,...
iterateBMAsurv.train.predict.assess
R Documentation
Iterative Bayesian Model Averaging: training, prediction, assessment
Description
Survival analysis and variable selection on microarray data.
This is a multivariate technique to select a small number
of relevant variables (typically genes) to perform survival
analysis on microarray data. This function performs the training,
prediction, and assessment steps. The data is not assumed to be
pre-sorted by rank before this function is called.
Data matrix for the training set where columns are variables
and rows are observations. In the case of gene expression data,
the columns (variables) represent genes, while the rows
(observations) represent samples. If Cox Proportional Hazards
Regression is the desired univariate ranking measure, the data
does not need to be pre-sorted. To use a different univariate
ranking measure, see iterateBMAsurv.train.wrapper.
test.dat
Data matrix for the test set where columns are variables
and rows are observations. In the case of gene expression data,
the columns (variables) represent genes, while the rows
(observations) represent samples. The test set should contain
the same variables as the training set.
surv.time.train
Vector of survival times for the patient samples in the
training set. Survival times are assumed to be presented
in uniform format (e.g., months or days), and the length
of this vector should be equal to the number of rows in
train.dat.
surv.time.test
Vector of survival times for the patient samples in the
test set. Survival times are assumed to be presented
in uniform format (e.g., months or days), and the length
of this vector should be equal to the number of rows in
test.dat.
cens.vec.train
Vector of censor data for the patient samples in the
training set. In general, 0 = censored and 1 = uncensored.
The length of this vector should equal the number of rows
in train.dat and the number of elements in surv.time.train.
cens.vec.test
Vector of censor data for the patient samples in the
test set. In general, 0 = censored and 1 = uncensored.
The length of this vector should equal the number of rows
in train.dat and the number of elements in surv.time.test.
p
A number indicating the maximum number of top univariate genes
used in the iterative bic.surv algorithm. This number is
assumed to be less than the total number of genes in the training data.
A larger p usually requires longer computational time as more iterations
of the bic.surv algorithm are potentially applied. The default is
100.
nbest
A number specifying the number of models of each size
returned to bic.surv in the BMA package.
The default is 10.
maxNvar
A number indicating the maximum number of variables used in
each iteration of bic.surv from the BMA package.
The default is 25.
maxIter
A number indicating the maximum iterations of bic.surv.
The default is 200000.
thresProbne0
A number specifying the threshold for the posterior
probability that each variable (gene) is non-zero (in
percent). Variables (genes) with such posterior
probability less than this threshold are dropped in
the iterative application of bic.surv. The default
is 1 percent.
cutPoint
Threshold percent for separating high- from low-risk groups.
The default is 50.
verbose
A boolean variable indicating whether or not to print interim
information to the console. The default is FALSE.
suff.string
A string for writing to file.
Details
This function consists of the training phase, the prediction phase,
and the assessment phase. The training phase first orders all the
variables (genes) by a univariate measure called Cox Proportional
Hazards Regression, and then iteratively applies the bic.surv
algorithm from the BMA package. The prediction phase uses the
variables (genes) selected in the training phase to predict the risk
scores of the patient samples in the test set. In the assessment phase,
the risk scores are used to designate each test sample as either
high-risk or low-risk based on the user-designated cutPoint.
Prediction accuracy is measured by the p-value difference between
groups as calculated through the central chi-square distribution. In
addition, a Kaplan-Meier Survival Analysis Curve illustrating the
difference between risk groups is written to file in the working R
directory. If Cox Proportional Hazards Regression is the desired
univariate ranking algorithm, then calling this function with the
training and testing sets is all that is necessary for a complete
survival analysis run.
Value
If all samples are assigned to a single risk group or all samples are in
the same censor category, an error message is printed and a boolean variable
success is returned as FALSE.If both risk groups are present in the
patient test samples, a Kaplan-Meier Survival Analysis Curve is written to file,
and a list with 6 components is returned:
nvar
The number of variables selected by the last iteration of bic.surv.
nmodel
The number of models selected by the last iteration of bic.surv.
ypred
The predicted risk scores on the test samples.
result.table
A 2 x 2 table indicating the number of test samples in each
category (high-risk/censored, high-risk/uncensored,
low-risk/censored, low-risk/uncensored).
statistics
An object of class survdiff that contains the statistics
from survival analysis, including the variance matrix, chi-square
statistic, and p-value.
success
A boolean variable returned as TRUE if both risk groups are present
in the patient test samples.
Note
The BMA package is required.
References
Annest, A., Yeung, K.Y., Bumgarner, R.E., and Raftery, A.E. (2008).
Iterative Bayesian Model Averaging for Survival Analysis.
Manuscript in Progress.
Raftery, A.E. (1995).
Bayesian model selection in social research (with Discussion). Sociological Methodology 1995 (Peter V. Marsden, ed.), pp. 111-196, Cambridge, Mass.: Blackwells.
Volinsky, C., Madigan, D., Raftery, A., and Kronmal, R. (1997)
Bayesian Model Averaging in Proprtional Hazard Models: Assessing the Risk of a Stroke.
Applied Statistics 46: 433-448.
Yeung, K.Y., Bumgarner, R.E. and Raftery, A.E. (2005)
Bayesian Model Averaging: Development of an improved multi-class, gene selection and classification tool for microarray data.
Bioinformatics 21: 2394-2402.
library (BMA)
library(iterativeBMAsurv)
data(trainData)
data(trainSurv)
data(trainCens)
data(testData)
data(testSurv)
data(testCens)
## Use p=10 genes and nbest=5 for fast computation
ret.bma <- iterateBMAsurv.train.predict.assess (train.dat=trainData, test.dat=testData, surv.time.train=trainSurv, surv.time.test=testSurv, cens.vec.train=trainCens, cens.vec.test=testCens, p=10, nbest=5)
## Extract the statistics from this survival analysis run
number.genes <- ret.bma$nvar
number.models <- ret.bma$nmodel
evaluate.success <- ret.bma$statistics
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(iterativeBMAsurv)
Loading required package: BMA
Loading required package: survival
Loading required package: leaps
Loading required package: robustbase
Attaching package: 'robustbase'
The following object is masked from 'package:survival':
heart
Loading required package: inline
Loading required package: rrcov
Scalable Robust Estimators with High Breakdown Point (version 1.3-11)
Loading required package: splines
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/iterativeBMAsurv/iterateBMAsurv.train.predict.assess.Rd_%03d_medium.png", width=480, height=480)
> ### Name: iterateBMAsurv.train.predict.assess
> ### Title: Iterative Bayesian Model Averaging: training, prediction,
> ### assessment
> ### Aliases: iterateBMAsurv.train.predict.assess
> ### Keywords: multivariate survival
>
> ### ** Examples
>
> library (BMA)
> library(iterativeBMAsurv)
> data(trainData)
> data(trainSurv)
> data(trainCens)
> data(testData)
> data(testSurv)
> data(testCens)
>
> ## Use p=10 genes and nbest=5 for fast computation
> ret.bma <- iterateBMAsurv.train.predict.assess (train.dat=trainData, test.dat=testData, surv.time.train=trainSurv, surv.time.test=testSurv, cens.vec.train=trainCens, cens.vec.test=testCens, p=10, nbest=5)
1: Explored up to variable # 10
Iterate bic.surv is done!
Selected genes:
[1] "X33310" "X28197" "X19373" "X16527" "X27415" "X24394" "X28531" "X27585"
[9] "X27766" "X26940"
Posterior probabilities of selected genes:
[1] 100.0 18.9 16.9 0.0 19.6 0.0 47.0 12.6 0.0 4.3
# selected genes = 7
# selected models = 11
Risk Table:
cens.vec.test
0 1
High Risk 7 9
Low Risk 8 12
Call:
survdiff(formula = mySurv.obj ~ unlist(risk.groups))
N Observed Expected (O-E)^2/E (O-E)^2/V
unlist(risk.groups)=High Risk 16 9 8.21 0.0752 0.126
unlist(risk.groups)=Low Risk 20 12 12.79 0.0483 0.126
Chisq= 0.1 on 1 degrees of freedom, p= 0.722
[,1] [,2]
[1,] 4.891446 -4.891446
[2,] -4.891446 4.891446
[1] 0.1262768
There were 50 or more warnings (use warnings() to see the first 50)
>
> ## Extract the statistics from this survival analysis run
> number.genes <- ret.bma$nvar
> number.models <- ret.bma$nmodel
> evaluate.success <- ret.bma$statistics
>
>
>
>
>
>
> dev.off()
null device
1
>