Main end-user function for fitting a cross-validated Survival Bump Hunting (SBH) model.
Returns a cross-validated PRSP object, as generated by our Patient Recursive Survival Peeling or PRSP algorithm,
containing cross-validated estimates of end-points statistics of interest.
data.frame or numericmatrix of input dataset containing the observed survival and status indicator variables
in the first two columns, respectively, and all the covariates thereafter. If a data.frame is provided, it will be coerced
to a numericmatrix. Discrete (or nominal) covariates should be made (or re-arranged into) ordinal variables.
B
Positive integer scalar of the number of replications of the cross-validation procedure. Defaults to 10.
K
Integer giving the number of folds (partitions) into which the observations should be randomly split
for the cross-validation procedure. Setting K also specifies the type of cross-validation to be done:
K = 1 carries no cross-validation out.
K in {2,...,n-1} carries out eqnK-fold cross-validation.
K = n carries out leave-one-out cross-validation.
A
Positive integer scalar of the number of permutations for the computation of cross-validated p-values. Defaults to 1000.
vs
logical scalar. Flag for optional variable (covariate) pre-selection.
Defaults to TRUE.
cpv
logical scalar. Flag for computation of permutation p-values.
Defaults to FALSE.
decimals
integer scalar. Number of user-specified significant decimals to output results.
Defaults to 2.
cvtype
Charactervector describing the cross-validation technique in {"combined", "averaged", "none", NULL}.
If NULL, automatically reset to "none".
cvcriterion
charactervector describing the optimization criterion in {"lrt", "lhr", "cer", NULL}.
If NULL, automatically reset to "none".
arg
Charactervector describing the PRSP parameters:
alpha = fraction to peel off at each step. Defaults to 0.05.
beta = minimum support size resulting from the peeling sequence. Defaults to 0.05.
minn = minimum number of observation that we want to be able to detect in a box. Defaults to 5.
L = fixed peeling length. Defaults to NULL.
peelcriterion in {"hr" for Log-Hazard Ratio (LHR),
"lr" for Log-Rank Test (LRT),
"ch" for Cumulative Hazard Summary (CHS)}.
Defaults to "lr".
Note that the parameters in arg come as a string of charaters between double quotes,
where all parameter evaluations are separated by comas (see example).
probval
Numeric scalar of the survival probability at which we want to get the endpoint box survival time. Defaults to NULL.
timeval
Numeric scalar of the survival time at which we want to get the endpoint box survival probability. Defaults to NULL.
parallel
Logical. Is parallel computing to be performed? Optional. Defaults to FALSE.
conf
List of parameters for cluster configuration.
Inputs for R package parallel function makeCluster (R package parallel) for cluster setup.
Optional, defaults to NULL. See details for usage.
seed
Positive integer scalar of the user seed to reproduce the results.
Details
At this point, the main function sbh performs the search of the first box of the recursive coverage (outer) loop of our
Patient Recursive Survival Peeling (PRSP) algorithm. It relies on an optional variable pre-selection procedure that is run before
the PRSP algorithm. At this point, this is done by Elastic-Net (EN) penalization of the partial likelihood, where both mixing (alpha)
and overal shrinkage (lambda) parameters are simultaneously estimated by cross-validation using
the glmnet::cv.glmnet function of the R package glmnet.
The returned S3-class PRSP object contains cross-validated estimates of all the decision-rules of pre-selected covariates and
all other statistical quantities of interest at each iteration of the peeling sequence (inner loop of the PRSP algorithm).
This enables the graphical display of results of profiling curves for model tuning, peeling trajectories, covariate traces and
survival distributions (see plotting functions for more details).
The function offers a number of options for the number of cross-validation replicates to be perfomed: B; the type of
cross-validation desired: K-fold (replicated)-averaged or-combined, as well as the peeling and optimization critera chosen
for model tuning and a few more parameters for the PRSP algorithm.
In case replicated cross-validations are performed, a "summary" of the outputs is done over the B replicates,
which requires some explanation:
Even thought the PRSP algorithm uses only one covariate at a time at each peeling step, the reported matrix of
"Replicated CV" box decision rules may show several covariates being used in a given step, simply because these decision rules
are averaged over the B replicates (see equation #21 in Dazard et al. 2015). This is also reflected in the reported
"Replicated CV" importance and usage plots of covariate traces.
Likewise, the output matrix of "Replicated CV" box membership indicator does not necessarily match exactly the output vector
of "Replicated CV" box support (and corresponding box sample size) for all peeling steps. The reason is that the reported
"Replicated CV" box membership indicators are computed (at each peeling step) as the point-wise majority vote over the B
replicates (see equation #22 in Dazard et al. 2015), whereas the "Replicated CV" box support vector (and corresponding box sample size)
is averaged (at each peeling step) over the B replicates.
The function takes advantage of the R package parallel, which allows users to create a cluster of workstations on a local
and/or remote machine(s), enabling scaling-up with the number of CPU cores specified and efficient parallel execution.
If the computation of permutation p-values is desired, then running with the parallelization option is strongly advised
as it may take a while. In the case of large (p > n) or very large (p >> n) datasets, it is also required to use the
parallelization option.
To run a parallel session (and parallel RNG) of the PRIMsrc procedures (parallel=TRUE), argument conf
is to be specified (i.e. non NULL). It must list the specifications of the folowing parameters for cluster configuration:
"names", "cpus", "type", "homo", "verbose", "outfile". These match the arguments described in function makeCluster
of the R package parallel. All fields are required to properly configure the cluster, except for "names" and "cpus",
which are the values used alternatively in the case of a cluster of type "SOCK" (socket), or in the case of a cluster
of type other than "SOCK" (socket), respectively. See examples below.
"names": names : charactervector specifying the host names on which to run the job.
Could default to a unique local machine, in which case, one may use the unique host name "localhost".
Each host name can potentially be repeated to the number of CPU cores available on the corresponding machine.
"cpus": spec : integer scalar specifying the total number of CPU cores to be used
across the network of available nodes, counting the workernodes and masternode.
"type": type : charactervector specifying the cluster type ("SOCK", "PVM", "MPI").
"homo": homogeneous : logical scalar to be set to FALSE for inhomogeneous clusters.
"verbose": verbose : logical scalar to be set to FALSE for quiet mode.
"outfile": outfile : charactervector of the output log file name for the workernodes.
Note that argument B is internally reset to conf$cpus*ceiling(B/conf$cpus) in case the
parallelization is used (i.e. conf is non NULL), where conf$cpus denotes the total number of CPUs to be
used (see above). The argument A is similarly reset.
The actual creation of the cluster, its initialization, and closing are all done internally.
In addition, when random number generation is needed, the creation of separate streams of parallel RNG per node
is done internally by distributing the stream states to the nodes (For more details see function makeCluster
(R package parallel) and/or http://www.stat.uiowa.edu/~luke/R/cluster/cluster.html.
The use of a seed allows to reproduce the results within the same type of session: the same seed will reproduce the same results
within a non-parallel session or within a parallel session, but it will not necessarily give the exact same results (up to sampling variability)
between a non-parallelized and parallelized session due to the difference of management of the seed between the two (see parallel RNG and
value of retuned seed below).
Value
Object of classPRSP (Patient Recursive Survival Peeling)
List containing the following 19 fields:
x
numericmatrix of original dataset.
times
numericvector of observed failure / survival times.
status
numericvector of observed event indicator in {1,0}.
B
positive integer of the number of replications used in the cross-validation procedure.
K
positive integer of the number of folds used in the cross-validation procedure.
A
positive integer of the number of permutations used for the computation of permutation p-values.
vs
logical scalar of returned flag of optional variable pre-selection.
cpv
logical scalar of returned flag of optional computation of permutation p-values.
decimals
integer of the number of user-specified significant decimals.
cvtype
charactervector of the cross-validation technique used.
cvcriterion
charactervector of optimization criterion used.
arg
charactervector of the parameters used.
probval
Numeric scalar of survival probability used.
timeval
Numeric scalar of survival time used.
cvfit
List with 10 fields of cross-validated estimates:
cv.maxsteps: numeric scalar of maximal number of peeling steps over the replicates.
cv.nsteps: numeric scalar of optimal number of peeling steps according to the optimization criterion.
cv.trace: numericvector of the modal trace values of covariate usage for all peeling steps.
cv.boxind: logicalmatrix in TRUE, FALSE of individual observation box membership indicator (columns) for all peeling steps (rows).
cv.rules: data.frame of decision rules on the covariates (columns) for all peeling steps (rows).
cv.signnumericvector in {-1,+1} of directions of peeling for all pre-selected covariates.
cv.selectednumericvector of pre-selected covariates in reference to original index.
cv.usednumericvector of covariates used for peeling in reference to original index.
cv.stats: numericmatrix of box endpoint quantities of interest (columns) for all peeling steps (rows).
cv.pval: numericvector of log-rank permutation p-values of sepraration of survival distributions.
cvprofiles
List of (B) of numericvectors, one for each replicate,
of the cross-validated statistics used in the optimization criterion (set by user) as a function of the number of peeling steps.
cvmeanprofiles
List of numericvectors of the cross-validated mean statistics over the replicates.
used in the optimization criterion (one set by user) as a function of the number of peeling steps.
plot
logical scalar of the returned flag for plotting or not the results of the fitted SBH model.
config
List with 7 fields of parameters used for configuring the parallelization including parallel and conf.
seed
User seed(s) used:
integer of a single value, if parallelization is used
integervector of values, one for each replication, if parallelization is not used.
Note
Unique end-user function for fitting the Survival Bump Hunting model.
Acknowledgments: This project was partially funded by the National Institutes of Health
NIH - National Cancer Institute (R01-CA160593) to J-E. Dazard and J.S. Rao.
References
Dazard J-E., Choe M., LeBlanc M. and Rao J.S. (2015).
"Cross-validation and Peeling Strategies for Survival Bump Hunting using Recursive Peeling Methods."
Statistical Analysis and Data Mining (in press).
Dazard J-E., Choe M., LeBlanc M. and Rao J.S. (2014).
"Cross-Validation of Survival Bump Hunting by Recursive Peeling Methods."
In JSM Proceedings, Survival Methods for Risk Estimation/Prediction Section. Boston, MA, USA.
American Statistical Association IMS - JSM, p. 3366-3380.
Dazard J-E., Choe M., LeBlanc M. and Rao J.S. (2015).
"R package PRIMsrc: Bump Hunting by Patient Rule Induction Method for Survival, Regression and Classification."
In JSM Proceedings, Statistical Programmers and Analysts Section. Seattle, WA, USA.
American Statistical Association IMS - JSM, (in press).
Dazard J-E. and J.S. Rao (2010).
"Local Sparse Bump Hunting."
J. Comp Graph. Statistics, 19(4):900-92.
See Also
makeCluster (R package parallel)
cv.glmnet (R package glmnet)
glmnet (R package glmnet)
Examples
#===================================================
# Loading the library and its dependencies
#===================================================
library("PRIMsrc")
#===================================================
# Package news
# Package citation
#===================================================
PRIMsrc.news()
citation("PRIMsrc")
#===================================================
# Demo with a synthetic dataset
# Use help for descriptions
#===================================================
data("Synthetic.1", package="PRIMsrc")
?Synthetic.1
#===================================================
# Simulated dataset #1 (n=250, p=3)
# Non Replicated Combined Cross-Validation (RCCV)
# Peeling criterion = LRT
# Optimization criterion = LRT
# Without parallelization
# Without computation of permutation p-values
#===================================================
CVCOMB.synt1 <- sbh(dataset = Synthetic.1,
cvtype = "combined", cvcriterion = "lrt",
B = 1, K = 5,
vs = TRUE, cpv = FALSE,
decimals = 2, probval = 0.5,
arg = "beta=0.05,
alpha=0.05,
minn=5,
L=NULL,
peelcriterion="lr"",
parallel = FALSE, conf = NULL, seed = 123)
## Not run:
#===================================================
# Examples of parallel backend parametrization
#===================================================
# Example #1 - 1-Quad (4-core double threaded) PC
# Running WINDOWS
# With SOCKET communication
#===================================================
if (.Platform$OS.type == "windows") {
cpus <- detectCores()
conf <- list("names" = rep("localhost", cpus),
"cpus" = cpus,
"type" = "SOCK",
"homo" = TRUE,
"verbose" = TRUE,
"outfile" = "")
}
#===================================================
# Example #2 - 1 master node + 3 worker nodes cluster
# All nodes equipped with identical setups and multicores
# Running LINUX
# With SOCKET communication
#===================================================
if (.Platform$OS.type == "unix") {
masterhost <- Sys.getenv("HOSTNAME")
slavehosts <- c("compute-0-0", "compute-0-1", "compute-0-2")
nodes <- length(slavehosts) + 1
cpus <- 8
conf <- list("names" = c(rep(masterhost, cpus),
rep(slavehosts, cpus)),
"cpus" = nodes * cpus,
"type" = "SOCK",
"homo" = TRUE,
"verbose" = TRUE,
"outfile" = "")
}
#===================================================
# Example #3 - Multinode multicore per node cluster
# Running LINUX
# with MPI communication
# Here, a file named ".nodes" (e.g. in the home directory)
# contains the list of nodes of the cluster
#===================================================
if (.Platform$OS.type == "unix") {
hosts <- scan(file=paste(Sys.getenv("HOME"), "/.nodes", sep=""),
what="",
sep="\n")
hostnames <- unique(hosts)
nodes <- length(hostnames)
cpus <- length(hosts)/length(hostnames)
conf <- list("cpus" = nodes * cpus,
"type" = "MPI",
"homo" = TRUE,
"verbose" = TRUE,
"outfile" = "")
}
#===================================================
# Simulated dataset #1 (n=250, p=3)
# Replicated Combined Cross-Validation (RCCV)
# Peeling criterion = LRT
# Optimization criterion = LRT
# With parallelization
# With computation of permutation p-values
#===================================================
CVCOMBREP.synt1 <- sbh(dataset = Synthetic.1,
cvtype = "combined", cvcriterion = "lrt",
B = 10, K = 5, A = 1024,
vs = TRUE, cpv = TRUE,
decimals = 2, probval = 0.5,
arg = "beta=0.05,
alpha=0.05,
minn=5,
L=NULL,
peelcriterion="lr"",
parallel = TRUE, conf = conf, seed = 123)
## End(Not run)
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(PRIMsrc)
Loading required package: parallel
Loading required package: survival
Loading required package: Hmisc
Loading required package: lattice
Loading required package: Formula
Loading required package: ggplot2
Attaching package: 'Hmisc'
The following objects are masked from 'package:base':
format.pval, round.POSIXt, trunc.POSIXt, units
Loading required package: glmnet
Loading required package: Matrix
Loading required package: foreach
Loaded glmnet 2.0-5
Loading required package: MASS
PRIMsrc 0.6.3
Type PRIMsrc.news() to see new features, changes, and bug fixes
> png(filename="/home/ddbj/snapshot/RGM3/R_CC/result/PRIMsrc/sbh.Rd_%03d_medium.png", width=480, height=480)
> ### Name: sbh
> ### Title: Cross-Validated Survival Bump Hunting
> ### Aliases: sbh
> ### Keywords: Exploratory Survival/Risk Analysis Survival/Risk Estimation &
> ### Prediction Non-Parametric Method Cross-Validation Bump Hunting
> ### Rule-Induction Method
>
> ### ** Examples
>
> #===================================================
> # Loading the library and its dependencies
> #===================================================
> library("PRIMsrc")
>
> #===================================================
> # Package news
> # Package citation
> #===================================================
> PRIMsrc.news()
Package: PRIMsrc
---------------------------------------------------------------------------------
Date : 2015-01-20
o RELEASE 0.1.0
- Initial release to GitHub.
- Built and tested under R 3.1.2 and release update to GitHub.
---------------------------------------------------------------------------------
Date : 2015-01-22
o RELEASE 0.2.0
- Built and tested under R 3.1.2 and release update to GitHub.
---------------------------------------------------------------------------------
Date : 2015-02-01
o RELEASE 0.3.0
- Minor updates in the manual, email and version number.
- Built and tested under R 3.1.2 and release update to GitHub.
---------------------------------------------------------------------------------
Date : 2015-02-27
o RELEASE 0.4.0
- Extension to high-dimensional p > n and p >> n cases by adding an
internal variable selection procedure using the Elasticnet-Regularized Cox Regression
function of the 'glmnet' package.
- Removed (temporarily) interactive option in sbh() in case no variables are selected by glmnet(...).
- Added dependency to glmnet package for initial variable selection.
- Added synthetic dataset #5 and example with p > n.
- Added real dataset #2 and example with p >> n.
- Added new ouputs 'selected' and 'used' in main function sbh(...) for variables effectively
selected and used for peeling.
- Removed returned values of box vertices that were redudant with the returned rules.
- Changed return value of variable traces: now also returns the matrix of traces by replication.
- Corrected superfluous codes in the parallelization section, before clusterCall(...) in sbh(...).
- Corrected number of replications in sbh(...) in case of parallelization.
- Corrected stepwise variable selection procedure in peel.box() to account for missing values.
- Corrected definition of the cross-validated box vertices (definition)
in the case of "combined CV" technique.
- Corrected generation of random seed when none is provided.
- Minor updates, bugs and code improvements in sbh(...) and internal peel.box(...) functions.
- Updated manual, version number.
- Built and tested under R 3.1.2 and release update to GitHub.
---------------------------------------------------------------------------------
Date : 2015-03-04
o RELEASE 0.5.0
- Change of package name and GitHub repository name from PrimSRC to PRIMsrc.
- Added CRAN/GitHub subfolder doc for PDF documentation files
(including manual and applied study abstract).
- Removed option for overlaying plots of multiple PRSP objects
in plot_boxtrace(...) and plot_boxtraj(...).
- Added argument "toplot" to choose which covariates should be plotted in
plot_boxtrace(...) and plot_boxtraj(...).
- Corrected handling of empty PRSP object (failed peeling) in all plotting functions.
- Implementation of plotting device now internal to all plotting functions.
- Removed internal functions from the manual, updated manual, version number.
- Built and tested under R 3.1.2 and release update to GitHub.
---------------------------------------------------------------------------------
Date : 2015-03-16
o RELEASE 0.5.3
- Added S3-generic 'summary' function.
- Added S3-generic 'predict' function.
- Built and tested under R 3.1.2 and release update to GitHub.
---------------------------------------------------------------------------------
Date : 2015-04-10
o RELEASE 0.5.5
- Removed argument 'discr' in the main function: no special rounding of discrete covariate
decision rules is done any longer.
- Made the internal variable selection procedure conditional on whether p <= n or not.
- Corrected treatment of missing values in case of replications for the variable traces.
- Corrected output of variable trace modal values.
- Corrected pre-selected variable output.
- Several minor bugs corrected.
- Built and tested under R 3.1.2 and release update to GitHub.
---------------------------------------------------------------------------------
Date : 2015-06-19
o RELEASE 0.5.6
- Correction/extension of internal variable pre-selection procedure by cross-validing
both parameters alpha (mixing) and lambda (shrinkage) of the 'glmnet' package.
This allows to get true lasso-ridge shrinkage estimates.
- Improved robustness in internal functions list2mat and list2array.
- Minor improvement in internal function cv.folds.
- Added vignettes
- Built and tested under R 3.0.2 and release update to GitHub.
---------------------------------------------------------------------------------
Date : 2015-07-28
o RELEASE 0.5.7
- Compliance with new R CMD check, which now checks code usage via 'codetools'.
Functions and packages from default packages other than base which are used in the package
code are now imported via the package namespace file (NAMESPACE).
Added new field 'Imports' in the package description file (DESCRIPTION)
to match the functions and packages newly imported via NAMESPACE.
- Added Cumulative Hazard Summary statistic (derived from the Nelson-Aalen estimator)
as new peeling criterion option in the PRSP algorithm.
- Built and tested under R-devel (2015-07-20 r68705).
- Initial release to CRAN and update to GitHub.
---------------------------------------------------------------------------------
Date : 2015-08-28
o RELEASE 0.5.8
- Removed pre-selection of variables (covariates) by regular Cox-regression
and made the remaining Elastic-Net pre-selection of variables optional by
passing an additional argument in the main function sbh().
- Main function sbh() now returns the parameters used for configuring the parallelization.
- Replaced real dataset #2 of breast cancer data with lung cancer data for reason of size.
- Added S3-generic 'print' function and updated S3-generic 'summary' function.
- Created a new internal subroutine cv.presel() for (optional) variable pre-selection.
- Changed main argument of plot functions from `x` to `object`.
- Minor corrections in the manual.
- Built and tested under R-devel (2015-08-02 r68804) and release update to GitHub.
---------------------------------------------------------------------------------
Date : 2015-09-07
o RELEASE 0.5.9
- Replaced plotting function plot_scatter(...) by S3-generic `plot` function.
- Corrected all plotting functions for the case of a NULL graphical device.
- Cross-validated estimates of box endpoint quantities of interest now contains
sample size for all peeling steps.
- Minor updates and corrections in the outputs of S3-generic functions.
- Minor updates and corrections in the documentation file and manual.
- Built and tested under R-devel (2015-08-02 r68804) and release update to GitHub.
---------------------------------------------------------------------------------
Date : 2015-09-15
o RELEASE 0.6.0
- The matrix of original dataset is now returned by the main function sbh()
and not the submatrix of pre-selected covariates only.
- Corrected bugs in the output of main function sbh():
. the returned vectors of `pre-selected` and `used` covariates are now in reference
to the original index of variables.
. the value of traces and rules are now matched accordingly.
. plot_boxtraj() and plot_boxtrace() are now corrected accordingly.
- The value of `object$cvfit$cv.trace` of the `PRSP` object that is returned
by the main function sbh() now only contains the vector of the modal trace values
of covariate usage at each step.
- Updated S3-generic 'summary' and 'print' functions.
- Minor updates and corrections in the documentation file and manual.
- Built and tested under R-devel (2015-09-14 r69384) and release update to GitHub.
---------------------------------------------------------------------------------
Date : 2015-10-11
o RELEASE 0.6.2
- Rename example datasets #4 and #5 into #1b and #4, respectively,
for consistency with companion article.
- Added argument `decimals` to main function sbh() to output results in
user-specified significant decimals.
- Added examples for all S3-generic functions.
- Corrected output of decision rules in S3-generic `print` function in case `vs=TRUE`.
- Renamed results 'varsign`, `selected` and `used` to 'CV.sign`, `CV.selected` and `CV.used` and
moved them to `cvfit` field of return `PRSP` object.
- Minor improvement in output plot axes names of plot_boxtrace() function.
- Updates of corresponding modifications in the documentation file and manual.
- Built and tested under R-devel (2015-09-14 r69384) and release update to GitHub.
---------------------------------------------------------------------------------
Date : 2015-11-16
o RELEASE 0.6.3
- Changed random splitting in the cross-validation step to random stratified splitting
with/by conservation of events.
- Changed default values of metaparameters `alpha` to 0.05 (instead of 0.10)
`minn` to 5 (instead of 10).
- Modified computation of replicated cross-validated maximal peeling length in order to avoid
getting below the minimal box support threshold (i.e. the greater of `beta*n` or `minn`)
that could occur when combining results from the cross-validation loops and replicates.
- Corrected behaviors in case `n` is less than `minn` and `n` is equal to `minn`.
- Corrected minor errors in list2array() and list2mat() internal functions.
- Corrected minor errors in plot() and predict() S3-generic functions.
- Updates in the manual file, including added explanation about the outputs of
averaged covariate traces, box membership indicators and box decision rules.
- Updates in the CITATION file.
- Built and tested under R-devel (2015-11-04 r69597) and release update to GitHub.
---------------------------------------------------------------------------------
> citation("PRIMsrc")
To cite PRIMsrc in publications use:
Dazard J-E. and Rao J.S. (2010). Local Sparse Bump Hunting. J. Comp
Graph. Statistics, 19(4):900-92.
Diaz-Pachon D.A., Rao J.S. and Dazard J-E. (2013). Optimization of
PRIM under Normality. In SCo Proceedings, Complex Data Modeling and
Computationally Intensive Statistical Methods for Estimation and
Prediction. Milan, Italy.
Diaz-Pachon D.A., Rao J.S and Dazard J-E. (2015). On the Explanatory
Power of Principal Components. (submitted).
Diaz-Pachon D.A., Dazard J-E. and Rao J.S. (2015). Unsupervised Bump
Hunting Using Principal Components. (submitted).
Dazard J-E., Choe M., LeBlanc M. and Rao J.S. (2014).
Cross-Validation of Survival Bump Hunting by Recursive Peeling
Methods. In JSM Proceedings, Survival Methods for Risk
Estimation/Prediction Section. Boston, MA, USA. American Statistical
Association-IMS, p. 3366-3380.
Dazard J-E., Choe M., LeBlanc M. and Rao J.S. (2015).
Cross-validation and Peeling Strategies for Survival Bump Hunting
using Recursive Peeling Methods. Statistical Analysis and Data
Mining, x(x):xxx-xxx.
Dazard J-E., Choe M., LeBlanc M. and Rao J.S. (2015). R package
PRIMsrc: Bump Hunting by Patient Rule Induction Method for Survival,
Regression and Classification. In JSM Proceedings, Section for
Statistical Programmers and Analysts Section. Seattle, WA, USA.
American Statistical Association-IMS, p. xxxx-xxxx.
Dazard J-E., Choe M., LeBlanc M. and Rao J.S. (2015). PRIMsrc for
Identification and Characterization of Informative Prognostic
Subgroups by Survival Bump Hunting. (submitted)
>
> #===================================================
> # Demo with a synthetic dataset
> # Use help for descriptions
> #===================================================
> data("Synthetic.1", package="PRIMsrc")
> ?Synthetic.1
Synthetic.1 package:PRIMsrc R Documentation
_S_y_n_t_h_e_t_i_c _D_a_t_a_s_e_t #_1: _p < _n _c_a_s_e
_D_e_s_c_r_i_p_t_i_o_n:
Dataset from simulated regression survival model #1 as described
in Dazard et al. (2015). Here, the regression function uses all
of the predictors, which are also part of the design matrix.
Survival time was generated from an exponential model with rate
parameter lambda (and mean frac{1}{lambda}) according to a Cox-PH
model with hazard exp(eta), where eta(.) is the regression
function. Censoring indicator were generated from a uniform
distribution on [0, 3]. In this synthetic example, all covariates
are continuous, i.i.d. from a multivariate uniform distribution on
[0, 1].
_U_s_a_g_e:
Synthetic.1
_F_o_r_m_a_t:
Each dataset consists of a 'numeric' 'matrix' containing n=250
observations (samples) by rows and p=3 variables by columns, not
including the censoring indicator and (censored) time-to-event
variables. It comes as a compressed Rda data file.
_A_u_t_h_o_r(_s):
* "Jean-Eudes Dazard, Ph.D." <email: jxd101@case.edu>
* "Michael Choe, M.D." <email: mjc206@case.edu>
* "Michael LeBlanc, Ph.D." <email: mleblanc@fhcrc.org>
* "Alberto Santana, MBA." <email: ahs4@case.edu>
Maintainer: "Jean-Eudes Dazard, Ph.D." <email: jxd101@case.edu>
Acknowledgments: This project was partially funded by the National
Institutes of Health NIH - National Cancer Institute
(R01-CA160593) to J-E. Dazard and J.S. Rao.
_S_o_u_r_c_e:
See simulated survival model #1 in Dazard et al., 2015.
_R_e_f_e_r_e_n_c_e_s:
* Dazard J-E., Choe M., LeBlanc M. and Rao J.S. (2015).
"_Cross-validation and Peeling Strategies for Survival Bump
Hunting using Recursive Peeling Methods._" Statistical
Analysis and Data Mining (in press).
* Dazard J-E., Choe M., LeBlanc M. and Rao J.S. (2014).
"_Cross-Validation of Survival Bump Hunting by Recursive
Peeling Methods._" In JSM Proceedings, Survival Methods for
Risk Estimation/Prediction Section. Boston, MA, USA.
American Statistical Association IMS - JSM, p. 3366-3380.
* Dazard J-E., Choe M., LeBlanc M. and Rao J.S. (2015). "_R
package PRIMsrc: Bump Hunting by Patient Rule Induction
Method for Survival, Regression and Classification._" In JSM
Proceedings, Statistical Programmers and Analysts Section.
Seattle, WA, USA. American Statistical Association IMS -
JSM, (in press).
* Dazard J-E. and J.S. Rao (2010). "_Local Sparse Bump
Hunting._" J. Comp Graph. Statistics, 19(4):900-92.
>
> #===================================================
> # Simulated dataset #1 (n=250, p=3)
> # Non Replicated Combined Cross-Validation (RCCV)
> # Peeling criterion = LRT
> # Optimization criterion = LRT
> # Without parallelization
> # Without computation of permutation p-values
> #===================================================
> CVCOMB.synt1 <- sbh(dataset = Synthetic.1,
+ cvtype = "combined", cvcriterion = "lrt",
+ B = 1, K = 5,
+ vs = TRUE, cpv = FALSE,
+ decimals = 2, probval = 0.5,
+ arg = "beta=0.05,
+ alpha=0.05,
+ minn=5,
+ L=NULL,
+ peelcriterion="lr"",
+ parallel = FALSE, conf = NULL, seed = 123)
Survival dataset provided.
Requested single 5-fold cross-validation without replications
Cross-validation technique: COMBINED
Cross-validation criterion: LRT
Variable pre-selection: TRUE
Computation of permutation p-values: FALSE
Peeling criterion: LRT
Parallelization: FALSE
Pre-selection of covariates and determination of directions of peeling...
Pre-selected covariates:
X1 X2 X3
1 2 3
Directions of peeling at each step of pre-selected covariates:
X1 X2 X3
1 -1 -1
Fitting and cross-validating the Survival Bump Hunting model using the PRSP algorithm ...
replicate : 1
seed : 123
Fold : 1
Fold : 2
Fold : 3
Fold : 4
Fold : 5
Success! 1 (replicated) cross-validation(s) has(ve) completed
Generating cross-validated optimal peeling lengths from all replicates ...
Generating cross-validated box memberships at each step ...
Generating cross-validated box rules for the pre-selected covariates at each step ...
Generating cross-validated modal trace values of covariate usage at each step ...
Covariates used for peeling at each step, based on covariate trace modal values:
X1 X2 X3
1 2 3
Generating cross-validated box statistics at each step ...
Finished!
>
> ## Not run:
> ##D #===================================================
> ##D # Examples of parallel backend parametrization
> ##D #===================================================
> ##D # Example #1 - 1-Quad (4-core double threaded) PC
> ##D # Running WINDOWS
> ##D # With SOCKET communication
> ##D #===================================================
> ##D if (.Platform$OS.type == "windows") {
> ##D cpus <- detectCores()
> ##D conf <- list("names" = rep("localhost", cpus),
> ##D "cpus" = cpus,
> ##D "type" = "SOCK",
> ##D "homo" = TRUE,
> ##D "verbose" = TRUE,
> ##D "outfile" = "")
> ##D }
> ##D #===================================================
> ##D # Example #2 - 1 master node + 3 worker nodes cluster
> ##D # All nodes equipped with identical setups and multicores
> ##D # Running LINUX
> ##D # With SOCKET communication
> ##D #===================================================
> ##D if (.Platform$OS.type == "unix") {
> ##D masterhost <- Sys.getenv("HOSTNAME")
> ##D slavehosts <- c("compute-0-0", "compute-0-1", "compute-0-2")
> ##D nodes <- length(slavehosts) + 1
> ##D cpus <- 8
> ##D conf <- list("names" = c(rep(masterhost, cpus),
> ##D rep(slavehosts, cpus)),
> ##D "cpus" = nodes * cpus,
> ##D "type" = "SOCK",
> ##D "homo" = TRUE,
> ##D "verbose" = TRUE,
> ##D "outfile" = "")
> ##D }
> ##D #===================================================
> ##D # Example #3 - Multinode multicore per node cluster
> ##D # Running LINUX
> ##D # with MPI communication
> ##D # Here, a file named ".nodes" (e.g. in the home directory)
> ##D # contains the list of nodes of the cluster
> ##D #===================================================
> ##D if (.Platform$OS.type == "unix") {
> ##D hosts <- scan(file=paste(Sys.getenv("HOME"), "/.nodes", sep=""),
> ##D what="",
> ##D sep="\n")
> ##D hostnames <- unique(hosts)
> ##D nodes <- length(hostnames)
> ##D cpus <- length(hosts)/length(hostnames)
> ##D conf <- list("cpus" = nodes * cpus,
> ##D "type" = "MPI",
> ##D "homo" = TRUE,
> ##D "verbose" = TRUE,
> ##D "outfile" = "")
> ##D }
> ##D #===================================================
> ##D # Simulated dataset #1 (n=250, p=3)
> ##D # Replicated Combined Cross-Validation (RCCV)
> ##D # Peeling criterion = LRT
> ##D # Optimization criterion = LRT
> ##D # With parallelization
> ##D # With computation of permutation p-values
> ##D #===================================================
> ##D CVCOMBREP.synt1 <- sbh(dataset = Synthetic.1,
> ##D cvtype = "combined", cvcriterion = "lrt",
> ##D B = 10, K = 5, A = 1024,
> ##D vs = TRUE, cpv = TRUE,
> ##D decimals = 2, probval = 0.5,
> ##D arg = "beta=0.05,
> ##D alpha=0.05,
> ##D minn=5,
> ##D L=NULL,
> ##D peelcriterion="lr"",
> ##D parallel = TRUE, conf = conf, seed = 123)
> ## End(Not run)
>
>
>
>
>
> dev.off()
null device
1
>