R: Random generalized linear model predictor thinning
thinRandomGLM
R Documentation
Random generalized linear model predictor thinning
Description
This function allows the user to define a "thinned" version of a random generalized linear
model predictor by focusing on those features that occur relatively frequently.
Usage
thinRandomGLM(rGLM, threshold)
Arguments
rGLM
a randomGLM object such as one returned by randomGLM.
threshold
integer specifying the minimum of times a feature was selected across the bags in
rGLM for the feature to be kept. Note that only features selected threshold +1 times and more
are retained. For the purposes of this count, appearances in interactions are not
counted. Features that appear
threshold times or fewer are removed from the underlying regression models when the models are re-fit.
Details
The function "thins out" (reduces) a previously-constructed random generalized linear model predictor by
removing rarely selected features and refitting each (generalized) linear model (GLM).
Each GLM (per bag) is refit using only those
features that occur more than threshold times across the nBags number of bags. The
occurrence count excludes interactions (in other words, the threshold will be applied to the first row of
timesSelectedByForwardRegression).
Value
The function returns a valid randomGLM object (see randomGLM for details) that can be
used as input to the predict() method (see predict.randomGLM). The returned object contains a
copy of the input rGLM in which the following components were modified:
predictedOOB
the updated continuous prediction (if classify is FALSE)
or predicted classification (if classify is TRUE) of the input data based on out-of-bag
samples.
predictedOOB.response
In case of a binary outcome, the updated predicted probability of each
outcome
specified by y based on out-of-bag samples. In case of a continuous outcome, this is the predicted
value based on out-of-bag samples (i.e., a copy of predictedOOB).
featuresInForwardRegression
features selected by forward selection in each bag. A list with one
component per bag. Each component
is a matrix with maxInteractionOrder rows.
Each column represents one
interaction obtained by multiplying the features indicated by the entries in each column (0 means no
feature, i.e. a lower order interaction).
coefOfForwardRegression
coefficients of forward regression. A list with one
component per bag. Each component is a vector giving the coefficients of the model determined by forward
selection in the corresponding bag. The order of the coefficients is the same as the order of the terms in
the corresponding component of featuresInForwardRegression.
interceptOfForwardRegression
a vector with one component per bag giving the intercept of the
regression model in each bag.
timesSelectedByForwardRegression
a matrix of maxInteractionOrder rows and number of features
columns. Each entry gives the number of times the corresponding feature appeared in a predictor model at the
corresponding order of interactions. Interactions where a single feature enters more than once (e.g., a
quadratic interaction of the feature with itself) are counted once.
models
the "thinned" regression models for each bag.
Author(s)
Lin Song, Steve Horvath, Peter Langfelder
References
Lin Song, Peter Langfelder, Steve Horvath: Random generalized linear model: a highly accurate and interpretable ensemble predictor. BMC Bioinformatics (2013)
Examples
## binary outcome prediction
# data generation
data(iris)
# Restrict data to first 100 observations
iris=iris[1:100,]
# Turn Species into a factor
iris$Species = as.factor(as.character(iris$Species))
# Select a training and a test subset of the 100 observations
set.seed(1)
indx = sample(100, 67, replace=FALSE)
xyTrain = iris[indx,]
xyTest = iris[-indx,]
xTrain = xyTrain[, -5]
yTrain = xyTrain[, 5]
xTest = xyTest[, -5]
yTest = xyTest[, 5]
# predict with a small number of bags - normally nBags should be at least 100.
RGLM = randomGLM(xTrain, yTrain, nCandidateCovariates=ncol(xTrain), nBags=30, keepModels = TRUE, nThreads = 1)
table(RGLM$timesSelectedByForwardRegression[1, ])
# 0 7 23
# 2 1 1
thinnedRGLM = thinRandomGLM(RGLM, threshold=7)
predicted = predict(thinnedRGLM, newdata = xTest, type="class")
predicted = predict(RGLM, newdata = xTest, type="class")