Several random subsets are sampled from the input data and for each random subset, various linear models are fitted using lars method.
A score is assigned to each feature based on the tendency of LASSO in including that feature in the models.
Finally, the average score and the models are returned as the output.
The vector of labels named according to the rows of F.
maximum.features.num
Upto this number of features are allowed to contribute to each linear model.
total.num.of.models
The total number of models that are fitted.
gamma
A value in range 0-1 that determines the relative size of sample subsets.
persistence
Maximum number of tries for randomly choosing.samples,
If we try this many times and the obtained labels are all the same,
we give up (maybe the whole labels are the same) with the error message: " Not enough variation in the labels...".
talk
If TRUE, some messages are printed during the computations.
minimum.class.size
The size of both positive and negative classes should be greater than this threshold after sampling.
report.fitting.failure
If TRUE, any failure in fitting the linear of logistic models will be printed.
return_linear.models
The models are memory intensive, so for if they more than 1000, we may decide to ignore them to prevent memory outage.
balance
If TRUE, the cases will be balanced for the same number of positive vs. negatives by oversampling before fitting the linear model.
replace
If TRUE, the subsets are sampled with replacement.
plot.scores
If TRUE, the scores are plotted in logarithmic scale after each iteration.
Details
See the reference for more details.
Value
Returns a list of:
log.scores
A vector containing the logarithm of final scores.
feature.matrix
The input feature matrix.
labels
The input labels
total.num.of.models
The total number of models that are fitted.
maximum.features.num
Upto this number of features are allowed to contribute to each linear model.
feature.scores.history
The matrix of history of feature scores where column i contains the scores after i runs.
num.of.features.score
A vector, entry i contains the number of times that i has been the best number of features.
best.feature.num
The i'th value of this vector is the best number of features for the i'th model.
mislabeling.record
A vector that keeps track of the frequency of mislabelling for each cases.
doctors
List of all models which are created by train.doctor() function.
best.features.intersection
Best features are computed for each sampling and their intersection is reported as this vector of features names
features.with.best.global.error
A list containing the sets of features. The set i was the best for i'th sampling.
time.taken
Total time used for executing this function.
Note
Logistic regression is also done on top of fitting the linear models.
Author(s)
Habil Zare
References
"Statistical Analysis of Overfitting Features", manuscript in preparation.
library(FeaLect)
data(mcl_sll)
F <- as.matrix(mcl_sll[ ,-1]) # The Feature matrix
L <- as.numeric(mcl_sll[ ,1]) # The labels
names(L) <- rownames(F)
message(dim(F)[1], " samples and ",dim(F)[2], " features.")
## For this data, total.num.of.models is suggested to be at least 100.
FeaLect.result <-FeaLect(F=F,L=L,maximum.features.num=10,total.num.of.models=20,talk=TRUE)
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(FeaLect)
Loading required package: lars
Loaded lars 1.2
Loading required package: rms
Loading required package: Hmisc
Loading required package: lattice
Loading required package: survival
Loading required package: Formula
Loading required package: ggplot2
Attaching package: 'Hmisc'
The following objects are masked from 'package:base':
format.pval, round.POSIXt, trunc.POSIXt, units
Loading required package: SparseM
Attaching package: 'SparseM'
The following object is masked from 'package:base':
backsolve
> png(filename="/home/ddbj/snapshot/RGM3/R_CC/result/FeaLect/FeaLect.Rd_%03d_medium.png", width=480, height=480)
> ### Name: FeaLect
> ### Title: Computes the scores of the features.
> ### Aliases: FeaLect
> ### Keywords: regression multivariate classif models
>
> ### ** Examples
>
> library(FeaLect)
> data(mcl_sll)
> F <- as.matrix(mcl_sll[ ,-1]) # The Feature matrix
> L <- as.numeric(mcl_sll[ ,1]) # The labels
> names(L) <- rownames(F)
> message(dim(F)[1], " samples and ",dim(F)[2], " features.")
22 samples and 236 features.
>
> ## For this data, total.num.of.models is suggested to be at least 100.
> FeaLect.result <-FeaLect(F=F,L=L,maximum.features.num=10,total.num.of.models=20,talk=TRUE)
***********************************************
Scoring 236 features using 22 samples.
- started at: 2016-07-04 18:03:40
- sampling.index: 1
- sampling.index: 2
----------------again
- sampling.index: 2
- sampling.index: 3
- sampling.index: 4
----------------again
- sampling.index: 4
- sampling.index: 5
----------------again
- sampling.index: 5
----------------again
- sampling.index: 5
- sampling.index: 6
- sampling.index: 7
- sampling.index: 8
----------------again
- sampling.index: 8
singular information matrix in lrm.fit (rank= 1 ). Offending variable(s):
linear.scores
----------------again
- sampling.index: 8
----------------again
- sampling.index: 8
----------------again
- sampling.index: 8
- sampling.index: 9
----------------again
- sampling.index: 9
- sampling.index: 10
----------------again
- sampling.index: 10
- sampling.index: 11
----------------again
- sampling.index: 11
- sampling.index: 12
- sampling.index: 13
----------------again
- sampling.index: 13
----------------again
- sampling.index: 13
- sampling.index: 14
- sampling.index: 15
----------------again
- sampling.index: 15
- sampling.index: 16
- sampling.index: 17
- sampling.index: 18
----------------again
- sampling.index: 18
----------------again
- sampling.index: 18
----------------again
- sampling.index: 18
- sampling.index: 19
----------------again
- sampling.index: 19
- sampling.index: 20
****************************************************
validation ended at: 2016-07-04 18:03:41 taking: 1.5087571144104
****************************************************
>
>
>
>
>
>
> dev.off()
null device
1
>