Description: For each feature, a score is computed that can be useful for feature selection. Several random subsets are sampled from the input data and for each random subset, various linear models are fitted using lars method. A score is assigned to each feature based on the tendency of LASSO in including that feature in the models.Finally, the average score and the models are returned as the output. The features with relatively low scores are recommended to be ignored because they can lead to overfitting of the model to the training data.Moreover, for each random subset, the best set of features in terms of global error is returned. They are useful for applying Bolasso, the alternative feature selection method that recommends the intersection of features subsets.
Details
Package:
FeaLect
Type:
Package
Version:
1.0
Date:
2010-10-27
License:
GPL version 2 or newer
LazyLoad:
yes
Suppose you have a feature matrix with 200 features fr only 20 samples and your goal is to build a classifier.
You can run the FeLeact() function to compute the scores for your features. Only the relatively high score
features (say the top 20) are recommended for further analysis. In this way, one can prevent overfitting by
reducing the number of features significantly.
"Statistical Analysis of Overfitting Features", manuscript in preparation.
Examples
library(FeaLect)
data(mcl_sll)
F <- as.matrix(mcl_sll[ ,-1]) # The Feature matrix
L <- as.numeric(mcl_sll[ ,1]) # The labels
names(L) <- rownames(F)
message(dim(F)[1], " samples and ",dim(F)[2], " features.")
## For this data, total.num.of.models is suggested to be at least 100.
FeaLect.result.1 <-FeaLect(F=F,L=L,maximum.features.num=10,total.num.of.models=20,talk=TRUE)
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(FeaLect)
Loading required package: lars
Loaded lars 1.2
Loading required package: rms
Loading required package: Hmisc
Loading required package: lattice
Loading required package: survival
Loading required package: Formula
Loading required package: ggplot2
Attaching package: 'Hmisc'
The following objects are masked from 'package:base':
format.pval, round.POSIXt, trunc.POSIXt, units
Loading required package: SparseM
Attaching package: 'SparseM'
The following object is masked from 'package:base':
backsolve
> png(filename="/home/ddbj/snapshot/RGM3/R_CC/result/FeaLect/FeaLect-package.Rd_%03d_medium.png", width=480, height=480)
> ### Name: FeaLect-package
> ### Title: Scores features for Feature seLection
> ### Aliases: FeaLect-package
> ### Keywords: package regression multivariate classif models
>
> ### ** Examples
>
> library(FeaLect)
> data(mcl_sll)
> F <- as.matrix(mcl_sll[ ,-1]) # The Feature matrix
> L <- as.numeric(mcl_sll[ ,1]) # The labels
> names(L) <- rownames(F)
> message(dim(F)[1], " samples and ",dim(F)[2], " features.")
22 samples and 236 features.
>
> ## For this data, total.num.of.models is suggested to be at least 100.
> FeaLect.result.1 <-FeaLect(F=F,L=L,maximum.features.num=10,total.num.of.models=20,talk=TRUE)
***********************************************
Scoring 236 features using 22 samples.
- started at: 2016-07-04 18:03:29
- sampling.index: 1
- sampling.index: 2
- sampling.index: 3
singular information matrix in lrm.fit (rank= 1 ). Offending variable(s):
linear.scores
----------------again
- sampling.index: 3
- sampling.index: 4
- sampling.index: 5
- sampling.index: 6
- sampling.index: 7
----------------again
- sampling.index: 7
- sampling.index: 8
----------------again
- sampling.index: 8
----------------again
- sampling.index: 8
----------------again
- sampling.index: 8
----------------again
- sampling.index: 8
- sampling.index: 9
- sampling.index: 10
- sampling.index: 11
- sampling.index: 12
- sampling.index: 13
- sampling.index: 14
----------------again
- sampling.index: 14
- sampling.index: 15
----------------again
- sampling.index: 15
- sampling.index: 16
- sampling.index: 17
- sampling.index: 18
- sampling.index: 19
----------------again
- sampling.index: 19
- sampling.index: 20
****************************************************
validation ended at: 2016-07-04 18:03:31 taking: 1.44863557815552
****************************************************
>
>
>
>
>
>
> dev.off()
null device
1
>