R Graphical Manual

Browse All

Last data update: 2014.03.03

R: Computes the scores of the features.

FeaLect

R Documentation

Computes the scores of the features.

Description

Several random subsets are sampled from the input data and for each random subset, various linear models are fitted using lars method. A score is assigned to each feature based on the tendency of LASSO in including that feature in the models. Finally, the average score and the models are returned as the output.

Usage

FeaLect(F, L, maximum.features.num = dim(F)[2], total.num.of.models, gamma = 3/4, 
	   persistence = 1000, talk = FALSE, minimum.class.size = 2, 
	   report.fitting.failure = FALSE, return_linear.models = TRUE, balance = TRUE,
	   replace = TRUE, plot.scores = TRUE)

Arguments

`F`	The feature matrix, each column is a feature.
`L`	The vector of labels named according to the rows of F.
`maximum.features.num`	Upto this number of features are allowed to contribute to each linear model.
`total.num.of.models`	The total number of models that are fitted.
`gamma`	A value in range 0-1 that determines the relative size of sample subsets.
`persistence`	Maximum number of tries for randomly choosing.samples, If we try this many times and the obtained labels are all the same, we give up (maybe the whole labels are the same) with the error message: " Not enough variation in the labels...".
`talk`	If TRUE, some messages are printed during the computations.
`minimum.class.size`	The size of both positive and negative classes should be greater than this threshold after sampling.
`report.fitting.failure`	If TRUE, any failure in fitting the linear of logistic models will be printed.
`return_linear.models`	The models are memory intensive, so for if they more than 1000, we may decide to ignore them to prevent memory outage.
`balance`	If TRUE, the cases will be balanced for the same number of positive vs. negatives by oversampling before fitting the linear model.
`replace`	If TRUE, the subsets are sampled with replacement.
`plot.scores`	If TRUE, the scores are plotted in logarithmic scale after each iteration.

Details

See the reference for more details.

Value

Returns a list of:

`log.scores`	A vector containing the logarithm of final scores.
`feature.matrix`	The input feature matrix.
`labels`	The input labels
`total.num.of.models`	The total number of models that are fitted.
`maximum.features.num`	Upto this number of features are allowed to contribute to each linear model.
`feature.scores.history`	The matrix of history of feature scores where column i contains the scores after i runs.
`num.of.features.score`	A vector, entry i contains the number of times that i has been the best number of features.
`best.feature.num`	The i'th value of this vector is the best number of features for the i'th model.
`mislabeling.record`	A vector that keeps track of the frequency of mislabelling for each cases.
`doctors`	List of all models which are created by train.doctor() function.
`best.features.intersection`	Best features are computed for each sampling and their intersection is reported as this vector of features names
`features.with.best.global.error`	A list containing the sets of features. The set i was the best for i'th sampling.
`time.taken`	Total time used for executing this function.

Note

Logistic regression is also done on top of fitting the linear models.

Author(s)

Habil Zare

References

"Statistical Analysis of Overfitting Features", manuscript in preparation.

Examples

library(FeaLect)
data(mcl_sll)
F <- as.matrix(mcl_sll[ ,-1])	# The Feature matrix
L <- as.numeric(mcl_sll[ ,1])	# The labels
names(L) <- rownames(F)
message(dim(F)[1], " samples and ",dim(F)[2], " features.")

## For this data, total.num.of.models is suggested to be at least 100.
FeaLect.result <-FeaLect(F=F,L=L,maximum.features.num=10,total.num.of.models=20,talk=TRUE)

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(FeaLect)
Loading required package: lars
Loaded lars 1.2

Loading required package: rms
Loading required package: Hmisc
Loading required package: lattice
Loading required package: survival
Loading required package: Formula
Loading required package: ggplot2

Attaching package: 'Hmisc'

The following objects are masked from 'package:base':

    format.pval, round.POSIXt, trunc.POSIXt, units

Loading required package: SparseM

Attaching package: 'SparseM'

The following object is masked from 'package:base':

    backsolve

> png(filename="/home/ddbj/snapshot/RGM3/R_CC/result/FeaLect/FeaLect.Rd_%03d_medium.png", width=480, height=480)
> ### Name: FeaLect
> ### Title: Computes the scores of the features.
> ### Aliases: FeaLect
> ### Keywords: regression multivariate classif models
> 
> ### ** Examples
> 
> library(FeaLect)
> data(mcl_sll)
> F <- as.matrix(mcl_sll[ ,-1])	# The Feature matrix
> L <- as.numeric(mcl_sll[ ,1])	# The labels
> names(L) <- rownames(F)
> message(dim(F)[1], " samples and ",dim(F)[2], " features.")
22 samples and 236 features.
> 
> ## For this data, total.num.of.models is suggested to be at least 100.
> FeaLect.result <-FeaLect(F=F,L=L,maximum.features.num=10,total.num.of.models=20,talk=TRUE)	
***********************************************
Scoring 236 features using 22 samples.
 - started at: 2016-07-04 18:03:40
 - sampling.index: 1
 - sampling.index: 2
----------------again
 - sampling.index: 2
 - sampling.index: 3
 - sampling.index: 4
----------------again
 - sampling.index: 4
 - sampling.index: 5
----------------again
 - sampling.index: 5
----------------again
 - sampling.index: 5
 - sampling.index: 6
 - sampling.index: 7
 - sampling.index: 8
----------------again
 - sampling.index: 8
singular information matrix in lrm.fit (rank= 1 ).  Offending variable(s):
linear.scores 
----------------again
 - sampling.index: 8
----------------again
 - sampling.index: 8
----------------again
 - sampling.index: 8
 - sampling.index: 9
----------------again
 - sampling.index: 9
 - sampling.index: 10
----------------again
 - sampling.index: 10
 - sampling.index: 11
----------------again
 - sampling.index: 11
 - sampling.index: 12
 - sampling.index: 13
----------------again
 - sampling.index: 13
----------------again
 - sampling.index: 13
 - sampling.index: 14
 - sampling.index: 15
----------------again
 - sampling.index: 15
 - sampling.index: 16
 - sampling.index: 17
 - sampling.index: 18
----------------again
 - sampling.index: 18
----------------again
 - sampling.index: 18
----------------again
 - sampling.index: 18
 - sampling.index: 19
----------------again
 - sampling.index: 19
 - sampling.index: 20
****************************************************
validation ended at: 2016-07-04 18:03:41   taking:   1.5087571144104
****************************************************
> 
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>