Last data update: 2014.03.03

R: Computes the scores of the features.
FeaLectR Documentation

Computes the scores of the features.

Description

Several random subsets are sampled from the input data and for each random subset, various linear models are fitted using lars method. A score is assigned to each feature based on the tendency of LASSO in including that feature in the models. Finally, the average score and the models are returned as the output.

Usage

FeaLect(F, L, maximum.features.num = dim(F)[2], total.num.of.models, gamma = 3/4, 
	   persistence = 1000, talk = FALSE, minimum.class.size = 2, 
	   report.fitting.failure = FALSE, return_linear.models = TRUE, balance = TRUE,
	   replace = TRUE, plot.scores = TRUE)

Arguments

F

The feature matrix, each column is a feature.

L

The vector of labels named according to the rows of F.

maximum.features.num

Upto this number of features are allowed to contribute to each linear model.

total.num.of.models

The total number of models that are fitted.

gamma

A value in range 0-1 that determines the relative size of sample subsets.

persistence

Maximum number of tries for randomly choosing.samples, If we try this many times and the obtained labels are all the same, we give up (maybe the whole labels are the same) with the error message: " Not enough variation in the labels...".

talk

If TRUE, some messages are printed during the computations.

minimum.class.size

The size of both positive and negative classes should be greater than this threshold after sampling.

report.fitting.failure

If TRUE, any failure in fitting the linear of logistic models will be printed.

return_linear.models

The models are memory intensive, so for if they more than 1000, we may decide to ignore them to prevent memory outage.

balance

If TRUE, the cases will be balanced for the same number of positive vs. negatives by oversampling before fitting the linear model.

replace

If TRUE, the subsets are sampled with replacement.

plot.scores

If TRUE, the scores are plotted in logarithmic scale after each iteration.

Details

See the reference for more details.

Value

Returns a list of:

log.scores

A vector containing the logarithm of final scores.

feature.matrix

The input feature matrix.

labels

The input labels

total.num.of.models

The total number of models that are fitted.

maximum.features.num

Upto this number of features are allowed to contribute to each linear model.

feature.scores.history

The matrix of history of feature scores where column i contains the scores after i runs.

num.of.features.score

A vector, entry i contains the number of times that i has been the best number of features.

best.feature.num

The i'th value of this vector is the best number of features for the i'th model.

mislabeling.record

A vector that keeps track of the frequency of mislabelling for each cases.

doctors

List of all models which are created by train.doctor() function.

best.features.intersection

Best features are computed for each sampling and their intersection is reported as this vector of features names

features.with.best.global.error

A list containing the sets of features. The set i was the best for i'th sampling.

time.taken

Total time used for executing this function.

Note

Logistic regression is also done on top of fitting the linear models.

Author(s)

Habil Zare

References

"Statistical Analysis of Overfitting Features", manuscript in preparation.

See Also

FeaLect, train.doctor, doctor.validate, random.subset, compute.balanced,compute.logistic.score, ignore.redundant, input.check.FeaLect

Examples

library(FeaLect)
data(mcl_sll)
F <- as.matrix(mcl_sll[ ,-1])	# The Feature matrix
L <- as.numeric(mcl_sll[ ,1])	# The labels
names(L) <- rownames(F)
message(dim(F)[1], " samples and ",dim(F)[2], " features.")

## For this data, total.num.of.models is suggested to be at least 100.
FeaLect.result <-FeaLect(F=F,L=L,maximum.features.num=10,total.num.of.models=20,talk=TRUE)	

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(FeaLect)
Loading required package: lars
Loaded lars 1.2

Loading required package: rms
Loading required package: Hmisc
Loading required package: lattice
Loading required package: survival
Loading required package: Formula
Loading required package: ggplot2

Attaching package: 'Hmisc'

The following objects are masked from 'package:base':

    format.pval, round.POSIXt, trunc.POSIXt, units

Loading required package: SparseM

Attaching package: 'SparseM'

The following object is masked from 'package:base':

    backsolve

> png(filename="/home/ddbj/snapshot/RGM3/R_CC/result/FeaLect/FeaLect.Rd_%03d_medium.png", width=480, height=480)
> ### Name: FeaLect
> ### Title: Computes the scores of the features.
> ### Aliases: FeaLect
> ### Keywords: regression multivariate classif models
> 
> ### ** Examples
> 
> library(FeaLect)
> data(mcl_sll)
> F <- as.matrix(mcl_sll[ ,-1])	# The Feature matrix
> L <- as.numeric(mcl_sll[ ,1])	# The labels
> names(L) <- rownames(F)
> message(dim(F)[1], " samples and ",dim(F)[2], " features.")
22 samples and 236 features.
> 
> ## For this data, total.num.of.models is suggested to be at least 100.
> FeaLect.result <-FeaLect(F=F,L=L,maximum.features.num=10,total.num.of.models=20,talk=TRUE)	
***********************************************
Scoring 236 features using 22 samples.
 - started at: 2016-07-04 18:03:40
 - sampling.index: 1
 - sampling.index: 2
----------------again
 - sampling.index: 2
 - sampling.index: 3
 - sampling.index: 4
----------------again
 - sampling.index: 4
 - sampling.index: 5
----------------again
 - sampling.index: 5
----------------again
 - sampling.index: 5
 - sampling.index: 6
 - sampling.index: 7
 - sampling.index: 8
----------------again
 - sampling.index: 8
singular information matrix in lrm.fit (rank= 1 ).  Offending variable(s):
linear.scores 
----------------again
 - sampling.index: 8
----------------again
 - sampling.index: 8
----------------again
 - sampling.index: 8
 - sampling.index: 9
----------------again
 - sampling.index: 9
 - sampling.index: 10
----------------again
 - sampling.index: 10
 - sampling.index: 11
----------------again
 - sampling.index: 11
 - sampling.index: 12
 - sampling.index: 13
----------------again
 - sampling.index: 13
----------------again
 - sampling.index: 13
 - sampling.index: 14
 - sampling.index: 15
----------------again
 - sampling.index: 15
 - sampling.index: 16
 - sampling.index: 17
 - sampling.index: 18
----------------again
 - sampling.index: 18
----------------again
 - sampling.index: 18
----------------again
 - sampling.index: 18
 - sampling.index: 19
----------------again
 - sampling.index: 19
 - sampling.index: 20
****************************************************
validation ended at: 2016-07-04 18:03:41   taking:   1.5087571144104
****************************************************
> 
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>