If a subset of samples are selected randomly, the navigate of positive classes might be too sparse or even empty.
This function will repeat sampling until the classes are appropriate in this sense.
The vector of labels named according to the rows of F.
gamma
A value in range 0-1 that determines the relative size of sample subsets.
persistence
Maximum number of tries for randomly choosing.samples,
If we try this many times and the obtained labels are all the same,
we give up (maybe the whole labels are the same) with the error message: " Not enough variation in the labels...".
minimum.class.size
A lower bound on the number of samples in each class.
replace
If TRUE, sampling is done by replacement.
Details
The function also returns a refined feature matrix by ignoring too sparse features after sampling.
Value
Returns a list of:
X_
The sampled feature matrix, each column is a feature after ignoring the redundant ones.
Y_
The vector of labels named according to the rows of X_.
remainder.samples
The names of the rows of F_ which do not appear in X_, later on can be used for validation.
Author(s)
Habil Zare
References
"Statistical Analysis of Overfitting Features", manuscript in preparation.
library(FeaLect)
data(mcl_sll)
F <- as.matrix(mcl_sll[ ,-1]) # The Feature matrix
L <- as.numeric(mcl_sll[ ,1]) # The labels
names(L) <- rownames(F)
message(dim(F)[1], " samples and ",dim(F)[2], " features.")
XY <- random.subset(F_=F, L_=L, gamma=3/4,replace=TRUE)
XY$remainder.samples
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(FeaLect)
Loading required package: lars
Loaded lars 1.2
Loading required package: rms
Loading required package: Hmisc
Loading required package: lattice
Loading required package: survival
Loading required package: Formula
Loading required package: ggplot2
Attaching package: 'Hmisc'
The following objects are masked from 'package:base':
format.pval, round.POSIXt, trunc.POSIXt, units
Loading required package: SparseM
Attaching package: 'SparseM'
The following object is masked from 'package:base':
backsolve
> png(filename="/home/ddbj/snapshot/RGM3/R_CC/result/FeaLect/random.subset.Rd_%03d_medium.png", width=480, height=480)
> ### Name: random.subset
> ### Title: Selects a random subset of the input.
> ### Aliases: random.subset
> ### Keywords: regression multivariate classif models
>
> ### ** Examples
>
> library(FeaLect)
> data(mcl_sll)
> F <- as.matrix(mcl_sll[ ,-1]) # The Feature matrix
> L <- as.numeric(mcl_sll[ ,1]) # The labels
> names(L) <- rownames(F)
> message(dim(F)[1], " samples and ",dim(F)[2], " features.")
22 samples and 236 features.
>
> XY <- random.subset(F_=F, L_=L, gamma=3/4,replace=TRUE)
> XY$remainder.samples
[1] "PAT10105" "PAT20762" "PAT14569" "PAT10301" "PAT10384" "PAT8355"
[7] "PAT8725" "PAT14706" "PAT8334" "PAT8893"
>
>
>
>
>
>
> dev.off()
null device
1
>