There are four methods available to perform classification: svm: support vector machines using radial-based kernel function, bagsvm: support vector machines with bagging ensemble, randomForest: random forest algorithm, cart: classification and regression trees algorithm.
normalize
Normalization of count data for classification. none: Normalization is not applied. Count data is used for classification. deseq: deseq normalization. tmm: Trimmed mean of M values.
deseqTransform
Transformation method applied after normalization.vst: variance stabilizing transformation. voomCPM: voom transformation (log of counts-per-million).
cv
Number of cross-validation folds.
rpt
Number of complete sets of folds for computation.
B
Number of bootstrap samples for bagsvm method.
ref
User defined reference class. Default is NULL.
...
Optional arguments for train() function from caret package.
Details
In RNA-Seq studies, normalization is used to adjust between-sample differences for further analysis. In this package, "deseq" and "tmm" normalization methods are available. "deseq" estimates the size factors by dividing each sample by the geometric means of the transcript counts. "tmm" trims the lower and upper side of the data by log fold changes to minimize the log-fold changes between the samples and by absolute intensity. After normalization, it is useful to transform the data for classification. MLSeq package has "voomCPM" and "vst" transformation methods. "voomCPM" transformation applies a logarithmic transformation (log-cpm) to normalized count data. Second transformation method is the "vst" transformation and this approach uses an error modeling and the concept of variance stabilizing transformations to estimate the mean-dispersion relationship of data.
For model validation, k-fold cross-validation ("cv" option in MLSeq package) is a widely used technique. Using this technique, training data is randomly splitted into k non-overlapping and equally sized subsets. A classification model is trained on (k-1) subsets and tested in the remaining subsets. MLSeq package also has the repeat option as "rpt" to obtain more generalizable models. Giving a number of m repeats, cross validation concept is applied m times.
For more details, see the vignette.
Value
model
fitted classification model
method
used classification method
normalization
used normalization method
deseqTransform
deseq transformation if deseq normalization is used
confusionMat
cross-tabulation of observed and predicted classes and corresponding statistics
Kuhn M. (2008). Building predictive models in R using the caret package. Journal of Statistical Software, (http://www.jstatsoft.org/v28/i05/).
Anders S. Huber W. (2010). Differential expression analysis for sequence count data. Genome Biology, 11:R106
Witten DM. (2011). Classification and clustering of sequencing data using a poisson model. The Annals of Applied Statistics, 5(4), 2493:2518.
Charity WL. et al. (2014) Voom: precision weights unlock linear model analysis tools for RNA-Seq read counts, Genome Biology, 15:R29, doi:10.1186/gb-2014-15-2-r29
Witten D. et al. (2010) Ultra-high throughput sequencing-based small RNA discovery and discrete statistical biomarker analysis in a collection of cervical tumours and matched controls. BMC Biology, 8:58
Robinson MD, Oshlack A (2010). A scaling normalization method for differential expression analysis of RNA-Seq data. Genome Biology, 11:R25, doi:10.1186/gb-2010-11-3-r25
See Also
predictClassify
Examples
data(cervical)
data = cervical[c(1:150),] # a subset of cervical data with first 150 features.
class = data.frame(condition=factor(rep(c("N","T"),c(29,29))))# defining sample classes.
n = ncol(data) # number of samples
p = nrow(data) # number of features
nTest = ceiling(n*0.2) # number of samples for test set (20% test, 80% train).
ind = sample(n,nTest,FALSE)
# train set
data.train = data[,-ind]
data.train = as.matrix(data.train + 1)
classtr = data.frame(condition=class[-ind,])
# train set in S4 class
data.trainS4 = DESeqDataSetFromMatrix(countData = data.train,
colData = classtr, formula(~ condition))
data.trainS4 = DESeq(data.trainS4, fitType="local")
# Classification and Regression Tree (CART) Classification
cart = classify(data = data.trainS4, method = "cart", normalize = "deseq", deseqTransform = "vst", cv = 5, rpt = 3, ref="T")
cart
# Random Forest (RF) Classification
rf = classify(data = data.trainS4, method = "randomforest", normalize = "deseq", deseqTransform = "vst", cv = 5, rpt = 3, ref="T")
rf
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(MLSeq)
Loading required package: caret
Loading required package: lattice
Loading required package: ggplot2
Loading required package: DESeq2
Loading required package: S4Vectors
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:parallel':
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from 'package:stats':
IQR, mad, xtabs
The following objects are masked from 'package:base':
Filter, Find, Map, Position, Reduce, anyDuplicated, append,
as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
rbind, rownames, sapply, setdiff, sort, table, tapply, union,
unique, unsplit
Attaching package: 'S4Vectors'
The following objects are masked from 'package:base':
colMeans, colSums, expand.grid, rowMeans, rowSums
Loading required package: IRanges
Loading required package: GenomicRanges
Loading required package: GenomeInfoDb
Loading required package: SummarizedExperiment
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Loading required package: limma
Attaching package: 'limma'
The following object is masked from 'package:DESeq2':
plotMA
The following object is masked from 'package:BiocGenerics':
plotMA
Loading required package: randomForest
randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.
Attaching package: 'randomForest'
The following object is masked from 'package:Biobase':
combine
The following object is masked from 'package:BiocGenerics':
combine
The following object is masked from 'package:ggplot2':
margin
Loading required package: edgeR
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/MLSeq/classify.Rd_%03d_medium.png", width=480, height=480)
> ### Name: classify
> ### Title: Fitting Classification Models to Sequencing Data
> ### Aliases: classify
> ### Keywords: RNA-seq classification
>
> ### ** Examples
>
> data(cervical)
>
> data = cervical[c(1:150),] # a subset of cervical data with first 150 features.
>
> class = data.frame(condition=factor(rep(c("N","T"),c(29,29))))# defining sample classes.
>
> n = ncol(data) # number of samples
> p = nrow(data) # number of features
>
> nTest = ceiling(n*0.2) # number of samples for test set (20% test, 80% train).
> ind = sample(n,nTest,FALSE)
>
> # train set
> data.train = data[,-ind]
> data.train = as.matrix(data.train + 1)
> classtr = data.frame(condition=class[-ind,])
>
> # train set in S4 class
> data.trainS4 = DESeqDataSetFromMatrix(countData = data.train,
+ colData = classtr, formula(~ condition))
converting counts to integer mode
> data.trainS4 = DESeq(data.trainS4, fitType="local")
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
-- replacing outliers and refitting for 12 genes
-- DESeq argument 'minReplicatesForReplace' = 7
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing
>
> # Classification and Regression Tree (CART) Classification
> cart = classify(data = data.trainS4, method = "cart", normalize = "deseq", deseqTransform = "vst", cv = 5, rpt = 3, ref="T")
found already estimated dispersions, replacing these
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
Loading required package: rpart
> cart
An object of class MLSeq
Method : cart
Accuracy(%) : 93.48
Sensitivity(%) : 91.67
Specificity(%) : 95.45
Reference Class : T
>
> # Random Forest (RF) Classification
> rf = classify(data = data.trainS4, method = "randomforest", normalize = "deseq", deseqTransform = "vst", cv = 5, rpt = 3, ref="T")
found already estimated dispersions, replacing these
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
> rf
An object of class MLSeq
Method : randomforest
Accuracy(%) : 100
Sensitivity(%) : 100
Specificity(%) : 100
Reference Class : T
>
>
>
>
>
> dev.off()
null device
1
>