Character matrix or data frame, containing SAAPs as columns
or alternatively as AAMultipleAlignment Biostrings object
phenotype
Numerical vector, where each element is a measured phenotype
corresponding to the observations of the genotype data.
technique
Two techniques are provided: random forests (rf) or linear
support vector machines (svm) (recommended = svm).
fold.cv
The cross-validation fraction (0, 1) of the data which is used
to train the classifier (recommended = 0.66). The ramaining fraction
(1-fold.cv) of the data is used to test the classifier.
boots
Number of bootstraps to be performed to estimate the
classification accuracy and the corresponding confidence intervals
(recommended >= 100).
Details
This procedure takes two types of data as input: first a genotype data
composed of a set of single amino acid polymorphisms (SAAPs), each of
which is represented by a column of character amino acids; second a
numerical phenotype vector, where the elements sorted to correspond
to the rows of the genotype data.
Using these two data types, it quantifies the association between
each SAAP and the phenotype. SAAPs are more complex than SNPs because
they may be composed of more than two genetic states, i.e. more than
two types of amino acid states, whereas SNPs are exclusively composed
of only two genetics states, i.e. two nucleotide states. To compute
the association between a SAAP and a phenotype, the SAAP is first
deconstructed into its amino acid substitution pairs. Following the
deconstruction of a SAAP, the procedure computes the association
between each amino acid substitution pair and the phenotype with
respect to the two metics “effect size” and “classification
accuracy”.
The effect size of an amino acid substitution pair is estimated by
computing the Cohen's d statistics (Cohen 1988). The 95% confidence
intervals are computed as well. The effect size quantifies the phenotypic
effect of substituting one amino acid state for the other at the specific
SAAP site. Substitution pairs characterized with a substantial effect
size and tight confidence intervals which do not include the null effect
are to be prioritized.
Classification accuracy is the second metric which is computed using
statistical learning techniques. This is the metric which is used to
quantify the strength of the association between an amino acid
substitution pair and a phenotype. The idea is to use either linear
suppport vector machines or random forests to build a classification
model between the phenotype vector and the substitution pair vector.
The more accurate the model, the easier we can predict the two states
of the substitution pair from the phenotype and hence the stronger is
the mutual association between the two vectors. In order to obtain a
robust classification accuracy measure, the classification analysis is
done in a bootstrapping fashion. First a subset of the substitution-
phenotype vectors is randomly selected to train a classifier, while
the remaining data is used to test the classifier. This step is repeated
multiple times after which the classification accuracies of all the
classifiers are averaged into a single classification accuracy measure
and the corresponding confidence intervals are computed.
In order to validate the classification accuracy, the tool also
computes the Cohen's kappa statistics (Cohen 1960) which compares
the observed classification accuracy with the expected classification
accuracy. If the expected and observed classification accuracies are
in concordance, the computed association can be taken seriously,
otherwise it can be discarded as noise.
Value
Five classes of results are computed for each SAAP with respect
to the phenotype, resulting in a 18 element vector which is stored
as a row in the final data frame:
data(genotype.saap)
#or data(genotype.saap.msa) in this case you cannot subset genotype.saap[, 1:3]
data(phenotype.saap)
genphen.results <- runGenphenSaap(genotype = genotype.saap[, 1:3],
phenotype = phenotype.saap, technique = "svm", fold.cv = 0.66, boots = 100)
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(genphen)
Loading required package: randomForest
randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.
Loading required package: e1071
Loading required package: ggplot2
Attaching package: 'ggplot2'
The following object is masked from 'package:randomForest':
margin
Loading required package: effsize
Loading required package: Biostrings
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:parallel':
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following object is masked from 'package:randomForest':
combine
The following objects are masked from 'package:stats':
IQR, mad, xtabs
The following objects are masked from 'package:base':
Filter, Find, Map, Position, Reduce, anyDuplicated, append,
as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
rbind, rownames, sapply, setdiff, sort, table, tapply, union,
unique, unsplit
Loading required package: S4Vectors
Loading required package: stats4
Attaching package: 'S4Vectors'
The following objects are masked from 'package:base':
colMeans, colSums, expand.grid, rowMeans, rowSums
Loading required package: IRanges
Loading required package: XVector
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/genphen/genphenSAAP.Rd_%03d_medium.png", width=480, height=480)
> ### Name: runGenphenSaap
> ### Title: Performing genetic association analysis between SAAPs and
> ### phenotypes
> ### Aliases: runGenphenSaap
>
> ### ** Examples
>
> data(genotype.saap)
> #or data(genotype.saap.msa) in this case you cannot subset genotype.saap[, 1:3]
> data(phenotype.saap)
> genphen.results <- runGenphenSaap(genotype = genotype.saap[, 1:3],
+ phenotype = phenotype.saap, technique = "svm", fold.cv = 0.66, boots = 100)
>
>
>
>
>
> dev.off()
null device
1
>