Character matrix or data frame, containing SNPs as columns
or alternatively a DNAMultipleAlignment Biostrings object
phenotype
Numerical vector, where each element is a measured phenotype
corresponding to the observations of the genotype data.
technique
Two techniques are provided: random forests (rf) or linear
support vector machines (svm) (recommended = svm).
fold.cv
The cross-validation fraction (0, 1) of the data which is used
to train the classifier (recommended = 0.66). The ramaining fraction
(1-fold.cv) of the data is used to test the classifier.
boots
Number of bootstraps to be performed to estimate the
classification accuracy and the corresponding confidence intervals
(recommended >= 100).
Details
This procedure takes as an input two types of data: first a genotype data
composed of single nucleotide polymorphism (SNP) sites, each of which is
represented by a column of alleles, whereby at most two types of alleles
should exist in each column; second a numerical phenotype vector, where the
elements sorted to correspond to the rows of the genotype data.
Using these two data types, it computes the association between each SNP and
the phenotype. For each SNP two metrics are computed, called "effect size" and
"classification accuracy".
The effect size of a given SNP is obtained by computing the Cohen's d
statistics (Cohen 1988). The 95% confidence intervals are computed as well.
Classification accuracy is the second metric which is computed using
statistical learning techniques. This is the metric which is used to quantify
the strength of the association between a SNP and a phenotype. The idea is to
use either linear suppport vector machines or random forests to build a
classification model between the phenotype vector and the SNP vector. The
more accurate the model, the easier we can predict the two allele states of
the SNP from the phenotype and hence the stronger is the mutual association
between the two vectors. In order to obtain a robust classification accuracy
measure, the classification analysis is done in a bootstrapping fashion.
First a subset of the SNP-phenotype vectors is randomly selected to train a
classifier, while the remaining data is used to test the classifier. This
step is repeated multiple times after which the classification accuracies
of all the classifiers are averaged into a single classification accuracy
measure and the corresponding confidence intervals are computed.
In order to validate the classification accuracy, the tool also computes
the Cohen's kappa statistics (Cohen 1960) which compares the observed
classification accuracy with the expected classification accuracy. If the
expected and observed classification accuracies are in concordance, the
computed association can be taken seriously, otherwise it can be discarded
as noise.
Value
Five classes of results are computed for each SNP with respect to the
phenotype, resulting in a 18 element vector which is stored as a row in
the final data frame:
data(genotype.snp)
#or data(genotype.snp.msa) in this case you cannot subset genotype.snp[, 1:3]
data(phenotype.snp)
genphen.results <- runGenphenSnp(genotype = genotype.snp[, 1:3],
phenotype = phenotype.snp, technique = "svm", fold.cv = 0.66, boots = 100)
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(genphen)
Loading required package: randomForest
randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.
Loading required package: e1071
Loading required package: ggplot2
Attaching package: 'ggplot2'
The following object is masked from 'package:randomForest':
margin
Loading required package: effsize
Loading required package: Biostrings
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:parallel':
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following object is masked from 'package:randomForest':
combine
The following objects are masked from 'package:stats':
IQR, mad, xtabs
The following objects are masked from 'package:base':
Filter, Find, Map, Position, Reduce, anyDuplicated, append,
as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
rbind, rownames, sapply, setdiff, sort, table, tapply, union,
unique, unsplit
Loading required package: S4Vectors
Loading required package: stats4
Attaching package: 'S4Vectors'
The following objects are masked from 'package:base':
colMeans, colSums, expand.grid, rowMeans, rowSums
Loading required package: IRanges
Loading required package: XVector
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/genphen/genphenSNP.Rd_%03d_medium.png", width=480, height=480)
> ### Name: runGenphenSnp
> ### Title: Performing genetic association analysis between SNPs and
> ### phenotypes
> ### Aliases: runGenphenSnp
>
> ### ** Examples
>
> data(genotype.snp)
> #or data(genotype.snp.msa) in this case you cannot subset genotype.snp[, 1:3]
> data(phenotype.snp)
> genphen.results <- runGenphenSnp(genotype = genotype.snp[, 1:3],
+ phenotype = phenotype.snp, technique = "svm", fold.cv = 0.66, boots = 100)
>
>
>
>
>
> dev.off()
null device
1
>