R: DiscML: An R package for estimating evolutionary rates of...
DiscML-package
R Documentation
DiscML: An R package for estimating evolutionary rates
of discrete characters using maximum likelihood
Description
DiscML was developed as a unified R program for estimating the evolutionary
rates of discrete characters with no restriction on the number of character
states and having a great flexibility on transition models. DiscML performs
maximum likelihood estimation with the options to correct for unobservable
data, to implement a gamma distribution for rate variation, and to estimate
the prior root probabilities from the empirical data. It gives users the
ability to customize the instantaneous rate transition matrices, and to
choose a variety of pre-determined matrices. DiscML is ideal for the
analysis of binary (1s/0s) patterns, gene families, and multistate discrete
morphological characteristics.
Arguments
x
a vector, a matrix (in which each row is a vector for a gene family),
a data frame (in which each row is a vector for a gene family).
phy
phylogenetic information; an object of class "phylo" in "ape".
CI
a logical specifying whether to return the 95% confidence
intervals the likelihood of the different states.
model
a customized numeric matrix, or one of the pre-determined ones,
"ER", "ARD", "SYM", "BDER", "BDARD","BDSYM", "BDBI", "BDIER", "BDIARD",
"BDISYM"
ip
the initial values of the entries in the rate matrix used in the
maximum likelihood estimation.
alpha
a logical specifying whether to estimate the alpha parameter of
a discrete gamma distribution in the maximum likelihood estimation, or a
numeric value specifying an alpha value.
ialpha
the initial alpha value in a gamma distribution used in the
maximum likelihood estimation.
rootprobability
a logical specifying whether to estimate the prior root
probabilities in the maximum likelihood estimation.
irootprobability
an initial numeric vector for the prior root
probabilities optimization. Else it will automatically set to an even numeric vector.
zerocorrection
a logical specifying whether to correct for unobservable
data; see details in Felsenstein (1992).
simplify
a logical specifying whether to convert all nonzero character
states to character state "1" and perform binary analysis.
individualrates
a logical specifying whether to calculate likelihoods of gene families individually, rather than multiplying them together, given that there are more than one gene family.
characters
a logical specifying whether to automatically find minimum number of character states from given data, or a non-negative integer vector to manually specify all possible character states.
plotloglik
a logical specifying whether to print the plot of 'Log likelihoods Vs Gene families', when 'inividualrates = TRUE'.
plotmu
a logical specifying whether to print the plot of 'Mu vs Gene families', when 'individualrates = TRUE', or a string specifying wich mu's to be plotted.
Details
Package:
DiscML
Type:
Package
Version:
1.0
Date:
2014-04-28
License:
GPL (>= 2)
DiscML is flexible on both the size and type of the rate transition matrix.
The argument, 'model', specifies the rate transition matrix, which can be
customized by the user or chosen from pre-determined matrices. The
pre-determined matrices in DiscML are:
1. 'ER': an equal-rate matrix, in which all non-diagonal entries are equal,
e.g., matrix(c(0,1,1,1,0,1,1,1,0),ncol = 3, nrow = 3).
2. 'SYM': a symmetric matrix, which is identical with its transpose, e.g.,
matrix(c(0,1,2,1,0,3,2,3,0), ncol= 3, nrow = 3).
3. 'ARD': an all-rates-different matrix, in which all non-diagonal entries are
free to vary, e.g., matrix(c(0,1,2,3,0,4,5,6,0), ncol = 3, nrow = 3).
4. 'GTR': a general time reversible matrix, which is the same as a combination
of arguments, model ="SYM", reversible = TRUE, and rootprobability = TRUE.
5. 'BDER': a birth-and-death matrix, in which all non-zero entries are equal,
e.g., matrix( c(0,1,0,0,1,0,1,0,0,1,0,1,0,0,1,0), ncol= 4, nrow = 4)
6. 'BDSYM': a birth-and-death matrix with symmetric entries, e.g.,
matrix( c(0,1,0,0,1,0,2,0,0,2,0,3,0,0,3,0), ncol= 4, nrow = 4).
7. 'BDARD': a birth-and-death matrix with all non-zero entries free to vary,
e.g., matrix( c(0,1,0,0,2,0,3,0,0,4,0,5,0,0,6,0), ncol= 4, nrow = 4).
8. 'BDIER': a birth-death-and-innovation matrix with equal entries, ultimately
the same as 'BDER'.
8. 'BDISYM': a birth-death-and-innovation matrix with symmetric entries, e.g.,
matrix( c(0,1,0,0,1,0,2,0,0,2,0,2,0,0,2,0), ncol= 4, nrow = 4).
9. 'BDIARD': a birth-death-and-innovation matrix with variable entries, e.g.,
matrix( c(0,1,0,0,3,0,2,0,0,4,0,2,0,0,4,0), ncol= 4, nrow = 4).
10. 'BDBI' : a birth-and-death matrix with birth entries and death entries
being equal respectively. e.g.,
matrix( c(0,1,0,0,2,0,1,0,0,2,0,1,0,0,2,0), ncol= 4, nrow = 4).
When the prior root probabilities are estimated, the argument 'reversible'
allows reversibility for the pre-determined symmetric matrices, namely,
ER, SYM, BDER, BDIER, BDSYM, and BDISYM, by multiplying the entries by the
corresponding root probabilities.
The user can customize entries in the rate matrix by assigning a matrix
to the argument 'model'. The diagonal of the matrix will be ignored, and the
remaining entries are assigned non-negative integers. The entry with the number
'0' will be translated into a value of 0 in the rate matrix,
non-negative integers in the entries represent rate index.
For example, setting
model = matrix( c(NA,1,0,0,1,NA,2,0,0,2,NA,3,0,0,3,NA) ,nrow= 4, ncol= 4),
will be as same as setting model= "BDSYM".
Moreover, DiscML considers rate variation among the character sites by implementing
a discrete gamma distribution using alpha = TRUE.
Author(s)
Tane Kim, Weilong Hao
Maintainer: Weilong Hao <haow@wayne.edu>
References
Felsenstein, J. (1992). Phylogenies from restriction sites: A
maximum-likelihood approach. Evolution, 46, 159–173.
Paradis, E., Claude, J., and Strimmer, K. (2004). APE: Analyses of
Phylogenetics and Evolution in R language. Bioinformatics, 20, 289–290.
Schliep K.P. (2011) phangorn: Phylogenetic analysis in R. Bioinformatics, 27(4) 592-593.
Yang, Z. (1994), Maximum likelihood phylogenetic estimation from DNA sequences
with variable rates over sites: approximate methods. J Mol Evol, 39, 306–314.
Examples
# The default arguments in DiscML are:
# model = "ER"
# reversible = FALSE
# alpha = FALSE
# rootprobability = FALSE
x<- c(1,2,1,0,1)
phy <- rtree(length(x))
# x is a vector with 5 elements, and phy is a randomly generated
# 5-taxon tree (using rtree from the 'ape' package).
DiscML(x, phy)
# x here is a matrix
x <- matrix(c(1,2,0,2,1,0),2,3, byrow = TRUE)
# phy is a randomly generated tree containing 3 tips.
phy <- rtree(3)
# a symmetric rate transition matrix is used in the estimation.
DiscML(x, phy, model = "SYM")
# the prior root probabilities will be estimated.
DiscML(x, phy, rootprobability = TRUE)
# the prior root probabilities are fixed to be 1/16, 5/16, and 10/16.
DiscML(x, phy, rootprobability = c(1/16,5/16,10/16))
# the alpha value in a gamma distribution will be estimated.
DiscML(x, phy, alpha = TRUE)
# DiscML allows the reversibility for the symmetric matrices, e.g., ER, SYM..
DiscML(x, phy, rootprobability = TRUE, reversible = TRUE)
# DiscML can convert all non-zero character states to be '1's to perform
# binary analysis.
DiscML(x, phy, simplify = TRUE)
# DiscML can compute for each gene family of 'x' individually,
DiscML(x, phy, individualrates = TRUE)
# this is equivalent to:
# phy <- "(A$mu2:0.1,(B$mu0:0.3,C$mu1:0.4)$mu2:0.5);"
phy <- "(A$mu2:0.1,(B:0.3,C$mu1:0.4)$mu2:0.5);"
# DiscML can optimize different mu's for each branches.
DiscML(x, phy)
# DiscML can plot each mu vs Gene families when individualrates =TRUE.
DiscML(x, phy, individualrates = TRUE, plotmu = TRUE)
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(DiscML)
Loading required package: ape
> png(filename="/home/ddbj/snapshot/RGM3/R_CC/result/DiscML/DiscML-package.Rd_%03d_medium.png", width=480, height=480)
> ### Name: DiscML-package
> ### Title: DiscML: An R package for estimating evolutionary rates of
> ### discrete characters using maximum likelihood
> ### Aliases: DiscML-package DiscML print.DiscML read.tree2
>
> ### ** Examples
>
>
> # The default arguments in DiscML are:
> # model = "ER"
> # reversible = FALSE
> # alpha = FALSE
> # rootprobability = FALSE
>
> x<- c(1,2,1,0,1)
> phy <- rtree(length(x))
>
> # x is a vector with 5 elements, and phy is a randomly generated
> # 5-taxon tree (using rtree from the 'ape' package).
> DiscML(x, phy)
DiscML (Maximum Likelihood Method for Discrete Character States)
Call: DiscML(x = x, phy = phy)
Log Likelighood (type '...$loglik' to get this data): -5.493061
Rate index matrix:
0 1 2
0 . 1 1
1 1 . 1
2 1 1 .
Standardized matrix values:
rate.index value
1 0.5
The values of mu's estimated (Type ...$mu to get this table):
mu(ID) estimate std-err
mu0 39.59884 46977.23
Prior root probability used (type '...$rprob' to get this table):
characters value
0 0.3333333
1 0.3333333
2 0.3333333
Total time elapsed in running this function (type '...$time' to get this data):
0.13 seconds
>
> # x here is a matrix
> x <- matrix(c(1,2,0,2,1,0),2,3, byrow = TRUE)
>
> # phy is a randomly generated tree containing 3 tips.
> phy <- rtree(3)
>
> # a symmetric rate transition matrix is used in the estimation.
> DiscML(x, phy, model = "SYM")
DiscML (Maximum Likelihood Method for Discrete Character States)
Call: DiscML(x = x, phy = phy, model = "SYM")
Log Likelighood (type '...$loglik' to get this data): -6.591674
Rate index matrix:
0 1 2
0 . 1 2
1 1 . 3
2 2 3 .
Standardized matrix values:
rate.index value
1 0.638299
2 0.638299
3 0.223402
The values of mu's estimated (Type ...$mu to get this table):
mu(ID) estimate std-err
mu0 17.50512 23979.45
Prior root probability used (type '...$rprob' to get this table):
characters value
0 0.3333333
1 0.3333333
2 0.3333333
Total time elapsed in running this function (type '...$time' to get this data):
0.14 seconds
>
> # the prior root probabilities will be estimated.
> DiscML(x, phy, rootprobability = TRUE)
DiscML (Maximum Likelihood Method for Discrete Character States)
Call: DiscML(x = x, phy = phy, rootprobability = TRUE)
Log Likelighood (type '...$loglik' to get this data): -5.075326
Rate index matrix:
0 1 2
0 . 1 1
1 1 . 1
2 1 1 .
Standardized matrix values:
rate.index value
1 0.5
The values of mu's estimated (Type ...$mu to get this table):
mu(ID) estimate std-err
mu0 3.150582 1.799449
Prior root probability estimated (type '...$rprob' to get this table):
characters estimate std-err
0 1 0.857079
1 0 0.877506
2 0 0.265415
Total time elapsed in running this function (type '...$time' to get this data):
0.11 seconds
>
> # the prior root probabilities are fixed to be 1/16, 5/16, and 10/16.
> DiscML(x, phy, rootprobability = c(1/16,5/16,10/16))
DiscML (Maximum Likelihood Method for Discrete Character States)
Call: DiscML(x = x, phy = phy, rootprobability = c(1/16, 5/16, 10/16))
Log Likelighood (type '...$loglik' to get this data): -6.591674
Rate index matrix:
0 1 2
0 . 1 1
1 1 . 1
2 1 1 .
Standardized matrix values:
rate.index value
1 0.5
The values of mu's estimated (Type ...$mu to get this table):
mu(ID) estimate std-err
mu0 157.2453 166850.5
Prior root probability used (type '...$rprob' to get this table):
characters value
0 0.0625
1 0.3125
2 0.6250
Total time elapsed in running this function (type '...$time' to get this data):
0.1 seconds
>
> # the alpha value in a gamma distribution will be estimated.
> DiscML(x, phy, alpha = TRUE)
DiscML (Maximum Likelihood Method for Discrete Character States)
Call: DiscML(x = x, phy = phy, alpha = TRUE)
Log Likelighood (type '...$loglik' to get this data): -6.591674
Rate index matrix:
0 1 2
0 . 1 1
1 1 . 1
2 1 1 .
Standardized matrix values:
rate.index value
1 0.5
The values of mu's estimated (Type ...$mu to get this table):
mu(ID) estimate std-err
mu0 30.90597 6750.505
Prior root probability used (type '...$rprob' to get this table):
characters value
0 0.3333333
1 0.3333333
2 0.3333333
alpha parameter estimate for discrete-gamma estimate (type '...$alpha' to get this table):
estimate std-err
alpha: 6.73391 3133.398
Total time elapsed in running this function (type '...$time' to get this data):
0.26 seconds
>
> # DiscML allows the reversibility for the symmetric matrices, e.g., ER, SYM..
> DiscML(x, phy, rootprobability = TRUE, reversible = TRUE)
DiscML (Maximum Likelihood Method for Discrete Character States)
Call: DiscML(x = x, phy = phy, reversible = TRUE, rootprobability = TRUE)
Log Likelighood (type '...$loglik' to get this data): -6.591674
Rate index matrix:
0 1 2
0 . 3 5
1 1 . 6
2 2 4 .
Standardized matrix values:
rate.index value
1 0.5
2 0.5
3 0.5
4 0.5
5 0.5
6 0.5
The values of mu's estimated (Type ...$mu to get this table):
mu(ID) estimate std-err
mu0 21.23218 25188.34
Prior root probability estimated (type '...$rprob' to get this table):
characters estimate std-err
0 0.333333 0.141108
1 0.333333 0.201108
2 0.333333 0.073795
Total time elapsed in running this function (type '...$time' to get this data):
0.3 seconds
>
> # DiscML can convert all non-zero character states to be '1's to perform
> # binary analysis.
> DiscML(x, phy, simplify = TRUE)
DiscML (Maximum Likelihood Method for Discrete Character States)
Call: DiscML(x = x, phy = phy, simplify = TRUE)
Log Likelighood (type '...$loglik' to get this data): -4.158883
Rate index matrix:
0 1
0 . 1
1 1 .
Standardized matrix values:
rate.index value
1 1
The values of mu's estimated (Type ...$mu to get this table):
mu(ID) estimate std-err
mu0 16.07465 26968.79
Prior root probability used (type '...$rprob' to get this table):
characters value
0 0.5
1 0.5
Total time elapsed in running this function (type '...$time' to get this data):
0.07 seconds
>
> # DiscML can compute for each gene family of 'x' individually,
> DiscML(x, phy, individualrates = TRUE)
DiscML (Maximum Likelihood Method for Discrete Character States)
Call: DiscML(x = x, phy = phy, individualrates = TRUE)
Rate index matrix:
0 1 2
0 . 1 1
1 1 . 1
2 1 1 .
Result summary (type '...$total' to get this data):
Site rate_index:1 Mu:mu0 std-err Prior_Prob:0 1 2 Log Likelihood
1 0.500000 21.263621 3.567444e+04 0.333333 0.333333 0.333333 -3.295837
2 0.500000 21.252994 4.117270e+04 0.333333 0.333333 0.333333 -3.295837
Total time elapsed in running this function (type '...$time' to get this data):
0.18 seconds
>
> # this is equivalent to:
> # phy <- "(A$mu2:0.1,(B$mu0:0.3,C$mu1:0.4)$mu2:0.5);"
> phy <- "(A$mu2:0.1,(B:0.3,C$mu1:0.4)$mu2:0.5);"
>
> # DiscML can optimize different mu's for each branches.
> DiscML(x, phy)
DiscML (Maximum Likelihood Method for Discrete Character States)
Call: DiscML(x = x, phy = phy)
Log Likelighood (type '...$loglik' to get this data): -6.591674
Rate index matrix:
0 1 2
0 . 1 1
1 1 . 1
2 1 1 .
Standardized matrix values:
rate.index value
1 0.5
The values of mu's estimated (Type ...$mu to get this table):
mu(ID) estimate std-err
mu0 19.95637 NaN
mu1 21.83955 NaN
mu2 15.80269 NaN
Prior root probability used (type '...$rprob' to get this table):
characters value
0 0.3333333
1 0.3333333
2 0.3333333
Total time elapsed in running this function (type '...$time' to get this data):
0.15 seconds
>
> # DiscML can plot each mu vs Gene families when individualrates =TRUE.
> DiscML(x, phy, individualrates = TRUE, plotmu = TRUE)
DiscML (Maximum Likelihood Method for Discrete Character States)
Call: DiscML(x = x, phy = phy, individualrates = TRUE, plotmu = TRUE)
Rate index matrix:
0 1 2
0 . 1 1
1 1 . 1
2 1 1 .
Result summary (type '...$total' to get this data):
Site rate_index:1 Mu:mu0 std-err mu1 std-err mu2 std-err Prior_Prob:0 1 2 Log Likelihood
1 0.500000 21.874094 NaN 16.540232 NaN 19.691987 NaN 0.333333 0.333333 0.333333 -3.295837
2 0.500000 21.868802 NaN 16.536418 NaN 19.687272 NaN 0.333333 0.333333 0.333333 -3.295837
Total time elapsed in running this function (type '...$time' to get this data):
0.3 seconds
>
>
>
>
>
> dev.off()
null device
1
>