Last data update: 2014.03.03

R: DiscML: An R package for estimating evolutionary rates of...
DiscML-packageR Documentation

DiscML: An R package for estimating evolutionary rates of discrete characters using maximum likelihood

Description

DiscML was developed as a unified R program for estimating the evolutionary rates of discrete characters with no restriction on the number of character states and having a great flexibility on transition models. DiscML performs maximum likelihood estimation with the options to correct for unobservable data, to implement a gamma distribution for rate variation, and to estimate the prior root probabilities from the empirical data. It gives users the ability to customize the instantaneous rate transition matrices, and to choose a variety of pre-determined matrices. DiscML is ideal for the analysis of binary (1s/0s) patterns, gene families, and multistate discrete morphological characteristics.

Arguments

x

a vector, a matrix (in which each row is a vector for a gene family), a data frame (in which each row is a vector for a gene family).

phy

phylogenetic information; an object of class "phylo" in "ape".

CI

a logical specifying whether to return the 95% confidence intervals the likelihood of the different states.

model

a customized numeric matrix, or one of the pre-determined ones, "ER", "ARD", "SYM", "BDER", "BDARD","BDSYM", "BDBI", "BDIER", "BDIARD", "BDISYM"

ip

the initial values of the entries in the rate matrix used in the maximum likelihood estimation.

alpha

a logical specifying whether to estimate the alpha parameter of a discrete gamma distribution in the maximum likelihood estimation, or a numeric value specifying an alpha value.

ialpha

the initial alpha value in a gamma distribution used in the maximum likelihood estimation.

rootprobability

a logical specifying whether to estimate the prior root probabilities in the maximum likelihood estimation.

irootprobability

an initial numeric vector for the prior root probabilities optimization. Else it will automatically set to an even numeric vector.

zerocorrection

a logical specifying whether to correct for unobservable data; see details in Felsenstein (1992).

simplify

a logical specifying whether to convert all nonzero character states to character state "1" and perform binary analysis.

individualrates

a logical specifying whether to calculate likelihoods of gene families individually, rather than multiplying them together, given that there are more than one gene family.

characters

a logical specifying whether to automatically find minimum number of character states from given data, or a non-negative integer vector to manually specify all possible character states.

plotloglik

a logical specifying whether to print the plot of 'Log likelihoods Vs Gene families', when 'inividualrates = TRUE'.

plotmu

a logical specifying whether to print the plot of 'Mu vs Gene families', when 'individualrates = TRUE', or a string specifying wich mu's to be plotted.

Details

Package: DiscML
Type: Package
Version: 1.0
Date: 2014-04-28
License: GPL (>= 2)

DiscML is flexible on both the size and type of the rate transition matrix. The argument, 'model', specifies the rate transition matrix, which can be customized by the user or chosen from pre-determined matrices. The pre-determined matrices in DiscML are:

1. 'ER': an equal-rate matrix, in which all non-diagonal entries are equal, e.g., matrix(c(0,1,1,1,0,1,1,1,0),ncol = 3, nrow = 3).

2. 'SYM': a symmetric matrix, which is identical with its transpose, e.g., matrix(c(0,1,2,1,0,3,2,3,0), ncol= 3, nrow = 3).

3. 'ARD': an all-rates-different matrix, in which all non-diagonal entries are free to vary, e.g., matrix(c(0,1,2,3,0,4,5,6,0), ncol = 3, nrow = 3).

4. 'GTR': a general time reversible matrix, which is the same as a combination of arguments, model ="SYM", reversible = TRUE, and rootprobability = TRUE.

5. 'BDER': a birth-and-death matrix, in which all non-zero entries are equal, e.g., matrix( c(0,1,0,0,1,0,1,0,0,1,0,1,0,0,1,0), ncol= 4, nrow = 4)

6. 'BDSYM': a birth-and-death matrix with symmetric entries, e.g., matrix( c(0,1,0,0,1,0,2,0,0,2,0,3,0,0,3,0), ncol= 4, nrow = 4).

7. 'BDARD': a birth-and-death matrix with all non-zero entries free to vary, e.g., matrix( c(0,1,0,0,2,0,3,0,0,4,0,5,0,0,6,0), ncol= 4, nrow = 4).

8. 'BDIER': a birth-death-and-innovation matrix with equal entries, ultimately the same as 'BDER'.

8. 'BDISYM': a birth-death-and-innovation matrix with symmetric entries, e.g., matrix( c(0,1,0,0,1,0,2,0,0,2,0,2,0,0,2,0), ncol= 4, nrow = 4).

9. 'BDIARD': a birth-death-and-innovation matrix with variable entries, e.g., matrix( c(0,1,0,0,3,0,2,0,0,4,0,2,0,0,4,0), ncol= 4, nrow = 4).

10. 'BDBI' : a birth-and-death matrix with birth entries and death entries being equal respectively. e.g., matrix( c(0,1,0,0,2,0,1,0,0,2,0,1,0,0,2,0), ncol= 4, nrow = 4).

When the prior root probabilities are estimated, the argument 'reversible' allows reversibility for the pre-determined symmetric matrices, namely, ER, SYM, BDER, BDIER, BDSYM, and BDISYM, by multiplying the entries by the corresponding root probabilities.

The user can customize entries in the rate matrix by assigning a matrix to the argument 'model'. The diagonal of the matrix will be ignored, and the remaining entries are assigned non-negative integers. The entry with the number '0' will be translated into a value of 0 in the rate matrix, non-negative integers in the entries represent rate index. For example, setting model = matrix( c(NA,1,0,0,1,NA,2,0,0,2,NA,3,0,0,3,NA) ,nrow= 4, ncol= 4), will be as same as setting model= "BDSYM".

Moreover, DiscML considers rate variation among the character sites by implementing a discrete gamma distribution using alpha = TRUE.

Author(s)

Tane Kim, Weilong Hao

Maintainer: Weilong Hao <haow@wayne.edu>

References

Felsenstein, J. (1992). Phylogenies from restriction sites: A maximum-likelihood approach. Evolution, 46, 159–173.

Paradis, E., Claude, J., and Strimmer, K. (2004). APE: Analyses of Phylogenetics and Evolution in R language. Bioinformatics, 20, 289–290.

Schliep K.P. (2011) phangorn: Phylogenetic analysis in R. Bioinformatics, 27(4) 592-593.

Yang, Z. (1994), Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J Mol Evol, 39, 306–314.

Examples

  
# The default arguments in DiscML are:
# model = "ER"
# reversible = FALSE
# alpha = FALSE
# rootprobability = FALSE

x<- c(1,2,1,0,1)
phy <- rtree(length(x))

# x is a vector with 5 elements, and phy is a randomly generated
# 5-taxon tree (using rtree from the 'ape' package).
DiscML(x, phy)

# x here is a matrix
x <- matrix(c(1,2,0,2,1,0),2,3, byrow = TRUE)

# phy is a randomly generated tree containing 3 tips.
phy <- rtree(3) 

# a symmetric rate transition matrix is used in the estimation.
DiscML(x, phy, model = "SYM")

# the prior root probabilities will be estimated.
DiscML(x, phy, rootprobability = TRUE)

# the prior root probabilities are fixed to be 1/16, 5/16, and 10/16.
DiscML(x, phy, rootprobability = c(1/16,5/16,10/16))

# the alpha value in a gamma distribution will be estimated.
DiscML(x, phy, alpha = TRUE)

# DiscML allows the reversibility for the symmetric matrices, e.g., ER, SYM..
DiscML(x, phy, rootprobability = TRUE, reversible = TRUE)

# DiscML can convert all non-zero character states to be '1's to perform
# binary analysis. 
DiscML(x, phy, simplify = TRUE)

# DiscML can compute for each gene family of 'x' individually,
DiscML(x, phy, individualrates = TRUE)

# this is equivalent to:
# phy <- "(A$mu2:0.1,(B$mu0:0.3,C$mu1:0.4)$mu2:0.5);" 
phy <- "(A$mu2:0.1,(B:0.3,C$mu1:0.4)$mu2:0.5);" 

# DiscML can optimize different mu's for each branches.
DiscML(x, phy)

# DiscML can plot each mu vs Gene families when individualrates =TRUE.
DiscML(x, phy, individualrates = TRUE, plotmu = TRUE)

Results


R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(DiscML)
Loading required package: ape
> png(filename="/home/ddbj/snapshot/RGM3/R_CC/result/DiscML/DiscML-package.Rd_%03d_medium.png", width=480, height=480)
> ### Name: DiscML-package
> ### Title: DiscML: An R package for estimating evolutionary rates of
> ###   discrete characters using maximum likelihood
> ### Aliases: DiscML-package DiscML print.DiscML read.tree2
> 
> ### ** Examples
> 
>   
> # The default arguments in DiscML are:
> # model = "ER"
> # reversible = FALSE
> # alpha = FALSE
> # rootprobability = FALSE
> 
> x<- c(1,2,1,0,1)
> phy <- rtree(length(x))
> 
> # x is a vector with 5 elements, and phy is a randomly generated
> # 5-taxon tree (using rtree from the 'ape' package).
> DiscML(x, phy)

  DiscML (Maximum Likelihood Method for Discrete Character States)

Call: DiscML(x = x, phy = phy)

Log Likelighood (type '...$loglik' to get this data):  -5.493061 

Rate index matrix:
  0 1 2
0 . 1 1
1 1 . 1
2 1 1 .

Standardized matrix values:
  rate.index value
          1   0.5

The values of mu's estimated (Type ...$mu to get this table):
 mu(ID) estimate  std-err
    mu0 39.59884 46977.23

Prior root probability used (type '...$rprob' to get this table): 
 characters     value
          0 0.3333333
          1 0.3333333
          2 0.3333333

Total time elapsed in running this function (type '...$time' to get this data): 
0.13 seconds 
> 
> # x here is a matrix
> x <- matrix(c(1,2,0,2,1,0),2,3, byrow = TRUE)
> 
> # phy is a randomly generated tree containing 3 tips.
> phy <- rtree(3) 
> 
> # a symmetric rate transition matrix is used in the estimation.
> DiscML(x, phy, model = "SYM")

  DiscML (Maximum Likelihood Method for Discrete Character States)

Call: DiscML(x = x, phy = phy, model = "SYM")

Log Likelighood (type '...$loglik' to get this data):  -6.591674 

Rate index matrix:
  0 1 2
0 . 1 2
1 1 . 3
2 2 3 .

Standardized matrix values:
  rate.index    value
          1 0.638299
          2 0.638299
          3 0.223402

The values of mu's estimated (Type ...$mu to get this table):
 mu(ID) estimate  std-err
    mu0 17.50512 23979.45

Prior root probability used (type '...$rprob' to get this table): 
 characters     value
          0 0.3333333
          1 0.3333333
          2 0.3333333

Total time elapsed in running this function (type '...$time' to get this data): 
0.14 seconds 
> 
> # the prior root probabilities will be estimated.
> DiscML(x, phy, rootprobability = TRUE)

  DiscML (Maximum Likelihood Method for Discrete Character States)

Call: DiscML(x = x, phy = phy, rootprobability = TRUE)

Log Likelighood (type '...$loglik' to get this data):  -5.075326 

Rate index matrix:
  0 1 2
0 . 1 1
1 1 . 1
2 1 1 .

Standardized matrix values:
  rate.index value
          1   0.5

The values of mu's estimated (Type ...$mu to get this table):
 mu(ID) estimate  std-err
    mu0 3.150582 1.799449

Prior root probability estimated (type '...$rprob' to get this table): 
 characters estimate  std-err
          0        1 0.857079
          1        0 0.877506
          2        0 0.265415

Total time elapsed in running this function (type '...$time' to get this data): 
0.11 seconds 
> 
> # the prior root probabilities are fixed to be 1/16, 5/16, and 10/16.
> DiscML(x, phy, rootprobability = c(1/16,5/16,10/16))

  DiscML (Maximum Likelihood Method for Discrete Character States)

Call: DiscML(x = x, phy = phy, rootprobability = c(1/16, 5/16, 10/16))

Log Likelighood (type '...$loglik' to get this data):  -6.591674 

Rate index matrix:
  0 1 2
0 . 1 1
1 1 . 1
2 1 1 .

Standardized matrix values:
  rate.index value
          1   0.5

The values of mu's estimated (Type ...$mu to get this table):
 mu(ID) estimate  std-err
    mu0 157.2453 166850.5

Prior root probability used (type '...$rprob' to get this table): 
 characters  value
          0 0.0625
          1 0.3125
          2 0.6250

Total time elapsed in running this function (type '...$time' to get this data): 
0.1 seconds 
> 
> # the alpha value in a gamma distribution will be estimated.
> DiscML(x, phy, alpha = TRUE)

  DiscML (Maximum Likelihood Method for Discrete Character States)

Call: DiscML(x = x, phy = phy, alpha = TRUE)

Log Likelighood (type '...$loglik' to get this data):  -6.591674 

Rate index matrix:
  0 1 2
0 . 1 1
1 1 . 1
2 1 1 .

Standardized matrix values:
  rate.index value
          1   0.5

The values of mu's estimated (Type ...$mu to get this table):
 mu(ID) estimate  std-err
    mu0 30.90597 6750.505

Prior root probability used (type '...$rprob' to get this table): 
 characters     value
          0 0.3333333
          1 0.3333333
          2 0.3333333

alpha parameter estimate for discrete-gamma estimate (type '...$alpha' to get this table):
        estimate  std-err
alpha:   6.73391 3133.398

Total time elapsed in running this function (type '...$time' to get this data): 
0.26 seconds 
> 
> # DiscML allows the reversibility for the symmetric matrices, e.g., ER, SYM..
> DiscML(x, phy, rootprobability = TRUE, reversible = TRUE)

  DiscML (Maximum Likelihood Method for Discrete Character States)

Call: DiscML(x = x, phy = phy, reversible = TRUE, rootprobability = TRUE)

Log Likelighood (type '...$loglik' to get this data):  -6.591674 

Rate index matrix:
  0 1 2
0 . 3 5
1 1 . 6
2 2 4 .

Standardized matrix values:
  rate.index value
          1   0.5
          2   0.5
          3   0.5
          4   0.5
          5   0.5
          6   0.5

The values of mu's estimated (Type ...$mu to get this table):
 mu(ID) estimate  std-err
    mu0 21.23218 25188.34

Prior root probability estimated (type '...$rprob' to get this table): 
 characters estimate  std-err
          0 0.333333 0.141108
          1 0.333333 0.201108
          2 0.333333 0.073795

Total time elapsed in running this function (type '...$time' to get this data): 
0.3 seconds 
> 
> # DiscML can convert all non-zero character states to be '1's to perform
> # binary analysis. 
> DiscML(x, phy, simplify = TRUE)

  DiscML (Maximum Likelihood Method for Discrete Character States)

Call: DiscML(x = x, phy = phy, simplify = TRUE)

Log Likelighood (type '...$loglik' to get this data):  -4.158883 

Rate index matrix:
  0 1
0 . 1
1 1 .

Standardized matrix values:
  rate.index value
          1     1

The values of mu's estimated (Type ...$mu to get this table):
 mu(ID) estimate  std-err
    mu0 16.07465 26968.79

Prior root probability used (type '...$rprob' to get this table): 
 characters value
          0   0.5
          1   0.5

Total time elapsed in running this function (type '...$time' to get this data): 
0.07 seconds 
> 
> # DiscML can compute for each gene family of 'x' individually,
> DiscML(x, phy, individualrates = TRUE)


  DiscML (Maximum Likelihood Method for Discrete Character States)

Call: DiscML(x = x, phy = phy, individualrates = TRUE)

Rate index matrix:
  0 1 2
0 . 1 1
1 1 . 1
2 1 1 .

Result summary (type '...$total' to get this data): 

Site  rate_index:1  Mu:mu0     std-err       Prior_Prob:0  1         2         Log Likelihood  
1     0.500000      21.263621  3.567444e+04  0.333333      0.333333  0.333333  -3.295837       
2     0.500000      21.252994  4.117270e+04  0.333333      0.333333  0.333333  -3.295837       

Total time elapsed in running this function (type '...$time' to get this data): 
0.18 seconds 
> 
> # this is equivalent to:
> # phy <- "(A$mu2:0.1,(B$mu0:0.3,C$mu1:0.4)$mu2:0.5);" 
> phy <- "(A$mu2:0.1,(B:0.3,C$mu1:0.4)$mu2:0.5);" 
> 
> # DiscML can optimize different mu's for each branches.
> DiscML(x, phy)

  DiscML (Maximum Likelihood Method for Discrete Character States)

Call: DiscML(x = x, phy = phy)

Log Likelighood (type '...$loglik' to get this data):  -6.591674 

Rate index matrix:
  0 1 2
0 . 1 1
1 1 . 1
2 1 1 .

Standardized matrix values:
  rate.index value
          1   0.5

The values of mu's estimated (Type ...$mu to get this table):
 mu(ID) estimate std-err
    mu0 19.95637     NaN
    mu1 21.83955     NaN
    mu2 15.80269     NaN

Prior root probability used (type '...$rprob' to get this table): 
 characters     value
          0 0.3333333
          1 0.3333333
          2 0.3333333

Total time elapsed in running this function (type '...$time' to get this data): 
0.15 seconds 
> 
> # DiscML can plot each mu vs Gene families when individualrates =TRUE.
> DiscML(x, phy, individualrates = TRUE, plotmu = TRUE)


  DiscML (Maximum Likelihood Method for Discrete Character States)

Call: DiscML(x = x, phy = phy, individualrates = TRUE, plotmu = TRUE)

Rate index matrix:
  0 1 2
0 . 1 1
1 1 . 1
2 1 1 .

Result summary (type '...$total' to get this data): 

Site  rate_index:1  Mu:mu0     std-err  mu1        std-err  mu2        std-err  Prior_Prob:0  1         2         Log Likelihood  
1     0.500000      21.874094      NaN  16.540232      NaN  19.691987      NaN  0.333333      0.333333  0.333333  -3.295837       
2     0.500000      21.868802      NaN  16.536418      NaN  19.687272      NaN  0.333333      0.333333  0.333333  -3.295837       

Total time elapsed in running this function (type '...$time' to get this data): 
0.3 seconds 
> 
> 
> 
> 
> 
> dev.off()
null device 
          1 
>