R Graphical Manual

Browse All

Last data update: 2014.03.03

R: Find joint maximum-likelihood estimates of ancestry and...

threeway

R Documentation

Find joint maximum-likelihood estimates of ancestry and heterozygosity for a sample of hybrids in a three-way hybrid zone

Description

For hybrids with up to three parental lineages, the function estimates genomic proportions and ancestry indices by numerically searching likelihood space.

Usage

threeway(G, P, type = "codominant", surf = TRUE, method = "SANN", iterations = 500, 
	start = rep(1/6, 6), Cscale = NULL, props = NULL, 
	control = list(fnscale = -1, maxit = iterations))

Arguments

`G`	A matrix or data frame of genetic marker data. Each column is a locus. For `type="dominant"`, there should be one row per individual. For `type="codominant"`, each individual is to be represented in consecutive rows (one for each allele).
`P`	A matrix or data frame with the following columns (order is important!): Locus name, Allele name, P1 allele frequency, P2 allele frequency, P3 allele frequency. For `type="dominant"`, there should be one row per locus, giving the frequencies of the dominant or "1" allele. For `type="codominant"` there should be a separate row for each allele AND the Allele names should match the data in `G`.
`type`	A string representing the data type. The options are `"codominant"` or `"dominant"`.
`surf`	Logical: should the function find starting values by evaluating likelihoods on a grid?
`method`	Optimization method to search for maximum likelihood estimates of ancestry and heterozygosity. Alternatives are `"SANN"`, `"L-BFGS-B"`, `"surf"`, and `"mcmc"`. See details.
`iterations`	The desired number of MCMC steps to perform when `method="mcmc" or "SANN"`.
`start`	A vector including the starting values of the six genomic proportions. See details.
`Cscale`	An integer, controlling the the proposal distribution for `method = "SANN"` or `"mcmc"`. Smaller values will cause the algorithm to search more broadly, but could make the search inefficient. See details.
`props`	Optional: a matrix of genomic proportions to evaluate if `surf = TRUE` or `method = "surf"`. Columns correspond to the six genomic proportions in order: p11, p22, p33, p12, p13, p23.
`control`	A list of options to be passed to `control` in the `optim` function. Whatever else is chosen, be sure `fnscale` is negative to make `optim` search for a maximum rather than a minimum. Specifying `maxit` will override `iterations` for `method = "SANN"`.

Details

Given three ancestral species or parental populations (P1, P2, and P3), the genome of a hybrid can be described by six genomic proportions: p_{11} = proportion of one's genome that is homozygous for alleles inherited from P1, p_{22} = proportion of one's genome that is homozygous for alleles inherited from P2, p_{33} = proportion of one's genome that is homozygous for alleles inherited from P3, p_{12} = proportion of one's genome that is heterozygous for alleles inhertied from P1 and P2, p_{13} = proportion of one's genome that is heterozygous for alleles inhertied from P1 and P3, p_{23} = proportion of one's genome that is heterozygous for alleles inhertied from P2 and P3.

Currently, the function provides four methods for searching the likelihood space (a 5-simplex with vertices wherever each genomic proportion is equal to 1.0). method = "SANN" is probably the best; it uses the general purpose optimization function optim with its simulated annealing algorithm. For this estimation problem, a custom proposal function is passed to the option gr. This proposal function draws new genomic proportions from a 6-dimensional Dirichlet distribution centered on the old genomic proportions. The concentration of the proposal distribution is controlled by Cscale; the larger this value, the more the proposal distribution is concentrated near the current state.

method = "mcmc" uses a Markov-Chain Monte Carlo with Metropolis-Hastings sampling to explore the likelihood space. It also uses the Dirichlet proposal distribution, and could be useful (with some modification of the code) for generating posterior distributions. method = "SANN" is probaby superior for simply finding the MLE.

method = "L-BFGS-B" also uses optim, but with a quasi-Newton likelihood search algorithm to look for the maximum likelihood. This method is relatively fast, but it can miss the MLE if it is near an edge of the sample space.

"surf", finds all likelihoods on a 6-dimensional grid defined by props and chooses the maximum. By default, props is all possible combinations of 10 equally spaced proportions (from 0 to 1), subject to the constraint that they add to 1. By itself, this method is not going to find the MLE unless the MLE happens to be one of the grid points. However, using the option surf = TRUE with the SANN or mcmc methods can improve efficiency by initiating the search at the grid point nearest the MLE.

Value

A matrix with 10 named columns for each individual, containing estimated genomic proportions, ancestry indices, and the log-likelihood:

`p11`	Proportion of markers homozygous for lineage 1 alleles
`p22`	Proportion of markers homozygous for lineage 2 alleles
`p33`	Proportion of markers homozygous for lineage 3 alleles
`p12`	Proportion of markers heterozygous for lineage 1 and 2 alleles
`p13`	Proportion of markers heterozygous for lineage 1 and 3 alleles
`p23`	Proportion of markers heterozygous for lineage 2 and 3 alleles
`S1`	Lineage 1 ancestry index: proportion of alleles derived from parental lineage 1
`S2`	Lineage 2 ancestry index: proportion of alleles derived from parental lineage 2
`S3`	Lineage 3 ancestry index: proportion of alleles derived from parental lineage 3
`logLik`	log-likelihood of the genomic proportions given the individual marker data

Author(s)

Ben Fitzpatrick

References

Fitzpatrick, B. M. 2012. Estimating ancestry and heterozygosity of hybrids using molecular markers. BMC Evolutionary Biology 12:131. http://www.biomedcentral.com/1471-2148/12/131

Examples

	## Not run: 
	## all possible 2-way crosses after 2 generations
G <- rbind(
rep(1,12),rep(1,12),               # parental 1
rep(2,12),rep(2,12),               # parental 2
rep(3,12),rep(3,12),               # parental 3
rep(1,12),rep(2,12),               # 1 x 2 F1
rep(1:2,each=6),rep(1:2,6),        # 1 x 2 F2
rep(1,12),rep(1:2,6),              # 1 x 1 x 2 BC
rep(2,12),rep(1:2,6),              # 1 x 2 x 2 BC
rep(1,12),rep(3,12),               # 1 x 3 F1
rep(c(1,3),each=6),rep(c(1,3),6),  # 1 x 3 F2
rep(1,12),rep(c(1,3),6),           # 1 x 1 x 3 BC
rep(3,12),rep(c(1,3),6),           # 1 x 3 x 3 BC
rep(2,12),rep(3,12),               # 2 x 3 F1
rep(2:3,each=6),rep(2:3,6),        # 2 x 3 F2
rep(3,12),rep(2:3,6),              # 2 x 3 x 3 BC
rep(2,12),rep(2:3,6)               # 2 x 2 x 3 BC
)

P <- data.frame(Locus=rep(1:12,each=3),allele=rep(1:3,12),
	P1=rep(c(1,0,0),12),P2=rep(c(0,1,0),12),P3=rep(c(0,0,1),12))

mle.o <- threeway(G,P,surf=FALSE,iterations=99)
mle.c <- HIC3(G,P)

# compare the optimization (mle.o) to the closed-form (mle.c): 
# 99 iterations is not enough to converge on the known true values. 
# Try setting surf=TRUE and/or increasing iterations. 

## End(Not run)