R: Extremely fast linkage map construction for data frame...
mstmap.data.frame
R Documentation
Extremely fast linkage map construction for data frame objects using MSTmap.
Description
Extremely fast linkage map construction for data frame objects utilizing the
source code for MSTmap (see Wu et al., 2008). The construction includes
linkage group clustering, marker ordering and genetic distance calculations.
A "data.frame" object containing marker information. The
data.frame must explicitly be arranged with markers in rows and
genotypes in columns. Marker names are obtained from the rownames of the
object and genotype names are obtained from the names
component of the object (see Details).
pop.type
Character string specifying the population type of the data frame
object. Accepted values are "DH" (doubled haploid),
"BC" (backcross), "RILn" (non-advanced RIL population with
n generations of selfing) and "ARIL" (advanced RIL) (see
Details). Default is "DH".
dist.fun
Character string defining the distance function used for calculation of
genetic distances. Options are "kosambi" and "haldane".
Default is "kosambi".
objective.fun
Character string defining the objective function to be used when
constructing the map. Options are "COUNT" for minimising the sum of
recombination events between markers and "ML" for maximising the
likelihood objective function. Default is "COUNT".
p.value
Numerical value to specify the threshold to use when constructing
linkage groups. Defaults to 1e-06. If a value greater than one
is given this feature is turned off and it is assumed that all marker
data inputted belong to the same linkage group (see Details).
noMap.dist
Numerical value to specify the smallest genetic distance a set of
isolated markers can appear distinct from other linked markers. Isolated
markers will appear in their own linkage groups ad will be of size
specified by noMap.size.
noMap.size
Numerical value to specify the maximum size of isolated marker linkage
groups that have been identified using noMap.dist. This feature
can be turned off by setting it to 0. Default is 0.
miss.thresh
Numerical value to specify the threshold proportion of missing marker
scores allowable in each of the markers. Markers above this threshold
will not be included in the linkage map. Default is 1.
mvest.bc
Logical value. If TRUE missing markers will be imputed
before clustering the markers into linkage groups. This is restricted
to "BC","DH","ARIL" populations only (see Details).
detectBadData
Logical value. If TRUE possible genotyping errors are detected,
set to missing and then imputed as part of the
marker ordering algorithm. Genotyping errors will also be printed in the
file specified by trace. This is restricted
to "BC","DH","ARIL" populations only. (see Details). Default is FALSE.
as.cross
Logical value. If TRUE the constructed linkage map is returned as
a R/qtl cross object (see Details). If FALSE then the constructed
linkage map is returned as a data.frame with extra columns
indicating the linkage group, marker name/position and genetic distance.
Default is TRUE.
return.imputed
Logical value. If TRUE then the imputed marker probability matrix is
returned for the linkage groups that are constructed (see
Details). Default is FALSE.
trace
An automatic tracing facility. If trace = FALSE then
minimal MSTmap output is piped to the screen during the algorithm.
If trace = TRUE, then detailed output from MSTmap is
piped to "MSToutput.txt". This file is equivalent to the output that
would be obtained from running the MSTmap executable from the command line.
...
Currently ignored.
Details
The data frame object must have an explicit format with markers
in rows and genotypes in columns. The marker names are required to be in
the rownames component and the genotype names are
required to be in the names component of the object. In
each set of names there must be no spaces. If spaces are detected they
are exchanged for a "-". Each of the columns of the data frame must be of class
"character" (not factors). If converting from a matrix, this can
easily be achieved by using the stringAsFactors = FALSE argument
for any data.frame method.
It is important to know what population type the data frame
object is and to correctly input this into pop.type. If
pop.type = "ARIL" then it is assumed that the minimal number of heterozygotes have been
set to missing before proceeding. The advanced RIL population is then
treated like a backcross population for the purpose of linkage map
construction. Genetic distances are adjusted post construction.
For non-advanced RIL populations pop.type =
"RILn", the number of generations of selfing is limited to 20 to
ensure sensible input.
The content of the markers in object can either be all numeric
(see below) or all character. If markers are of type character then
the following allelic content must be explicitly adhered to. For pop.type"BC",
"DH" or "ARIL" the two allele types should
be represented as ("A" or "a") and ("B" or
"b"). For non-advanced RIL populations (pop.type = "RILn")
phase unknown heterozygotes should be represented as
"X". For all populations, missing marker scores should be represented
as ("U" or "-").
This function also extends the functionality of the MSTmap
algorithm by allowing users to input a complete numeric data frame of
marker probabilities for pop.type"BC", "DH" or
"ARIL". The values must be inclusively between 1 (A) and 0 (B) and be
representative of the probability that the A allele is present. No
missing values are allowed.
The algorithm allows an adjustment of the p.value threshold for
clustering of markers to distinct linkage groups (see Wu et al.,
2008) and is highly dependent on the number of individuals in
the population. As the number of individuals increases the
p.value threshold should be decreased accordingly. This may
require some trial and error to achieve desired results.
If mvest.bc = TRUE and the population type is "BC","DH","ARIL"
then missing values are imputed before markers are clustered into
linkage groups. This is only a simple imputation that places a 0.5
probability of the missing observation being one allele or the other and
is used to assist the clustering algorithm when there is known to be high numbers of
missing observations between pairs of markers.
It should be highlighted that for population types
"BC","DH","ARIL", imputation of missing values occurs
regardless of the value of mvest.bc. This is achieved using an EM algorithm that is
tightly coupled with marker ordering (see Wu et al., 2008). Initially
a marker order is obtained omitting missing marker scores and then
imputation is performed based on the underlying recombinant probabilities
of the flanking markers with the markers containing the missing
value. The recombinant probabilities are then recomputed and an update of
the pairwise distances are calculated. The ordering algorithm is then
run again and the complete process is repeated until
convergence. Note, the imputed probability matrix for the linkage map
being constructed is returned if return.imputed = TRUE.
For populations "BC","DH","ARIL", if detectBadData = TRUE,
the marker ordering algorithm also
includes the detection of genotyping errors. For any individual
genotype, the detection method is based on a weighted Euclidean metric
(see Wu et al., 2008) that is a function of the
recombination probabilities of all the markers with the marker containing
the suspicious observation. Any genotyping errors detected are set to
missing and the missing values are then imputed if mv.est =
TRUE. Note, the detection of these errors and their
amendment is returned in the imputed probability matrix if
return.imputed = TRUE
If as.cross = TRUE then the constructed object is returned as a
R/qtl cross object with the appropriate class structure. For "RILn"
populations the constructed object is given the class "bcsft" by
using the qtl package conversion function convert2bcsft
with arguments F.gen = n and BC.gen =
0. For "ARIL" populations the constructed object is given the
class "riself".
If return.imputed = TRUE and pop.type is one of
"BC","DH","ARIL", then the marker probability matrix is
returned for the linkage groups that have been constructed using the
algorithm. Each linkage group is named identically to the linkage groups
of the map and, if as.cross = TRUE, contains an ordered
"map" element and a "data"
element consisting of marker probabilities of the A allele being present
(i.e. P(A) = 1, P(B) = 0). Both elements contain a
possibly reduced version of the marker set that includes all
non-colocating markers as well as the first marker of any set of
co-locating markers. If as.cross = FALSE then an ordered data frame of matrix
probabilities is returned.
Value
If as.cross = TRUE the function returns an R/qtl cross object with the appropriate
class structure. The object is a list with usual components
"pheno" and "geno". If as.cross = FALSE the
function returns an ordered data frame object
with additional columns that indicate the linkage group, the position
and marker names and genetic distance of the markers within in each
linkage group. If markers were omitted for any reason during the
construction, the object will have an "omit" component with
all omitted markers in a collated matrix. If return.imputed =
TRUE then the object will also contain an "imputed.geno" element.
Author(s)
Julian Taylor, Dave Butler, Timothy Close, Yonghui Wu, Stefano Lonardi
References
Y. Wu, P. Bhat, T.J. Close, S. Lonardi, Efficient and Accurate
Construction of Genetic Linkage Maps from Minimum Spanning Tree of a
Graph Plos Genetics, Volume 4, Issue 10, 2008.