These functions read DNA sequences in a file, and returns a matrix or a
list of DNA sequences with the names of the taxa read in the file as
rownames or names, respectively. By default, the sequences are stored
in binary format, otherwise (if as.character = "TRUE") in lower
case.
a file name specified by either a variable of mode character,
or a double-quoted string.
format
a character string specifying the format of the DNA
sequences. Four choices are possible: "interleaved",
"sequential", "clustal", or "fasta", or any
unambiguous abbreviation of these.
skip
the number of lines of the input file to skip before
beginning to read data (ignored for FASTA files; see below).
nlines
the number of lines to be read (by default the file is
read untill its end; ignored for FASTA files)).
comment.char
a single character, the remaining of the line
after this character is ignored (ignored for FASTA files).
as.character
a logical controlling whether to return the
sequences as an object of class "DNAbin" (the default).
as.matrix
(used if format = "fasta") one of the three
followings: (i) NULL: returns the sequences in a matrix if
they are of the same length, otherwise in a list; (ii) TRUE:
returns the sequences in a matrix, or stops with an error if they
are of different lengths; (iii) FALSE: always returns the
sequences in a list.
Details
read.dna follows the interleaved and sequential formats defined
in PHYLIP (Felsenstein, 1993) but with the original feature than there
is no restriction on the lengths of the taxa names. For these two
formats, the first line of the file must contain the dimensions of the
data (the numbers of taxa and the numbers of nucleotides); the
sequences are considered as aligned and thus must be of the same
lengths for all taxa. For the FASTA format, the conventions defined in
the URL below (see References) are followed; the sequences are taken as
non-aligned. For all formats, the nucleotides can be arranged in any
way with blanks and line-breaks inside (with the restriction that the
first ten nucleotides must be contiguous for the interleaved and
sequential formats, see below). The names of the sequences are read in
the file. Particularities for each format are detailed below.
Interleaved:the function starts to read the sequences after it
finds one or more spaces (or tabulations). All characters before the
sequences are taken as the taxa names after removing the leading and
trailing spaces (so spaces in taxa names are allowed). It is assumed
that the taxa names are not repeated in the subsequent blocks of
nucleotides.
Sequential:the same criterion than for the interleaved format
is used to start reading the sequences and the taxa names; the
sequences are then read until the number of nucleotides specified in
the first line of the file is reached. This is repeated for each taxa.
Clustal:this is the format output by the Clustal programs
(.aln). It is somehow similar to the interleaved format: the
differences being that the dimensions of the data are not indicated
in the file, and the names of the sequences are repeated in each block.
FASTA:This looks like the sequential format but the taxa names
(or rather a description of the sequence) are on separate lines
beginning with a ‘greater than’ character ‘>’ (there may be
leading spaces before this character). These lines are taken as taxa
names after removing the ‘>’ and the possible leading and trailing
spaces. All the data in the file before the first sequence is ignored.
Value
a matrix or a list (if format = "fasta") of DNA sequences
stored in binary format, or of mode character (if as.character =
"TRUE").
read.FASTA always returns a list of class "DNAbin".