A symmetric N x N distance matrix with the values of dissimilarity between N sequences, or NULL if method is "inexact".
method
An agglomeration method to be used. This should be (an abbreviation of) one of "complete", "single", "UPGMA", "WPGMA", "NJ", "ML", or "inexact". (See details section below.)
cutoff
A vector with the maximum edge length separating the sequences in the same cluster. Multiple cutoffs may be provided in ascending or descending order. If asDendrogram=TRUE or showPlot=TRUE then only one cutoff may be specified. (See details section below.)
showPlot
Logical specifying whether or not to plot the resulting dendrogram. Not applicable if method='inexact'.
asDendrogram
Logical. If TRUE then the object returned is of class dendrogram. Not applicable if method='inexact'.
myXStringSet
If method is "ML", the DNAStringSet or RNAStringSet used in the creation of myDistMatrix. If method is "inexact", the DNAStringSet, RNAStringSet, or AAStringSet to cluster. Not applicable for other methods.
model
One or more of the available MODELS of DNA evolution. Only applicable if method is "ML".
processors
The number of processors to use, or NULL to automatically detect and use all available processors.
verbose
Logical indicating whether to display progress.
Details
IdClusters groups the input sequences into clusters using a set dissimilarities representing the distance between N sequences. Initially a phylogenetic tree is formed using the specified method. Then each leaf (sequence) of the tree is assigned to a cluster based on its edge lengths to the other sequences. The available clustering methods are described as follows:
Ultrametric methods: The method complete assigns clusters using complete-linkage so that sequences in the same cluster are no more than cutoff percent apart. The method single assigns clusters using single-linkage so that sequences in the same cluster are within cutoff of at least one other sequence in the same cluster. UPGMA (the default) or WPGMA assign clusters using average-linkage which is a compromise between the sensitivity of complete-linkage clustering to outliers and the tendency of single-linkage clustering to connect distant relatives that do not appear to be closely related. UPGMA produces an unweighted tree, where each leaf contributes equally to the average edge lengths, whereas WPGMA produces a weighted result.
Additive methods: NJ uses the Neighbor-Joining method proposed by Saitou and Nei that does not assume lineages evolve at the same rate (the molecular clock hypothesis). The NJ method is typically the most phylogenetically accurate of the above distance-based methods. ML creates a neighbor-joining tree and then iteratively maximizes the likelihood of the tree given the aligned sequences (myXStringSet). This is accomplished through a combination of optimizing edge lengths with Brent's method and improving tree topology with nearest-neighbor interchanges (NNIs). When method="ML", one or more MODELS of DNA evolution must be specified. Model parameters are iteratively optimized to maximize likelihood, except base frequencies which are empirically determined. If multiple models are given, the best model is automatically chosen based on BIC calculated from the likelihood and the sample size (defined as the number of variable sites in the DNA sequence).
Sequence-only method: inexact uses a heuristic algorithm to directly assign sequences to clusters without a distance matrix. First the sequences are ordered by length and the longest sequence becomes the first cluster seed. If the second sequence is less than cutoff percent distance then it is added to the cluster, otherwise it becomes a new cluster representative. The remaining sequences are matched to cluster representatives based on their k-mer distribution and then aligned to find the closest sequence. This approach is repeated until all sequences belong to a cluster. In the vast majority of cases, this process results in clusters with members separated by less than cutoff distance, where distance is defined as the percent dissimilarity between the overlapping region of a “glocal” alignment.
Multiple cutoffs may be provided if they are in increasing or decreasing order. If cutoffs are provided in descending order then clustering at each new value of cutoff is continued within the prior cutoff's clusters. In this way clusters at lower values of cutoff are completely contained within their umbrella clusters at higher values of cutoff. This is useful for defining taxonomy, where lower level groups (e.g., genera) are expected not to straddle multiple higher level groups (e.g., families). If multiple cutoffs are provided in ascending order then clustering at each level of cutoff is independent of the prior level. This may result in fewer high-level clusters for NJ and ML methods, but will have no impact on ultrametric methods. Providing cutoffs in descending order makes inexact clustering faster, but has negligible impact on the other methods.
Value
If asDendrogram=FALSE (the default), then a data.frame is returned with a column for each cutoff specified. This data.frame has dimensions N*M, where each one of N sequences is assigned to a cluster at the M-level of cutoff. The row.names of the data.frame correspond to the dimnames of myDistMatrix.
If asDendrogram=TRUE, returns an object of class dendrogram that can be used for further manipulation and plotting. Leaves of the dendrogram are randomly colored by cluster number.
Felsenstein, J. (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution, 17(6), 368-376.
Ghodsi, M., Liu, B., & Pop, M. (2011) DNACLUST. BMC Bioinformatics, 12(1), 271. doi:10.1186/1471-2105-12-271.
Saitou, N. and Nei, M. (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular Biology and Evolution, 4(4), 406-425.
See Also
DistanceMatrix, Add2DB, MODELS
Examples
# using the matrix from the original paper by Saitou and Nei
m <- matrix(0,8,8)
m[2:8,1] <- c(7, 8, 11, 13, 16, 13, 17)
m[3:8,2] <- c(5, 8, 10, 13, 10, 14)
m[4:8,3] <- c(5, 7, 10, 7, 11)
m[5:8,4] <- c(8, 11, 8, 12)
m[6:8,5] <- c(5, 6, 10)
m[7:8,6] <- c(9, 13)
m[8,7] <- c(8)
# returns an object of class "dendrogram"
myClusters <- IdClusters(m, cutoff=10, method="NJ", showPlot=TRUE, asDendrogram=TRUE)
# example of specifying multiple cutoffs
IdClusters(m, cutoff=c(2,6,10,20)) # returns a data frame
# example of 'inexact' clustering
fas <- system.file("extdata", "50S_ribosomal_protein_L2.fas", package="DECIPHER")
dna <- readDNAStringSet(fas)
IdClusters(myXStringSet=dna, method="inexact", cutoff=0.05)
Results
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(DECIPHER)
Loading required package: Biostrings
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:parallel':
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from 'package:stats':
IQR, mad, xtabs
The following objects are masked from 'package:base':
Filter, Find, Map, Position, Reduce, anyDuplicated, append,
as.data.frame, cbind, colnames, do.call, duplicated, eval, evalq,
get, grep, grepl, intersect, is.unsorted, lapply, lengths, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rank,
rbind, rownames, sapply, setdiff, sort, table, tapply, union,
unique, unsplit
Loading required package: S4Vectors
Loading required package: stats4
Attaching package: 'S4Vectors'
The following objects are masked from 'package:base':
colMeans, colSums, expand.grid, rowMeans, rowSums
Loading required package: IRanges
Loading required package: XVector
Loading required package: RSQLite
Loading required package: DBI
> png(filename="/home/ddbj/snapshot/RGM3/R_BC/result/DECIPHER/IdClusters.Rd_%03d_medium.png", width=480, height=480)
> ### Name: IdClusters
> ### Title: Cluster Sequences By Distance or Sequence
> ### Aliases: IdClusters
>
> ### ** Examples
>
> # using the matrix from the original paper by Saitou and Nei
> m <- matrix(0,8,8)
> m[2:8,1] <- c(7, 8, 11, 13, 16, 13, 17)
> m[3:8,2] <- c(5, 8, 10, 13, 10, 14)
> m[4:8,3] <- c(5, 7, 10, 7, 11)
> m[5:8,4] <- c(8, 11, 8, 12)
> m[6:8,5] <- c(5, 6, 10)
> m[7:8,6] <- c(9, 13)
> m[8,7] <- c(8)
>
> # returns an object of class "dendrogram"
> myClusters <- IdClusters(m, cutoff=10, method="NJ", showPlot=TRUE, asDendrogram=TRUE)
| | | 0% | |================== | 25% | |================================ | 46% | |============================================= | 64% | |======================================================= | 78% | |============================================================== | 89% | |=================================================================== | 96% | |======================================================================| 100%
Time difference of 0.06 secs
>
> # example of specifying multiple cutoffs
> IdClusters(m, cutoff=c(2,6,10,20)) # returns a data frame
| | | 0% | |================== | 25% | |================================ | 46% | |============================================= | 64% | |======================================================= | 78% | |============================================================== | 89% | |=================================================================== | 96% | |======================================================================| 100%
Time difference of 0.01 secs
cluster2UPGMA cluster6UPGMA cluster10UPGMA cluster20UPGMA
1 7 5 2 1
2 5 3 2 1
3 4 2 2 1
4 3 2 2 1
5 2 1 1 1
6 1 1 1 1
7 6 4 1 1
8 8 6 3 1
>
> # example of 'inexact' clustering
> fas <- system.file("extdata", "50S_ribosomal_protein_L2.fas", package="DECIPHER")
> dna <- readDNAStringSet(fas)
> IdClusters(myXStringSet=dna, method="inexact", cutoff=0.05)
| | | 0% | |= | 1% | |= | 2% | |== | 3% | |=== | 4% | |==== | 5% | |==== | 6% | |===== | 7% | |====== | 8% | |====== | 9% | |======= | 10% | |======== | 11% | |======== | 12% | |========= | 13% | |========== | 14% | |========== | 15% | |=========== | 16% | |============ | 17% | |============= | 18% | |============= | 19% | |============== | 20% | |=============== | 21% | |=============== | 22% | |================ | 23% | |================= | 24% | |================== | 25% | |================== | 26% | |=================== | 27% | |==================== | 28% | |==================== | 29% | |===================== | 30% | |====================== | 31% | |====================== | 32% | |======================= | 33% | |======================== | 34% | |======================== | 35% | |========================= | 36% | |========================== | 37% | |=========================== | 38% | |=========================== | 39% | |============================ | 40% | |============================= | 41% | |============================= | 42% | |============================== | 43% | |=============================== | 44% | |================================ | 45% | |================================ | 46% | |================================= | 47% | |================================== | 48% | |================================== | 49% | |=================================== | 50% | |==================================== | 51% | |==================================== | 52% | |===================================== | 53% | |====================================== | 54% | |====================================== | 55% | |======================================= | 56% | |======================================== | 57% | |========================================= | 58% | |========================================= | 59% | |========================================== | 60% | |=========================================== | 61% | |=========================================== | 62% | |============================================ | 63% | |============================================= | 64% | |============================================== | 65% | |============================================== | 66% | |=============================================== | 67% | |================================================ | 68% | |================================================ | 69% | |================================================= | 70% | |================================================== | 71% | |================================================== | 72% | |=================================================== | 73% | |==================================================== | 74% | |==================================================== | 75% | |===================================================== | 76% | |====================================================== | 77% | |======================================================= | 78% | |======================================================= | 79% | |======================================================== | 80% | |========================================================= | 81% | |========================================================= | 82% | |========================================================== | 83% | |=========================================================== | 84% | |============================================================ | 85% | |============================================================ | 86% | |============================================================= | 87% | |============================================================== | 88% | |============================================================== | 89% | |=============================================================== | 90% | |================================================================ | 91% | |================================================================ | 92% | |================================================================= | 93% | |================================================================== | 94% | |================================================================== | 95% | |=================================================================== | 96% | |==================================================================== | 97% | |===================================================================== | 98% | |===================================================================== | 99% | |======================================================================| 100%
Time difference of 3.25 secs
cluster
Rickettsia prowazekii str. Dachau 102
Porphyromonas gingivalis W83 90
Porphyromonas gingivalis TDC60 90
Porphyromonas gingivalis ATCC 33277 90
Pasteurella multocida 671/90 103
Pasteurella multocida 36950 103
Xanthomonas campestris pv. campestris 79
Lactobacillus plantarum subsp. plantarum P-8 27
Lactobacillus plantarum ZJ316 27
Lactobacillus plantarum subsp. plantarum NC8 27
Lactobacillus plantarum subsp. plantarum ATCC 14917 27
Lactobacillus plantarum WCFS1 27
Xanthomonas citri pv. mangiferaeindicae LMG 941 80
Xanthomonas axonopodis pv. punicae str. LMG 859 80
Xanthomonas citri subsp. citri Aw12879 80
Xanthomonas vesicatoria ATCC 35937 80
Xanthomonas fuscans subsp. aurantifolii str. ICPB 11122 80
Xanthomonas fuscans subsp. aurantifolii str. ICPB 10535 80
Xanthomonas campestris pv. vesicatoria str. 85-10 80
Xanthomonas axonopodis pv. citrumelo F1 80
Xanthomonas oryzae pv. oryzicola BLS256 81
Xanthomonas campestris pv. raphani 756C 79
Bacillus halodurans C-125 56
Corynebacterium glutamicum SCgG2 20
Corynebacterium glutamicum K051 20
Synechocystis sp. PCC 6803 substr. PCC-N 57
Vibrio parahaemolyticus O1:K33 str. CDC_K4557 91
Vibrio parahaemolyticus BB22OP 91
Vibrio parahaemolyticus 10329 91
Streptococcus pyogenes SSI-1 115
Lactococcus lactis subsp. lactis IO-1 58
Lactococcus lactis subsp. lactis bv. diacetylactis str. LD61 58
Lactococcus lactis subsp. cremoris CNCM I-1631 58
Lactococcus lactis subsp. lactis KF147 58
Clostridium perfringens F262 41
Clostridium perfringens D str. JGS1721 41
Clostridium perfringens SM101 41
Mycoplasma pneumoniae M129-B7 4
Streptomyces avermitilis MA-4680 32
Treponema pallidum subsp. pallidum str. Nichols 104
Helicobacter pylori X47-2AL 59
Helicobacter pylori G27 60
Helicobacter pylori XZ274 61
Helicobacter pylori Shi417 61
Helicobacter pylori UM032 61
Helicobacter pylori Aklavik86 61
Helicobacter pylori 83 61
Helicobacter pylori Hp M3 62
Helicobacter pylori Hp P-15b 60
Helicobacter pylori Hp P-4d 62
Helicobacter pylori Hp P-62 62
Helicobacter pylori Hp P-30 60
Helicobacter pylori Hp P-3 62
Helicobacter pylori Hp H-23 62
Helicobacter pylori Hp H-19 62
Helicobacter pylori Hp H-18 62
Helicobacter pylori Hp H-11 59
Helicobacter pylori Hp A-16 62
Helicobacter pylori Hp H-44 62
Helicobacter pylori Hp H-41 62
Helicobacter pylori Hp H-27 59
Helicobacter pylori CPY6311 61
Helicobacter pylori CPY1313 61
Helicobacter pylori NQ4216 59
Helicobacter pylori Hp H-16 62
Helicobacter pylori Hp H-9 59
Helicobacter pylori Hp H-30 62
Helicobacter pylori Hp P-11 62
Helicobacter pylori CPY1962 59
Helicobacter pylori Hp H-4 62
Helicobacter pylori Hp H-5b 62
Helicobacter pylori CPY1124 59
Helicobacter pylori Hp H-42 62
Helicobacter pylori Hp H-36 62
Helicobacter pylori Hp H-10 62
Helicobacter pylori Hp P-74 59
Helicobacter pylori NQ4044 60
Helicobacter pylori Hp A-26 60
Helicobacter pylori CPY6081 59
Helicobacter pylori CPY3281 61
Helicobacter pylori NQ4200 59
Helicobacter pylori Hp A-8 62
Helicobacter pylori NQ4053 62
Helicobacter pylori Hp P-26 62
Helicobacter pylori CPY6261 59
Helicobacter pylori B128 62
Helicobacter pylori 98-10 61
Helicobacter pylori Hp A-6 62
Helicobacter pylori ELS37 59
Helicobacter pylori Puno135 59
Helicobacter pylori Puno120 59
Helicobacter pylori Gambia94/24 62
Helicobacter pylori 2017 62
Helicobacter pylori Sat464 59
Helicobacter pylori 908 62
Helicobacter pylori v225d 59
Helicobacter pylori J99 60
Helicobacter pylori P12 59
Helicobacter pylori F57 59
Helicobacter pylori F32 61
Helicobacter pylori F16 63
Helicobacter pylori Rif2 59
Helicobacter pylori PeCan18 64
Helicobacter pylori Hp P-13b 62
Helicobacter pylori Hp P-16 60
Helicobacter pylori Hp H-43 62
Helicobacter pylori Hp A-9 59
Hel