Evaluating tree distance metrics by cluster recovery

An effective measure of tree distance will recover clusters of similar trees. These datasets contain the results of tests modelled on those in Lin et al. (2012).

linTestOneResults

linTestTwoResults

linTestSPRResults

Format

A three-dimensional array.

Rows correspond to the clustering methods:

spc: spectral clustering
pam: partitioning around medioids
h...: hierarchical clustering using: h.cmp, complete; h.sng, single; and h.avg, average linkage.

Columns correspond to distance metrics; see 'Methods tested' below.

Slices correspond to values of k:

linTestOneResults: k = 30, 40, 50, 60, 70
linTestTwoResults: k = 10, 20, 30, 40
linTestSPRResults: k = 30, 40, 50, 60, 70

An object of class array of dimension 5 x 18 x 4.

An object of class array of dimension 5 x 18 x 5.

Source

Scripts used to generate data objects are housed in the data-raw directory.

Details

I used three approaches to generate clusters of similar trees, and tested each metric in its ability to recover these clusters (Lin et al., 2012).

For the first test, I generated 500 datasets of 100 binary trees with n = 40 leaves. Each set of trees was created by randomly selecting two k-leaf 'skeleton' trees, where k ranges from 0.3 n to 0.9 n. From each skeleton, 50 trees were generated by adding each of the remaining n - k leaves in turn at a uniformly selected point on the tree.

For the second and third test, each dataset was constructed by selecting at random two binary 40-leaf trees. From each starting tree, I generated 50 binary trees by conducting k leaf-label interchange (LLI) operations (test two) or k subtree prune and regraft (SPR) operations (test three) on the starting tree. An LLI operation swaps the positions of two randomly selected leaves, without affecting tree shape; an SPR operation moves a subtree to a new location within the tree.

For each dataset, I calculated the distance between each pair of trees. Trees where then partitioned into clusters using five methods, using the packages stats and cluster. I define the success rate of each distance measure as the proportion of datasets in which every tree generated from the same skeleton was placed in the same cluster.

For analysis of this data, see the accompanying vignette.

Methods tested

pid: Phylogenetic Information Distance (Smith 2020)
msid: Matching Split Information Distance (Smith 2020)
cid: Clustering Information Distance (Smith 2020)
qd: Quartet divergence (Smith 2019)
nye: Nye et al. tree distance (Nye et al. 2006)
jnc2, jnc4: Jaccard-Robinson-Foulds distances with k = 2, 4, conflicting pairings prohibited ('no-conflict')
joc2, jco4: Jaccard-Robinson-Foulds distances with k = 2, 4, conflicting pairings permitted ('conflict-ok')
ms: Matching Split Distance (Bogdanowicz & Giaro 2012)
mast: Size of Maximum Agreement Subtree (Valiente 2009)
masti: Information content of Maximum Agreement Subtree
nni_l, nni_u: Lower and upper bounds for nearest-neighbour interchange distance (Li et al. 1996)
spr: Approximate SPR distance
tbr_l, tbr_u: Lower and upper bound for tree bisection and reconnection (TBR) distance, calculated using TBRDist
rf: Robinson-Foulds distance (Robinson & Foulds 1981)
icrf: Information-corrected Robinson-Foulds distance (Smith 2020)
path: Path distance (Steel & Penny 1993)

References

Bogdanowicz D, Giaro K (2012). “Matching split distance for unrooted binary phylogenetic trees.” IEEE/ACM Transactions on Computational Biology and Bioinformatics, 9(1), 150--160. doi: 10.1109/TCBB.2011.48 .

Li M, Tromp J, Zhang L (1996). “Some notes on the nearest neighbour interchange distance.” In Goos G, Hartmanis J, Leeuwen J, Cai J, Wong CK (eds.), Computing and Combinatorics, volume 1090, 343--351. Springer, Berlin, Heidelberg. ISBN 978-3-540-61332-9 978-3-540-68461-9, doi: 10.1007/3-540-61332-3_168 .

Kendall M, Colijn C (2016). “Mapping phylogenetic trees to reveal distinct patterns of evolution.” Molecular Biology and Evolution, 33(10), 2735--2743. doi: 10.1093/molbev/msw124 .

Nye TMW, Liò P, Gilks WR (2006). “A novel algorithm and web-based tool for comparing two alternative phylogenetic trees.” Bioinformatics, 22(1), 117--119. doi: 10.1093/bioinformatics/bti720 .

Robinson DF, Foulds LR (1981). “Comparison of phylogenetic trees.” Mathematical Biosciences, 53(1-2), 131--147. doi: 10.1016/0025-5564(81)90043-2 .

Smith MR (2019). “Bayesian and parsimony approaches reconstruct informative trees from simulated morphological datasets.” Biology Letters, 15, 20180632. doi: 10.1098/rsbl.2018.0632 .

Smith MR (2020). “Information theoretic Generalized Robinson-Foulds metrics for comparing phylogenetic trees.” Bioinformatics, online ahead of print. doi: 10.1093/bioinformatics/btaa614 .

Steel MA, Penny D (1993). “Distributions of tree comparison metrics---some new results.” Systematic Biology, 42(2), 126--141. doi: 10.1093/sysbio/42.2.126 .

Valiente G (2009). Combinatorial Pattern Matching Algorithms in Computational Biology using Perl and R, CRC Mathematical and Computing Biology Series. CRC Press, Boca Raton.

Yu Lin, Rajan V, Moret BME (2012). “A metric for phylogenetic trees based on matching.” IEEE/ACM Transactions on Computational Biology and Bioinformatics, 9(4), 1014--1022. doi: 10.1109/TCBB.2011.157 .