How did I do it – a short guide to a nice graph

At the end of the 20th and into the new 21st century, phylogenies have been largely reduced to stick graphs, often quite unappealing ones. In the papers I co-authored, I always tried to enhance the graphics, and I have not rarely been asked how I do it. So here's my protocol for a little basic tree-and-networks magic.

These days, when we think of a phylogeny, we usually mean a one-dimensional stick graph, a phylogenetic tree much different from Haeckel's famous oak (see this GWoN post for a short introduction into tree metaphors and further links).

Evolution is – in essence – change over time. Hence, it should be a standard to show (or at least document) the phylogram – the tree with branch lengths reflecting the amount of change, or genetic or other distances. Unfortunately, many publications still report only cladograms, trees without branch lengths.

The figure that got me a Nature co-authorship (Friis et al. 2007, fig. 3). See also this post

A possible phylogeny of extinct and extant cycads (Grimm, 1999) The branch lengths in this spiral tree are not proportional, but reflect the amount of change.

Molecular evolution of the internal transcribed spacers in beech, exemplified at hand of four sequence patterns (coloured version of Fig. 8 eventually published in Grimm, Denk & Hemleben 2007).
Even when I still did "cladistic" analyses (for my Diplom thesis), I tried to show my results in a more pleasant, visually appealing way. Without knowing it, I entered very early the realm of tree metaphors.

My first phylogenetic paper (Denk et al. 2002) naturally included just a standard phylogram, much in contrast to my doctoral thesis (Grimm 2003), which is quite packed with trees (straightforward or metaphorical ones) and graphics trying to visualise evolution rather than phylogeny.

I always felt phylogeneticists should not stop by inferring a tree, but a) try to understand its basis, the underlying data, b) to present a processed form of the inference results, and c) show all the data has to offer. The latter directly forced me into using networks. Mainly splits graphs such as the distance-based neighbour-net (Bryant & Moulton 2002, 2004), planar meta-phylogenetic networks, and support consensus networks (Holland & Moulton 2003; Schliep et al. 2017) based on Bayesian-sampled topologies or bootstrap (BS) pseudoreplicates.

Eventually, I managed to publish my first non-tree graphs in 2006
Some networks for maples (Grimm et al. 2006, figs 2, 3; open access). Left, distance-based neighbour-nets; right, a Bayesian posterior probability network ('bipartition network')

And in the following decade (till the end of my scientific career in 2016), any paper with me as co-author would include something like this (at least in the supplement)
2ISP-aware neighbour-nets with bootstrap supports mapped for selected phylogenetic splits (Potts et al. 2014, fig. 3)

You want that, too? Here's how I did it.

My standard work pipeline

Dealing with non-trivial data sets, I soon realised that Bayesian inference has little to offer. It is a method to infer a tree and assess the probability of its branches. Other algorithms also infer trees, but bootstrapping (or jack-knifing) have the potential to capture conflicting signals in the data. Bayesian posterior probabilities will usually tilt to one alternative. For the 2006 paper, I was brought into contact with Alexandros Stamatakis, who just finished his doctoral thesis and programmed RAxML III, and since then RAxML (now RAxML 8) has been my prime choice to analyse my data.

Ideally, you have a LINUX system (or computer cluster), so you can compile the most recent version of RAxML that can be found at GitHub. I often used the Windows-executables, which nice people compile from time-to-time (also included in the GitHub file list) on my stand-alone computer when dealing with quite small data sets.

The batch file for my standard analyses of oligogene data sets (here: a data set for Drosanthumum, an analysis I did for my last professional scientist paper) includes the following code lines (gray background). The input was a oligogene (5-gene) matrix labelled "DrosOnly.all.epf" in (extended) PHYLIP format ("extended" means, the names have no length restrictions as in the classic PHYLIP format); "1_ITS", etc. are exclusion files, "part" is the file defining the partitions (see RAxML's manual for formatting and syntax). The installed RAxML Windows/DOS executables (needs to be in same folder as the batch file) is "raxmlHPC-PTHREADS-SSE3"

Step 1: Cutting the combined matrix into its constituent genes
raxmlHPC-PTHREADS-SSE3 -T 1 -s DrosOnly.all.epf -m GTRCAT -n xx -E 1_ITS
raxmlHPC-PTHREADS-SSE3 -T 1 -s DrosOnly.all.epf -m GTRCAT -n xx -E 2_trnSG
raxmlHPC-PTHREADS-SSE3 -T 1 -s DrosOnly.all.epf -m GTRCAT -n xx -E 3_rpl16
raxmlHPC-PTHREADS-SSE3 -T 1 -s DrosOnly.all.epf -m GTRCAT -n xx -E 4_tQr16
raxmlHPC-PTHREADS-SSE3 -T 1 -s DrosOnly.all.epf -m GTRCAT -n xx -E 5_r16tK
raxmlHPC-PTHREADS-SSE3 -T 1 -s DrosOnly.all.epf -m GTRCAT -n xx -E 6_cpAll -q part
The last code line generates a plastid-only dataset. When combining data from different genomes, e.g. nuclear and plastid, you always should run mututally exclusive tree and branch-support analyses to test for congruence: commonly used tests like the ILD test will not recognize incongruence and fail to identify jumping taxa, so-called "rogue taxa" in most cases.

Step 2: Inferring ML trees and establish branch-support via fast bootstrapping; number of necessary bootstrap replicates determined by Pattengale et al.'s 2009 extended majority rule bootstop criterion; the -T option determines how many processors will be used for parallelisation; I have a tetra-core i7, hence dedicating three or four leaves me five or six for running other programmes.

raxmlHPC-PTHREADS-SSE3 -T 3 -s DrosOnly.all.epf -m GTRCAT -n 0_all.q -f a -p 32123 -x 65489 -# autoMRE -q part
raxmlHPC-PTHREADS-SSE3 -T 3 -s DrosOnly.all.epf -m GTRCAT -n 0_all.noq -f a -p 32123 -x 65489 -# autoMRE
These two lines invoke a full-partitioned and unpartitioned analysis. It has become a fashion to test models, and e.g. use PartitionFinder to define partitions: for oligogene datasets this is a waste of time, and may even be counterproductive. Hence, I just run a full (and biologically/genetically meaningful) partitioned analyses (e.g. noncoding vs. coding, in case of protein-coding genes one/two partitions for 1st and 2nd, and another for 3rd codon position) and an unpartitioned one as extreme endpoints. One observation I made is that they are usually not that different in the result.]
raxmlHPC-PTHREADS-SSE3 -T 3 -s DrosOnly.all.epf.1_ITS -m GTRCAT -n 1_ITS -f a -p 32123 -x 65489 -# autoMRE
raxmlHPC-PTHREADS-SSE3 -T 3 -s DrosOnly.all.epf.2_trnSG -m GTRCAT -n 2_trnSG -f a -p 32123 -x 65489 -# autoMRE
raxmlHPC-PTHREADS-SSE3 -T 3 -s DrosOnly.all.epf.3_rpl16 -m GTRCAT -n 3_rpl16 -f a -p 32123 -x 65489 -# autoMRE
raxmlHPC-PTHREADS-SSE3 -T 3 -s DrosOnly.all.epf.4_tQr16 -m GTRCAT -n 4_tQr16 -f a -p 32123 -x 65489 -# autoMRE
raxmlHPC-PTHREADS-SSE3 -T 3 -s DrosOnly.all.epf.5_r16tK -m GTRCAT -n 5_r16tK -f a -p 32123 -x 65489 -# autoMRE
raxmlHPC-PTHREADS-SSE3 -T 3 -s DrosOnly.all.epf.6_cpAll -m GTRCAT -n 6_cp.q -f a -p 32123 -x 65489 -# autoMRE -q part.6_cpAll
raxmlHPC-PTHREADS-SSE3 -T 3 -s DrosOnly.all.epf.6_cpAll -m GTRCAT -n 6_cp.noq -f a -p 32123 -x 65489 -# autoMRE

The result is a collection of trees (single-gene and combined) and bootstrap analyses (single-gene and combined), partitioned and unpartitioned. Since it's quickly done, and there are options to view and compare the results (such as bootstrap consensus networks, see e.g. Schliep et al. 2017), there is no excuse for not doing a full analysis these days. A full analysis should include single-gene inferences and genome-exclusive inferences to make sure the combined tree is not riddled by branching artefacts due to conflicting or partly incompatible signal. (You can find batch/shell files and according data input and RAxML output files in the supplementary data archives to my more recent papers including phylogenetic analyses, e.g. [Osmundaceae] [Loranthaceae1] [Loranthaceae2]; more here)

For the trees, I always opened and viewed the "RAxML_branchlabelled" tree in Dendroscope to avoid the potential re-rooting error (see Czech, Huerta-Cepas & Stamatakis 2017) in some tree viewers (a problem with the NEWICK format, since it fixes the support essentially to nodes, most-recent common ancestors, not branches/splits.

Primary graphing can be done in Dendroscope, such as colouring of subtrees and collapsing them into parallelograms proportional to the amount of constituent OTUs, and where the upper and lower edge reflect the length of the least and most distinct OTU with the collapsed subtree.

Dendroscope allows exporting as PDF, SVG, and other formats. I was lucky enough to purchase professional graphic software from my grant money: I use(d) Adobe Illustrator (a now outdated, but standalone version: 3.3) for manipulating the exported PDFs (or EPSs), and (originally) CorelDraw (equally outdated, but also working standalone version X4) for the exported SVG files (not recommendable anymore).  But there are no-cost alternatives (both Adobe and Corel now only sell subscriptions charging you monthly) such as Inkscape. Inkscape can handle PDF, EPS and SVG exports (in fact, it is more versatile in general regarding import and export than the costly alternatives), the only drawback is that you (still?!) don't have a layer manager. And the latter is a great tool, when you are trying to pimp up phylogenetic reconstructions including hundreds or thousands of elements.

An Adobe Illustrator (AI) project based on the EPS export from Dendroscope. Some clades were proportionally collapsed (parallelograms) and coloured in Dendroscope. For better handling, corresponding elements have been moved to different layers in AI (support values, taxon labels, the tree graph).

For the networks, the only option for graphical enhancement is still SplitsTree4, a programme that appeared for some time to be deserted by its inventors, but enjoys re-newed attention (current version is 14.6, and the last version of the now comprehensive manual dates to September last year).

SplitsTree can open and read in the "RAxML_bootstrap" files (or the .t file produced by MrBayes, if you want to have a PP network). A little window pop-up informing you about the to read-in trees (bootstrap pseudoreplicates), you just click the "Apply" button.

Side note: When you want to make Bayesian PP networks, keep the number of sampled topologies moderate, e.g. 1,000 or 10,000 per run, either by using a high burn-in fraction, e.g. keeping only the last 10% of sampled topologies (can be done manually by editing the .t file in a simple text-editor) or a low sample number (e.g. for 10 million generations you sample every 1000th, 5000th or 10,000th generation). 
My standard set-up for morphological data (typically very rough likelihood tree surfaces) was running a million generations on 10 parallel runs with no heating, and to sample every 1000th generation and then concatenate the .t files of all 10 runs into one for the SplitsTree import.
The more trees you try to consense and the more taxa and alternative splits you have, the more memory you need. SplitsTree informs you about the used memory in the lower right corner. If you reach the maximum the programme will just freeze. Opening several files will fill the contingent consecutively, but closing them seems not to decrease the used memory. For memory-intensive files, you have to shut down the program and re-call it.

You'll see the first bootstrap (BS) pseudoreplicate as an unrooted phylogram (the most appropriate image: all support analysis works unrooted, rooting is just a graphical post-analysis manipulation).

Then you choose "Networks" > "Consensus Network" and a mask like this pops up

"Processing pipeline" window for consensus networks.

You change to "count" and select a cut-off threshold. Large, complex data will have many different alternatives in the BS replicates (data with weak signal will have many alternatives in the Bayesian-sampled topologies, too), and you may run into memory capacity issues when selecting too low thresholds. I usually go for 0.15 or 0.2, which means that only splits (taxon bipartitions; branches in phylogenetic trees) are shown that occured in at least 15% or 20% of the BS replicates.

The product is a "bootstrap consensus network" (Schliep et al. 2017) in which the edge lengths are proportional to the BS support of the (alternative) phylogenetic splits (Please don't call it "SplitsTree network" or just "splits graph"; SplitsTree can generate many different networks, and splits graphs can be generated from very different data sources.)

The "Format nodes and edges" window in SplitsTree
Using the graphical options ("View" > "Format nodes and edges"; you can pop up the mask with CTRL-J), you can highlight (colour) individual edge bundles; show or hide confidence values etc. You should do this in SplitsTree because when you click on one edge, the entire corresponding edge bundle will be selected. You can also drag the edge bundles to make topological conflicts and ambiguities more visible. This is just a graphical manipulation, as the edge lengths remain fixed.

To get the nice bubbles seen in my graphs (a relative unique feature even in network-showing literature), you go to "Edit" > "Select labeled nodes"; call the graphical option mask (if not open already), and choose circle/square and a size of 7+ (otherwise the bubbles will turn out to be too small for the figure).

A neighbour-net calculated automatically by SplitsTree when opening a pairwise distance matrix

Same network, with bubbles highlighting the position of "taxa" (here: German political parties that competed in the 2017 election: GWoN; related posts)

When you use the filled bubbles (circles, squares), make sure that you change their outline to "invisible". Otherwise your export will have two objects per bubble, one for the filling and one for the outline. Which can be annoying for further processing (outlines are easily changed within the graphic programmes) 

I use EPS as export for further manipulation in Adobe Illustrator, and SVG for CorelDraw (here the export and import still works without odd elements, in contrast to Dendroscope). Note, that you have to add the proper suffix (e.g. "mynet.eps" as filename), because SplitsTree will not add it automatically. Don't use the PDF export option here, SplitsTree and Dendroscope have been made by the same group, but Dendroscope's PDF export includes elements, but SplitsTree's is just a down-scaled image.

Screenshot of a typical CorelDraw project for a pimped-up bootstrap consensus network (this one was for Khanum et al. 2016). Basic graph structure optimised using Splitstree, exported as SVG, and opened in CorelDraw for graphical enhancement (replacing terminal edge complexes by coloured triangles, fusing support information from different analysis (here: ML bootstrapping and Bayesian PP)

What if a tree is not enough

In case you have conflicting support patterns, a neighbour-net is always a good option to map them.

For this you need a pairwise distance matrix, ideally in PHYLIP format, and open it in SplitsTree (there are no length restriction regarding taxon names). SplitsTree can open NEXUS-files, but the only problem-free export/import is the "simple NEXUS" option of Mesquite. I have still a copy of PAUP*, so I use it for calculating the distances. In principle, SplitsTree can calculate distances directly from a character matrix, but in the past one could find inconsistencies, small errors. This maybe not a problem anymore. Still, I recommend using a programme you usally rely on to calculate the pairwise distances and export them as a PHYLIP-formatted matrix for import in SplisTree and auto-calculation of the neighbour-net (additional algorithms are implemented, trees and networks)

With the neighbour-net, BS consensus network(s) and a graphic software at hand, you just identify corresponding groups and map the BS support on the brackets or edge bundles referring to that split. And you can enrich your paper with something like this:

Happy drawing.

Some important links

Cited and selected literature
Bryant D, Moulton V. 2002. NeighborNet: an agglomerative method for the construction of planar phylogenetic networks. In: Guigó R, and Gusfield D, eds. Algorithms in Bioinformatics, Second International Workshop, WABI. Rome, Italy: Springer Verlag, Berlin, Heidelberg, New York, p. 375-391. — The original paper introducing the neighbour-net, apparently ignored by biologists.
Bryant D, Moulton V. 2004. Neighbor-Net: An agglomerative method for the construction of phylogenetic networks. Molecular Biology and Evolution 21:255-265. — The re-publication in a biological journal.
Denk T, Grimm G, Stögerer K, Langer M, Hemleben V. 2002. The evolutionary history of Fagus in western Eurasia: Evidence from genes, morphology and the fossil record. Plant Systematics and Evolution 232:213-236.
Friis EM, Crane PR, Pedersen KR, Bengtson S, Donoghue PCJ, Grimm GW, Stampanoni M. 2007. Phase-contrast X-ray microtomography links Cretaceous seeds with Gnetales and Bennettitales. Nature 450:549-552.
Grimm GW. 1999. Phylogenie der Cycadales. Diploma thesis. Eberhard Karls Universität. [English abstract]
Grimm GW, Renner SS, Stamatakis A, Hemleben V. 2006. A nuclear ribosomal DNA phylogeny of Acer inferred with maximum likelihood, splits graphs, and motif analyses of 606 sequences. Evolutionary Bioinformatics 2:279–294.
Holland B, Moulton V. 2003. Consensus networks: A method for visualising incompatibilities in collections of trees. In: Benson G, and Page R, eds. Algorithms in Bioinformatics: Third International Workshop, WABI, Budapest, Hungary Proceedings. Berlin, Heidelberg, Stuttgart: Springer Verlag, p. 165–176 — This is the publication to quote when using the consensus network approach implemented in SplitsTree.
Quote for SplitsTree4 — Huson DH, Bryant D. 2006. Application of phylogenetic networks in evolutionary studies. Molecular Biology and Evolution 23:254–267.
Khanum R, Surveswaran S, Meve U, Liede-Schumann S. 2016. Cynanchum (Apocynaceae: Asclepiadoideae): A pantropical Asclepiadoid genus revisited. Taxon 65:467–486. [PDF]
Schliep K, Potts AJ, Morrison DA, Grimm GW. 2017. Intertwining phylogenetic trees and networks. Methods in Ecology and Evolution DOI:10.1111/2041-210X.12760.
Quote for bootstop criterion — Pattengale ND, Masoud A, Bininda-Emonds ORP, Moret BME, Stamatakis A. 2009. How many bootstrap replicates are necessary? In: Batzoglou S, ed. RECOMB 2009. Berlin, Heidelberg: Springer-Verlag, p. 184–200.
Quote for RAxML 8 — Stamatakis A. 2014. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30:1312–1313.
Stamatakis A. 2015. Using RAxML to infer phylogenies. Current Protocols in Bioinformatics 51:6.14.11–16.14.14. doi: 10.1002/0471250953.bi0614s51 — Useful for freshlings and experienced tree-inferrers.

No comments:

Post a Comment

Enter your comment ...