Creating visualisations¶

We have moved visualisation tools into their own program poppunk_visualise, both to reinforce our commitment to UK spellings, and so that you can rerun visualisations with different outputs and settings without rerunning the other parts of the code.

Starting with either a full database where you have fitted a model (Fitting new models (--fit-model)), or output where you have assigned queries using an existing database (Query assignment (poppunk_assign)), you can create outputs for external interactive visualisation tools:

Microreact – a genomic epidemiology visualisation tool, displaying clusters, phylogeny and accessory clustering.
GrapeTree – a tool to visualise strains (designed for cgMLST).
Phandango – visualisation linking genomes and phylogenies.
Cytoscape – a network visualisation tool, which can be used to create, view and manipulate a layout of the graph.

At least one of these output types must be specified as a flag.

Important

If you run a visualisation on output from query assignment (Query assignment (poppunk_assign)) this will not contain all the necessary distances, and they will be calculated before the visualisation files are produced. You will see a message Note: Distance will be extended to full all-vs-all distances. If you are running multiple visualisations this calculation will be completed every time. To avoid this re-run your assignment with --update-db, which will add these distances in permanently.

If you are only interested in visualising sequences that are closely-related to those in a set of query sequences, then the quickest approach is to use the --save-partial-query-graph option when assigning sequences. This will generate a list of sequences spanning both your queries, and all the reference database isolates that are in the same network components (i.e., excluding those that are not related to your query isolates); this is contained in the file [prefix]_query.subset. This list can be passed to the visualisation command using the flag --use-partial-query-graph ./[prefix]/[prefix]_query.subset to only visualise the parts of the network that are relevant to your query.

Common options¶

Some typical commands for various input settings (with --microreact, but this can be altered to any output type) with a database example.

Visualisation of a full database:

poppunk_visualise --ref-db example_db --output example_viz --microreact

Visualisation after query assignment:

poppunk_visualise --ref-db example_db --query-db example_query --output example_viz --microreact

Visualisation when sketches and models are in different folders:

poppunk_visualise --ref-db example_db --previous-clustering example_lineages/example_lineages_lineages.csv \
 --model-dir example_lineages --output example_viz --microreact

Visualisation with a lineage model, which has been queried (query-query distances must be provided):

poppunk_visualise --distances example_query/example_query.dists --ref-db example_db \
 --model-dir example_lineages --query-db example_lineage_query \
 --output example_viz --microreact

Notable modifiers include:

--include-files – give a file with a subset of names to be included in the visualisation.
--external-clustering – other cluster names to map to strains (such as MLST, serotype etc), as described in model fitting and query assignment.
--info-csv – similar to the above, but a CSV which is simply (inner-)joined to the output on sample name.
--rapidnj – the location of a rapidnj binary, used to build the core NJ tree. We highly recommend using this for any tree-building (and is included with the conda install). This defaults to rapidnj on the PATH. Set blank to use dendropy instead (slower, especially for large datasets).
--core-only/--accessory-only – use the core or accessory fit from an individually refined model (see Using core/accessory only).
--threads, --gpu-dist, --deviceid, --strand-preserved – querying options used if extra distance calculations are needed. To avoid these, rerun your query with --update-db.

Microreact¶

Adding the --microreact flag will create the following files:

_microreact_clusters.csv – the strain or lineage assignments with headers formatted for microreact, plus anything from --info-csv.
_core_NJ.nwk – a neighbour-joining tree from the core distances.
_perplexity20.0_accessory_mandrake.dot – a 2D embedding of the accessory distances, produced using mandrake (in this case with Setting the perplexity parameter for mandrake 20).

From version 2.5.0 this will also include:

.microreact – a Microreact compatible JSON containing all of the above, which
can be uploaded directly.

If you add --api-key and provide your account’s API key <https://docs.microreact.org/api/access-tokens>__ this will automatically create an instance, and the URL will be output to the terminal.

Otherwise, open https://microreact.org/upload in your browser, and drag and drop these three files to create your visualisation. Here is the result of running the visualisation on the Listeria BGMM model:

poppunk_visualise --ref-db listeria --microreact --threads 8

Graph-tools OpenMP parallelisation enabled: with 8 threads
PopPUNK: visualise
Loading BGMM 2D Gaussian model
Completed model loading
Building phylogeny
Writing microreact output
Parsed data, now writing to CSV
Running mandrake
Running on CPU
Preprocessing 128 samples with perplexity = 20 took 0ms
Optimizing   Progress: 99.9%, eta=0.0010, Eq=0.2583546852, clashes=2.1%
Optimizing done in 30s
Provide --api-key to create microreact automatically

Done

This can be viewed at https://microreact.org/project/3JAZKqzJiaNyViWXindNLv-listeria-poppunk-perplexity-20:

Microreact page for Listeria monocytogenes

Useful controls include the tree shape, accessed with the control slider in the top right of the phylogeny page, and the metadata labels, accessed with the ‘eye’ on the right of the page. When visualising lineages, changing the ‘Colour by’ is useful to compare results from different ranks.

Setting the perplexity parameter for mandrake¶

In mandrake an embedding of the accessory genome distances is found which represents local structure of the data. Isolates with similar accessory content will visually appear in clusters together.

The perplexity sets a guess about the number of close neighbours each point has, and is a trade-off between local and global structure. t-SNE (and by extension mandrake) is reasonably robust to changes in the perplexity parameter (set with --perplexity when creating microreact output with --microreact), however we would recommend trying a few values to get a good embedding for the accessory distances.

There is a good discussion of the effect of perplexity here and the sklearn documentation shows some examples of the effect of changing perplexity. In mandrake, points will usually appear ‘tighter’ than in t-SNE, and form more obvious clusters.

In the example with Listeria monocytogenes above, a perplexity of 20 gives clear clustering of the accessory genome content, condordant with the core genome structure (data):

With a lower perplexity of 5, the clustering is not as tight, but it still looks ok (data):

Microreact plot of results with perplexity = 5

30 is a good default, but you may wish to try other values, particularly with larger or smaller datasets. You can re-run the mandrake using the poppunk_mandrake command, providing the distances from the previous run:

poppunk_mandrake --distances strain_db/strain_db.dists --output strain_db \
--perplexity 50

GrapeTree¶

Adding the --grapetree flag will create:

_microreact_clusters.csv – the strain or lineage assignments with headers formatted for grapetree, plus anything from --info-csv.
_core_NJ.nwk – a neighbour-joining tree from the core distances.

Open https://achtman-lab.github.io/GrapeTree/MSTree_holder.html in your browser, and use the ‘Load files’ button once for each of the files to add the tree and strain assignments to GrapeTree. This will display an unrooted tree with your clusters:

One of GrapeTree’s key features is the ability to collapse branches, and condense information into nodes. By going to Tree Layout -> Branch style -> Collapse branches, and setting the long branch to be shortened, one can obtain a view which shows strain prevalence and relationships:

There is also a handy ‘Export to Microreact’ button in GrapeTree, though this will not include the accessory embedding, so you may wish to add the --microreact flag and generate the files yourself.

Phandango¶

Adding the --phandango flag will create:

_phandango_clusters.csv – the strain or lineage assignments with headers formatted for phandango, plus anything from --info-csv.
_core_NJ.tree – a neighbour-joining tree from the core distances.

Open https://www.phandango.net in your browser, and use the ‘Load files’ button once for each of the files to add the tree and strain assignments to GrapeTree. This will display the tree with your clusters:

Press ‘s’ to access settings, and ‘p’ to create an .svg file. Phandango is most useful with a genome (.gff file), and either a plot of recombination, accessory genome analysis or GWAS results. See the documentation for more information.

Cytoscape¶

Cytoscape is different from the above modes as it creates a layout and visualisation of the graph used to create strains from distances. This can be useful for more detailed investigation of network scores, particularly in strains which have less than perfect transitivity.

Add the --cytoscape option, and also --network-file to point to the network you wish to visualise:

poppunk_visualise --ref-db listeria --cytoscape --network-file listeria/listeria_graph.gt

Graph-tools OpenMP parallelisation enabled: with 1 threads
PopPUNK: visualise
Loading BGMM 2D Gaussian model
Writing cytoscape output
Network loaded: 128 samples
Parsed data, now writing to CSV

Done

Which will create:

_cytoscape.csv – the strain or lineage assignments with headers formatted for cytoscape, plus anything from --info-csv.
_cytoscape.graphml – the network in graphml format.

The .graphml file is an XML file which contains definitions of the nodes (samples) and edges (within-strain distances) connecting them. If you used --graph-weights when you fitted your model the edges will be annotated with their Euclidean distances in the ‘weight’ attribute (which you will need to tell cytoscape). These can be added with the poppunk_add_weights script if this flag was not used.

Open cytoscape and drag and drop the .graphml file onto the window to import the network. Import -> table -> file to load the CSV. Click ‘Select None’ then add the ‘id’ column as a key, and any required metadata columns (at least the ‘Cluster’ column) as attributes. Make sure ‘Node Table Columns’ is selected as the data type.

The graphml file does not contain a layout for the graph, that is, positions of nodes and edges are not specified for a visualisation. These will be calculated by cytoscape, automatically for small graphs, and with the ‘Layout’ menu for larger graphs. The ‘Prefuse force directed layout’ or ‘yFiles Organic Layout’ work well. Select the ‘weight’ dropdown to use the edge-lengths when drawing the network.

Warning

We have found that making graphs with >10k nodes may exceed the memory on a typical laptop. To view larger graphs, first splitting into subgraphs of each connected component is very helpful. Older versions of cytoscape allowed you to split the graph into connected components, but newer versions have removed this feature. This can be done programmatically with networkx or graph-tool in python, or igraph in R.

Click on ‘Style’ and change the node fill colour to be by cluster, the mapping type as discrete, then right click to autogenerate a colour scheme (‘Random’ is usually best). You can also modify the node size and shape here. Here is the Listeria example, using edge weights in the layout:

If you used assign query mode you will also have a column with ‘Query’ or ‘Reference’, which can be used to map to different shapes or colours:

Adding an info CSV, or loading external tables directly into cytoscapes gives further options for investigating individual strains:

In some cases, edges which are between strain links may have been erroneously included in the network. This could be due to poor model fit, or a poor quality sequence. Use Tools -> NetworkAnalyzer -> Analyze Network to compute information for each node and edge. It may help to analyze connected components separately. They can be split under Tools -> NetworkAnalyzer -> Subnetwork Creation.

Here is an example where an errant node is connecting two clusters into one large cluster, which should be split:

The incorrect node in question has a low CluteringCoefficient and high Stress. The EdgeBetweeness of its connections are also high. Sorting the node and edge tables by these columns can find individual problems such as this.