Creating visualisations ======================= We have moved visualisation tools into their own program ``poppunk_visualise``, both to reinforce our commitment to UK spellings, and so that you can rerun visualisations with different outputs and settings without rerunning the other parts of the code. Starting with either a full database where you have fitted a model (:doc:`model_fitting`), or output where you have assigned queries using an existing database (:doc:`query_assignment`), you can create outputs for external interactive visualisation tools: - `Microreact `__ -- a genomic epidemiology visualisation tool, displaying clusters, phylogeny and accessory clustering. - `GrapeTree `__ -- a tool to visualise strains (designed for cgMLST). - `Phandango `__ -- visualisation linking genomes and phylogenies. - `Cytoscape `__ -- a network visualisation tool, which can be used to create, view and manipulate a layout of the graph. At least one of these output types must be specified as a flag. .. important:: If you run a visualisation on output from query assignment (:doc:`query_assignment`) this will not contain all the necessary distances, and they will be calculated before the visualisation files are produced. You will see a message ``Note: Distance will be extended to full all-vs-all distances``. If you are running multiple visualisations this calculation will be completed every time. To avoid this re-run your assignment with ``--update-db``, which will add these distances in permanently. If you are only interested in visualising sequences that are closely-related to those in a set of query sequences, then the quickest approach is to use the ``--save-partial-query-graph`` option when assigning sequences. This will generate a list of sequences spanning both your queries, and all the reference database isolates that are in the same network components (i.e., excluding those that are not related to your query isolates); this is contained in the file ``[prefix]_query.subset``. This list can be passed to the visualisation command using the flag ``--use-partial-query-graph ./[prefix]/[prefix]_query.subset`` to only visualise the parts of the network that are relevant to your query. Common options -------------- Some typical commands for various input settings (with ``--microreact``, but this can be altered to any output type) with a database ``example``. Visualisation of a full database:: poppunk_visualise --ref-db example_db --output example_viz --microreact Visualisation after query assignment:: poppunk_visualise --ref-db example_db --query-db example_query --output example_viz --microreact Visualisation when sketches and models are in different folders:: poppunk_visualise --ref-db example_db --previous-clustering example_lineages/example_lineages_lineages.csv \ --model-dir example_lineages --output example_viz --microreact Visualisation with a lineage model, which has been queried (query-query distances must be provided):: poppunk_visualise --distances example_query/example_query.dists --ref-db example_db \ --model-dir example_lineages --query-db example_lineage_query \ --output example_viz --microreact Notable modifiers include: - ``--include-files`` -- give a file with a subset of names to be included in the visualisation. - ``--external-clustering`` -- other cluster names to map to strains (such as MLST, serotype etc), as described in model fitting and query assignment. - ``--info-csv`` -- similar to the above, but a CSV which is simply (inner-)joined to the output on sample name. - ``--rapidnj`` -- the location of a `rapidnj `__ binary, used to build the core NJ tree. We highly recommend using this for any tree-building (and is included with the conda install). This defaults to ``rapidnj`` on the PATH. Set blank to use dendropy instead (slower, especially for large datasets). - ``--core-only``/``--accessory-only`` -- use the core or accessory fit from an individually refined model (see :ref:`indiv-refine`). - ``--threads``, ``--gpu-dist``, ``--deviceid``, ``--strand-preserved`` -- querying options used if extra distance calculations are needed. To avoid these, rerun your query with ``--update-db``. Microreact ---------- Adding the ``--microreact`` flag will create the following files: - _microreact_clusters.csv -- the strain or lineage assignments with headers formatted for microreact, plus anything from ``--info-csv``. - _core_NJ.nwk -- a neighbour-joining tree from the core distances. - _perplexity20.0_accessory_mandrake.dot -- a 2D embedding of the accessory distances, produced using `mandrake `__ (in this case with :ref:`perplexity` 20). From version 2.5.0 this will also include: - .microreact -- a Microreact compatible JSON containing all of the above, which can be uploaded directly. If you add ``--api-key`` and provide `your account's API key __` this will automatically create an instance, and the URL will be output to the terminal. Otherwise, open https://microreact.org/upload in your browser, and drag and drop these three files to create your visualisation. Here is the result of running the visualisation on the *Listeria* BGMM model:: poppunk_visualise --ref-db listeria --microreact --threads 8 Graph-tools OpenMP parallelisation enabled: with 8 threads PopPUNK: visualise Loading BGMM 2D Gaussian model Completed model loading Building phylogeny Writing microreact output Parsed data, now writing to CSV Running mandrake Running on CPU Preprocessing 128 samples with perplexity = 20 took 0ms Optimizing Progress: 99.9%, eta=0.0010, Eq=0.2583546852, clashes=2.1% Optimizing done in 30s Provide --api-key to create microreact automatically Done This can be viewed at https://microreact.org/project/3JAZKqzJiaNyViWXindNLv-listeria-poppunk-perplexity-20: .. image:: images/microreact.png :alt: Microreact page for Listeria monocytogenes :align: center Useful controls include the tree shape, accessed with the control slider in the top right of the phylogeny page, and the metadata labels, accessed with the 'eye' on the right of the page. When visualising lineages, changing the 'Colour by' is useful to compare results from different ranks. .. _perplexity: Setting the perplexity parameter for mandrake ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ In mandrake an embedding of the accessory genome distances is found which represents local structure of the data. Isolates with similar accessory content will visually appear in clusters together. The perplexity sets a guess about the number of close neighbours each point has, and is a trade-off between local and global structure. t-SNE (and by extension mandrake) is reasonably robust to changes in the perplexity parameter (set with ``--perplexity`` when creating microreact output with ``--microreact``), however we would recommend trying a few values to get a good embedding for the accessory distances. There is a good discussion of the effect of perplexity `here `__ and the sklearn documentation shows some examples of the effect of `changing perplexity `__. In mandrake, points will usually appear 'tighter' than in t-SNE, and form more obvious clusters. In the example with *Listeria monocytogenes* above, a perplexity of 20 gives clear clustering of the accessory genome content, condordant with the core genome structure (`data `__): With a lower perplexity of 5, the clustering is not as tight, but it still looks ok (`data `__): .. image:: images/microreact_perplexity5.png :alt: Microreact plot of results with perplexity = 5 :align: center 30 is a good default, but you may wish to try other values, particularly with larger or smaller datasets. You can re-run the mandrake using the ``poppunk_mandrake`` command, providing the distances from the previous run:: poppunk_mandrake --distances strain_db/strain_db.dists --output strain_db \ --perplexity 50 GrapeTree --------- Adding the ``--grapetree`` flag will create: - _microreact_clusters.csv -- the strain or lineage assignments with headers formatted for grapetree, plus anything from ``--info-csv``. - _core_NJ.nwk -- a neighbour-joining tree from the core distances. Open https://achtman-lab.github.io/GrapeTree/MSTree_holder.html in your browser, and use the 'Load files' button once for each of the files to add the tree and strain assignments to GrapeTree. This will display an unrooted tree with your clusters: .. image:: images/grapetree.png :alt: Grapetree visualisation of results :align: center One of GrapeTree's key features is the ability to collapse branches, and condense information into nodes. By going to Tree Layout -> Branch style -> Collapse branches, and setting the long branch to be shortened, one can obtain a view which shows strain prevalence and relationships: .. image:: images/grapetree_collapse.png :alt: Grapetree visualisation of results :align: center There is also a handy 'Export to Microreact' button in GrapeTree, though this will not include the accessory embedding, so you may wish to add the ``--microreact`` flag and generate the files yourself. Phandango --------- Adding the ``--phandango`` flag will create: - _phandango_clusters.csv -- the strain or lineage assignments with headers formatted for phandango, plus anything from ``--info-csv``. - _core_NJ.tree -- a neighbour-joining tree from the core distances. Open https://www.phandango.net in your browser, and use the 'Load files' button once for each of the files to add the tree and strain assignments to GrapeTree. This will display the tree with your clusters: .. image:: images/phandango.png :alt: Phandango visualisation of results :align: center Press 's' to access settings, and 'p' to create an .svg file. Phandango is most useful with a genome (.gff file), and either a plot of recombination, accessory genome analysis or GWAS results. See the documentation for more information. .. _cytoscape-view: Cytoscape --------- Cytoscape is different from the above modes as it creates a layout and visualisation of the graph used to create strains from distances. This can be useful for more detailed investigation of network scores, particularly in strains which have less than perfect transitivity. Add the ``--cytoscape`` option, and also ``--network-file`` to point to the network you wish to visualise:: poppunk_visualise --ref-db listeria --cytoscape --network-file listeria/listeria_graph.gt Graph-tools OpenMP parallelisation enabled: with 1 threads PopPUNK: visualise Loading BGMM 2D Gaussian model Writing cytoscape output Network loaded: 128 samples Parsed data, now writing to CSV Done Which will create: - _cytoscape.csv -- the strain or lineage assignments with headers formatted for cytoscape, plus anything from ``--info-csv``. - _cytoscape.graphml -- the network in graphml format. The .graphml file is an XML file which contains definitions of the nodes (samples) and edges (within-strain distances) connecting them. If you used ``--graph-weights`` when you fitted your model the edges will be annotated with their Euclidean distances in the 'weight' attribute (which you will need to tell cytoscape). These can be added with the ``poppunk_add_weights`` script if this flag was not used. Open `cytoscape `_ and drag and drop the .graphml file onto the window to import the network. Import -> table -> file to load the CSV. Click 'Select None' then add the 'id' column as a key, and any required metadata columns (at least the 'Cluster' column) as attributes. Make sure 'Node Table Columns' is selected as the data type. The graphml file does not contain a layout for the graph, that is, positions of nodes and edges are not specified for a visualisation. These will be calculated by cytoscape, automatically for small graphs, and with the 'Layout' menu for larger graphs. The 'Prefuse force directed layout' or 'yFiles Organic Layout' work well. Select the 'weight' dropdown to use the edge-lengths when drawing the network. .. warning:: We have found that making graphs with >10k nodes may exceed the memory on a typical laptop. To view larger graphs, first splitting into subgraphs of each connected component is very helpful. Older versions of cytoscape allowed you to split the graph into connected components, but newer versions have removed this feature. This can be done programmatically with ``networkx`` or ``graph-tool`` in python, or ``igraph`` in R. Click on 'Style' and change the node fill colour to be by cluster, the mapping type as discrete, then right click to autogenerate a colour scheme ('Random' is usually best). You can also modify the node size and shape here. Here is the *Listeria* example, using edge weights in the layout: .. image:: images/cytoscape.png :alt: Cytoscape plot of network :align: center If you used assign query mode you will also have a column with 'Query' or 'Reference', which can be used to map to different shapes or colours: .. image:: images/assign_network.png :alt: Network produced after query assignment :align: center Adding an info CSV, or loading external tables directly into cytoscapes gives further options for investigating individual strains: .. image:: images/cytoscape_gpsc.png :alt: Network with added annotation :align: center In some cases, edges which are between strain links may have been erroneously included in the network. This could be due to poor model fit, or a poor quality sequence. Use Tools -> NetworkAnalyzer -> Analyze Network to compute information for each node and edge. It may help to analyze connected components separately. They can be split under Tools -> NetworkAnalyzer -> Subnetwork Creation. Here is an example where an errant node is connecting two clusters into one large cluster, which should be split: .. image:: images/cytoscape_component.png :alt: Cytoscape plot of network :align: center The incorrect node in question has a low CluteringCoefficient and high Stress. The EdgeBetweeness of its connections are also high. Sorting the node and edge tables by these columns can find individual problems such as this.