Query assignment (poppunk_assign)

This is the recommended mode to use PopPUNK, as long as a database is available for your species. If there is no database available, you can fit your own (Fitting new models (--fit-model)).

Briefly, download your reference database and run:

poppunk_assign --db database --query qfile.txt \
--output poppunk_clusters --threads 8

Nomenclature

PopPUNK clusters are numbered from one upwards, in decreasing order of size in the initial dataset.

poppunk_assign will assign your genomes into these existing clusters, with the same labels as the initial run. So cluster labels, when used as documented, do not change.

In some cases, due to undersampling of the initial dataset or emergence of hybrids, some clusters may be merged. These merged clusters will be named with underscores separating the older clusters they were merges of. Use --external-clustering if you prefer other nicknames for these.

If you require ‘stable nomenclature’ where clusters never merge, use the --stable option with poppunk_assign. Each query will be assigned based on its nearest neighbour’s cluster, though novel clusters will still be separately identified as ‘NA’.

Note that maintaining stable nomenclature in a dynamic population is not possible (for any nomenclature). If you are maintaining a database and want to add new queries in, you will need to use --update-db which may merge clusters. There is no way with two or more updates of giving consistent new names to merged clusters.

Downloading a database

Current PopPUNK databases can be found here: https://www.bacpop.org/poppunk/

We refer to sequences in the database as references, and those being added as queries. The clusters assigned by PopPUNK are variable-length-k-mer clusters (VLKCs).

A database called database will contain the following files, in database/:

  • database.h5 – the sketches of the reference sequences generated by pp-sketchlib.

  • database.dists.pkl – the order of the core and accessory distances for all pairwise comparisons in the sketch database.

  • database_fit.npz and database_fit.pkl – the model fit to the core and accessory distances.

  • database_graph.gt – the network defining the fit (loadable with graph_tool).

  • database_clusters.csv – the PopPUNK clusters for the reference sequences.

  • database.refs – a minimal list of references needed to produce correct clusters.

If the .refs file is missing, all of the samples in the sketch database will be used in the distance calculations.

You can use the following arguments to individually target these items if necessary, for example when using an alternative fit, or if split across different directories. The examples below refer to the default database name:

  • (required) --db database – the name of directory containing the .h5 file.

  • --distances database/database.dists – prefix of the distances.

  • --model-dir database – directory containing the model fit and network (dists + fit define the network).

  • --previous-clustering database – directory containing the PopPUNK clusters for the references.

Clustering your genomes

Create a file which lists your sample names and paths to their sequence data. This file has no header, is tab separated, and contains the sample name in the first column. Subsequent columns may contain paths to either assembled or raw read data (the type will automatically be inferred by checking for the presence of quality scores). Data may be gzipped or uncompressed:

MS1 ms1_assembled.fa
MS2 ms2_assembled.fa
SM14        SM14_1.fq.gz SM14_2.fq.gz

Save this as qfile.txt. You’re now ready to cluster them! Run the following command:

poppunk_assign --db database --query qfile.txt \
--output poppunk_clusters --threads 8

This will first of all sketch your input genomes, saving them in poppunk_clusters/poppunk_clusters.h5. If you need to rerun part of the analysis with different options this will automatically be picked up and loaded.

Note

Data quality control (--qc-db) does not apply to query sequences. A test for maximum accessory distance will be made, but the program will only emit warnings and will run with all genomes anyway. Most options for sketching will be taken from the reference database, but you can still specify error filtering options from read input (--min-kmer-count and --exact-count) and specify your input as --strand-preserved. See Sketching (--create-db) for more information on these options.

Next, core and accessory distances between your input sketches and those in the database will be computed. This has complexity \(O(RQ)\) where \(R\) is the number of samples in database_references.refs and \(Q\) is the number in qfile.txt. These distances are then fed into the model and used to update the network, and therefore clusters.

The output will look something like this:

Graph-tools OpenMP parallelisation enabled: with 4 threads
PopPUNK (POPulation Partitioning Using Nucleotide Kmers)
        (with backend: sketchlib v1.5.1
         sketchlib: /Users/jlees/miniconda3/envs/pp-py38/lib/python3.8/site-packages/pp_sketchlib.cpython-38-darwin.so)
Mode: Assigning clusters of query sequences

Sketching genomes using 8 thread(s)
Calculating distances using 8 thread(s)
Loading previously refined model
Network loaded: 2007 samples
Found novel query clusters. Calculating distances between them.
Could not find random match chances in database, calculating assuming equal base frequencies
Calculating distances using 8 thread(s)

Your VLKCs will be written to poppunk_clusters/poppunk_clusters_clusters.csv:

Taxon,Cluster
21946_6_66,9
22695_3_148,9
22984_8_88,9
21946_6_245,116
21946_6_189,814
22695_3_73,814
21946_6_50,422
21903_8_95,148
21903_8_250,301
22984_8_47,70

These names are identical to those used in the reference database, so retain the same meaning between studies. If new clusters are found they will be numbered in ascending order from largest to smallest, beginning from the end of the reference clusters.

Note

You may observe clusters merging (but never splitting). If your genomes do cause clusters to merge this will be noted in the output, and the new clusters will be named using the old ones. For example, if clusters 23 and 38 merged, the new cluster would be called 23_38.

By default, only the query genome clusters are included here. The reference genome clusters are considered unchanged from the input. If there are many merges and you wish to know their new cluster IDs, use --update-db (Updating the database).

You can use poppunk_visualise to look at your results. Here’s an example output to cytoscape, showing the clusters as colours, reference genomes as circles and queries as triangles (open in a new tab to zoom on detail):

Network produced after query assignment

Adding external cluster labels (MLST, CC etc)

Add the --external-clustering argument to add a CSV file of cluster definitions which the output will be additionally labelled with, and output to database/database_external_clusters.csv. These can be any cluster definitions you wish, with as many columns as you like. A header row is required:

sample,GPSC,MLST
23430_1_186,1,22
17794_6_29,23,43
12291_4_13,1,2

For each PopPUNK cluster, all the samples found in said cluster will be accumulated. From these accumulated samples the external clusters will be collected, and assigned to all of these examples. This may give you a one-to-one mapping between PopPUNK clusters and your external cluster, or you may find multiple external clusters refer to the PopPUNK cluster giving output such as 227;811;763;824.

Using a model fitted with --indiv-refine

If the database was fitted with the refine fit mode, and indiv-refine you may have a core distance boundary, accessory boundary and combined core-accessory boundary fit. The default is to use the combined boundary, to use the others add --core-only or --accessory-only.

Increasing speed

Query assignment is the most efficient mode in which to run PopPUNK, typically requiring \(Q\) sketches and \(RQ\) distances. If you are updating the database, this increases to \(Q^2 + RQ\) distances. If you are assigning a very large number of queries you can run poppunk_assign with --update-db repeatedly for batches of query input, as the \(Q^2\) term will be reduced by clique-pruning at each iteration.

Straightforward ways to increase speed include:

  • Add --gpu-dist, if you have a GPU available.

  • Add --gpu-sketch, if your input is all reads, and you have a GPU available. If your input is a mix of assemblies and reads, run in two separate batches, with the batch of reads using this option.

  • Increase --threads.

Updating the database

If you want to add your query genomes into the reference database so that they can be used to inform future cluster assignment, this is as simple as adding the --update-db option to the command above. This is particularly useful when novel query clusters have been found – they will then be the consistent name for future assignments:

poppunk_assign --db database --query qfile.txt \
--output poppunk_clusters --threads 8 --update-db

Graph-tools OpenMP parallelisation enabled: with 4 threads
PopPUNK (POPulation Partitioning Using Nucleotide Kmers)
    (with backend: sketchlib v1.5.1
    sketchlib: /Users/jlees/miniconda3/envs/pp-py38/lib/python3.8/site-packages/pp_sketchlib.cpython-38-darwin.so)
Mode: Assigning clusters of query sequences

Sketching 28 genomes using 4 thread(s)
Writing sketches to file
Calculating distances using 4 thread(s)
Loading BGMM 2D Gaussian model
Network loaded: 18 samples
Calculating all query-query distances
Could not find random match chances in database, calculating assuming equal base frequencies
Calculating distances using 4 thread(s)
Updating reference database to poppunk_clusters
Removing 27 sequences

Done

The new database contains all of the reference sequences, and all of your query sequences. The poppunk_clusters folder will now contain all of the files of a reference database listed above, except for the model. You can use --model-dir to target this for future assignment, or copy it over yourself. Alternatively, if you run with the same --output folder as --ref-db, adding --overwrite, the original input folder will contain the updated database containing everything needed.

Note

This mode can take longer to run with large numbers of input query genomes, as it will calculate all \(Q^2\) query-query distances, rather than just those found in novel query clusters. Furthermore, you may observe query genomes previously assigned to novel clusters without --update-db being assigned to existing clusters when using this option. This is expected behaviour, and is a manifestation of cluster merging, whereby the comparison of all database genomes to queries, not just references, enables queries to be assigned to existing clusters. See Troubleshooting for more details.

Visualising results

If you wish to produce visualisations from query assignment results the best way to do this is to run with --update-db, and then run poppunk_visualise on the output directory, as if visualising a full reference fit.

However, it is possible to run directly on the outputs by adding a --ref-db as used in the assign command, and a --query-db which points to the --output directory used in the assign command. In this mode isolates will be annotated depending on whether they were a query or reference input.

Warning

Without --update-db, visualisation is required to recalculate all query-query distances each time it is called. If your query set is large and you want repeated visualisations, run poppunk_assign with --update-db.

See Creating visualisations for more details.