Query assignment (``poppunk_assign``) ===================================== This is the recommended mode to use PopPUNK, as long as a database is available for your species. If there is no database available, you can fit your own (:doc:`model_fitting`). Briefly, `download your reference database `__ and run:: poppunk_assign --db database --query qfile.txt \ --output poppunk_clusters --threads 8 Nomenclature ------------ PopPUNK clusters are numbered from one upwards, in decreasing order of size in the initial dataset. ``poppunk_assign`` will assign your genomes into these existing clusters, with the same labels as the initial run. So cluster labels, when used as documented, **do not change**. In some cases, due to undersampling of the initial dataset or emergence of hybrids, some clusters may be merged. These merged clusters will be named with underscores separating the older clusters they were merges of. Use ``--external-clustering`` if you prefer other nicknames for these. If you require 'stable nomenclature' where clusters never merge, use the ``--stable`` option with ``poppunk_assign``. Each query will be assigned based on its nearest neighbour's cluster, though novel clusters will still be separately identified as 'NA'. Note that maintaining stable nomenclature in a dynamic population is not possible (for any nomenclature). If you are maintaining a database and want to add new queries in, you will need to use ``--update-db`` which may merge clusters. There is no way with two or more updates of giving consistent new names to merged clusters. Downloading a database ---------------------- Current PopPUNK databases can be found here: https://www.bacpop.org/poppunk/ We refer to sequences in the database as references, and those being added as queries. The clusters assigned by PopPUNK are variable-length-k-mer clusters (VLKCs). A database called ``database`` will contain the following files, in ``database/``: - ``database.h5`` -- the sketches of the reference sequences generated by ``pp-sketchlib``. - ``database.dists.pkl`` -- the order of the core and accessory distances for all pairwise comparisons in the sketch database. - ``database_fit.npz`` and ``database_fit.pkl`` -- the model fit to the core and accessory distances. - ``database_graph.gt`` -- the network defining the fit (loadable with ``graph_tool``). - ``database_clusters.csv`` -- the PopPUNK clusters for the reference sequences. - ``database.refs`` -- a minimal list of references needed to produce correct clusters. If the ``.refs`` file is missing, all of the samples in the sketch database will be used in the distance calculations. You can use the following arguments to individually target these items if necessary, for example when using an alternative fit, or if split across different directories. The examples below refer to the default database name: - (required) ``--db database`` -- the name of directory containing the .h5 file. - ``--distances database/database.dists`` -- prefix of the distances. - ``--model-dir database`` -- directory containing the model fit and network (dists + fit define the network). - ``--previous-clustering database`` -- directory containing the PopPUNK clusters for the references. Clustering your genomes ----------------------- Create a file which lists your sample names and paths to their sequence data. This file has no header, is tab separated, and contains the sample name in the first column. Subsequent columns may contain paths to either assembled or raw read data (the type will automatically be inferred by checking for the presence of quality scores). Data may be gzipped or uncompressed:: MS1 ms1_assembled.fa MS2 ms2_assembled.fa SM14 SM14_1.fq.gz SM14_2.fq.gz Save this as ``qfile.txt``. You're now ready to cluster them! Run the following command:: poppunk_assign --db database --query qfile.txt \ --output poppunk_clusters --threads 8 This will first of all sketch your input genomes, saving them in ``poppunk_clusters/poppunk_clusters.h5``. If you need to rerun part of the analysis with different options this will automatically be picked up and loaded. .. note:: :doc:`qc` does not apply to query sequences. A test for maximum accessory distance will be made, but the program will only emit warnings and will run with all genomes anyway. Most options for sketching will be taken from the reference database, but you can still specify error filtering options from read input (``--min-kmer-count`` and ``--exact-count``) and specify your input as ``--strand-preserved``. See :doc:`sketching` for more information on these options. Next, core and accessory distances between your input sketches and those in the database will be computed. This has complexity :math:`O(RQ)` where :math:`R` is the number of samples in ``database_references.refs`` and :math:`Q` is the number in ``qfile.txt``. These distances are then fed into the model and used to update the network, and therefore clusters. The output will look something like this:: Graph-tools OpenMP parallelisation enabled: with 4 threads PopPUNK (POPulation Partitioning Using Nucleotide Kmers) (with backend: sketchlib v1.5.1 sketchlib: /Users/jlees/miniconda3/envs/pp-py38/lib/python3.8/site-packages/pp_sketchlib.cpython-38-darwin.so) Mode: Assigning clusters of query sequences Sketching genomes using 8 thread(s) Calculating distances using 8 thread(s) Loading previously refined model Network loaded: 2007 samples Found novel query clusters. Calculating distances between them. Could not find random match chances in database, calculating assuming equal base frequencies Calculating distances using 8 thread(s) Your VLKCs will be written to ``poppunk_clusters/poppunk_clusters_clusters.csv``:: Taxon,Cluster 21946_6_66,9 22695_3_148,9 22984_8_88,9 21946_6_245,116 21946_6_189,814 22695_3_73,814 21946_6_50,422 21903_8_95,148 21903_8_250,301 22984_8_47,70 These names are identical to those used in the reference database, so retain the same meaning between studies. If new clusters are found they will be numbered in ascending order from largest to smallest, beginning from the end of the reference clusters. .. note:: You may observe clusters merging (but never splitting). If your genomes do cause clusters to merge this will be noted in the output, and the new clusters will be named using the old ones. For example, if clusters 23 and 38 merged, the new cluster would be called 23_38. By default, only the query genome clusters are included here. The reference genome clusters are considered unchanged from the input. If there are many merges and you wish to know their new cluster IDs, use ``--update-db`` (:ref:`update-db`). You can use ``poppunk_visualise`` to look at your results. Here's an example output to cytoscape, showing the clusters as colours, reference genomes as circles and queries as triangles (open in a new tab to zoom on detail): .. image:: images/assign_network.png :alt: Network produced after query assignment :align: center Adding external cluster labels (MLST, CC etc) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Add the ``--external-clustering`` argument to add a CSV file of cluster definitions which the output will be additionally labelled with, and output to ``database/database_external_clusters.csv``. These can be any cluster definitions you wish, with as many columns as you like. A header row is required:: sample,GPSC,MLST 23430_1_186,1,22 17794_6_29,23,43 12291_4_13,1,2 For each PopPUNK cluster, all the samples found in said cluster will be accumulated. From these accumulated samples the external clusters will be collected, and assigned to all of these examples. This may give you a one-to-one mapping between PopPUNK clusters and your external cluster, or you may find multiple external clusters refer to the PopPUNK cluster giving output such as ``227;811;763;824``. Using a model fitted with ``--indiv-refine`` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ If the database was fitted with the refine fit mode, and ``indiv-refine`` you may have a core distance boundary, accessory boundary and combined core-accessory boundary fit. The default is to use the combined boundary, to use the others add ``--core-only`` or ``--accessory-only``. Increasing speed ---------------- Query assignment is the most efficient mode in which to run PopPUNK, typically requiring :math:`Q` sketches and :math:`RQ` distances. If you are updating the database, this increases to :math:`Q^2 + RQ` distances. If you are assigning a very large number of queries you can run ``poppunk_assign`` with ``--update-db`` repeatedly for batches of query input, as the :math:`Q^2` term will be reduced by clique-pruning at each iteration. Straightforward ways to increase speed include: - Add ``--gpu-dist``, if you have a GPU available. - Add ``--gpu-sketch``, if your input is all reads, and you have a GPU available. If your input is a mix of assemblies and reads, run in two separate batches, with the batch of reads using this option. - Increase ``--threads``. .. _update-db: Updating the database --------------------- If you want to add your query genomes into the reference database so that they can be used to inform future cluster assignment, this is as simple as adding the ``--update-db`` option to the command above. This is particularly useful when novel query clusters have been found -- they will then be the consistent name for future assignments:: poppunk_assign --db database --query qfile.txt \ --output poppunk_clusters --threads 8 --update-db Graph-tools OpenMP parallelisation enabled: with 4 threads PopPUNK (POPulation Partitioning Using Nucleotide Kmers) (with backend: sketchlib v1.5.1 sketchlib: /Users/jlees/miniconda3/envs/pp-py38/lib/python3.8/site-packages/pp_sketchlib.cpython-38-darwin.so) Mode: Assigning clusters of query sequences Sketching 28 genomes using 4 thread(s) Writing sketches to file Calculating distances using 4 thread(s) Loading BGMM 2D Gaussian model Network loaded: 18 samples Calculating all query-query distances Could not find random match chances in database, calculating assuming equal base frequencies Calculating distances using 4 thread(s) Updating reference database to poppunk_clusters Removing 27 sequences Done The new database contains all of the reference sequences, and all of your query sequences. The ``poppunk_clusters`` folder will now contain all of the files of a reference database listed above, except for the model. You can use ``--model-dir`` to target this for future assignment, or copy it over yourself. Alternatively, if you run with the same ``--output`` folder as ``--ref-db``, adding ``--overwrite``, the original input folder will contain the updated database containing everything needed. .. note:: This mode can take longer to run with large numbers of input query genomes, as it will calculate all :math:`Q^2` query-query distances, rather than just those found in novel query clusters. Furthermore, you may observe query genomes previously assigned to novel clusters without ``--update-db`` being assigned to existing clusters when using this option. This is expected behaviour, and is a manifestation of cluster merging, whereby the comparison of all database genomes to queries, not just references, enables queries to be assigned to existing clusters. See :doc:`troubleshooting` for more details. Visualising results ------------------- If you wish to produce visualisations from query assignment results the best way to do this is to run with ``--update-db``, and then run ``poppunk_visualise`` on the output directory, as if visualising a full reference fit. However, it is possible to run directly on the outputs by adding a ``--ref-db`` as used in the assign command, and a ``--query-db`` which points to the ``--output`` directory used in the assign command. In this mode isolates will be annotated depending on whether they were a query or reference input. .. warning:: Without ``--update-db``, visualisation is required to recalculate all query-query distances each time it is called. If your query set is large and you want repeated visualisations, run ``poppunk_assign`` with ``--update-db``. See :doc:`visualisation` for more details.