Fitting new models (``--fit-model``)
====================================
.. |nbsp| unicode:: 0xA0
:trim:
If you cannot find an existing model for your species in the
`list `__ you will want to fit your own.
This process is flexible, and there are five different models you can use depending
on the population structure of your dataset.
.. note::
After fitting a model to a new species we would like to share it on our website,
so others can use it for assigning queries. If you are open to this, please read
:doc:`model_distribution` after this page.
Overview
--------
First, use ``poppunk --create-db`` to sketch your input data and calculate distances
between all samples. This is detailed in :doc:`sketching`.
Then, use ``poppunk --fit-model `` with one of the following model names:
- ``bgmm`` -- Bayesian Gaussian Mixture Model. Best for small sample collections
with strain-structure. Works best when distance distribution components are clearly
separated.
- ``dbscan`` -- HDBSCAN. A good general method for larger sample collections with
strain-structure. Some points will always be designated as noise, so a subsequent run
of model refinement may help improve the fit.
- ``refine`` -- Model refinement. Requires a model already fitted with ``bgmm`` or ``dbscan``
and attempts to improve it by maximising the network score. Particularly useful when
components overlap significantly (often due to recombination), or when the strain boundary
is thought to lie somewhere within a component.
- ``threshold`` -- Apply a given core or accessory distance threshold to define VLKCs. Useful if
a cutoff threshold is already known/calculated, is estimated from a plot, or to compare a threshold
between datasets or species.
- ``lineage`` -- Lineage clustering. To find lineages within a strain (subclustering), or
find clusters in a population without strain structure. Uses a simple nearest neighbour approach
so is more of a heuristic. Network scores are not meaningful in this mode.
The most useful guide to deciding which model to use is the ``_distanceDistribution.png`` file
showing the core and accessory distances. More details on each of these models is given
further down this page.
A completed fit will consist of:
- A ``_clusters.csv`` file, which gives the VLKC (strain) for each sample in the database.
- A ``_unword_clusters.csv`` file, which gives an English-pronounceable name instead of a number
to each VLKC.
- ``_fit.npz`` and ``_fit.pkl`` files, which contain numeric data and metadata for the fit.
- A ``_graph.gt`` file, which is the network defining the fit in graph-tool format.
- Some plots of the fit, which depend on the specific model used.
- A ``.refs`` file, which lists the samples kept as 'references' for assigning
future samples (see :doc:`model_distribution` for more details).
This page will use 128 *Listeria*\ |nbsp| \ *monocytogenes* genomes from `Kremer et al `__,
which can be downloaded from `figshare `__. The distribution of
core and accessory distances from the ``--create-db`` step is as follows:
.. image:: images/listeria_dists.png
:alt: Core and accessory distances for the example data
:align: center
We also show some examples with 616 *Streptococcus*\ |nbsp| \ *pneumoniae* genomes, which are more complex.
These genomes were collected from Massachusetts,
first reported `here `__ and can be accessed
`here `__.
Common arguments
----------------
- ``--ref-db``: the output prefix used with ``--create-db`` i.e. the directory where the .h5 file is located
- ``--output``: where to save the model. If not specified this defaults to ``ref-db``.
- ``--overwrite``: overwrite any existing files in the output directory.
- ``--external-clustering``: any additional labels to add to the cluster output.
- ``--graph-weights``: save the edges weights in the network as their Euclidean core-accessory
distances, rather than as 0 or 1 (useful for visualising the network).
External clusters may be other cluster names, such as serotype, sequence type, cgMLST etc.
VLKCs are mapped as one-to-many, so that each strain is labelled with all of
the clusters any of its members is assigned to in this file. This input file must
be comma separated, one sample per line, with the sample name as the first column, and
other clusters as subsequent columns. A header line with 'sample' and the names of other cluster
types is required. Output is to ``output/output_external_clusters.csv``.
How good is my fit?
-------------------
We have found the best way to assess this is to use :doc:`visualisation` on your output
and look at your assigned VLKCs against a tree, to determine whether they have
the specificity required.
You can also compare models with their network score, and
whether the output plots look as expected. Typically the key thing is that
**your spatial component nearest the origin is accurate**. More detail is given for each model below.
Interpreting the network summary
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
All fits will output a network summary which looks similar to this::
Network summary:
Components 59
Density 0.0531
Transitivity 0.9966
Mean betweenness 0.0331
Weighted-mean betweenness 0.0454
Score 0.9438
Score (w/ betweenness) 0.9126
Score (w/ weighted-betweenness) 0.9009
- Components are the number of VLKCs (strains) found using this model.
- Density is the proportion of distances assigned as 'within-strain'. Generally
smaller is better as this gives more specific clusters, but too close to zero
may be an over-specific model.
- Transitivity measures whether every member of each strain is connected to every
other member. Closer to 1 is better, but this can be achieved with very loose fits.
- Score synthesises the above as :math:`(1 - \mathrm{density}) * \mathrm{transitivity}`,
which gives a single number between 0 (bad) and 1 (good) which in many cases is
at a maximum when it accurately describes strains in the data.
- Two further scores for larger networks. See :ref:`alt-scores` for more information
on these.
.. _bgmm:
bgmm
----
This mode fits a `Bayesian Gaussian mixture model `__
to the core and accessory distances. With few points, methods such as DBSCAN may struggle to find
clusters due to the sparsity, whereas a BGMM can often find a good fit. A further advantage
is that the equation for the posterior is known, so all points will have an assignment and a non-linear
boundary found exactly.
However, when there are a very large number of points the likelihood has a tendency
to totally override the prior in the estimated posterior, meaning many overlapping components
may be fitted, which may give poor clusters, and is less robust to adding more data. It is possible
for this mode to fail to converge, but it is more likely to produce a bad fit in difficult cases.
The key parameter to specify is the maximum number of components ``--K``. You should
choose a number based on the number of components you can see on your distance plot. This
may be automatically reduced if there is insufficent evidence for this many components. As a rule of thumb,
if you have under 150 samples or under 1000 samples and clear components then this mode should give
a good fit.
A better network score is evidence of a better fit, but the output files should also be used to
judge this. With the test dataset, four components are visible::
poppunk --fit-model bgmm --ref-db listeria --K 4
PopPUNK (POPulation Partitioning Using Nucleotide Kmers)
(with backend: sketchlib v1.6.0
sketchlib: /Users/jlees/miniconda3/envs/pp-py38/lib/python3.8/site-packages/pp_sketchlib.cpython-38-darwin.so)
Graph-tools OpenMP parallelisation enabled: with 1 threads
Mode: Fitting bgmm model to reference database
Fit summary:
Avg. entropy of assignment 0.0042
Number of components used 4
Scaled component means:
[0.9415286 0.90320047]
[0.11542755 0.24570244]
[0.20966101 0.37694884]
[0.00527421 0.07043826]
Network summary:
Components 31
Density 0.0897
Transitivity 1.0000
Score 0.9103
Removing 97 sequences
Done
In the output to the terminal:
- The average entropy of assignment is a measure of the certainty of assignment
of each point. Lower is better. Higher values may indicate overlapping components,
perhaps due to high amounts of recombination between strains.
- Number of components used is how many components from ``K`` were actually used
in the spatial fit. This is usually equal to ``K``, but may be reduced in small datasets.
- Scaled component means are the centres of the fitted components in the model, where
the core and accessory distances have been rescaled between 0 and 1. These can be
used with :ref:`manual-start`.
The fit actually just uses the component closest to the origin -- any distances
assigned to this component are within-strain. This is the most important part of the
fit in this mode.
You can see that this gives a good network score, and fits the data well:
.. image:: images/bgmm_k4_fit.png
:alt: BGMM fit with K = 4
:align: center
The position of the boundary is also produced (in red), along with contours of
the fitted mixture components:
.. image:: images/bgmm_k4_boundary.png
:alt: BGMM fit with K = 4
:align: center
If you make K too low, some components will be merged, resulting in a less-specific
fit with fewer clusters, that do not fully delineate all of the strains (in this
case just finding the two main lineages of *Listeria* in this data)::
poppunk --fit-model bgmm --ref-db listeria --K 2
PopPUNK (POPulation Partitioning Using Nucleotide Kmers)
(with backend: sketchlib v1.6.0
sketchlib: /Users/jlees/miniconda3/envs/pp-py38/lib/python3.8/site-packages/pp_sketchlib.cpython-38-darwin.so)
Graph-tools OpenMP parallelisation enabled: with 1 threads
Mode: Fitting bgmm model to reference database
Fit summary:
Avg. entropy of assignment 0.0007
Number of components used 2
Scaled component means:
[0.11627304 0.2432584 ]
[0.9415286 0.90320047]
Network summary:
Components 2
Density 0.5405
Transitivity 1.0000
Score 0.4595
Removing 126 sequences
Done
.. image:: images/bgmm_k2_fit.png
:alt: BGMM fit with K = 2
:align: center
Too many components in a small dataset are automatically reduced to an
appropriate number, obtaining the same good fit as above::
poppunk --fit-model bgmm --ref-db listeria --K 10
PopPUNK (POPulation Partitioning Using Nucleotide Kmers)
(with backend: sketchlib v1.6.0
sketchlib: /Users/jlees/miniconda3/envs/pp-py38/lib/python3.8/site-packages/pp_sketchlib.cpython-38-darwin.so)
Graph-tools OpenMP parallelisation enabled: with 1 threads
Mode: Fitting bgmm model to reference database
Fit summary:
Avg. entropy of assignment 0.3195
Number of components used 4
Scaled component means:
[0.9415286 0.90320047]
[3.72458739e-07 4.73196248e-07]
[0.00527421 0.07043826]
[0.20966682 0.37695524]
[0.11542849 0.2457043 ]
[1.68940242e-11 2.14632815e-11]
[7.66987488e-16 9.74431443e-16]
[3.48211781e-20 4.42391191e-20]
[1.58087904e-24 2.00845290e-24]
[7.17717973e-29 9.11836205e-29]
Network summary:
Components 31
Density 0.0897
Transitivity 1.0000
Score 0.9103
Removing 97 sequences
Done
In a dataset with more points, and less clear components, too many components can lead to
a bad fit:
.. image:: images/bgmm_fit_K10.png
:alt: BGMM fit with K = 10
:align: center
This is clearly a poor fit. The real issue is that the component whose mean is nearest
the origin is unclear, and doesn't include all of the smallest distances.
.. _dbscan:
dbscan
------
This mode uses `HDBSCAN `__ to find clusters
in the core and accessory distances. This is a versatile clustering algorithm capable of
finding non-linear structure in the data, and can represent irregularly shaped components
well. Possible drawbacks are that a fit cannot always be found (this can happen
for small datasets with sparse points, or for datasets without much structure in the core
and accessory), and that some points are classified as 'noise' so not all of their
edges are included in the network (these are the small black points).
.. warning::
HDBSCAN models are not backwards compatible from sklearn v1.0 onwards. We
would recommend using at least this version. Even better would be to then run
model refinement (:ref:`refine-models`) to get a simpler and faster model
for onward query assignment.
dbscan usually needs little modification to run::
poppunk --fit-model dbscan --ref-db listeria
PopPUNK (POPulation Partitioning Using Nucleotide Kmers)
(with backend: sketchlib v1.6.0
sketchlib: /Users/jlees/miniconda3/envs/pp-py38/lib/python3.8/site-packages/pp_sketchlib.cpython-38-darwin.so)
Graph-tools OpenMP parallelisation enabled: with 1 threads
Mode: Fitting dbscan model to reference database
Fit summary:
Number of clusters 5
Number of datapoints 8128
Number of assignments 7804
Scaled component means
[0.94155383 0.90322459]
[0.00527493 0.07044794]
[0.20945986 0.37491995]
[0.12876077 0.34294888]
[0.11413982 0.24224743]
Network summary:
Components 31
Density 0.0897
Transitivity 1.0000
Score 0.9103
Removing 97 sequences
Done
In the output to the terminal:
- The number of clusters is the number of spatial components found in the data.
- Number of datapoints is the number of points used (all-vs-all distances), which
may have been subsampled from the maximum.
- Number of assignments is the number of points assign to one of the spatial components,
so excluding noise points.
- Scaled component means are the centres of the fitted components in the model, where
the core and accessory distances have been rescaled between 0 and 1. These can be
used with :ref:`manual-start`.
The fit actually just uses the component closest to the origin -- any distances
assigned to this component are within-strain. This is the most important part of the
fit in this mode. In this case the identification of this component is identical to the bgmm
fit, so they produce the same strains. Note there is a small yellow cluster which is poorly
defined, but as it does not impact the within-strain cluster the fit is unaffected:
.. image:: images/dbscan_fit.png
:alt: DBSCAN fit
:align: center
You can alter the fit with ``--D``, which sets a maximum number of clusters, and
``--min-cluster-prop`` which sets the minimum number of points a cluster can have (as
a proportion of 'Number of datapoints). If the means of both of the core and accessory are not
strictly increasing between the within-strain and next further component, the clustering
fails. In this case the minimum number of samples per cluster is halved, and the fit is
tried again. If this goes below ten, no fit can be found.
Increasing ``--min-cluster-prop`` or decreasing ``--D`` gets rid of the errant cluster above::
poppunk --fit-model dbscan --ref-db listeria --min-cluster-prop 0.01
PopPUNK (POPulation Partitioning Using Nucleotide Kmers)
(with backend: sketchlib v1.6.0
sketchlib: /Users/jlees/miniconda3/envs/pp-py38/lib/python3.8/site-packages/pp_sketchlib.cpython-38-darwin.so)
Graph-tools OpenMP parallelisation enabled: with 1 threads
Mode: Fitting dbscan model to reference database
Fit summary:
Number of clusters 4
Number of datapoints 8128
Number of assignments 7805
Scaled component means
[0.94155383 0.90322459]
[0.00522549 0.06876396]
[0.11515678 0.24488282]
[0.21152104 0.37635505]
Network summary:
Components 31
Density 0.0886
Transitivity 0.9953
Score 0.9071
Removing 95 sequences
Done
But note that a few more noise points are generated, and fewer samples are removed
when pruning cliques:
.. image:: images/dbscan_fit_min_prop.png
:alt: DBSCAN fit increasing assignments per cluster
:align: center
Setting either ``--min-cluster-prop`` or ``--D`` too low can cause the fit to fail::
poppunk --fit-model dbscan --ref-db listeria --min-cluster-prop 0.05
PopPUNK (POPulation Partitioning Using Nucleotide Kmers)
(with backend: sketchlib v1.6.0
sketchlib: /Users/jlees/miniconda3/envs/pp-py38/lib/python3.8/site-packages/pp_sketchlib.cpython-38-darwin.so)
Graph-tools OpenMP parallelisation enabled: with 1 threads
Mode: Fitting dbscan model to reference database
Failed to find distinct clusters in this dataset
.. _refine-models:
refine
------
Model refinement is slightly different: it takes a model already fitted by :ref:`bgmm`
or :ref:`dbscan` and tries to improve it by optimising the network score. This starts
with a parallelised global optimisation step, followed by a serial local optimisation
step (which can be turned off with ``--no-local``). Use of multiple ``--cpus`` is
effective for these model fits.
Briefly:
* A line between the within- and between-strain means is constructed
* The point on this line where samples go from being assigned as within-strain to between-strain is used as the starting point
* A line normal to the first line, passing through this point is constructed. The triangle formed by this line and the x- and y-axes is now the decision boundary. Points within this line are within-strain.
* The starting point is shifted by a distance along the first line, and a new decision boundary formed in the same way. The network is reconstructed.
* The shift of the starting point is optimised, as judged by the network score. First globally by a grid search, then locally near the global optimum.
Applying this to the *Listeria* DBSCAN fit (noting that you may specify a separate
directory to load the model from with ``--model-dir``, if multiple model fits are available)::
poppunk --fit-model refine --ref-db listeria --model-dir dbscan
PopPUNK (POPulation Partitioning Using Nucleotide Kmers)
(with backend: sketchlib v1.6.0
sketchlib: /Users/jlees/miniconda3/envs/pp-py38/lib/python3.8/site-packages/pp_sketchlib.cpython-38-darwin.so)
Graph-tools OpenMP parallelisation enabled: with 1 threads
Mode: Fitting refine model to reference database
Loading DBSCAN model
Loaded previous model of type: dbscan
Initial model-based network construction based on DBSCAN fit
Initial boundary based network construction
Decision boundary starts at (0.63,0.62)
Trying to optimise score globally
Trying to optimise score locally
Optimization terminated successfully;
The returned value satisfies the termination criteria
(using xtol = 1e-05 )
Network summary:
Components 29
Density 0.0897
Transitivity 0.9984
Score 0.9088
Removing 93 sequences
Done
As this model was already well fitted, this doesn't change much, and finds very similar
VLKC assignments (though noise points are eliminated):
.. image:: images/listeria_refined.png
:alt: A refine fit on Listeria
:align: center
The default is to search along the entire range between the within- and between-strain clusters,
but sometimes this can include undesired optima, particularly near the origin. To exclude these,
use ``--pos-shift`` to alter the distance between the end of the search range and the origin
and ``--neg-shift`` for the start of the search range.
This mode is more useful in species with a relatively high recombination rate the distinction between
the within- and between-strain distributions may be blurred in core and
accessory space. This does not give the mixture model enough information to
draw a good boundary as the likelihood is very flat in this region:
.. image:: images/pneumo_unrefined.png
:alt: A bad DPGMM fit
:align: center
Although the score of this fit looks ok (0.904), inspection of the network and
microreact reveals that it is too liberal and VLKCs/strains have been merged. This
is because some of the blur between the origin and the central distribution has
been included, and connected clusters together erroneously.
The likelihood of the model fit and the decision boundary looks like this:
.. image:: images/pneumo_likelihood.png
:alt: The likelihood and decision boundary of the above fit
:align: center
Using the core and accessory distributions alone does not give much information
about exactly where to put the boundary, and the only way to fix this would be
by specifying strong priors on the weights of the distributions. Fortunately
the network properties give information in the region, and we can use
``--refine-fit`` to tweak the existing fit and pick a better boundary.
Here is the refined fit, which has a score of 0.939, and 62 rather than 32
components:
.. image:: images/pneumo_refined.png
:alt: The refined fit
:align: center
Which, looking at the `microreact output `__, is much better:
.. image:: images/refined_microreact.png
:alt: The refined fit, in microreact
:align: center
.. _alt-scores:
The calculation of the summary statistics used to optimise the network properties may be slow with
large datasets. Hence the ``--summary-sample`` argument can be used to specify the number of network
nodes that are randomly subsampled from the overall network to calculate the optimal boundary position.
Alternative network scores
^^^^^^^^^^^^^^^^^^^^^^^^^^
Two additional network scores are now available using node betweenness. We have observed
that in some refined fits to large datasets, some clusters are merged with a single high-stress
edge at a relatively large distance. These scores aim to create a more conservative boundary that splits
these clusters.
For these scores:
- The network is split into :math:`S` connected components (the strains) each of size :math:`w_i`
- For each component with at least four nodes, the betweenness of the nodes are calculated
- Each component is summarised by the maximum betweenness of any member node :math:`b^{\mathrm{max}}_i`
.. math::
\mathrm{score}_1 &= \mathrm{score}_0 \cdot (1 - \frac{1}{S} \sum_{i = 1}^S b^{\mathrm{max}}_i) \\
\mathrm{score}_2 &= \mathrm{score}_0 \cdot (1 - \frac{1}{S \cdot \Sigma w_i} \sum_{i = 1}^S \left[ b^{\mathrm{max}}_i \cdot w_i \right])
Score 1 is printed as score (w/ betweenness) and score 2 as score (w/ weighted-betweenness). Use ``--score-idx``
with 0 (default), 1 (betweenness) or 2 (weighted-betweenness) to choose which score to optimise in refine
mode. The default is the original score 0. Note that scores 1 and 2 may take longer to compute due to
the betweenness calculation, though this can take advantage of multiple ``--threads``. If it is prohibitively
slow, then the ``--betweenness-sample`` argument may be used to specify the size of a random sample of nodes
for the calculation of the between statistic.
Unconstrained (two-dimensional) optimisation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In the default mode described above, the boundary gradient is set from the identified
means in the input model, and the position of the intercept is optimised (one-dimensional optimisation).
In cases where the gradient of the boundary is not well set by the two means in the
plot, you can optimise both the intercept and the gradient by adding the ``--unconstrained`` option
(which is incompatible with ``--indiv-refine``). This will perform a global search
of 20 x 20 (400 total) x- and y-intercept positions, followed by a 1D local search
to further optimise the intercept (unless ``--no-local`` is added).
As this calculates the boundary at ten times as many positions, it is generally expected to
take ten times longer. However, you can effectively parallelise this with up to 20 ``--threads``::
poppunk --fit-model refine --ref-db listeria --model-dir dbscan --unconstrained --threads 4
PopPUNK (POPulation Partitioning Using Nucleotide Kmers)
(with backend: sketchlib v1.6.2
sketchlib: /Users/jlees/Documents/Imperial/pp-sketchlib/build/lib.macosx-10.9-x86_64-3.8/pp_sketchlib.cpython-38-darwin.so)
Graph-tools OpenMP parallelisation enabled: with 4 threads
Mode: Fitting refine model to reference database
Loading BGMM 2D Gaussian model
Loaded previous model of type: bgmm
Initial model-based network construction based on Gaussian fit
Initial boundary based network construction
Decision boundary starts at (0.52,0.43)
Trying to optimise score globally
Trying to optimise score locally
Optimization terminated successfully;
The returned value satisfies the termination criteria
(using xtol = 1e-05 )
Network summary:
Components 59
Density 0.0531
Transitivity 0.9966
Mean betweenness 0.0331
Weighted-mean betweenness 0.0454
Score 0.9438
Score (w/ betweenness) 0.9126
Score (w/ weighted-betweenness) 0.9009
Removing 545 sequences
Done
Which gives a slightly higher network score, though overall similar clusters:
.. image:: images/unconstrained_refine.png
:alt: Refining fit with --unconstrained
:align: center
This is because the gradient from the 1D optimisation was well set. Unconstrained optimisation
can be useful with clusters which aren't parallel to the line that connects them. This is an
example in *E.*\ |nbsp| \ *coli*:
.. list-table::
* - .. figure:: images/ecoli_refine_constrained.png
1D refine fit between DBSCAN cluster centroids
- .. figure:: images/ecoli_refine_unconstrained.png
Unconstrained 2D fit over a greater range
The search range will always be defined by a trapezium in light red -- bounded by
the two axes, and two lines passing through the means which are normal to the line
which connects the means.
.. _manual-start:
Using fit refinement when mixture model totally fails
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If the mixture model does not give any sort of reasonable fit to the points,
you can manually provide a file with ``--manual-start`` to give the starting parameters to
``--refine-fit`` mode. The format of this file is as follows::
start 0,0
end 0.5,0.6
scaled True
A key, followed by its value (space separated).
``start`` and ``end`` define the points (x,y) to draw the line between.
These define the two red points (and therefore the
search range) in the output plot.
``scaled`` defines whether these are on the [0,1] scale. If these have been set
using means from the terminal output this should be ``True``. Otherwise, if you
have set them based on the plot (unscaled space), set to ``False``.
.. _indiv-refine:
Using core/accessory only
^^^^^^^^^^^^^^^^^^^^^^^^^
In some cases, such as analysis within a lineage, it may be desirable to use
only core or accessory distances to classify further queries. This can be
achieved by adding the ``--indiv-refine both`` option, which will allow these boundaries to be
placed independently, allowing the best fit in each case::
poppunk --fit-model refine --ref-db listeria --model-dir dbscan --indiv-refine both
PopPUNK (POPulation Partitioning Using Nucleotide Kmers)
(with backend: sketchlib v1.6.0
sketchlib: /Users/jlees/miniconda3/envs/pp-py38/lib/python3.8/site-packages/pp_sketchlib.cpython-38-darwin.so)
Graph-tools OpenMP parallelisation enabled: with 1 threads
Mode: Fitting refine model to reference database
Loading DBSCAN model
Loaded previous model of type: dbscan
Initial model-based network construction based on DBSCAN fit
Initial boundary based network construction
Decision boundary starts at (0.63,0.62)
Trying to optimise score globally
Trying to optimise score locally
Optimization terminated successfully;
The returned value satisfies the termination criteria
(using xtol = 1e-05 )
Refining core and accessory separately
Initial boundary based network construction
Decision boundary starts at (0.63,0.62)
Trying to optimise score globally
Trying to optimise score locally
Optimization terminated successfully;
The returned value satisfies the termination criteria
(using xtol = 1e-05 )
Initial boundary based network construction
Decision boundary starts at (0.63,0.62)
Trying to optimise score globally
Trying to optimise score locally
Optimization terminated successfully;
The returned value satisfies the termination criteria
(using xtol = 1e-05 )
Network summary:
Components 29
Density 0.0897
Transitivity 0.9984
Score 0.9088
Network summary:
Components 31
Density 0.0897
Transitivity 1.0000
Score 0.9103
Network summary:
Components 31
Density 0.0808
Transitivity 0.9862
Score 0.9064
Removing 93 sequences
Done
There are three different networks, and the core and accessory boundaries will
also be shown on the _refined_fit.png plot as dashed gray lines:
.. image:: images/indiv_refine.png
:alt: Refining fit with core and accessory individuals independently
:align: center
To use one of these for your saved model, rerun, but instead setting
``--indiv-refine core`` or ``--indiv-refine accessory``.
.. _multi-boundary:
Running with multiple boundary positions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
To create clusters at equally spaced positions across the refinement range, add
the ``--multi-boundary `` argument, with the number of positions specified by
````. This will create up to ```` sets of clusters, with boundaries equally spaced
between the origin and the refined boundary position.
Trivial cluster sets, where every sample is in its own cluster, will be excluded, so
the final number of clusters may be less than ````.
For a use of these cluster sets, see the :doc:`poppunk_iterate` section.
threshold
---------
In this mode no model is fitted. You provide the threshold at which within- and
between-strain distances is drawn. This can be useful if ``refine`` cannot find a boundary
due to a poorly performing network score, but one can clearly be seen from the plot.
It may also be useful to compare with other fits from related species where a boundary
has been identified using one of the fitting procedures.
Currently only a core-distance boundary is supported (if you would like an accessory or
combined mode available, please `raise an issue `__).
Provide the cutoff with ``--threshold``::
poppunk --fit-model threshold --ref-db listeria --threshold 0.003
PopPUNK (POPulation Partitioning Using Nucleotide Kmers)
(with backend: sketchlib v1.6.0
sketchlib: /Users/jlees/miniconda3/envs/pp-py38/lib/python3.8/site-packages/pp_sketchlib.cpython-38-darwin.so)
Graph-tools OpenMP parallelisation enabled: with 1 threads
Mode: Fitting threshold model to reference database
Network summary:
Components 31
Density 0.0897
Transitivity 1.0000
Score 0.9103
Removing 97 sequences
Done
.. image:: images/listeria_threshold.png
:alt: A threshold fit on Listeria
:align: center
.. _lineage-fit:
lineage
-------
This mode defines clusters by joining nearest neighbours. As this will typically
define subclusters within strains, we refer to these as 'lineages'. This can be used
to find subclusters in addition to one of the above models, or for species without
strain-structure (e.g. some viruses, *Neisseria gonorrhoeae*, *Mycobacterium tuberculosis*).
This is the highest resolution (most specific clusters) provided directly by PopPUNK. If it does
not meet your needs, take a look at :doc:`subclustering` for other options.
A model is not fitted, and a simple data-driven heuristic is used. For each sample, the
nearest :math:`k` neighbours will be indentified, and joined in the network. Connected components
of the network define lineages, as in the other models. Only core distances are used (add ``--use-accessory`` to modify this),
and in the case of ties all distances are included. Note that these are not necessarily
expected to be transitive, so network scores are not as informative of the optimum.
We refer to :math:`k` as the 'rank' of the model. Typically you won't know which rank
to use beforehand, so you can provide multiple integer values to the ``--rank`` option, comma separated.
Clusters from all ranks will be output, and all used with :doc:`query_assignment`. :math:`k = 1` is the
most specific rank, and higher values will form looser clusters. With the *Listeria* example::
poppunk --fit-model lineage --ref-db listeria --ranks 1,2,3,5
PopPUNK (POPulation Partitioning Using Nucleotide Kmers)
(with backend: sketchlib v1.6.0
sketchlib: /Users/jlees/miniconda3/envs/pp-py38/lib/python3.8/site-packages/pp_sketchlib.cpython-38-darwin.so)
Graph-tools OpenMP parallelisation enabled: with 1 threads
Mode: Fitting lineage model to reference database
Network for rank 1
Network summary:
Components 26
Density 0.0271
Transitivity 0.1834
Score 0.1785
Network for rank 2
Network summary:
Components 12
Density 0.0428
Transitivity 0.3528
Score 0.3377
Network for rank 3
Network summary:
Components 6
Density 0.0589
Transitivity 0.4191
Score 0.3944
Network for rank 5
Network summary:
Components 2
Density 0.0904
Transitivity 0.5319
Score 0.4838
Parsed data, now writing to CSV
Done
This has produced four fits, with ranks 1, 2, 3 and 5 (with fit information contained in
the .pkl file, and a .npz file for each rank). The _clusters.csv will contain the clusters
from the lowest rank. The _lineages.csv file contains all of the assignments, a column
with all of the ranks hyphen-separated (which will give clusters indentical to the lowest rank)::
id,Rank_1_Lineage,Rank_2_Lineage,Rank_3_Lineage,Rank_5_Lineage,overall_Lineage
12673_8#24,18,2,2,1,18-2-2-1
12673_8#26,4,2,2,1,4-2-2-1
12673_8#27,26,1,1,1,26-1-1-1
12673_8#28,1,1,1,1,1-1-1-1
12673_8#29,4,2,2,1,4-2-2-1
12673_8#31,18,2,2,1,18-2-2-1
12673_8#32,9,8,1,1,9-8-1-1
12673_8#34,7,7,1,1,7-7-1-1
12673_8#36,1,1,1,1,1-1-1-1
The best way to assess the ranks is by visualising them (:doc:`visualisation`)::
poppunk_visualise --distances listeria/listeria.dists --ref-db listeria --microreact
Graph-tools OpenMP parallelisation enabled: with 1 threads
PopPUNK: visualise
Loading previously lineage cluster model
Writing microreact output
Parsed data, now writing to CSV
Building phylogeny
Running t-SNE
Done
This can be loaded in microreact: https://microreact.org/project/dVNMftmK6VXRvDxBfrH2y.
Rank 1 has the smallest clusters:
.. image:: images/listeria_lineage_rank_1.png
:alt: Rank 1 lineage fit for Listeria
:align: center
Rank 3 has larger clusters. Some of these clusters are polyphyletic on the core neighbour-joining
tree:
.. image:: images/listeria_lineage_rank_3.png
:alt: Rank 3 lineage fit for Listeria
:align: center
At the model fit stage, you will also get histograms which show the distances included
in the network, a useful comparison with the original distance distribution and between ranks:
.. list-table::
* - .. figure:: images/listeria_lineage_rank_1_histogram.png
Rank 1
- .. figure:: images/listeria_lineage_rank_3_histogram.png
Rank 3
Use an existing model with new data
-----------------------------------
There is also one further mode, ``--use-model``, which may be useful in limited circumstances. This
applies any of the above models to a new dataset without refitting it. This may be useful if a reference
dataset has changed (been added to or removed from) and you do not wish to refit the model, for example
because it is already in use. However, typically you would use :doc:`query_assignment` with ``--update-db``
to add to a model::
poppunk --use-model --ref-db new_db --model-dir old_db
PopPUNK (POPulation Partitioning Using Nucleotide Kmers)
(with backend: sketchlib v1.6.0
sketchlib: /Users/jlees/miniconda3/envs/pp-py38/lib/python3.8/site-packages/pp_sketchlib.cpython-38-darwin.so)
Graph-tools OpenMP parallelisation enabled: with 1 threads
Mode: Using previous model with a reference database
Loading BGMM 2D Gaussian model
Loaded previous model of type: bgmm
Network summary:
Components 31
Density 0.0897
Transitivity 1.0000
Score 0.9103
Removing 97 sequences
Done