Reference documentation¶

Documentation for module functions (for developers)

assign.py¶

poppunk_assign main function

PopPUNK.assign.assign_query(dbFuncs, ref_db, q_files, output, qc_dict, update_db, write_references, distances, serial, stable, threads, overwrite, plot_fit, graph_weights, model_dir, strand_preserved, previous_clustering, external_clustering, core, accessory, gpu_sketch, gpu_dist, gpu_graph, deviceid, save_partial_query_graph, use_full_network)[source]¶: Code for assign query mode for CLI

PopPUNK.assign.assign_query_hdf5(dbFuncs, ref_db, qNames, output, qc_dict, update_db, write_references, distances, serial, stable, threads, overwrite, plot_fit, graph_weights, model_dir, strand_preserved, previous_clustering, external_clustering, core, accessory, gpu_dist, gpu_graph, save_partial_query_graph, use_full_network)[source]¶: Code for assign query mode taking hdf5 as input. Written as a separate function so it can be called by web APIs

PopPUNK.assign.main()[source]¶: Main function. Parses cmd line args and runs in the specified mode.

bgmm.py¶

Functions used to fit the mixture model to a database. Access using BGMMFit.

BGMM using sklearn

PopPUNK.bgmm.findBetweenLabel_bgmm(means, assignments)[source]¶

Identify between-strain links

Finds the component with the largest number of points assigned to it

Args:

means (numpy.array): K x 2 array of mixture component means
assignments (numpy.array): Sample cluster assignments

Returns:

between_label (int): The cluster label with the most points assigned to it

PopPUNK.bgmm.findWithinLabel(means, assignments, rank=0)[source]¶

Identify within-strain links

Finds the component with mean closest to the origin and also makes sure some samples are assigned to it (in the case of small weighted components with a Dirichlet prior some components are unused)

Args:

means (numpy.array): K x 2 array of mixture component means
assignments (numpy.array): Sample cluster assignments
rank (int): Which label to find, ordered by distance from origin. 0-indexed. (default = 0)

Returns:

within_label (int): The cluster label for the within-strain assignments

PopPUNK.bgmm.fit2dMultiGaussian(X, dpgmm_max_K=2)[source]¶

Main function to fit BGMM model, called from fit()

Fits the mixture model specified, saves model parameters to a file, and assigns the samples to a component. Write fit summary stats to STDERR.

Args:

X (np.array): n x 2 array of core and accessory distances for n samples. This should be subsampled to 100000 samples.
dpgmm_max_K (int): Maximum number of components to use with the EM fit. (default = 2)

Returns:

dpgmm (sklearn.mixture.BayesianGaussianMixture): Fitted bgmm model

PopPUNK.bgmm.log_likelihood(X, weights, means, covars, scale)[source]¶

modified sklearn GMM function predicting distribution membership

Returns the mixture LL for points X. Used by assign_samples() and plot_contours()

Args:

X (numpy.array): n x 2 array of core and accessory distances for n samples
weights (numpy.array): Component weights from fit2dMultiGaussian()
means (numpy.array): Component means from fit2dMultiGaussian()
covars (numpy.array): Component covariances from fit2dMultiGaussian()
scale (numpy.array): Scaling of core and accessory distances from fit2dMultiGaussian()

Returns:

logprob (numpy.array): The log of the probabilities under the mixture model
lpr (numpy.array): The components of the log probability from each mixture component

PopPUNK.bgmm.log_multivariate_normal_density(X, means, covars, min_covar=1e-07)[source]¶

Log likelihood of multivariate normal density distribution

Used to calculate per component Gaussian likelihood in assign_samples()

Args:

X (numpy.array): n x 2 array of core and accessory distances for n samples
means (numpy.array): Component means from fit2dMultiGaussian()
covars (numpy.array): Component covariances from fit2dMultiGaussian()
min_covar (float): Minimum covariance, added when Choleksy decomposition fails due to too few observations (default = 1.e-7)

Returns:

log_prob (numpy.array): An n-vector with the log-likelihoods for each sample being in this component

dbscan.py¶

Functions used to fit DBSCAN to a database. Access using DBSCANFit.

DBSCAN using hdbscan

PopPUNK.dbscan.evaluate_dbscan_clusters(model)[source]¶

Evaluate whether fitted dbscan model contains non-overlapping clusters

Args:

model (DBSCANFit): Fitted model from fit()

Returns:

indistinct (bool): Boolean indicating whether putative within- and between-strain clusters of points overlap

PopPUNK.dbscan.findBetweenLabel(assignments, within_cluster)[source]¶

Identify between-strain links from a DBSCAN model

Finds the component containing the largest number of between-strain links, excluding the cluster identified as containing within-strain links.

Args:

assignments (numpy.array): Sample cluster assignments
within_cluster (int): Cluster ID assigned to within-strain assignments, from findWithinLabel()

Returns:

between_cluster (int): The cluster label for the between-strain assignments

PopPUNK.dbscan.fitDbScan(X, min_samples, min_cluster_size, cache_out, use_gpu=False)[source]¶

Function to fit DBSCAN model as an alternative to the Gaussian

Fits the DBSCAN model to the distances using hdbscan

Args:

X (np.array): n x 2 array of core and accessory distances for n samples
min_samples (int): Parameter for DBSCAN clustering ‘conservativeness’
min_cluster_size (int): Minimum number of points in a cluster for HDBSCAN
cache_out (str): Prefix for DBSCAN cache used for refitting
use_gpu (bool): Whether GPU algorithms should be used in DBSCAN fitting

Returns:

hdb (hdbscan.HDBSCAN or cuml.cluster.HDBSCAN): Fitted HDBSCAN to subsampled data
labels (list): Cluster assignments of each sample
n_clusters (int): Number of clusters used

mandrake.py¶

PopPUNK.mandrake.generate_embedding(seqLabels, accMat, perplexity, outPrefix, overwrite, kNN=50, maxIter=10000000, n_threads=1, use_gpu=False, device_id=0)[source]¶

Generate t-SNE projection using accessory distances

Writes a plot of t-SNE clustering of accessory distances (.dot)

Args:

seqLabels (list): Processed names of sequences being analysed.
accMat (numpy.array): n x n array of accessory distances for n samples.
perplexity (int): Perplexity parameter passed to t-SNE
outPrefix (str): Prefix for all generated output files, which will be placed in outPrefix subdirectory
overwrite (bool): Overwrite existing output if present (default = False)
kNN (int): Number of neigbours to use with SCE (cannot be > n_samples) (default = 50)
maxIter (int): Number of iterations to run (default = 1000000)
n_threads (int): Number of CPU threads to use (default = 1)
use_gpu (bool): Whether to use GPU libraries
device_id (int): Device ID of GPU to be used (default = 0)

Returns:

mandrake_filename (str): Filename with .dot of embedding

models.py¶

Classes used for model fits

class PopPUNK.models.BGMMFit(outPrefix, max_samples=100000, max_batch_size=100000, assign_points=True)[source]¶

Class for fits using the Gaussian mixture model. Inherits from ClusterFit.

Must first run either fit() or load() before calling other functions

Args:

outPrefix (str): The output prefix used for reading/writing
max_samples (int): The number of subsamples to fit the model to (default = 100000)

assign(X, max_batch_size=100000, values=False, progress=True)[source]¶

Assign the clustering of new samples using assign_samples()

Args:

X (numpy.array)

Core and accessory distances

values (bool)

Return the responsibilities of assignment rather than most likely cluster

max_batch_size (int)

Size of batches to be assigned

progress (bool)

Show progress bar

[default = True]

Returns:

y (numpy.array): Cluster assignments or values by samples

fit(X, max_components)[source]¶

Extends fit()

Fits the BGMM and returns assignments by calling fit2dMultiGaussian().

Fitted parameters are stored in the object.

Args:

X (numpy.array): The core and accessory distances to cluster. Must be set if preprocess is set.
max_components (int): Maximum number of mixture components to use.

Returns:

y (numpy.array): Cluster assignments of samples in X

load(fit_npz, fit_obj)[source]¶

Load the model from disk. Called from loadClusterFit()

Args:

fit_npz (dict): Fit npz opened with numpy.load()
fit_obj (sklearn.mixture.BayesianGaussianMixture): The saved fit object

plot(X, y)[source]¶

Extends plot()

Write a summary of the fit, and plot the results using PopPUNK.plot.plot_results() and PopPUNK.plot.plot_contours()

Args:

X (numpy.array): Core and accessory distances
y (numpy.array): Cluster assignments from assign()

save()[source]¶: Save the model to disk, as an npz and pkl (using outPrefix).

class PopPUNK.models.ClusterFit(outPrefix, default_dtype=numpy.float32)[source]¶

Parent class for all models used to cluster distances

Args:

outPrefix (str): The output prefix used for reading/writing

copy(prefix)[source]¶: Copy the model to a new directory

fit(X=None)[source]¶

Initial steps for all fit functions.

Creates output directory. If preprocess is set then subsamples passed X

Args:

X (numpy.array)

The core and accessory distances to cluster. Must be set if preprocess is set.

(default = None)

default_dtype (numpy dtype)

Type to use if no X provided

no_scale()[source]¶: Turn off scaling (useful for refine, where optimization is done in the scaled space).

plot(X=None)[source]¶

Initial steps for all plot functions.

Ensures model has been fitted.

Args:

X (numpy.array)

The core and accessory distances to subsample.

(default = None)

class PopPUNK.models.DBSCANFit(outPrefix, use_gpu=False, max_batch_size=5000, max_samples=100000, assign_points=True)[source]¶

Class for fits using HDBSCAN. Inherits from ClusterFit.

Must first run either fit() or load() before calling other functions

Args:

outPrefix (str): The output prefix used for reading/writing
max_samples (int): The number of subsamples to fit the model to (default = 100000)

assign(X, no_scale=False, progress=True, max_batch_size=5000, use_gpu=False)[source]¶

Assign the clustering of new samples using assign_samples_dbscan()

Args:

X (numpy.array or cupy.array): Core and accessory distances
no_scale (bool): Do not scale X [default = False]
progress (bool): Show progress bar [default = True]
max_batch_size (int): Batch size used for assignments [default = 5000]
use_gpu (bool): Use GPU-enabled algorithms for clustering [default = False]

Returns:

y (numpy.array): Cluster assignments by samples

fit(X, max_num_clusters, min_cluster_prop, use_gpu=False)[source]¶

Extends fit()

Fits the distances with HDBSCAN and returns assignments by calling fitDbScan().

Fitted parameters are stored in the object.

Args:

X (numpy.array): The core and accessory distances to cluster. Must be set if preprocess is set.
max_num_clusters (int): Maximum number of clusters in DBSCAN fitting
min_cluster_prop (float): Minimum proportion of points in a cluster in DBSCAN fitting
use_gpu (bool): Whether GPU algorithms should be used in DBSCAN fitting

Returns:

y (numpy.array): Cluster assignments of samples in X

load(fit_npz, fit_obj)[source]¶

Load the model from disk. Called from loadClusterFit()

Args:

fit_npz (dict): Fit npz opened with numpy.load()
fit_obj (hdbscan.HDBSCAN): The saved fit object

plot(X=None, y=None)[source]¶

Extends plot()

Write a summary of the fit, and plot the results using PopPUNK.plot.plot_dbscan_results()

Args:

X (numpy.array): Core and accessory distances
y (numpy.array): Cluster assignments from assign()

save()[source]¶: Save the model to disk, as an npz and pkl (using outPrefix).

class PopPUNK.models.LineageFit(outPrefix, ranks, max_search_depth, reciprocal_only, count_unique_distances, dist_col=None, use_gpu=False)[source]¶

Class for fits using the lineage assignment model. Inherits from ClusterFit.

Must first run either fit() or load() before calling other functions

Args:

outPrefix (str): The output prefix used for reading/writing
ranks (list): The ranks used in the fit

assign(rank)[source]¶

Get the edges for the network. A little different from other methods, as it doesn’t go through the long form distance vector (as coo_matrix is basically already in the correct gt format)

Args:

rank (int): Rank to assign at

Returns:

y (list of tuples): Edges to include in network

edge_weights(rank)[source]¶

Get the distances for each edge returned by assign

Args:

rank (int): Rank assigned at

Returns:

weights (list): Distance for each assignment

extend(qqDists, qrDists)[source]¶

Update the sparse distance matrix of nearest neighbours after querying

Args:

qqDists (numpy or cupy ndarray): Two column array of query-query distances
qqDists (numpy or cupy ndarray): Two column array of reference-query distances

Returns:

y (list of tuples): Edges to include in network

fit(X, accessory)[source]¶

Extends fit()

Gets assignments by using nearest neigbours.

Args:

X (numpy.array): The core and accessory distances to cluster. Must be set if preprocess is set.
accessory (bool): Use accessory rather than core distances

Returns:

y (numpy.array): Cluster assignments of samples in X

load(fit_npz, fit_obj)[source]¶

Load the model from disk. Called from loadClusterFit()

Args:

fit_npz (dict): Fit npz opened with numpy.load()
fit_obj (sklearn.mixture.BayesianGaussianMixture): The saved fit object

plot(X, y=None)[source]¶

Extends plot()

Write a summary of the fit, and plot the results using PopPUNK.plot.plot_results() and PopPUNK.plot.plot_contours()

Args:

X (numpy.array): Core and accessory distances
y (any): Unused variable for compatibility with other plotting functions

save()[source]¶: Save the model to disk, as an npz and pkl (using outPrefix).

class PopPUNK.models.NumpyShared(name, shape, dtype)¶

dtype¶: Alias for field number 2

name¶: Alias for field number 0

shape¶: Alias for field number 1

class PopPUNK.models.RefineFit(outPrefix)[source]¶

Class for fits using a triangular boundary and network properties. Inherits from ClusterFit.

Must first run either fit() or load() before calling other functions

Args:

outPrefix (str): The output prefix used for reading/writing

apply_threshold(X, threshold)[source]¶

Applies a boundary threshold, given by user. Does not run optimisation.

Args:

X (numpy.array): The core and accessory distances to cluster. Must be set if preprocess is set.
threshold (float): The value along the x-axis (core distance) at which to draw the assignment boundary

Returns:

y (numpy.array): Cluster assignments of samples in X

assign(X, slope=None)[source]¶

Assign the clustering of new samples

Args:

X (numpy.array): Core and accessory distances
slope (int): Override self.slope. Default - use self.slope Set to 0 for a vertical line, 1 for a horizontal line, or 2 to use a slope

Returns:

y (numpy.array): Cluster assignments by samples

fit(X, sample_names, model, max_move, min_move, startFile=None, indiv_refine=False, unconstrained=False, multi_boundary=0, score_idx=0, no_local=False, betweenness_sample=100, sample_size=None, use_gpu=False)[source]¶

Extends fit()

Fits the distances by optimising network score, by calling refineFit2D().

Fitted parameters are stored in the object.

Args:

X (numpy.array): The core and accessory distances to cluster. Must be set if preprocess is set.
sample_names (list): Sample names in X (accessed by iterDistRows())
model (ClusterFit): The model fit to refine
max_move (float): Maximum distance to move away from start point
min_move (float): Minimum distance to move away from start point
startFile (str): A file defining an initial fit, rather than one from --fit-model. See documentation for format. (default = None).
indiv_refine (str): Run refinement for core or accessory distances separately (default = None).
multi_boundary (int): Produce cluster output at multiple boundary positions downward from the optimum. (default = 0).
unconstrained (bool): If True, search in 2D and change the slope of the boundary
score_idx (int): Index of score from networkSummary() to use [default = 0]
no_local (bool): Turn off the local optimisation step. Quicker, but may be less well refined.
betweenness_sample (int): Number of sequences per component used to estimate betweenness using a GPU. Smaller numbers are faster but less precise [default = 100]
sample_size (int): Number of nodes to subsample for graph statistic calculation
use_gpu (bool): Whether to use cugraph for graph analyses

Returns:

y (numpy.array): Cluster assignments of samples in X

load(fit_npz, fit_obj)[source]¶

Load the model from disk. Called from loadClusterFit()

Args:

fit_npz (dict): Fit npz opened with numpy.load()
fit_obj (None): The saved fit object (not used)

plot(X, y=None)[source]¶

Extends plot()

Write a summary of the fit, and plot the results using PopPUNK.plot.plot_refined_results()

Args:

X (numpy.array): Core and accessory distances
y (numpy.array): Assignments (unused)

save()[source]¶: Save the model to disk, as an npz and pkl (using outPrefix).

PopPUNK.models.assign_samples(chunk, X, y, model, scale, chunk_size, values=False)[source]¶

Runs a models assignment on a chunk of input

Args:

chunk (int)

Index of chunk to process

X (NumpyShared)

n x 2 array of core and accessory distances for n samples

y (NumpyShared)

An n-vector to store results, with the most likely cluster memberships or an n by k matrix with the component responsibilities for each sample.

weights (numpy.array)

Component weights from BGMMFit

means (numpy.array)

Component means from BGMMFit

covars (numpy.array)

Component covariances from BGMMFit

scale (numpy.array)

Scaling of core and accessory distances from BGMMFit

chunk_size (int)

Size of each chunk in X

values (bool)

Whether to return the responsibilities, rather than the most likely assignment (used for entropy calculation).

Default is False

PopPUNK.models.loadClusterFit(pkl_file, npz_file, outPrefix='', max_samples=100000, use_gpu=False)[source]¶

Call this to load a fitted model

Args:

pkl_file (str): Location of saved .pkl file on disk
npz_file (str): Location of saved .npz file on disk
outPrefix (str): Output prefix for model to save to (e.g. plots)
max_samples (int): Maximum samples if subsampling X [default = 100000]
use_gpu (bool): Whether to load npz file with GPU libraries for lineage models

Returns:

load_obj (model): Loaded model

network.py¶

Functions used to construct the network, and update with new queries. Main entry point is constructNetwork() for new reference databases, and findQueryLinksToNetwork() for querying databases.

Network functions

PopPUNK.network.addQueryToNetwork(dbFuncs, rList, qList, G, assignments, model, queryDB, kmers=None, distance_type='euclidean', queryQuery=False, strand_preserved=False, weights=None, threads=1, use_gpu=False)[source]¶

Finds edges between queries and items in the reference database, and modifies the network to include them.

Args:

dbFuncs (list)

List of backend functions from setupDBFuncs()

rList (list)

List of reference names

qList (list)

List of query names

G (graph)

Network to add to (mutated)

assignments (numpy.array)

Cluster assignment of items in qlist

model (ClusterModel)

Model fitted to reference database

queryDB (str)

Query database location

kmers (list)

List of k-mer sizes

distance_type (str)

Distance type to use as weights in network

queryQuery (bool)

Add in all query-query distances (default = False)

strand_preserved (bool)

Whether to treat strand as known (i.e. ignore rc k-mers) when adding random distances. Only used if queryQuery = True [default = False]

weights (numpy.array)

If passed, the core,accessory distances for each assignment, which will be annotated as an edge attribute

threads (int)

Number of threads to use if new db created

use_gpu (bool)

Whether to use cugraph for analysis

(default = 1)

Returns:

distMat (numpy.array): Query-query distances

PopPUNK.network.checkNetworkVertexCount(seq_list, G, use_gpu)[source]¶

Checks the number of network vertices matches the number of sequence names.

Args:

seq_list (list): The list of sequence names
G (graph): The network of sequences
use_gpu (bool): Whether to use cugraph for graph analyses

PopPUNK.network.cliquePrune(component, graph, reference_indices, components_list)[source]¶: Wrapper function around getCliqueRefs() so it can be called by a multiprocessing pool

PopPUNK.network.construct_dense_weighted_network(rlist, distMat, weights_type=None, use_gpu=False)[source]¶

Construct an undirected network using sequence lists, assignments of pairwise distances to clusters, and the identifier of the cluster assigned to within-strain distances. Nodes are samples and edges where samples are within the same cluster

Will print summary statistics about the network to STDERR

Args:

rlist (list): List of reference sequence labels
distMat (2 column ndarray): Numpy array of pairwise distances
weights_type (str): Type of weight to use for network
use_gpu (bool): Whether to use GPUs for network construction

Returns:

G (graph): The resulting network

PopPUNK.network.construct_network_from_assignments(rlist, qlist, assignments, within_label=1, int_offset=0, weights=None, distMat=None, weights_type=None, previous_network=None, old_ids=None, adding_qq_dists=False, previous_pkl=None, betweenness_sample=100, summarise=True, sample_size=None, use_gpu=False)[source]¶

Will print summary statistics about the network to STDERR

Args:

rlist (list): List of reference sequence labels
qlist (list): List of query sequence labels
assignments (numpy.array or int): Labels of most likely cluster assignment
within_label (int): The label for the cluster representing within-strain distances
int_offset (int): Constant integer to add to each node index
weights (list): List of weights for each edge in the network
distMat (2 column ndarray): Numpy array of pairwise distances
weights_type (str): Measure to calculate from the distMat to use as edge weights in network - options are core, accessory or euclidean distance
previous_network (str): Name of file containing a previous network to be integrated into this new network
old_ids (list): Ordered list of vertex names in previous network
adding_qq_dists (bool): Boolean specifying whether query-query edges are being added to an existing network, such that not all the sequence IDs will be found in the old IDs, which should already be correctly ordered
previous_pkl (str): Name of file containing the names of the sequences in the previous_network
betweenness_sample (int): Number of sequences per component used to estimate betweenness using a GPU. Smaller numbers are faster but less precise [default = 100]
summarise (bool): Whether to calculate and print network summaries with networkSummary() (default = True)
sample_size (int): Number of nodes to subsample for graph statistic calculation
use_gpu (bool): Whether to use GPUs for network construction

Returns:

G (graph): The resulting network

PopPUNK.network.construct_network_from_df(rlist, qlist, G_df, weights=False, distMat=None, previous_network=None, adding_qq_dists=False, old_ids=None, previous_pkl=None, betweenness_sample=100, summarise=True, sample_size=None, use_gpu=False)[source]¶

Construct an undirected network using a data frame of edges. Nodes are samples and edges where samples are within the same cluster

Will print summary statistics about the network to STDERR

Args:

rlist (list): List of reference sequence labels
qlist (list): List of query sequence labels
G_df (cudf or pandas data frame): Data frame in which the first two columns are the nodes linked by edges
weights (bool): Whether weights in the G_df data frame should be included in the network
distMat (2 column ndarray): Numpy array of pairwise distances
previous_network (str or graph object): Name of file containing a previous network to be integrated into this new network, or the already-loaded graph object
adding_qq_dists (bool): Boolean specifying whether query-query edges are being added to an existing network, such that not all the sequence IDs will be found in the old IDs, which should already be correctly ordered
old_ids (list): Ordered list of vertex names in previous network
previous_pkl (str): Name of file containing the names of the sequences in the previous_network
betweenness_sample (int): Number of sequences per component used to estimate betweenness using a GPU. Smaller numbers are faster but less precise [default = 100]
summarise (bool): Whether to calculate and print network summaries with networkSummary() (default = True)
sample_size (int): Number of nodes to subsample for graph statistic calculation
use_gpu (bool): Whether to use GPUs for network construction

Returns:

G (graph): The resulting network

PopPUNK.network.construct_network_from_edge_list(rlist, qlist, edge_list, weights=None, distMat=None, previous_network=None, adding_qq_dists=False, old_ids=None, previous_pkl=None, betweenness_sample=100, summarise=True, sample_size=None, use_gpu=False)[source]¶

Construct an undirected network using a list of edges as tuples. Nodes are samples and edges where samples are within the same cluster

Will print summary statistics about the network to STDERR

Args:

rlist (list): List of reference sequence labels
qlist (list): List of query sequence labels
edge_list (list of tuples): List of tuples describing the edges of the graph
weights (list): List of edge weights
distMat (2 column ndarray): Numpy array of pairwise distances
previous_network (str or graph object): Name of file containing a previous network to be integrated into this new network, or the already-loaded graph object
adding_qq_dists (bool): Boolean specifying whether query-query edges are being added to an existing network, such that not all the sequence IDs will be found in the old IDs, which should already be correctly ordered
old_ids (list): Ordered list of vertex names in previous network
previous_pkl (str): Name of file containing the names of the sequences in the previous_network
betweenness_sample (int): Number of sequences per component used to estimate betweenness using a GPU. Smaller numbers are faster but less precise [default = 100]
summarise (bool): Whether to calculate and print network summaries with networkSummary() (default = True)
sample_size (int): Number of nodes to subsample for graph statistic calculation
use_gpu (bool): Whether to use GPUs for network construction

Returns:

G (graph): The resulting network

PopPUNK.network.construct_network_from_sparse_matrix(rlist, qlist, sparse_input, weights=None, previous_network=None, previous_pkl=None, betweenness_sample=100, summarise=True, sample_size=None, use_gpu=False)[source]¶

Construct an undirected network using a sparse matrix. Nodes are samples and edges where samples are within the same cluster

Will print summary statistics about the network to STDERR

Args:

rlist (list): List of reference sequence labels
qlist (list): List of query sequence labels
sparse_input (numpy.array): Sparse distance matrix from lineage fit
weights (list): List of weights for each edge in the network
distMat (2 column ndarray): Numpy array of pairwise distances
previous_network (str): Name of file containing a previous network to be integrated into this new network
previous_pkl (str): Name of file containing the names of the sequences in the previous_network
betweenness_sample (int): Number of sequences per component used to estimate betweenness using a GPU. Smaller numbers are faster but less precise [default = 100]
summarise (bool): Whether to calculate and print network summaries with networkSummary() (default = True)
sample_size (int): Number of nodes to subsample for graph statistic calculation
use_gpu (bool): Whether to use GPUs for network construction

Returns:

G (graph): The resulting network

PopPUNK.network.cugraph_to_graph_tool(G, rlist)[source]¶

Save a network to disk

Args:

G (cugraph network): Cugraph network
rlist (list): List of sequence names

Returns:

G (graph-tool network): Graph tool network

PopPUNK.network.extractReferences(G, dbOrder, outPrefix, merged_queries=[], outSuffix='', type_isolate=None, existingRefs=None, threads=1, use_gpu=False, fast_mode=False)[source]¶

Extract references for each cluster based on cliques

Writes chosen references to file by calling writeReferences()

Args:

G (graph): A network used to define clusters
dbOrder (list): The order of files in the sketches, so returned references are in the same order
outPrefix (str): Prefix for output file
merged_queries (list): Query files which were part of a merged cluster (used for fast assign, to enrich refs with)
outSuffix (str): Suffix for output file (.refs will be appended)
type_isolate (str): Isolate to be included in set of references
existingRefs (list): References that should be used for each clique
use_gpu (bool): Use cugraph for graph analysis (default = False)
fast_mode (bool): Use random selection rather than clique pruning to pick refs

Returns:

refFileName (str): The name of the file references were written to
references (list): An updated list of the reference names

PopPUNK.network.fastPrune(component, graph, reference_indices, merged_query_idx, components_list)[source]¶

Pick references for a component using a simple random sampling method

Args:

component (int): The component ID to process
graph (graph): The graph to process
reference_indices (set): The set of reference indices to update
merged_query_idx (set): The set of merged query indices
components_list (array): The list of component assignments

Returns:

list: The updated list of reference indices

PopPUNK.network.fetchNetwork(network_dir, model, refList, ref_graph=False, core_only=False, accessory_only=False, use_gpu=False)[source]¶

Load the network based on input options

Returns the network as a graph-tool format graph, and sets the slope parameter of the passed model object.

Args:

network_dir (str): A network used to define clusters
model (ClusterFit): A fitted model object
refList (list): Names of references that should be in the network
ref_graph (bool): Use ref only graph, if available [default = False]
core_only (bool): Return the network created using only core distances [default = False]
accessory_only (bool): Return the network created using only accessory distances [default = False]
use_gpu (bool): Use cugraph library to load graph

Returns:

genomeNetwork (graph): The loaded network
cluster_file (str): The CSV of cluster assignments corresponding to this network

PopPUNK.network.generate_cugraph(G_df, max_index, weights=False, renumber=True)[source]¶

Builds cugraph graph to ensure all nodes are included in the graph, even if singletons.

Args:

G_df (cudf): cudf data frame containing edge list
max_index (int): The 0-indexed maximum of the node indices
renumber (bool): Whether to renumber the vertices when added to the graph

Returns:

G_new (graph): Dictionary of cluster assignments (keys are sequence names)

PopPUNK.network.generate_minimum_spanning_tree(G, from_cugraph=False)[source]¶

Generate a minimum spanning tree from a network

Args:

G (network): Graph tool network
from_cugraph (bool): If a pre-calculated MST from cugraph [default = False]

Returns:

mst_network (str): Minimum spanning tree (as graph-tool graph)

PopPUNK.network.generate_network_from_distances(mode, model, core_distMat=None, acc_distMat=None, sparse_mat=None, previous_mst=None, combined_seq=None, rlist=None, old_rlist=None, distance_type='core', threads=1, gpu_graph=False)[source]¶

Generates a network from a distance matrix.

Args:

mode (str): Whether a core or sparse distance matrix is being analysed
model (ClusterFit or LineageFit): A fitted model object
coreMat (numpy.array): NxN array of core distances for N sequences
accMat (numpy.array): NxN array of accessory distances for N sequences
sparse_mat (scipy or cupyx sparse matrix): Sparse matrix of kNN from lineage fit
previous_mst (str or graph object): Path of file containing existing network, or already-loaded graph object
combined_seq (list): Ordered list of isolate names
rlist (list): List of reference sequence labels
old_rlist (list): List of reference sequence labels for previous MST
distance_type (str): Whether to use core or accessory distances for MST calculation or dense network weighting
threads (int): Number of threads to use in calculations
use_gpu (bool): Whether to use GPUs for network construction

Returns:

G (graph): The resulting network
pruned_names (list): The labels of the sequences in the pruned network

PopPUNK.network.getCliqueRefs(G, reference_indices={})[source]¶

Recursively prune a network of its cliques. Returns one vertex from a clique at each stage

Args:

G (graph): The graph to get clique representatives from
reference_indices (set): The unique list of vertices being kept, to add to

PopPUNK.network.get_vertex_list(G, use_gpu=False)[source]¶

Generate a list of node indices

Args:

G (network): Graph tool network
use_gpu (bool): Whether graph is a cugraph or not [default = False]

Returns:

vlist (list): List of integers corresponding to nodes

PopPUNK.network.load_network_file(fn, use_gpu=False)[source]¶

Load the network based on input options

Returns the network as a graph-tool format graph, and sets the slope parameter of the passed model object.

Args:

fn (str): Network file name
use_gpu (bool): Use cugraph library to load graph

Returns:

genomeNetwork (graph): The loaded network

PopPUNK.network.networkSummary(G, calc_betweenness=True, betweenness_sample=100, subsample=None, use_gpu=False)[source]¶

Provides summary values about the network

Args:

G (graph): The network of strains
calc_betweenness (bool): Whether to calculate betweenness stats
betweenness_sample (int): Number of sequences per component used to estimate betweenness using a GPU. Smaller numbers are faster but less precise [default = 100]
subsample (int): Number of vertices to randomly subsample from graph
use_gpu (bool): Whether to use cugraph for graph analysis

Returns:

metrics (list): List with # components, density, transitivity, mean betweenness and weighted mean betweenness
scores (list): List of scores

PopPUNK.network.network_to_edges(prev_G_fn, rlist, adding_qq_dists=False, old_ids=None, previous_pkl=None, weights=False, use_gpu=False)[source]¶

Load previous network, extract the edges to match the vertex order specified in rlist, and also return weights if specified.

Args:

prev_G_fn (str or graph object): Path of file containing existing network, or already-loaded graph object
adding_qq_dists (bool): Boolean specifying whether query-query edges are being added to an existing network, such that not all the sequence IDs will be found in the old IDs, which should already be correctly ordered
rlist (list): List of reference sequence labels in new network
old_ids (list): List of IDs of vertices in existing network
previous_pkl (str): Path of pkl file containing names of sequences in previous network
weights (bool): Whether to return edge weights (default = False)
use_gpu (bool): Whether to use cugraph for graph analyses

Returns:

source_ids (list): Source nodes for each edge
target_ids (list): Target nodes for each edge
edge_weights (list): Weights for each new edge

PopPUNK.network.printClusters(G, rlist, outPrefix=None, oldClusterFile=None, externalClusterCSV=None, printRef=True, printCSV=True, clustering_type='combined', write_unwords=True, use_gpu=False)[source]¶

Get cluster assignments

Also writes assignments to a CSV file

Args:

G (graph): Network used to define clusters
rlist (list): Names of samples
outPrefix (str): Prefix for output CSV Default = None
oldClusterFile (str): CSV with previous cluster assignments. Pass to ensure consistency in cluster assignment name. Default = None
externalClusterCSV (str): CSV with cluster assignments from any source. Will print a file relating these to new cluster assignments Default = None
printRef (bool): If false, print only query sequences in the output Default = True
printCSV (bool): Print results to file Default = True
clustering_type (str): Type of clustering network, used for comparison with old clusters Default = ‘combined’
write_unwords (bool): Write clusters with a pronouncable name rather than numerical index Default = True
use_gpu (bool): Whether to use cugraph for network analysis

Returns:

clustering (dict): Dictionary of cluster assignments (keys are sequence names)
merged_queries (list): Any query files which were part of a merge

PopPUNK.network.printExternalClusters(newClusters, extClusterFile, outPrefix, oldNames, printRef=True)[source]¶

Prints cluster assignments with respect to previously defined clusters or labels.

Args:

newClusters (set iterable)

The components from the graph G, defining the PopPUNK clusters

extClusterFile (str)

A CSV file containing definitions of the external clusters for each sample (does not need to contain all samples)

outPrefix (str)

Prefix for output CSV (_external_clusters.csv)

oldNames (list)

A list of the reference sequences

printRef (bool)

If false, print only query sequences in the output

Default = True

PopPUNK.network.print_network_summary(G, sample_size=None, betweenness_sample=100, use_gpu=False)[source]¶

Wrapper function for printing network information

Args:

G (graph): List of reference sequence labels
sample_size (int): Number of nodes to subsample for graph statistic calculation
betweenness_sample (int): Number of sequences per component used to estimate betweenness using a GPU. Smaller numbers are faster but less precise [default = 100]
use_gpu (bool): Whether to use GPUs for network construction

PopPUNK.network.process_previous_network(previous_network=None, adding_qq_dists=False, old_ids=None, previous_pkl=None, vertex_labels=None, weights=False, use_gpu=False)[source]¶

Extract edge types from an existing network

Args:

previous_network (str or graph object): Name of file containing a previous network to be integrated into this new network, or already-loaded graph object
adding_qq_dists (bool): Boolean specifying whether query-query edges are being added to an existing network, such that not all the sequence IDs will be found in the old IDs, which should already be correctly ordered
old_ids (list): Ordered list of vertex names in previous network
previous_pkl (str): Name of file containing the names of the sequences in the previous_network ordered based on the original network construction
vertex_labels (list): Ordered list of sequence labels
weights (bool): Whether weights should be extracted from the previous network
use_gpu (bool): Whether to use GPUs for network construction

Returns:

extra_sources (list): List of source node identifiers
extra_targets (list): List of destination node identifiers
extra_weights (list or None): List of edge weights

PopPUNK.network.process_weights(distMat, weights_type)[source]¶

Calculate edge weights from the distance matrix

Args:

distMat (2 column ndarray): Numpy array of pairwise distances
weights_type (str): Measure to calculate from the distMat to use as edge weights in network - options are core, accessory or euclidean distance

Returns:

processed_weights (list): Edge weights

PopPUNK.network.prune_graph(prefix, reflist, samples_to_keep, output_db_name, threads, use_gpu)[source]¶

Keep only the specified sequences in a graph

Args:

prefix (str): Name of directory containing network
reflist (list): Ordered list of sequences of database
samples_to_keep (list): The names of samples to be retained in the graph
output_db_name (str): Name of output directory
threads (int): Number of CPU threads to use when recalculating random match chances [default = 1].
use_gpu (bool): Whether graph is a cugraph or not [default = False]

PopPUNK.network.remove_nodes_from_graph(G, reflist, samples_to_keep, use_gpu)[source]¶

Return a modified graph containing only the requested nodes

Args:

reflist (list): Ordered list of sequences of database
samples_to_keep (list): The names of samples to be retained in the graph
use_gpu (bool): Whether graph is a cugraph or not [default = False]

Returns:

G_new (graph): Pruned graph

PopPUNK.network.remove_non_query_components(G, rlist, qlist, use_gpu=False)[source]¶

Removes all components that do not contain a query sequence.

Args:

G (graph): Network of queries linked to reference sequences
rlist (list): List of reference sequence labels
qlist (list): List of query sequence labels
use_gpu (bool): Whether to use GPUs for network construction

Returns:

G (graph): The resulting network
pruned_names (list): The labels of the sequences in the pruned network

PopPUNK.network.save_network(G, prefix=None, suffix=None, use_graphml=False, use_gpu=False)[source]¶

Save a network to disk

Args:

G (network): Graph tool network
prefix (str): Prefix for output file
use_graphml (bool): Whether to output a graph-tool file in graphml format
use_gpu (bool): Whether graph is a cugraph or not [default = False]

PopPUNK.network.sparse_mat_to_network(sparse_mat, rlist, use_gpu=False)[source]¶

Generate a network from a lineage rank fit

Args:

sparse_mat (scipy or cupyx sparse matrix): Sparse matrix of kNN from lineage fit
rlist (list): List of sequence names
use_gpu (bool): Whether GPU libraries should be used

Returns:

G (network): Graph tool or cugraph network

PopPUNK.network.translate_network_indices(G_ref_df, reference_indices)[source]¶

Function for ensuring an updated reference network retains numbering consistent with sample names

Args:

G_ref_df (cudf data frame)
List of edges in reference network

reference_indices (list)
The ordered list of reference indices in the original network

Returns:

G_ref (cugraph network)
Network of reference sequences

PopPUNK.network.vertex_betweenness(graph, norm=True)[source]¶: Returns betweenness for nodes in the graph

PopPUNK.network.writeReferences(refList, outPrefix, outSuffix='')[source]¶

Writes chosen references to file

Args:

refList (list): Reference names to write
outPrefix (str): Prefix for output file
outSuffix (str): Suffix for output file (.refs will be appended)

Returns:

refFileName (str): The name of the file references were written to

refine.py¶

Functions used to refine an existing model. Access using RefineFit.

Refine mixture model using network properties

class PopPUNK.refine.NumpyShared(name, shape, dtype)¶

dtype¶: Alias for field number 2

name¶: Alias for field number 0

shape¶: Alias for field number 1

PopPUNK.refine.check_search_range(scale, mean0, mean1, lower_s, upper_s)[source]¶

Checks a search range is within a valid range

Args:

scale (np.array): Rescaling factor to [0, 1] for each axis
mean0 (np.array): (x, y) of starting point defining line
mean1 (np.array): (x, y) of end point defining line
lower_s (float): distance along line to start search
upper_s (float): distance along line to end search

Returns:

min_x, max_x: minimum and maximum x-intercepts of the search range
min_y, max_y: minimum and maximum x-intercepts of the search range

PopPUNK.refine.expand_cugraph_network(G, G_extra_df)[source]¶

Reconstruct a cugraph network with additional edges.

Args:

G (cugraph network): Original cugraph network
extra_edges (cudf dataframe): Data frame of edges to add

Returns:

G (cugraph network): Expanded cugraph network

PopPUNK.refine.growNetwork(sample_names, i_vec, j_vec, idx_vec, s_range, score_idx=0, thread_idx=0, betweenness_sample=100, write_clusters=None, sample_size=None, use_gpu=False)[source]¶

Construct a network, then add edges to it iteratively. Input is from pp_sketchlib.iterateBoundary1D or``pp_sketchlib.iterateBoundary2D``

Args:

sample_names (list): Sample names corresponding to distMat (accessed by iterator)
i_vec (list): Ordered ref vertex index to add
j_vec (list): Ordered query (==ref) vertex index to add
idx_vec (list): For each i, j tuple, the index of the intercept at which these enter the network. These are sorted and increasing
s_range (list): Offsets which correspond to idx_vec entries
score_idx (int): Index of score from networkSummary() to use [default = 0]
thread_idx (int): Optional thread idx (if multithreaded) to offset progress bar by
betweenness_sample (int): Number of sequences per component used to estimate betweenness using a GPU. Smaller numbers are faster but less precise [default = 100]
write_clusters (str): Set to a prefix to write the clusters from each position to files [default = None]
sample_size (int): Number of nodes to subsample for graph statistic calculation
use_gpu (bool): Whether to use cugraph for graph analyses

Returns:

scores (list): -1 * network score for each of x_range. Where network score is from networkSummary()

PopPUNK.refine.likelihoodBoundary(s, model, start, end, within, between)[source]¶

Wrapper function around fit2dMultiGaussian() so that it can go into a root-finding function for probabilities between components

Args:

s (float): Distance along line from mean0
model (BGMMFit): Fitted mixture model
start (numpy.array): The co-ordinates of the centre of the within-strain distribution
end (numpy.array): The co-ordinates of the centre of the between-strain distribution
within (int): Label of the within-strain distribution
between (int): Label of the between-strain distribution

Returns:

responsibility (float): The difference between responsibilities of assignment to the within component and the between assignment

PopPUNK.refine.multi_refine(distMat, sample_names, mean0, mean1, scale, s_max, n_boundary_points, output_prefix, num_processes=1, betweenness_sample=100, sample_size=None, use_gpu=False)[source]¶

Move the refinement boundary between the optimum and where it meets an axis. Discrete steps, output the clusers at each step

Args:

distMat (numpy.array): n x 2 array of core and accessory distances for n samples
sample_names (list): List of query sequence labels
mean0 (numpy.array): Start point to define search line
mean1 (numpy.array): End point to define search line
scale (numpy.array): Scaling factor of distMat
s_max (float): The optimal s position from refinement (refineFit())
n_boundary_points (int): Number of positions to try drawing the boundary at
num_processes (int): Number of threads to use in the global optimisation step. (default = 1)
betweenness_sample (int): Number of sequences per component used to estimate betweenness using a GPU. Smaller numbers are faster but less precise [default = 100]
sample_size (int): Number of nodes to subsample for graph statistic calculation
use_gpu (bool): Whether to use cugraph for graph analyses

PopPUNK.refine.newNetwork(s, sample_names, distMat, mean0, mean1, gradient, slope=2, score_idx=0, cpus=1, betweenness_sample=100, sample_size=None, use_gpu=False)[source]¶

Wrapper function for construct_network_from_edge_list() which is called by optimisation functions moving a triangular decision boundary.

Given the boundary parameterisation, constructs the network and returns its score, to be minimised.

Args:

s (float): Distance along line between start_point and mean1 from start_point
sample_names (list): Sample names corresponding to distMat (accessed by iterator)
distMat (numpy.array or NumpyShared): Core and accessory distances or NumpyShared describing these in sharedmem
mean0 (numpy.array): Start point
mean1 (numpy.array): End point
gradient (float): Gradient of line to move along
slope (int): Set to 0 for a vertical line, 1 for a horizontal line, or 2 to use a slope [default = 2]
score_idx (int): Index of score from networkSummary() to use [default = 0]
cpus (int): Number of CPUs to use for calculating assignment
betweenness_sample (int): Number of sequences per component used to estimate betweenness using a GPU. Smaller numbers are faster but less precise [default = 100]
sample_size (int): Number of nodes to subsample for graph statistic calculation
use_gpu (bool): Whether to use cugraph for graph analysis

Returns:

score (float): -1 * network score. Where network score is from networkSummary()

PopPUNK.refine.newNetwork2D(y_idx, sample_names, distMat, x_range, y_range, score_idx=0, betweenness_sample=100, sample_size=None, use_gpu=False)[source]¶

Wrapper function for thresholdIterate2D and growNetwork().

For a given y_max, constructs networks across x_range and returns a list of scores

Args:

y_idx (float): Maximum y-intercept of boundary, as index into y_range
sample_names (list): Sample names corresponding to distMat (accessed by iterator)
distMat (numpy.array or NumpyShared): Core and accessory distances or NumpyShared describing these in sharedmem
x_range (list): Sorted list of x-intercepts to search
y_range (list): Sorted list of y-intercepts to search
score_idx (int): Index of score from networkSummary() to use [default = 0]
betweenness_sample (int): Number of sequences per component used to estimate betweenness using a GPU. Smaller numbers are faster but less precise [default = 100]
sample_size (int): Number of nodes to subsample for graph statistic calculation
use_gpu (bool): Whether to use cugraph for graph analysis

Returns:

scores (list): -1 * network score for each of x_range. Where network score is from networkSummary()

PopPUNK.refine.readManualStart(startFile)[source]¶

Reads a file to define a manual start point, rather than using --fit-model

Throws and exits if incorrectly formatted.

Args:

startFile (str): Name of file with values to read

Returns:

mean0 (numpy.array): Centre of within-strain distribution
mean1 (numpy.array): Centre of between-strain distribution
scaled (bool): True if means are scaled between [0,1]

PopPUNK.refine.refineFit(distMat, sample_names, mean0, mean1, scale, max_move, min_move, slope=2, score_idx=0, unconstrained=False, no_local=False, num_processes=1, betweenness_sample=100, sample_size=None, use_gpu=False)[source]¶

Try to refine a fit by maximising a network score based on transitivity and density.

Iteratively move the decision boundary to do this, using starting point from existing model.

Args:

distMat (numpy.array): n x 2 array of core and accessory distances for n samples
sample_names (list): List of query sequence labels
mean0 (numpy.array): Start point to define search line
mean1 (numpy.array): End point to define search line
scale (numpy.array): Scaling factor of distMat
max_move (float): Maximum distance to move away from start point
min_move (float): Minimum distance to move away from start point
slope (int): Set to 0 for a vertical line, 1 for a horizontal line, or 2 to use a slope
score_idx (int): Index of score from networkSummary() to use [default = 0]
unconstrained (bool): If True, search in 2D and change the slope of the boundary
no_local (bool): Turn off the local optimisation step. Quicker, but may be less well refined.
num_processes (int): Number of threads to use in the global optimisation step. (default = 1)
betweenness_sample (int): Number of sequences per component used to estimate betweenness using a GPU. Smaller numbers are faster but less precise [default = 100]
sample_size (int): Number of nodes to subsample for graph statistic calculation
use_gpu (bool): Whether to use cugraph for graph analyses

Returns:

optimal_x (float): x-coordinate of refined fit
optimal_y (float): y-coordinate of refined fit
optimised_s (float): Position along search range of refined fit

plot.py¶

Plots of GMM results, k-mer fits, and microreact output

PopPUNK.plot.createMicroreact(prefix, microreact_files, api_key=None, info_csv=None)[source]¶

Creates a .microreact file, and instance via the API

Args:

prefix (str): Prefix for output file
microreact_files (str): List of Microreact files [clusters, dot, tree, mst_tree]
api_key (str): API key for your account
info_csv (str): CSV file containing additional information for Microreact

PopPUNK.plot.distHistogram(dists, rank, outPrefix)[source]¶

Plot a histogram of distances (1D)

Args:

dists (np.array): Distance vector
rank (int): Rank (used for name and title)
outPrefix (int): Full path prefix for plot file

PopPUNK.plot.drawMST(mst, outPrefix, isolate_clustering, clustering_name, overwrite)[source]¶

Plot a layout of the minimum spanning tree

Args:

mst (graph_tool.Graph): A minimum spanning tree
outPrefix (str): Output prefix for save files
isolate_clustering (dict): Dictionary of ID: cluster, used for colouring vertices
clustering_name (str): Name of clustering scheme to be used for colouring
overwrite (bool): Overwrite existing output files

PopPUNK.plot.get_grid(minimum, maximum, resolution)[source]¶

Get a square grid of points to evaluate a function across

Used for plot_scatter() and plot_contours()

Args:

minimum (float): Minimum value for grid
maximum (float): Maximum value for grid
resolution (int): Number of points along each axis

Returns:

xx (numpy.array): x values across n x n grid
yy (numpy.array): y values across n x n grid
xy (numpy.array): n x 2 pairs of x, y values grid is over

PopPUNK.plot.outputsForCytoscape(G, G_mst, isolate_names, clustering, outPrefix, epiCsv, queryList=None, suffix=None, writeCsv=True, use_partial_query_graph=None)[source]¶

Write outputs for cytoscape. A graphml of the network, and CSV with metadata

Args:

G (graph): The network to write
G_mst (graph): The minimum spanning tree of G
isolate_names (list): Ordered list of sequence names
clustering (dict): Dictionary of cluster assignments (keys are nodeNames).
outPrefix (str): Prefix for files to be written
epiCsv (str): Optional CSV of epi data to paste in the output in addition to the clusters.
queryList (list): Optional list of isolates that have been added as a query. (default = None)
suffix (string): String to append to network file name. (default = None)
writeCsv (bool): Whether to print CSV file to accompany network
use_partial_query_graph (str): File listing sequences to be included in output graph

PopPUNK.plot.outputsForGrapetree(combined_list, clustering, nj_tree, mst_tree, outPrefix, epiCsv, queryList=None, overwrite=False)[source]¶

Generate files for Grapetree

Write a neighbour joining tree (.nwk) from core distances and cluster assignment (.csv)

Args:

combined_list (list): Name of sequences being analysed. The part of the name before the first ‘.’ will be shown in the output
clustering (dict or dict of dicts): List of cluster assignments from printClusters(). Further clusterings (e.g. 1D core only) can be included by passing these as a dict.
nj_tree (str or None): String representation of a Newick-formatted NJ tree
mst_tree (str or None): String representation of a Newick-formatted minimum-spanning tree
outPrefix (str): Prefix for all generated output files, which will be placed in outPrefix subdirectory.
epiCsv (str): A CSV containing other information, to include with the CSV of clusters
queryList (list): Optional list of isolates that have been added as a query for colouring in the CSV. (default = None)
overwrite (bool): Overwrite existing output if present (default = False).

PopPUNK.plot.outputsForMicroreact(combined_list, clustering, nj_tree, mst_tree, accMat, perplexity, maxIter, outPrefix, epiCsv, queryList=None, overwrite=False, n_threads=1, use_gpu=False, device_id=0)[source]¶

Generate files for microreact

Output a neighbour joining tree (.nwk) from core distances, a plot of t-SNE clustering of accessory distances (.dot) and cluster assignment (.csv)

Args:

combined_list (list): Name of sequences being analysed. The part of the name before the first ‘.’ will be shown in the output
clustering (dict or dict of dicts): List of cluster assignments from printClusters(). Further clusterings (e.g. 1D core only) can be included by passing these as a dict.
nj_tree (str or None): String representation of a Newick-formatted NJ tree
mst_tree (str or None): String representation of a Newick-formatted minimum-spanning tree
accMat (numpy.array): n x n array of accessory distances for n samples.
perplexity (int): Perplexity parameter passed to mandrake
maxIter (int): Maximum iterations for mandrake
outPrefix (str): Prefix for all generated output files, which will be placed in outPrefix subdirectory
epiCsv (str): A CSV containing other information, to include with the CSV of clusters
queryList (list): Optional list of isolates that have been added as a query for colouring in the CSV. (default = None)
overwrite (bool): Overwrite existing output if present (default = False)
n_threads (int): Number of CPU threads to use (default = 1)
use_gpu (bool): Whether to use a GPU for t-SNE generation
device_id (int): Device ID of GPU to be used (default = 0)

Returns:

outfiles (list): List of output files create

PopPUNK.plot.outputsForPhandango(combined_list, clustering, nj_tree, mst_tree, outPrefix, epiCsv, queryList=None, overwrite=False)[source]¶

Generate files for Phandango

Write a neighbour joining tree (.tree) from core distances and cluster assignment (.csv)

Args:

combined_list (list): Name of sequences being analysed. The part of the name before the first ‘.’ will be shown in the output
clustering (dict or dict of dicts): List of cluster assignments from printClusters(). Further clusterings (e.g. 1D core only) can be included by passing these as a dict.
nj_tree (str or None): String representation of a Newick-formatted NJ tree
mst_tree (str or None): String representation of a Newick-formatted minimum-spanning tree
outPrefix (str): Prefix for all generated output files, which will be placed in outPrefix subdirectory
epiCsv (str): A CSV containing other information, to include with the CSV of clusters
queryList (list): Optional list of isolates that have been added as a query for colouring in the CSV. (default = None)
overwrite (bool): Overwrite existing output if present (default = False)
threads (int): Number of threads to use with rapidnj

PopPUNK.plot.plot_contours(model, assignments, title, out_prefix)[source]¶

Draw contours of mixture model assignments

Will draw the decision boundary for between/within in red

Args:

model (BGMMFit): Model we are plotting from
assignments (numpy.array): n-vectors of cluster assignments for model
title (str): The title to display above the plot
out_prefix (str): Prefix for output plot file (.pdf will be appended)

PopPUNK.plot.plot_database_evaluations(prefix, genome_lengths, ambiguous_bases)[source]¶

Plot histograms of sequence characteristics for database evaluation.

Args:

prefix (str): Prefix for output files
genome_lengths (list): Lengths of genomes in database
ambiguous_bases (list): Counts of ambiguous bases in genomes in database

PopPUNK.plot.plot_dbscan_results(X, y, n_clusters, out_prefix, use_gpu)[source]¶

Draw a scatter plot (png) to show the DBSCAN model fit

A scatter plot of core and accessory distances, coloured by component membership. Black is noise

Args:

X (numpy.array): n x 2 array of core and accessory distances for n samples.
Y (numpy.array): n x 1 array of cluster assignments for n samples.
n_clusters (int): Number of clusters used (excluding noise)
out_prefix (str): Prefix for output file (.png will be appended)
use_gpu (bool): Whether model was fitted with GPU-enabled code

PopPUNK.plot.plot_evaluation_histogram(input_data, n_bins=100, prefix='hist', suffix='', plt_title='histogram', xlab='x')[source]¶

Plot histograms of sequence characteristics for database evaluation.

Args:

input_data (list): Input data (list of numbers)
n_bins (int): Number of bins to use for the histogram
prefix (str): Prefix of database
suffix (str): Suffix specifying plot type
plt_title (str): Title for plot
xlab (str): Title for the horizontal axis

PopPUNK.plot.plot_fit(klist, raw_matching, raw_fit, corrected_matching, corrected_fit, out_prefix, title)[source]¶

Draw a scatter plot (pdf) of k-mer sizes vs match probability, and the fit used to assign core and accessory distance

K-mer sizes on x-axis, log(pr(match)) on y - expect a straight line fit with intercept representing accessory distance and slope core distance

Args:

klist (list): List of k-mer sizes
raw_matching (list): Proportion of matching k-mers at each klist value
raw_fit (numpy.array): Fit to klist and raw_matching from fitKmerCurve()
corrected_matching (list): Corrected proportion of matching k-mers at each klist value
corrected_fit (numpy.array): Fit to klist and corrected_matching from fitKmerCurve()
out_prefix (str): Prefix for output plot file (.pdf will be appended)
title (str): The title to display above the plot

PopPUNK.plot.plot_refined_results(X, Y, x_boundary, y_boundary, core_boundary, accessory_boundary, mean0, mean1, min_move, max_move, scale, threshold, indiv_boundaries, unconstrained, title, out_prefix)[source]¶

Draw a scatter plot (png) to show the refined model fit

A scatter plot of core and accessory distances, coloured by component membership. The triangular decision boundary is also shown

Args:

X (numpy.array): n x 2 array of core and accessory distances for n samples.
Y (numpy.array): n x 1 array of cluster assignments for n samples.
x_boundary (float): Intercept of boundary with x-axis, from RefineFit
y_boundary (float): Intercept of boundary with y-axis, from RefineFit
core_boundary (float): Intercept of 1D (core) boundary with x-axis, from RefineFit
accessory_boundary (float): Intercept of 1D (core) boundary with y-axis, from RefineFit
mean0 (numpy.array): Centre of within-strain distribution
mean1 (numpy.array): Centre of between-strain distribution
min_move (float): Minimum s range
max_move (float): Maximum s range
scale (numpy.array): Scaling factor from RefineFit
threshold (bool): If fit was just from a simple thresholding
indiv_boundaries (bool): Whether to draw lines for core and accessory refinement
title (str): The title to display above the plot
out_prefix (str): Prefix for output plot file (.png will be appended)

PopPUNK.plot.plot_results(X, Y, means, covariances, scale, title, out_prefix)[source]¶

Draw a scatter plot (png) to show the BGMM model fit

A scatter plot of core and accessory distances, coloured by component membership. Also shown are ellipses for each component (centre: means axes: covariances).

This is based on the example in the sklearn documentation.

Args:

X (numpy.array): n x 2 array of core and accessory distances for n samples.
Y (numpy.array): n x 1 array of cluster assignments for n samples.
means (numpy.array): Component means from BGMMFit
covars (numpy.array): Component covariances from BGMMFit
scale (numpy.array): Scaling factor from BGMMFit
out_prefix (str): Prefix for output plot file (.png will be appended)
title (str): The title to display above the plot

PopPUNK.plot.plot_scatter(X, out_prefix, title, kde=True)[source]¶

Draws a 2D scatter plot (png) of the core and accessory distances

Also draws contours of kernel density estimare

Args:

X (numpy.array)

n x 2 array of core and accessory distances for n samples.

out_prefix (str)

Prefix for output plot file (.png will be appended)

title (str)

The title to display above the plot

kde (bool)

Whether to draw kernel density estimate contours

(default = True)

PopPUNK.plot.update_maps_timelines(micoreact_sample_json, info_csv=None)[source]¶

Update the maps and timelines in the Microreact JSON file. Removes maps and timelines if the required columns are not present in the info CSV.

Args:

micoreact_sample_json (dict): Microreact JSON file
info_csv (str): CSV file containing additional information for Microreact

PopPUNK.plot.writeClusterCsv(outfile, nodeNames, nodeLabels, clustering, output_format='microreact', epiCsv=None, queryNames=None, suffix='_Cluster')[source]¶

Print CSV file of clustering and optionally epi data

Writes CSV output of clusters which can be used as input to microreact and cytoscape. Uses pandas to deal with CSV reading and writing nicely.

The epiCsv, if provided, should have the node labels in the first column.

Args:

outfile (str)

File to write the CSV to.

nodeNames (list)

Names of sequences in clustering (includes path).

nodeLabels (list)

Names of sequences to write in CSV (usually has path removed).

clustering (dict or dict of dicts)

Dictionary of cluster assignments (keys are nodeNames). Pass a dict with depth two to include multiple possible clusterings.

output_format (str)

Software for which CSV should be formatted (microreact, phandango, grapetree and cytoscape are accepted)

epiCsv (str)

Optional CSV of epi data to paste in the output in addition to the clusters (default = None).

queryNames (list)

Optional list of isolates that have been added as a query.

(default = None)

sparse_mst.py¶

sketchlib.py¶

Sketchlib functions for database construction

PopPUNK.sketchlib.addRandom(oPrefix, sequence_names, klist, strand_preserved=False, overwrite=False, threads=1)[source]¶

Add chance of random match to a HDF5 sketch DB

Args:

oPrefix (str): Sketch database prefix
sequence_names (list): Names of sequences to include in calculation
klist (list): List of k-mer sizes to sketch
strand_preserved (bool): Set true to ignore rc k-mers
overwrite (str): Set true to overwrite existing random match chances
threads (int): Number of threads to use (default = 1)

PopPUNK.sketchlib.checkSketchlibLibrary()[source]¶

Gets the location of the sketchlib library

Returns:

lib (str): Location of sketchlib .so/.dyld

PopPUNK.sketchlib.checkSketchlibVersion()[source]¶

Checks that sketchlib can be run, and returns version

Returns:

version (str): Version string

PopPUNK.sketchlib.constructDatabase(assemblyList, klist, sketch_size, oPrefix, threads, overwrite, strand_preserved, min_count, use_exact, calc_random=True, codon_phased=False, use_gpu=False, deviceid=0)[source]¶

Sketch the input assemblies at the requested k-mer lengths

A multithread wrapper around runSketch(). Threads are used to either run multiple sketch processes for each klist value.

Also calculates random match probability based on length of first genome in assemblyList.

Args:

assemblyList (str): File with locations of assembly files to be sketched
klist (list): List of k-mer sizes to sketch
sketch_size (int): Size of sketch (-s option)
oPrefix (str): Output prefix for resulting sketch files
threads (int): Number of threads to use (default = 1)
overwrite (bool): Whether to overwrite sketch DBs, if they already exist. (default = False)
strand_preserved (bool): Ignore reverse complement k-mers (default = False)
min_count (int): Minimum count of k-mer in reads to include (default = 0)
use_exact (bool): Use exact count of k-mer appearance in reads (default = False)
calc_random (bool): Add random match chances to DB (turn off for queries)
codon_phased (bool): Use codon phased seeds (default = False)
use_gpu (bool): Use GPU for read sketching (default = False)
deviceid (int): GPU device id (default = 0)

Returns:

names (list): List of names included in the database (from rfile)

PopPUNK.sketchlib.createDatabaseDir(outPrefix, kmers)[source]¶

Creates the directory to write sketches to, removing old files if unnecessary

Args:

outPrefix (str): output db prefix
kmers (list): k-mer sizes in db

PopPUNK.sketchlib.fitKmerCurve(pairwise, klist, jacobian)[source]¶

Fit the function \(pr = (1-a)(1-c)^k\)

Supply jacobian = -np.hstack((np.ones((klist.shape[0], 1)), klist.reshape(-1, 1)))

Args:

pairwise (numpy.array): Proportion of shared k-mers at k-mer values in klist
klist (list): k-mer sizes used
jacobian (numpy.array): Should be set as above (set once to try and save memory)

Returns:

transformed_params (numpy.array): Column with core and accessory distance

PopPUNK.sketchlib.getKmersFromReferenceDatabase(dbPrefix)[source]¶

Get kmers lengths from existing database

Args:

dbPrefix (str): Prefix for sketch DB files

Returns:

kmers (list): List of k-mer lengths used in database

PopPUNK.sketchlib.getSeqsInDb(dbname)[source]¶

Return an array with the sequences in the passed database

Args:

dbname (str): Sketches database filename

Returns:

seqs (list): List of sequence names in sketch DB

PopPUNK.sketchlib.getSketchSize(dbPrefix)[source]¶

Determine sketch size, and ensures consistent in whole database

sys.exit(1) is called if DBs have different sketch sizes

Args:

dbprefix (str): Prefix for databases

Returns:

sketchSize (int): sketch size (64x C++ definition)
codonPhased (bool): whether the DB used codon phased seeds

PopPUNK.sketchlib.get_database_statistics(prefix)[source]¶

Extract statistics for evaluating databases.

Args:

prefix (str): Prefix of database

PopPUNK.sketchlib.joinDBs(db1, db2, output, update_random=None, full_names=False)[source]¶

Join two sketch databases with the low-level HDF5 copy interface

Args:

db1 (str): Prefix for db1
db2 (str): Prefix for db2
output (str): Prefix for joined output
update_random (dict): Whether to re-calculate the random object. May contain control arguments strand_preserved and threads (see addRandom())
full_names (bool): If True, db_name and out_name are the full paths to h5 files

PopPUNK.sketchlib.queryDatabase(rNames, qNames, dbPrefix, queryPrefix, klist, self=True, number_plot_fits=0, threads=1, use_gpu=False, deviceid=0)[source]¶

Calculate core and accessory distances between query sequences and a sketched database

For a reference database, runs the query against itself to find all pairwise core and accessory distances.

Uses the relation \(pr(a, b) = (1-a)(1-c)^k\)

To get the ref and query name for each row of the returned distances, call to the iterator iterDistRows() with the returned refList and queryList

Args:

rNames (list): Names of references to query
qNames (list): Names of queries
dbPrefix (str): Prefix for reference sketch database created by constructDatabase()
queryPrefix (str): Prefix for query sketch database created by constructDatabase()
klist (list): K-mer sizes to use in the calculation
self (bool): Set true if query = ref (default = True)
number_plot_fits (int): If > 0, the number of k-mer length fits to plot (saved as pdfs). Takes random pairs of comparisons and calls plot_fit() (default = 0)
threads (int): Number of threads to use in the process (default = 1)
use_gpu (bool): Use a GPU for querying (default = False)
deviceid (int): Index of the CUDA GPU device to use (default = 0)

Returns:

distMat (numpy.array): Core distances (column 0) and accessory distances (column 1) between refList and queryList

PopPUNK.sketchlib.readDBParams(dbPrefix)[source]¶

Get kmers lengths and sketch sizes from existing database

Calls getKmersFromReferenceDatabase() and getSketchSize() Uses passed values if db missing

Args:

dbPrefix (str): Prefix for sketch DB files

Returns:

kmers (list): List of k-mer lengths used in database
sketch_sizes (list): List of sketch sizes used in database
codonPhased (bool): whether the DB used codon phased seeds

PopPUNK.sketchlib.removeFromDB(db_name, out_name, removeSeqs, full_names=False)[source]¶

Remove sketches from the DB the low-level HDF5 copy interface

Args:

db_name (str): Prefix for hdf database
out_name (str): Prefix for output (pruned) database
removeSeqs (list): Names of sequences to remove from database
full_names (bool): If True, db_name and out_name are the full paths to h5 files

utils.py¶

General utility functions for data read/writing/manipulation in PopPUNK

PopPUNK.utils.check_and_set_gpu(use_gpu, gpu_lib, quit_on_fail=False)[source]¶

Check GPU libraries can be loaded and set managed memory.

Args:

use_gpu (bool): Whether GPU packages have been requested
gpu_lib (bool): Whether GPU packages are available

Returns:

use_gpu (bool): Whether GPU packages can be used

PopPUNK.utils.decisionBoundary(intercept, gradient, adj=0.0)[source]¶

Returns the co-ordinates where the triangle the decision boundary forms meets the x- and y-axes.

Args:

intercept (numpy.array): Cartesian co-ordinates of point along line (transformLine()) which intercepts the boundary
gradient (float): Gradient of the line
adj (float): Fraction by which to shift the intercept up the y axis

Returns:

x (float): The x-axis intercept
y (float): The y-axis intercept

PopPUNK.utils.get_match_search_depth(rlist, rank_list)[source]¶

Return a default search depth for lineage model fitting.

Args:

rlist (list): List of sequences in database
rank_list (list): List of ranks to be used to fit lineage models

Returns:

max_search_depth (int): Maximum kNN used for lineage model fitting

PopPUNK.utils.isolateNameToLabel(names)[source]¶

Function to process isolate names to labels appropriate for visualisation.

Args:

names (list): List of isolate names.

Returns:

labels (list): List of isolate labels.

PopPUNK.utils.iterDistRows(refSeqs, querySeqs, self=True)[source]¶

Gets the ref and query ID for each row of the distance matrix

Returns an iterable with ref and query ID pairs by row.

Args:

refSeqs (list): List of reference sequence names.
querySeqs (list): List of query sequence names.
self (bool): Whether a self-comparison, used when constructing a database. Requires refSeqs == querySeqs Default is True

Returns:

ref, query (str, str): Iterable of tuples with ref and query names for each distMat row.

PopPUNK.utils.joinClusterDicts(d1, d2)[source]¶

Join two dictionaries returned by readIsolateTypeFromCsv() with return_dict = True. Useful for concatenating ref and query assignments

Args:

d1 (dict of dicts): First dictionary to concat
d2 (dict of dicts): Second dictionary to concat

Returns:

d1 (dict of dicts): d1 with d2 appended

PopPUNK.utils.listDistInts(refSeqs, querySeqs, self=True)[source]¶

Gets the ref and query ID for each row of the distance matrix

Returns an iterable with ref and query ID pairs by row.

Args:

refSeqs (list): List of reference sequence names.
querySeqs (list): List of query sequence names.
self (bool): Whether a self-comparison, used when constructing a database. Requires refSeqs == querySeqs Default is True

Returns:

ref, query (str, str): Iterable of tuples with ref and query names for each distMat row.

PopPUNK.utils.readIsolateTypeFromCsv(clustCSV, mode='clusters', return_dict=False)[source]¶

Read cluster definitions from CSV file.

Args:

clustCSV (str): File name of CSV with isolate assignments
mode (str): Type of file to read ‘clusters’, ‘lineages’, or ‘external’
return_dict (bool): If True, return a dict with sample->cluster instead of sets [default = False]

Returns:

clusters (dict): Dictionary of cluster assignments (keys are cluster names, values are sets containing samples in the cluster). Or if return_dict is set keys are sample names, values are cluster assignments.

PopPUNK.utils.readPickle(pklName, enforce_self=False, distances=True)[source]¶

Loads core and accessory distances saved by storePickle()

Called during --fit-model

Args:

pklName (str)

Prefix for saved files

enforce_self (bool)

Error if self == False

[default = True]

distances (bool)

Read the distance matrix

[default = True]

Returns:

rlist (list): List of reference sequence names (for iterDistRows())
qlist (list): List of query sequence names (for iterDistRows())
self (bool): Whether an all-vs-all self DB (for iterDistRows())
X (numpy.array): n x 2 array of core and accessory distances

PopPUNK.utils.readRfile(rFile, oneSeq=False)[source]¶

Reads in files for sketching. Names and sequence, tab separated

Args:

rFile (str): File with locations of assembly files to be sketched
oneSeq (bool): Return only the first sequence listed, rather than a list (used with mash)

Returns:

names (list): Array of sequence names
sequences (list of lists): Array of sequence files

PopPUNK.utils.read_rlist_from_distance_pickle(fn, allow_non_self=True, include_queries=False)[source]¶

Return the list of reference sequences from a distance pickle.

Args:

fn (str): Name of distance pickle
allow_non_self (bool): Whether non-self distance datasets are permissible
include_queries (bool): Whether queries should be included in the rlist

Returns:

rlist (list): List of reference sequence names

PopPUNK.utils.set_env(**environ)[source]¶: Temporarily set the process environment variables. >>> with set_env(PLUGINS_DIR=u’test/plugins’): … “PLUGINS_DIR” in os.environ True >>> “PLUGINS_DIR” in os.environ False

PopPUNK.utils.setupDBFuncs(args)[source]¶

Wraps common database access functions from sketchlib and mash, to try and make their API more similar

Args:

args (argparse.opts): Parsed command lines options
qc_dict (dict): Table of parameters for QC function

Returns:

dbFuncs (dict): Functions with consistent arguments to use as the database API

PopPUNK.utils.stderr_redirected(to='/dev/null')[source]¶

import os

with stdout_redirected(to=filename):: print(“from Python”) os.system(“echo non-Python applications are also supported”)

PopPUNK.utils.storePickle(rlist, qlist, self, X, pklName)[source]¶

Saves core and accessory distances in a .npy file, names in a .pkl

Called during --create-db

Args:

rlist (list)

List of reference sequence names (for iterDistRows())

qlist (list)

List of query sequence names (for iterDistRows())

self (bool)

Whether an all-vs-all self DB (for iterDistRows())

X (numpy.array)

n x 2 array of core and accessory distances

If None, do not save

pklName (str)

Prefix for output files

PopPUNK.utils.transformLine(s, mean0, mean1)[source]¶

Return x and y co-ordinates for traversing along a line between mean0 and mean1, parameterised by a single scalar distance s from the start point mean0.

Args:

s (float): Distance along line from mean0
mean0 (numpy.array): Start position of line (x0, y0)
mean1 (numpy.array): End position of line (x1, y1)

Returns:

x (float): The Cartesian x-coordinate
y (float): The Cartesian y-coordinate

PopPUNK.utils.update_distance_matrices(refList, distMat, queryList=None, query_ref_distMat=None, query_query_distMat=None, threads=1)[source]¶

Convert distances from long form (1 matrix with n_comparisons rows and 2 columns) to a square form (2 NxN matrices), with merging of query distances if necessary.

Args:

refList (list): List of references
distMat (numpy.array): Two column long form list of core and accessory distances for pairwise comparisons between reference db sequences
queryList (list): List of queries
query_ref_distMat (numpy.array): Two column long form list of core and accessory distances for pairwise comparisons between queries and reference db sequences
query_query_distMat (numpy.array): Two column long form list of core and accessory distances for pairwise comparisons between query sequences
threads (int): Number of threads to use

Returns:

seqLabels (list): Combined list of reference and query sequences
coreMat (numpy.array): NxN array of core distances for N sequences
accMat (numpy.array): NxN array of accessory distances for N sequences

visualise.py¶

poppunk_visualise main function

PopPUNK.visualise.main()[source]¶: Main function. Parses cmd line args and runs in the specified mode.

web.py¶

Functions used by the web API to convert a sketch to an h5 database, then generate visualisations and post results to PopPUNK-web.

PopPUNK.web.api(query, ref_db)[source]¶: Post cluster and tree information to microreact

PopPUNK.web.calc_prevalence(cluster, cluster_list, num_samples)[source]¶: Cluster prevalences for Plotly.js

PopPUNK.web.graphml_to_json(network_dir)[source]¶: Converts full GraphML file to JSON subgraph

PopPUNK.web.highlight_cluster(query, cluster)[source]¶: Colour assigned cluster in Microreact output

PopPUNK.web.sketch_to_hdf5(sketches_dict, output)[source]¶: Convert dict of JSON sketches to query hdf5 database

PopPUNK.web.summarise_clusters(output, species, species_db, qNames)[source]¶: Retreieve assigned query and all cluster prevalences. Write list of all isolates in cluster for tree subsetting