Reference documentation

Documentation for module functions (for developers)

assign.py

poppunk_assign main function

PopPUNK.assign.assign_query(dbFuncs, ref_db, q_files, output, qc_dict, update_db, write_references, distances, serial, stable, threads, overwrite, plot_fit, graph_weights, model_dir, strand_preserved, previous_clustering, external_clustering, core, accessory, gpu_sketch, gpu_dist, gpu_graph, deviceid, save_partial_query_graph, use_full_network)[source]

Code for assign query mode for CLI

PopPUNK.assign.assign_query_hdf5(dbFuncs, ref_db, qNames, output, qc_dict, update_db, write_references, distances, serial, stable, threads, overwrite, plot_fit, graph_weights, model_dir, strand_preserved, previous_clustering, external_clustering, core, accessory, gpu_dist, gpu_graph, save_partial_query_graph, use_full_network)[source]

Code for assign query mode taking hdf5 as input. Written as a separate function so it can be called by web APIs

PopPUNK.assign.main()[source]

Main function. Parses cmd line args and runs in the specified mode.

bgmm.py

Functions used to fit the mixture model to a database. Access using BGMMFit.

BGMM using sklearn

PopPUNK.bgmm.findBetweenLabel_bgmm(means, assignments)[source]

Identify between-strain links

Finds the component with the largest number of points assigned to it

Args:
means (numpy.array)

K x 2 array of mixture component means

assignments (numpy.array)

Sample cluster assignments

Returns:
between_label (int)

The cluster label with the most points assigned to it

PopPUNK.bgmm.findWithinLabel(means, assignments, rank=0)[source]

Identify within-strain links

Finds the component with mean closest to the origin and also makes sure some samples are assigned to it (in the case of small weighted components with a Dirichlet prior some components are unused)

Args:
means (numpy.array)

K x 2 array of mixture component means

assignments (numpy.array)

Sample cluster assignments

rank (int)

Which label to find, ordered by distance from origin. 0-indexed. (default = 0)

Returns:
within_label (int)

The cluster label for the within-strain assignments

PopPUNK.bgmm.fit2dMultiGaussian(X, dpgmm_max_K=2)[source]

Main function to fit BGMM model, called from fit()

Fits the mixture model specified, saves model parameters to a file, and assigns the samples to a component. Write fit summary stats to STDERR.

Args:
X (np.array)

n x 2 array of core and accessory distances for n samples. This should be subsampled to 100000 samples.

dpgmm_max_K (int)

Maximum number of components to use with the EM fit. (default = 2)

Returns:
dpgmm (sklearn.mixture.BayesianGaussianMixture)

Fitted bgmm model

PopPUNK.bgmm.log_likelihood(X, weights, means, covars, scale)[source]

modified sklearn GMM function predicting distribution membership

Returns the mixture LL for points X. Used by assign_samples() and plot_contours()

Args:
X (numpy.array)

n x 2 array of core and accessory distances for n samples

weights (numpy.array)

Component weights from fit2dMultiGaussian()

means (numpy.array)

Component means from fit2dMultiGaussian()

covars (numpy.array)

Component covariances from fit2dMultiGaussian()

scale (numpy.array)

Scaling of core and accessory distances from fit2dMultiGaussian()

Returns:
logprob (numpy.array)

The log of the probabilities under the mixture model

lpr (numpy.array)

The components of the log probability from each mixture component

PopPUNK.bgmm.log_multivariate_normal_density(X, means, covars, min_covar=1e-07)[source]

Log likelihood of multivariate normal density distribution

Used to calculate per component Gaussian likelihood in assign_samples()

Args:
X (numpy.array)

n x 2 array of core and accessory distances for n samples

means (numpy.array)

Component means from fit2dMultiGaussian()

covars (numpy.array)

Component covariances from fit2dMultiGaussian()

min_covar (float)

Minimum covariance, added when Choleksy decomposition fails due to too few observations (default = 1.e-7)

Returns:
log_prob (numpy.array)

An n-vector with the log-likelihoods for each sample being in this component

dbscan.py

Functions used to fit DBSCAN to a database. Access using DBSCANFit.

DBSCAN using hdbscan

PopPUNK.dbscan.evaluate_dbscan_clusters(model)[source]

Evaluate whether fitted dbscan model contains non-overlapping clusters

Args:
model (DBSCANFit)

Fitted model from fit()

Returns:
indistinct (bool)

Boolean indicating whether putative within- and between-strain clusters of points overlap

PopPUNK.dbscan.findBetweenLabel(assignments, within_cluster)[source]

Identify between-strain links from a DBSCAN model

Finds the component containing the largest number of between-strain links, excluding the cluster identified as containing within-strain links.

Args:
assignments (numpy.array)

Sample cluster assignments

within_cluster (int)

Cluster ID assigned to within-strain assignments, from findWithinLabel()

Returns:
between_cluster (int)

The cluster label for the between-strain assignments

PopPUNK.dbscan.fitDbScan(X, min_samples, min_cluster_size, cache_out, use_gpu=False)[source]

Function to fit DBSCAN model as an alternative to the Gaussian

Fits the DBSCAN model to the distances using hdbscan

Args:
X (np.array)

n x 2 array of core and accessory distances for n samples

min_samples (int)

Parameter for DBSCAN clustering ‘conservativeness’

min_cluster_size (int)

Minimum number of points in a cluster for HDBSCAN

cache_out (str)

Prefix for DBSCAN cache used for refitting

use_gpu (bool)

Whether GPU algorithms should be used in DBSCAN fitting

Returns:
hdb (hdbscan.HDBSCAN or cuml.cluster.HDBSCAN)

Fitted HDBSCAN to subsampled data

labels (list)

Cluster assignments of each sample

n_clusters (int)

Number of clusters used

mandrake.py

PopPUNK.mandrake.generate_embedding(seqLabels, accMat, perplexity, outPrefix, overwrite, kNN=50, maxIter=10000000, n_threads=1, use_gpu=False, device_id=0)[source]

Generate t-SNE projection using accessory distances

Writes a plot of t-SNE clustering of accessory distances (.dot)

Args:
seqLabels (list)

Processed names of sequences being analysed.

accMat (numpy.array)

n x n array of accessory distances for n samples.

perplexity (int)

Perplexity parameter passed to t-SNE

outPrefix (str)

Prefix for all generated output files, which will be placed in outPrefix subdirectory

overwrite (bool)

Overwrite existing output if present (default = False)

kNN (int)

Number of neigbours to use with SCE (cannot be > n_samples) (default = 50)

maxIter (int)

Number of iterations to run (default = 1000000)

n_threads (int)

Number of CPU threads to use (default = 1)

use_gpu (bool)

Whether to use GPU libraries

device_id (int)

Device ID of GPU to be used (default = 0)

Returns:
mandrake_filename (str)

Filename with .dot of embedding

models.py

Classes used for model fits

class PopPUNK.models.BGMMFit(outPrefix, max_samples=100000, max_batch_size=100000, assign_points=True)[source]

Class for fits using the Gaussian mixture model. Inherits from ClusterFit.

Must first run either fit() or load() before calling other functions

Args:
outPrefix (str)

The output prefix used for reading/writing

max_samples (int)

The number of subsamples to fit the model to (default = 100000)

assign(X, max_batch_size=100000, values=False, progress=True)[source]

Assign the clustering of new samples using assign_samples()

Args:
X (numpy.array)

Core and accessory distances

values (bool)

Return the responsibilities of assignment rather than most likely cluster

max_batch_size (int)

Size of batches to be assigned

progress (bool)

Show progress bar

[default = True]

Returns:
y (numpy.array)

Cluster assignments or values by samples

fit(X, max_components)[source]

Extends fit()

Fits the BGMM and returns assignments by calling fit2dMultiGaussian().

Fitted parameters are stored in the object.

Args:
X (numpy.array)

The core and accessory distances to cluster. Must be set if preprocess is set.

max_components (int)

Maximum number of mixture components to use.

Returns:
y (numpy.array)

Cluster assignments of samples in X

load(fit_npz, fit_obj)[source]

Load the model from disk. Called from loadClusterFit()

Args:
fit_npz (dict)

Fit npz opened with numpy.load()

fit_obj (sklearn.mixture.BayesianGaussianMixture)

The saved fit object

plot(X, y)[source]

Extends plot()

Write a summary of the fit, and plot the results using PopPUNK.plot.plot_results() and PopPUNK.plot.plot_contours()

Args:
X (numpy.array)

Core and accessory distances

y (numpy.array)

Cluster assignments from assign()

save()[source]

Save the model to disk, as an npz and pkl (using outPrefix).

class PopPUNK.models.ClusterFit(outPrefix, default_dtype=numpy.float32)[source]

Parent class for all models used to cluster distances

Args:
outPrefix (str)

The output prefix used for reading/writing

copy(prefix)[source]

Copy the model to a new directory

fit(X=None)[source]

Initial steps for all fit functions.

Creates output directory. If preprocess is set then subsamples passed X

Args:
X (numpy.array)

The core and accessory distances to cluster. Must be set if preprocess is set.

(default = None)

default_dtype (numpy dtype)

Type to use if no X provided

no_scale()[source]

Turn off scaling (useful for refine, where optimization is done in the scaled space).

plot(X=None)[source]

Initial steps for all plot functions.

Ensures model has been fitted.

Args:
X (numpy.array)

The core and accessory distances to subsample.

(default = None)

class PopPUNK.models.DBSCANFit(outPrefix, use_gpu=False, max_batch_size=5000, max_samples=100000, assign_points=True)[source]

Class for fits using HDBSCAN. Inherits from ClusterFit.

Must first run either fit() or load() before calling other functions

Args:
outPrefix (str)

The output prefix used for reading/writing

max_samples (int)

The number of subsamples to fit the model to (default = 100000)

assign(X, no_scale=False, progress=True, max_batch_size=5000, use_gpu=False)[source]

Assign the clustering of new samples using assign_samples_dbscan()

Args:
X (numpy.array or cupy.array)

Core and accessory distances

no_scale (bool)

Do not scale X [default = False]

progress (bool)

Show progress bar [default = True]

max_batch_size (int)

Batch size used for assignments [default = 5000]

use_gpu (bool)

Use GPU-enabled algorithms for clustering [default = False]

Returns:
y (numpy.array)

Cluster assignments by samples

fit(X, max_num_clusters, min_cluster_prop, use_gpu=False)[source]

Extends fit()

Fits the distances with HDBSCAN and returns assignments by calling fitDbScan().

Fitted parameters are stored in the object.

Args:
X (numpy.array)

The core and accessory distances to cluster. Must be set if preprocess is set.

max_num_clusters (int)

Maximum number of clusters in DBSCAN fitting

min_cluster_prop (float)

Minimum proportion of points in a cluster in DBSCAN fitting

use_gpu (bool)

Whether GPU algorithms should be used in DBSCAN fitting

Returns:
y (numpy.array)

Cluster assignments of samples in X

load(fit_npz, fit_obj)[source]

Load the model from disk. Called from loadClusterFit()

Args:
fit_npz (dict)

Fit npz opened with numpy.load()

fit_obj (hdbscan.HDBSCAN)

The saved fit object

plot(X=None, y=None)[source]

Extends plot()

Write a summary of the fit, and plot the results using PopPUNK.plot.plot_dbscan_results()

Args:
X (numpy.array)

Core and accessory distances

y (numpy.array)

Cluster assignments from assign()

save()[source]

Save the model to disk, as an npz and pkl (using outPrefix).

class PopPUNK.models.LineageFit(outPrefix, ranks, max_search_depth, reciprocal_only, count_unique_distances, dist_col=None, use_gpu=False)[source]

Class for fits using the lineage assignment model. Inherits from ClusterFit.

Must first run either fit() or load() before calling other functions

Args:
outPrefix (str)

The output prefix used for reading/writing

ranks (list)

The ranks used in the fit

assign(rank)[source]

Get the edges for the network. A little different from other methods, as it doesn’t go through the long form distance vector (as coo_matrix is basically already in the correct gt format)

Args:
rank (int)

Rank to assign at

Returns:
y (list of tuples)

Edges to include in network

edge_weights(rank)[source]

Get the distances for each edge returned by assign

Args:
rank (int)

Rank assigned at

Returns:
weights (list)

Distance for each assignment

extend(qqDists, qrDists)[source]

Update the sparse distance matrix of nearest neighbours after querying

Args:
qqDists (numpy or cupy ndarray)

Two column array of query-query distances

qqDists (numpy or cupy ndarray)

Two column array of reference-query distances

Returns:
y (list of tuples)

Edges to include in network

fit(X, accessory)[source]

Extends fit()

Gets assignments by using nearest neigbours.

Args:
X (numpy.array)

The core and accessory distances to cluster. Must be set if preprocess is set.

accessory (bool)

Use accessory rather than core distances

Returns:
y (numpy.array)

Cluster assignments of samples in X

load(fit_npz, fit_obj)[source]

Load the model from disk. Called from loadClusterFit()

Args:
fit_npz (dict)

Fit npz opened with numpy.load()

fit_obj (sklearn.mixture.BayesianGaussianMixture)

The saved fit object

plot(X, y=None)[source]

Extends plot()

Write a summary of the fit, and plot the results using PopPUNK.plot.plot_results() and PopPUNK.plot.plot_contours()

Args:
X (numpy.array)

Core and accessory distances

y (any)

Unused variable for compatibility with other plotting functions

save()[source]

Save the model to disk, as an npz and pkl (using outPrefix).

class PopPUNK.models.NumpyShared(name, shape, dtype)
dtype

Alias for field number 2

name

Alias for field number 0

shape

Alias for field number 1

class PopPUNK.models.RefineFit(outPrefix)[source]

Class for fits using a triangular boundary and network properties. Inherits from ClusterFit.

Must first run either fit() or load() before calling other functions

Args:
outPrefix (str)

The output prefix used for reading/writing

apply_threshold(X, threshold)[source]

Applies a boundary threshold, given by user. Does not run optimisation.

Args:
X (numpy.array)

The core and accessory distances to cluster. Must be set if preprocess is set.

threshold (float)

The value along the x-axis (core distance) at which to draw the assignment boundary

Returns:
y (numpy.array)

Cluster assignments of samples in X

assign(X, slope=None)[source]

Assign the clustering of new samples

Args:
X (numpy.array)

Core and accessory distances

slope (int)

Override self.slope. Default - use self.slope Set to 0 for a vertical line, 1 for a horizontal line, or 2 to use a slope

Returns:
y (numpy.array)

Cluster assignments by samples

fit(X, sample_names, model, max_move, min_move, startFile=None, indiv_refine=False, unconstrained=False, multi_boundary=0, score_idx=0, no_local=False, betweenness_sample=100, sample_size=None, use_gpu=False)[source]

Extends fit()

Fits the distances by optimising network score, by calling refineFit2D().

Fitted parameters are stored in the object.

Args:
X (numpy.array)

The core and accessory distances to cluster. Must be set if preprocess is set.

sample_names (list)

Sample names in X (accessed by iterDistRows())

model (ClusterFit)

The model fit to refine

max_move (float)

Maximum distance to move away from start point

min_move (float)

Minimum distance to move away from start point

startFile (str)

A file defining an initial fit, rather than one from --fit-model. See documentation for format. (default = None).

indiv_refine (str)

Run refinement for core or accessory distances separately (default = None).

multi_boundary (int)

Produce cluster output at multiple boundary positions downward from the optimum. (default = 0).

unconstrained (bool)

If True, search in 2D and change the slope of the boundary

score_idx (int)

Index of score from networkSummary() to use [default = 0]

no_local (bool)

Turn off the local optimisation step. Quicker, but may be less well refined.

betweenness_sample (int)

Number of sequences per component used to estimate betweenness using a GPU. Smaller numbers are faster but less precise [default = 100]

sample_size (int)

Number of nodes to subsample for graph statistic calculation

use_gpu (bool)

Whether to use cugraph for graph analyses

Returns:
y (numpy.array)

Cluster assignments of samples in X

load(fit_npz, fit_obj)[source]

Load the model from disk. Called from loadClusterFit()

Args:
fit_npz (dict)

Fit npz opened with numpy.load()

fit_obj (None)

The saved fit object (not used)

plot(X, y=None)[source]

Extends plot()

Write a summary of the fit, and plot the results using PopPUNK.plot.plot_refined_results()

Args:
X (numpy.array)

Core and accessory distances

y (numpy.array)

Assignments (unused)

save()[source]

Save the model to disk, as an npz and pkl (using outPrefix).

PopPUNK.models.assign_samples(chunk, X, y, model, scale, chunk_size, values=False)[source]

Runs a models assignment on a chunk of input

Args:
chunk (int)

Index of chunk to process

X (NumpyShared)

n x 2 array of core and accessory distances for n samples

y (NumpyShared)

An n-vector to store results, with the most likely cluster memberships or an n by k matrix with the component responsibilities for each sample.

weights (numpy.array)

Component weights from BGMMFit

means (numpy.array)

Component means from BGMMFit

covars (numpy.array)

Component covariances from BGMMFit

scale (numpy.array)

Scaling of core and accessory distances from BGMMFit

chunk_size (int)

Size of each chunk in X

values (bool)

Whether to return the responsibilities, rather than the most likely assignment (used for entropy calculation).

Default is False

PopPUNK.models.loadClusterFit(pkl_file, npz_file, outPrefix='', max_samples=100000, use_gpu=False)[source]

Call this to load a fitted model

Args:
pkl_file (str)

Location of saved .pkl file on disk

npz_file (str)

Location of saved .npz file on disk

outPrefix (str)

Output prefix for model to save to (e.g. plots)

max_samples (int)

Maximum samples if subsampling X [default = 100000]

use_gpu (bool)

Whether to load npz file with GPU libraries for lineage models

Returns:
load_obj (model)

Loaded model

network.py

Functions used to construct the network, and update with new queries. Main entry point is constructNetwork() for new reference databases, and findQueryLinksToNetwork() for querying databases.

Network functions

PopPUNK.network.addQueryToNetwork(dbFuncs, rList, qList, G, assignments, model, queryDB, kmers=None, distance_type='euclidean', queryQuery=False, strand_preserved=False, weights=None, threads=1, use_gpu=False)[source]

Finds edges between queries and items in the reference database, and modifies the network to include them.

Args:
dbFuncs (list)

List of backend functions from setupDBFuncs()

rList (list)

List of reference names

qList (list)

List of query names

G (graph)

Network to add to (mutated)

assignments (numpy.array)

Cluster assignment of items in qlist

model (ClusterModel)

Model fitted to reference database

queryDB (str)

Query database location

distances (str)

Prefix of distance files for extending network

kmers (list)

List of k-mer sizes

distance_type (str)

Distance type to use as weights in network

queryQuery (bool)

Add in all query-query distances (default = False)

strand_preserved (bool)

Whether to treat strand as known (i.e. ignore rc k-mers) when adding random distances. Only used if queryQuery = True [default = False]

weights (numpy.array)

If passed, the core,accessory distances for each assignment, which will be annotated as an edge attribute

threads (int)

Number of threads to use if new db created

use_gpu (bool)

Whether to use cugraph for analysis

(default = 1)

Returns:
distMat (numpy.array)

Query-query distances

PopPUNK.network.checkNetworkVertexCount(seq_list, G, use_gpu)[source]

Checks the number of network vertices matches the number of sequence names.

Args:
seq_list (list)

The list of sequence names

G (graph)

The network of sequences

use_gpu (bool)

Whether to use cugraph for graph analyses

PopPUNK.network.cliquePrune(component, graph, reference_indices, components_list)[source]

Wrapper function around getCliqueRefs() so it can be called by a multiprocessing pool

PopPUNK.network.construct_dense_weighted_network(rlist, distMat, weights_type=None, use_gpu=False)[source]

Construct an undirected network using sequence lists, assignments of pairwise distances to clusters, and the identifier of the cluster assigned to within-strain distances. Nodes are samples and edges where samples are within the same cluster

Will print summary statistics about the network to STDERR

Args:
rlist (list)

List of reference sequence labels

distMat (2 column ndarray)

Numpy array of pairwise distances

weights_type (str)

Type of weight to use for network

use_gpu (bool)

Whether to use GPUs for network construction

Returns:
G (graph)

The resulting network

PopPUNK.network.construct_network_from_assignments(rlist, qlist, assignments, within_label=1, int_offset=0, weights=None, distMat=None, weights_type=None, previous_network=None, old_ids=None, adding_qq_dists=False, previous_pkl=None, betweenness_sample=100, summarise=True, sample_size=None, use_gpu=False)[source]

Construct an undirected network using sequence lists, assignments of pairwise distances to clusters, and the identifier of the cluster assigned to within-strain distances. Nodes are samples and edges where samples are within the same cluster

Will print summary statistics about the network to STDERR

Args:
rlist (list)

List of reference sequence labels

qlist (list)

List of query sequence labels

assignments (numpy.array or int)

Labels of most likely cluster assignment

within_label (int)

The label for the cluster representing within-strain distances

int_offset (int)

Constant integer to add to each node index

weights (list)

List of weights for each edge in the network

distMat (2 column ndarray)

Numpy array of pairwise distances

weights_type (str)

Measure to calculate from the distMat to use as edge weights in network - options are core, accessory or euclidean distance

previous_network (str)

Name of file containing a previous network to be integrated into this new network

old_ids (list)

Ordered list of vertex names in previous network

adding_qq_dists (bool)

Boolean specifying whether query-query edges are being added to an existing network, such that not all the sequence IDs will be found in the old IDs, which should already be correctly ordered

previous_pkl (str)

Name of file containing the names of the sequences in the previous_network

betweenness_sample (int)

Number of sequences per component used to estimate betweenness using a GPU. Smaller numbers are faster but less precise [default = 100]

summarise (bool)

Whether to calculate and print network summaries with networkSummary() (default = True)

sample_size (int)

Number of nodes to subsample for graph statistic calculation

use_gpu (bool)

Whether to use GPUs for network construction

Returns:
G (graph)

The resulting network

PopPUNK.network.construct_network_from_df(rlist, qlist, G_df, weights=False, distMat=None, previous_network=None, adding_qq_dists=False, old_ids=None, previous_pkl=None, betweenness_sample=100, summarise=True, sample_size=None, use_gpu=False)[source]

Construct an undirected network using a data frame of edges. Nodes are samples and edges where samples are within the same cluster

Will print summary statistics about the network to STDERR

Args:
rlist (list)

List of reference sequence labels

qlist (list)

List of query sequence labels

G_df (cudf or pandas data frame)

Data frame in which the first two columns are the nodes linked by edges

weights (bool)

Whether weights in the G_df data frame should be included in the network

distMat (2 column ndarray)

Numpy array of pairwise distances

previous_network (str or graph object)

Name of file containing a previous network to be integrated into this new network, or the already-loaded graph object

adding_qq_dists (bool)

Boolean specifying whether query-query edges are being added to an existing network, such that not all the sequence IDs will be found in the old IDs, which should already be correctly ordered

old_ids (list)

Ordered list of vertex names in previous network

previous_pkl (str)

Name of file containing the names of the sequences in the previous_network

betweenness_sample (int)

Number of sequences per component used to estimate betweenness using a GPU. Smaller numbers are faster but less precise [default = 100]

summarise (bool)

Whether to calculate and print network summaries with networkSummary() (default = True)

sample_size (int)

Number of nodes to subsample for graph statistic calculation

use_gpu (bool)

Whether to use GPUs for network construction

Returns:
G (graph)

The resulting network

PopPUNK.network.construct_network_from_edge_list(rlist, qlist, edge_list, weights=None, distMat=None, previous_network=None, adding_qq_dists=False, old_ids=None, previous_pkl=None, betweenness_sample=100, summarise=True, sample_size=None, use_gpu=False)[source]

Construct an undirected network using a list of edges as tuples. Nodes are samples and edges where samples are within the same cluster

Will print summary statistics about the network to STDERR

Args:
rlist (list)

List of reference sequence labels

qlist (list)

List of query sequence labels

edge_list (list of tuples)

List of tuples describing the edges of the graph

weights (list)

List of edge weights

distMat (2 column ndarray)

Numpy array of pairwise distances

previous_network (str or graph object)

Name of file containing a previous network to be integrated into this new network, or the already-loaded graph object

adding_qq_dists (bool)

Boolean specifying whether query-query edges are being added to an existing network, such that not all the sequence IDs will be found in the old IDs, which should already be correctly ordered

old_ids (list)

Ordered list of vertex names in previous network

previous_pkl (str)

Name of file containing the names of the sequences in the previous_network

betweenness_sample (int)

Number of sequences per component used to estimate betweenness using a GPU. Smaller numbers are faster but less precise [default = 100]

summarise (bool)

Whether to calculate and print network summaries with networkSummary() (default = True)

sample_size (int)

Number of nodes to subsample for graph statistic calculation

use_gpu (bool)

Whether to use GPUs for network construction

Returns:
G (graph)

The resulting network

PopPUNK.network.construct_network_from_sparse_matrix(rlist, qlist, sparse_input, weights=None, previous_network=None, previous_pkl=None, betweenness_sample=100, summarise=True, sample_size=None, use_gpu=False)[source]

Construct an undirected network using a sparse matrix. Nodes are samples and edges where samples are within the same cluster

Will print summary statistics about the network to STDERR

Args:
rlist (list)

List of reference sequence labels

qlist (list)

List of query sequence labels

sparse_input (numpy.array)

Sparse distance matrix from lineage fit

weights (list)

List of weights for each edge in the network

distMat (2 column ndarray)

Numpy array of pairwise distances

previous_network (str)

Name of file containing a previous network to be integrated into this new network

previous_pkl (str)

Name of file containing the names of the sequences in the previous_network

betweenness_sample (int)

Number of sequences per component used to estimate betweenness using a GPU. Smaller numbers are faster but less precise [default = 100]

summarise (bool)

Whether to calculate and print network summaries with networkSummary() (default = True)

sample_size (int)

Number of nodes to subsample for graph statistic calculation

use_gpu (bool)

Whether to use GPUs for network construction

Returns:
G (graph)

The resulting network

PopPUNK.network.cugraph_to_graph_tool(G, rlist)[source]

Save a network to disk

Args:
G (cugraph network)

Cugraph network

rlist (list)

List of sequence names

Returns:
G (graph-tool network)

Graph tool network

PopPUNK.network.extractReferences(G, dbOrder, outPrefix, outSuffix='', type_isolate=None, existingRefs=None, threads=1, use_gpu=False)[source]

Extract references for each cluster based on cliques

Writes chosen references to file by calling writeReferences()

Args:
G (graph)

A network used to define clusters

dbOrder (list)

The order of files in the sketches, so returned references are in the same order

outPrefix (str)

Prefix for output file

outSuffix (str)

Suffix for output file (.refs will be appended)

type_isolate (str)

Isolate to be included in set of references

existingRefs (list)

References that should be used for each clique

use_gpu (bool)

Use cugraph for graph analysis (default = False)

Returns:
refFileName (str)

The name of the file references were written to

references (list)

An updated list of the reference names

PopPUNK.network.fetchNetwork(network_dir, model, refList, ref_graph=False, core_only=False, accessory_only=False, use_gpu=False)[source]

Load the network based on input options

Returns the network as a graph-tool format graph, and sets the slope parameter of the passed model object.

Args:
network_dir (str)

A network used to define clusters

model (ClusterFit)

A fitted model object

refList (list)

Names of references that should be in the network

ref_graph (bool)

Use ref only graph, if available [default = False]

core_only (bool)

Return the network created using only core distances [default = False]

accessory_only (bool)

Return the network created using only accessory distances [default = False]

use_gpu (bool)

Use cugraph library to load graph

Returns:
genomeNetwork (graph)

The loaded network

cluster_file (str)

The CSV of cluster assignments corresponding to this network

PopPUNK.network.generate_cugraph(G_df, max_index, weights=False, renumber=True)[source]

Builds cugraph graph to ensure all nodes are included in the graph, even if singletons.

Args:
G_df (cudf)

cudf data frame containing edge list

max_index (int)

The 0-indexed maximum of the node indices

renumber (bool)

Whether to renumber the vertices when added to the graph

Returns:
G_new (graph)

Dictionary of cluster assignments (keys are sequence names)

PopPUNK.network.generate_minimum_spanning_tree(G, from_cugraph=False)[source]

Generate a minimum spanning tree from a network

Args:
G (network)

Graph tool network

from_cugraph (bool)

If a pre-calculated MST from cugraph [default = False]

Returns:
mst_network (str)

Minimum spanning tree (as graph-tool graph)

PopPUNK.network.getCliqueRefs(G, reference_indices={})[source]

Recursively prune a network of its cliques. Returns one vertex from a clique at each stage

Args:
G (graph)

The graph to get clique representatives from

reference_indices (set)

The unique list of vertices being kept, to add to

PopPUNK.network.get_vertex_list(G, use_gpu=False)[source]

Generate a list of node indices

Args:
G (network)

Graph tool network

use_gpu (bool)

Whether graph is a cugraph or not [default = False]

Returns:
vlist (list)

List of integers corresponding to nodes

PopPUNK.network.load_network_file(fn, use_gpu=False)[source]

Load the network based on input options

Returns the network as a graph-tool format graph, and sets the slope parameter of the passed model object.

Args:
fn (str)

Network file name

use_gpu (bool)

Use cugraph library to load graph

Returns:
genomeNetwork (graph)

The loaded network

PopPUNK.network.networkSummary(G, calc_betweenness=True, betweenness_sample=100, subsample=None, use_gpu=False)[source]

Provides summary values about the network

Args:
G (graph)

The network of strains

calc_betweenness (bool)

Whether to calculate betweenness stats

betweenness_sample (int)

Number of sequences per component used to estimate betweenness using a GPU. Smaller numbers are faster but less precise [default = 100]

subsample (int)

Number of vertices to randomly subsample from graph

use_gpu (bool)

Whether to use cugraph for graph analysis

Returns:
metrics (list)

List with # components, density, transitivity, mean betweenness and weighted mean betweenness

scores (list)

List of scores

PopPUNK.network.network_to_edges(prev_G_fn, rlist, adding_qq_dists=False, old_ids=None, previous_pkl=None, weights=False, use_gpu=False)[source]

Load previous network, extract the edges to match the vertex order specified in rlist, and also return weights if specified.

Args:
prev_G_fn (str or graph object)

Path of file containing existing network, or already-loaded graph object

adding_qq_dists (bool)

Boolean specifying whether query-query edges are being added to an existing network, such that not all the sequence IDs will be found in the old IDs, which should already be correctly ordered

rlist (list)

List of reference sequence labels in new network

old_ids (list)

List of IDs of vertices in existing network

previous_pkl (str)

Path of pkl file containing names of sequences in previous network

weights (bool)

Whether to return edge weights (default = False)

use_gpu (bool)

Whether to use cugraph for graph analyses

Returns:
source_ids (list)

Source nodes for each edge

target_ids (list)

Target nodes for each edge

edge_weights (list)

Weights for each new edge

PopPUNK.network.printClusters(G, rlist, outPrefix=None, oldClusterFile=None, externalClusterCSV=None, printRef=True, printCSV=True, clustering_type='combined', write_unwords=True, use_gpu=False)[source]

Get cluster assignments

Also writes assignments to a CSV file

Args:
G (graph)

Network used to define clusters

rlist (list)

Names of samples

outPrefix (str)

Prefix for output CSV Default = None

oldClusterFile (str)

CSV with previous cluster assignments. Pass to ensure consistency in cluster assignment name. Default = None

externalClusterCSV (str)

CSV with cluster assignments from any source. Will print a file relating these to new cluster assignments Default = None

printRef (bool)

If false, print only query sequences in the output Default = True

printCSV (bool)

Print results to file Default = True

clustering_type (str)

Type of clustering network, used for comparison with old clusters Default = ‘combined’

write_unwords (bool)

Write clusters with a pronouncable name rather than numerical index Default = True

use_gpu (bool)

Whether to use cugraph for network analysis

Returns:
clustering (dict)

Dictionary of cluster assignments (keys are sequence names)

PopPUNK.network.printExternalClusters(newClusters, extClusterFile, outPrefix, oldNames, printRef=True)[source]

Prints cluster assignments with respect to previously defined clusters or labels.

Args:
newClusters (set iterable)

The components from the graph G, defining the PopPUNK clusters

extClusterFile (str)

A CSV file containing definitions of the external clusters for each sample (does not need to contain all samples)

outPrefix (str)

Prefix for output CSV (_external_clusters.csv)

oldNames (list)

A list of the reference sequences

printRef (bool)

If false, print only query sequences in the output

Default = True

PopPUNK.network.print_network_summary(G, sample_size=None, betweenness_sample=100, use_gpu=False)[source]

Wrapper function for printing network information

Args:
G (graph)

List of reference sequence labels

sample_size (int)

Number of nodes to subsample for graph statistic calculation

betweenness_sample (int)

Number of sequences per component used to estimate betweenness using a GPU. Smaller numbers are faster but less precise [default = 100]

use_gpu (bool)

Whether to use GPUs for network construction

PopPUNK.network.process_previous_network(previous_network=None, adding_qq_dists=False, old_ids=None, previous_pkl=None, vertex_labels=None, weights=False, use_gpu=False)[source]

Extract edge types from an existing network

Args:
previous_network (str or graph object)

Name of file containing a previous network to be integrated into this new network, or already-loaded graph object

adding_qq_dists (bool)

Boolean specifying whether query-query edges are being added to an existing network, such that not all the sequence IDs will be found in the old IDs, which should already be correctly ordered

old_ids (list)

Ordered list of vertex names in previous network

previous_pkl (str)

Name of file containing the names of the sequences in the previous_network ordered based on the original network construction

vertex_labels (list)

Ordered list of sequence labels

weights (bool)

Whether weights should be extracted from the previous network

use_gpu (bool)

Whether to use GPUs for network construction

Returns:
extra_sources (list)

List of source node identifiers

extra_targets (list)

List of destination node identifiers

extra_weights (list or None)

List of edge weights

PopPUNK.network.process_weights(distMat, weights_type)[source]

Calculate edge weights from the distance matrix

Args:
distMat (2 column ndarray)

Numpy array of pairwise distances

weights_type (str)

Measure to calculate from the distMat to use as edge weights in network - options are core, accessory or euclidean distance

Returns:
processed_weights (list)

Edge weights

PopPUNK.network.prune_graph(prefix, reflist, samples_to_keep, output_db_name, threads, use_gpu)[source]

Keep only the specified sequences in a graph

Args:
prefix (str)

Name of directory containing network

reflist (list)

Ordered list of sequences of database

samples_to_keep (list)

The names of samples to be retained in the graph

output_db_name (str)

Name of output directory

threads (int)

Number of CPU threads to use when recalculating random match chances [default = 1].

use_gpu (bool)

Whether graph is a cugraph or not [default = False]

PopPUNK.network.remove_nodes_from_graph(G, reflist, samples_to_keep, use_gpu)[source]

Return a modified graph containing only the requested nodes

Args:
reflist (list)

Ordered list of sequences of database

samples_to_keep (list)

The names of samples to be retained in the graph

use_gpu (bool)

Whether graph is a cugraph or not [default = False]

Returns:
G_new (graph)

Pruned graph

PopPUNK.network.remove_non_query_components(G, rlist, qlist, use_gpu=False)[source]

Removes all components that do not contain a query sequence.

Args:
G (graph)

Network of queries linked to reference sequences

rlist (list)

List of reference sequence labels

qlist (list)

List of query sequence labels

use_gpu (bool)

Whether to use GPUs for network construction

Returns:
G (graph)

The resulting network

pruned_names (list)

The labels of the sequences in the pruned network

PopPUNK.network.save_network(G, prefix=None, suffix=None, use_graphml=False, use_gpu=False)[source]

Save a network to disk

Args:
G (network)

Graph tool network

prefix (str)

Prefix for output file

use_graphml (bool)

Whether to output a graph-tool file in graphml format

use_gpu (bool)

Whether graph is a cugraph or not [default = False]

PopPUNK.network.sparse_mat_to_network(sparse_mat, rlist, use_gpu=False)[source]

Generate a network from a lineage rank fit

Args:
sparse_mat (scipy or cupyx sparse matrix)

Sparse matrix of kNN from lineage fit

rlist (list)

List of sequence names

use_gpu (bool)

Whether GPU libraries should be used

Returns:
G (network)

Graph tool or cugraph network

PopPUNK.network.translate_network_indices(G_ref_df, reference_indices)[source]

Function for ensuring an updated reference network retains numbering consistent with sample names

Args:
G_ref_df (cudf data frame)

List of edges in reference network

reference_indices (list)

The ordered list of reference indices in the original network

Returns:
G_ref (cugraph network)

Network of reference sequences

PopPUNK.network.vertex_betweenness(graph, norm=True)[source]

Returns betweenness for nodes in the graph

PopPUNK.network.writeReferences(refList, outPrefix, outSuffix='')[source]

Writes chosen references to file

Args:
refList (list)

Reference names to write

outPrefix (str)

Prefix for output file

outSuffix (str)

Suffix for output file (.refs will be appended)

Returns:
refFileName (str)

The name of the file references were written to

refine.py

Functions used to refine an existing model. Access using RefineFit.

Refine mixture model using network properties

class PopPUNK.refine.NumpyShared(name, shape, dtype)
dtype

Alias for field number 2

name

Alias for field number 0

shape

Alias for field number 1

PopPUNK.refine.check_search_range(scale, mean0, mean1, lower_s, upper_s)[source]

Checks a search range is within a valid range

Args:
scale (np.array)

Rescaling factor to [0, 1] for each axis

mean0 (np.array)

(x, y) of starting point defining line

mean1 (np.array)

(x, y) of end point defining line

lower_s (float)

distance along line to start search

upper_s (float)

distance along line to end search

Returns:
min_x, max_x

minimum and maximum x-intercepts of the search range

min_y, max_y

minimum and maximum x-intercepts of the search range

PopPUNK.refine.expand_cugraph_network(G, G_extra_df)[source]

Reconstruct a cugraph network with additional edges.

Args:
G (cugraph network)

Original cugraph network

extra_edges (cudf dataframe)

Data frame of edges to add

Returns:
G (cugraph network)

Expanded cugraph network

PopPUNK.refine.growNetwork(sample_names, i_vec, j_vec, idx_vec, s_range, score_idx=0, thread_idx=0, betweenness_sample=100, write_clusters=None, sample_size=None, use_gpu=False)[source]

Construct a network, then add edges to it iteratively. Input is from pp_sketchlib.iterateBoundary1D or``pp_sketchlib.iterateBoundary2D``

Args:
sample_names (list)

Sample names corresponding to distMat (accessed by iterator)

i_vec (list)

Ordered ref vertex index to add

j_vec (list)

Ordered query (==ref) vertex index to add

idx_vec (list)

For each i, j tuple, the index of the intercept at which these enter the network. These are sorted and increasing

s_range (list)

Offsets which correspond to idx_vec entries

score_idx (int)

Index of score from networkSummary() to use [default = 0]

thread_idx (int)

Optional thread idx (if multithreaded) to offset progress bar by

betweenness_sample (int)

Number of sequences per component used to estimate betweenness using a GPU. Smaller numbers are faster but less precise [default = 100]

write_clusters (str)

Set to a prefix to write the clusters from each position to files [default = None]

sample_size (int)

Number of nodes to subsample for graph statistic calculation

use_gpu (bool)

Whether to use cugraph for graph analyses

Returns:
scores (list)

-1 * network score for each of x_range. Where network score is from networkSummary()

PopPUNK.refine.likelihoodBoundary(s, model, start, end, within, between)[source]

Wrapper function around fit2dMultiGaussian() so that it can go into a root-finding function for probabilities between components

Args:
s (float)

Distance along line from mean0

model (BGMMFit)

Fitted mixture model

start (numpy.array)

The co-ordinates of the centre of the within-strain distribution

end (numpy.array)

The co-ordinates of the centre of the between-strain distribution

within (int)

Label of the within-strain distribution

between (int)

Label of the between-strain distribution

Returns:
responsibility (float)

The difference between responsibilities of assignment to the within component and the between assignment

PopPUNK.refine.multi_refine(distMat, sample_names, mean0, mean1, scale, s_max, n_boundary_points, output_prefix, num_processes=1, betweenness_sample=100, sample_size=None, use_gpu=False)[source]

Move the refinement boundary between the optimum and where it meets an axis. Discrete steps, output the clusers at each step

Args:
distMat (numpy.array)

n x 2 array of core and accessory distances for n samples

sample_names (list)

List of query sequence labels

mean0 (numpy.array)

Start point to define search line

mean1 (numpy.array)

End point to define search line

scale (numpy.array)

Scaling factor of distMat

s_max (float)

The optimal s position from refinement (refineFit())

n_boundary_points (int)

Number of positions to try drawing the boundary at

num_processes (int)

Number of threads to use in the global optimisation step. (default = 1)

betweenness_sample (int)

Number of sequences per component used to estimate betweenness using a GPU. Smaller numbers are faster but less precise [default = 100]

sample_size (int)

Number of nodes to subsample for graph statistic calculation

use_gpu (bool)

Whether to use cugraph for graph analyses

PopPUNK.refine.newNetwork(s, sample_names, distMat, mean0, mean1, gradient, slope=2, score_idx=0, cpus=1, betweenness_sample=100, sample_size=None, use_gpu=False)[source]

Wrapper function for construct_network_from_edge_list() which is called by optimisation functions moving a triangular decision boundary.

Given the boundary parameterisation, constructs the network and returns its score, to be minimised.

Args:
s (float)

Distance along line between start_point and mean1 from start_point

sample_names (list)

Sample names corresponding to distMat (accessed by iterator)

distMat (numpy.array or NumpyShared)

Core and accessory distances or NumpyShared describing these in sharedmem

mean0 (numpy.array)

Start point

mean1 (numpy.array)

End point

gradient (float)

Gradient of line to move along

slope (int)

Set to 0 for a vertical line, 1 for a horizontal line, or 2 to use a slope [default = 2]

score_idx (int)

Index of score from networkSummary() to use [default = 0]

cpus (int)

Number of CPUs to use for calculating assignment

betweenness_sample (int)

Number of sequences per component used to estimate betweenness using a GPU. Smaller numbers are faster but less precise [default = 100]

sample_size (int)

Number of nodes to subsample for graph statistic calculation

use_gpu (bool)

Whether to use cugraph for graph analysis

Returns:
score (float)

-1 * network score. Where network score is from networkSummary()

PopPUNK.refine.newNetwork2D(y_idx, sample_names, distMat, x_range, y_range, score_idx=0, betweenness_sample=100, sample_size=None, use_gpu=False)[source]

Wrapper function for thresholdIterate2D and growNetwork().

For a given y_max, constructs networks across x_range and returns a list of scores

Args:
y_idx (float)

Maximum y-intercept of boundary, as index into y_range

sample_names (list)

Sample names corresponding to distMat (accessed by iterator)

distMat (numpy.array or NumpyShared)

Core and accessory distances or NumpyShared describing these in sharedmem

x_range (list)

Sorted list of x-intercepts to search

y_range (list)

Sorted list of y-intercepts to search

score_idx (int)

Index of score from networkSummary() to use [default = 0]

betweenness_sample (int)

Number of sequences per component used to estimate betweenness using a GPU. Smaller numbers are faster but less precise [default = 100]

sample_size (int)

Number of nodes to subsample for graph statistic calculation

use_gpu (bool)

Whether to use cugraph for graph analysis

Returns:
scores (list)

-1 * network score for each of x_range. Where network score is from networkSummary()

PopPUNK.refine.readManualStart(startFile)[source]

Reads a file to define a manual start point, rather than using --fit-model

Throws and exits if incorrectly formatted.

Args:
startFile (str)

Name of file with values to read

Returns:
mean0 (numpy.array)

Centre of within-strain distribution

mean1 (numpy.array)

Centre of between-strain distribution

scaled (bool)

True if means are scaled between [0,1]

PopPUNK.refine.refineFit(distMat, sample_names, mean0, mean1, scale, max_move, min_move, slope=2, score_idx=0, unconstrained=False, no_local=False, num_processes=1, betweenness_sample=100, sample_size=None, use_gpu=False)[source]

Try to refine a fit by maximising a network score based on transitivity and density.

Iteratively move the decision boundary to do this, using starting point from existing model.

Args:
distMat (numpy.array)

n x 2 array of core and accessory distances for n samples

sample_names (list)

List of query sequence labels

mean0 (numpy.array)

Start point to define search line

mean1 (numpy.array)

End point to define search line

scale (numpy.array)

Scaling factor of distMat

max_move (float)

Maximum distance to move away from start point

min_move (float)

Minimum distance to move away from start point

slope (int)

Set to 0 for a vertical line, 1 for a horizontal line, or 2 to use a slope

score_idx (int)

Index of score from networkSummary() to use [default = 0]

unconstrained (bool)

If True, search in 2D and change the slope of the boundary

no_local (bool)

Turn off the local optimisation step. Quicker, but may be less well refined.

num_processes (int)

Number of threads to use in the global optimisation step. (default = 1)

betweenness_sample (int)

Number of sequences per component used to estimate betweenness using a GPU. Smaller numbers are faster but less precise [default = 100]

sample_size (int)

Number of nodes to subsample for graph statistic calculation

use_gpu (bool)

Whether to use cugraph for graph analyses

Returns:
optimal_x (float)

x-coordinate of refined fit

optimal_y (float)

y-coordinate of refined fit

optimised_s (float)

Position along search range of refined fit

plot.py

Plots of GMM results, k-mer fits, and microreact output

PopPUNK.plot.createMicroreact(prefix, microreact_files, api_key=None)[source]

Creates a .microreact file, and instance via the API

Args:
prefix (str)

Prefix for output file

microreact_files (str)

List of Microreact files [clusters, dot, tree, mst_tree]

api_key (str)

API key for your account

PopPUNK.plot.distHistogram(dists, rank, outPrefix)[source]

Plot a histogram of distances (1D)

Args:
dists (np.array)

Distance vector

rank (int)

Rank (used for name and title)

outPrefix (int)

Full path prefix for plot file

PopPUNK.plot.drawMST(mst, outPrefix, isolate_clustering, clustering_name, overwrite)[source]

Plot a layout of the minimum spanning tree

Args:
mst (graph_tool.Graph)

A minimum spanning tree

outPrefix (str)

Output prefix for save files

isolate_clustering (dict)

Dictionary of ID: cluster, used for colouring vertices

clustering_name (str)

Name of clustering scheme to be used for colouring

overwrite (bool)

Overwrite existing output files

PopPUNK.plot.get_grid(minimum, maximum, resolution)[source]

Get a square grid of points to evaluate a function across

Used for plot_scatter() and plot_contours()

Args:
minimum (float)

Minimum value for grid

maximum (float)

Maximum value for grid

resolution (int)

Number of points along each axis

Returns:
xx (numpy.array)

x values across n x n grid

yy (numpy.array)

y values across n x n grid

xy (numpy.array)

n x 2 pairs of x, y values grid is over

PopPUNK.plot.outputsForCytoscape(G, G_mst, isolate_names, clustering, outPrefix, epiCsv, queryList=None, suffix=None, writeCsv=True, use_partial_query_graph=None)[source]

Write outputs for cytoscape. A graphml of the network, and CSV with metadata

Args:
G (graph)

The network to write

G_mst (graph)

The minimum spanning tree of G

isolate_names (list)

Ordered list of sequence names

clustering (dict)

Dictionary of cluster assignments (keys are nodeNames).

outPrefix (str)

Prefix for files to be written

epiCsv (str)

Optional CSV of epi data to paste in the output in addition to the clusters.

queryList (list)

Optional list of isolates that have been added as a query. (default = None)

suffix (string)

String to append to network file name. (default = None)

writeCsv (bool)

Whether to print CSV file to accompany network

use_partial_query_graph (str)

File listing sequences to be included in output graph

PopPUNK.plot.outputsForGrapetree(combined_list, clustering, nj_tree, mst_tree, outPrefix, epiCsv, queryList=None, overwrite=False)[source]

Generate files for Grapetree

Write a neighbour joining tree (.nwk) from core distances and cluster assignment (.csv)

Args:
combined_list (list)

Name of sequences being analysed. The part of the name before the first ‘.’ will be shown in the output

clustering (dict or dict of dicts)

List of cluster assignments from printClusters(). Further clusterings (e.g. 1D core only) can be included by passing these as a dict.

nj_tree (str or None)

String representation of a Newick-formatted NJ tree

mst_tree (str or None)

String representation of a Newick-formatted minimum-spanning tree

outPrefix (str)

Prefix for all generated output files, which will be placed in outPrefix subdirectory.

epiCsv (str)

A CSV containing other information, to include with the CSV of clusters

queryList (list)

Optional list of isolates that have been added as a query for colouring in the CSV. (default = None)

overwrite (bool)

Overwrite existing output if present (default = False).

PopPUNK.plot.outputsForMicroreact(combined_list, clustering, nj_tree, mst_tree, accMat, perplexity, maxIter, outPrefix, epiCsv, queryList=None, overwrite=False, n_threads=1, use_gpu=False, device_id=0)[source]

Generate files for microreact

Output a neighbour joining tree (.nwk) from core distances, a plot of t-SNE clustering of accessory distances (.dot) and cluster assignment (.csv)

Args:
combined_list (list)

Name of sequences being analysed. The part of the name before the first ‘.’ will be shown in the output

clustering (dict or dict of dicts)

List of cluster assignments from printClusters(). Further clusterings (e.g. 1D core only) can be included by passing these as a dict.

nj_tree (str or None)

String representation of a Newick-formatted NJ tree

mst_tree (str or None)

String representation of a Newick-formatted minimum-spanning tree

accMat (numpy.array)

n x n array of accessory distances for n samples.

perplexity (int)

Perplexity parameter passed to mandrake

maxIter (int)

Maximum iterations for mandrake

outPrefix (str)

Prefix for all generated output files, which will be placed in outPrefix subdirectory

epiCsv (str)

A CSV containing other information, to include with the CSV of clusters

queryList (list)

Optional list of isolates that have been added as a query for colouring in the CSV. (default = None)

overwrite (bool)

Overwrite existing output if present (default = False)

n_threads (int)

Number of CPU threads to use (default = 1)

use_gpu (bool)

Whether to use a GPU for t-SNE generation

device_id (int)

Device ID of GPU to be used (default = 0)

Returns:
outfiles (list)

List of output files create

PopPUNK.plot.outputsForPhandango(combined_list, clustering, nj_tree, mst_tree, outPrefix, epiCsv, queryList=None, overwrite=False)[source]

Generate files for Phandango

Write a neighbour joining tree (.tree) from core distances and cluster assignment (.csv)

Args:
combined_list (list)

Name of sequences being analysed. The part of the name before the first ‘.’ will be shown in the output

clustering (dict or dict of dicts)

List of cluster assignments from printClusters(). Further clusterings (e.g. 1D core only) can be included by passing these as a dict.

nj_tree (str or None)

String representation of a Newick-formatted NJ tree

mst_tree (str or None)

String representation of a Newick-formatted minimum-spanning tree

outPrefix (str)

Prefix for all generated output files, which will be placed in outPrefix subdirectory

epiCsv (str)

A CSV containing other information, to include with the CSV of clusters

queryList (list)

Optional list of isolates that have been added as a query for colouring in the CSV. (default = None)

overwrite (bool)

Overwrite existing output if present (default = False)

threads (int)

Number of threads to use with rapidnj

PopPUNK.plot.plot_contours(model, assignments, title, out_prefix)[source]

Draw contours of mixture model assignments

Will draw the decision boundary for between/within in red

Args:
model (BGMMFit)

Model we are plotting from

assignments (numpy.array)

n-vectors of cluster assignments for model

title (str)

The title to display above the plot

out_prefix (str)

Prefix for output plot file (.pdf will be appended)

PopPUNK.plot.plot_database_evaluations(prefix, genome_lengths, ambiguous_bases)[source]

Plot histograms of sequence characteristics for database evaluation.

Args:
prefix (str)

Prefix for output files

genome_lengths (list)

Lengths of genomes in database

ambiguous_bases (list)

Counts of ambiguous bases in genomes in database

PopPUNK.plot.plot_dbscan_results(X, y, n_clusters, out_prefix, use_gpu)[source]

Draw a scatter plot (png) to show the DBSCAN model fit

A scatter plot of core and accessory distances, coloured by component membership. Black is noise

Args:
X (numpy.array)

n x 2 array of core and accessory distances for n samples.

Y (numpy.array)

n x 1 array of cluster assignments for n samples.

n_clusters (int)

Number of clusters used (excluding noise)

out_prefix (str)

Prefix for output file (.png will be appended)

use_gpu (bool)

Whether model was fitted with GPU-enabled code

PopPUNK.plot.plot_evaluation_histogram(input_data, n_bins=100, prefix='hist', suffix='', plt_title='histogram', xlab='x')[source]

Plot histograms of sequence characteristics for database evaluation.

Args:
input_data (list)

Input data (list of numbers)

n_bins (int)

Number of bins to use for the histogram

prefix (str)

Prefix of database

suffix (str)

Suffix specifying plot type

plt_title (str)

Title for plot

xlab (str)

Title for the horizontal axis

PopPUNK.plot.plot_fit(klist, raw_matching, raw_fit, corrected_matching, corrected_fit, out_prefix, title)[source]

Draw a scatter plot (pdf) of k-mer sizes vs match probability, and the fit used to assign core and accessory distance

K-mer sizes on x-axis, log(pr(match)) on y - expect a straight line fit with intercept representing accessory distance and slope core distance

Args:
klist (list)

List of k-mer sizes

raw_matching (list)

Proportion of matching k-mers at each klist value

raw_fit (numpy.array)

Fit to klist and raw_matching from fitKmerCurve()

corrected_matching (list)

Corrected proportion of matching k-mers at each klist value

corrected_fit (numpy.array)

Fit to klist and corrected_matching from fitKmerCurve()

out_prefix (str)

Prefix for output plot file (.pdf will be appended)

title (str)

The title to display above the plot

PopPUNK.plot.plot_refined_results(X, Y, x_boundary, y_boundary, core_boundary, accessory_boundary, mean0, mean1, min_move, max_move, scale, threshold, indiv_boundaries, unconstrained, title, out_prefix)[source]

Draw a scatter plot (png) to show the refined model fit

A scatter plot of core and accessory distances, coloured by component membership. The triangular decision boundary is also shown

Args:
X (numpy.array)

n x 2 array of core and accessory distances for n samples.

Y (numpy.array)

n x 1 array of cluster assignments for n samples.

x_boundary (float)

Intercept of boundary with x-axis, from RefineFit

y_boundary (float)

Intercept of boundary with y-axis, from RefineFit

core_boundary (float)

Intercept of 1D (core) boundary with x-axis, from RefineFit

accessory_boundary (float)

Intercept of 1D (core) boundary with y-axis, from RefineFit

mean0 (numpy.array)

Centre of within-strain distribution

mean1 (numpy.array)

Centre of between-strain distribution

min_move (float)

Minimum s range

max_move (float)

Maximum s range

scale (numpy.array)

Scaling factor from RefineFit

threshold (bool)

If fit was just from a simple thresholding

indiv_boundaries (bool)

Whether to draw lines for core and accessory refinement

title (str)

The title to display above the plot

out_prefix (str)

Prefix for output plot file (.png will be appended)

PopPUNK.plot.plot_results(X, Y, means, covariances, scale, title, out_prefix)[source]

Draw a scatter plot (png) to show the BGMM model fit

A scatter plot of core and accessory distances, coloured by component membership. Also shown are ellipses for each component (centre: means axes: covariances).

This is based on the example in the sklearn documentation.

Args:
X (numpy.array)

n x 2 array of core and accessory distances for n samples.

Y (numpy.array)

n x 1 array of cluster assignments for n samples.

means (numpy.array)

Component means from BGMMFit

covars (numpy.array)

Component covariances from BGMMFit

scale (numpy.array)

Scaling factor from BGMMFit

out_prefix (str)

Prefix for output plot file (.png will be appended)

title (str)

The title to display above the plot

PopPUNK.plot.plot_scatter(X, out_prefix, title, kde=True)[source]

Draws a 2D scatter plot (png) of the core and accessory distances

Also draws contours of kernel density estimare

Args:
X (numpy.array)

n x 2 array of core and accessory distances for n samples.

out_prefix (str)

Prefix for output plot file (.png will be appended)

title (str)

The title to display above the plot

kde (bool)

Whether to draw kernel density estimate contours

(default = True)

PopPUNK.plot.writeClusterCsv(outfile, nodeNames, nodeLabels, clustering, output_format='microreact', epiCsv=None, queryNames=None, suffix='_Cluster')[source]

Print CSV file of clustering and optionally epi data

Writes CSV output of clusters which can be used as input to microreact and cytoscape. Uses pandas to deal with CSV reading and writing nicely.

The epiCsv, if provided, should have the node labels in the first column.

Args:
outfile (str)

File to write the CSV to.

nodeNames (list)

Names of sequences in clustering (includes path).

nodeLabels (list)

Names of sequences to write in CSV (usually has path removed).

clustering (dict or dict of dicts)

Dictionary of cluster assignments (keys are nodeNames). Pass a dict with depth two to include multiple possible clusterings.

output_format (str)

Software for which CSV should be formatted (microreact, phandango, grapetree and cytoscape are accepted)

epiCsv (str)

Optional CSV of epi data to paste in the output in addition to the clusters (default = None).

queryNames (list)

Optional list of isolates that have been added as a query.

(default = None)

sparse_mst.py

sketchlib.py

Sketchlib functions for database construction

PopPUNK.sketchlib.addRandom(oPrefix, sequence_names, klist, strand_preserved=False, overwrite=False, threads=1)[source]

Add chance of random match to a HDF5 sketch DB

Args:
oPrefix (str)

Sketch database prefix

sequence_names (list)

Names of sequences to include in calculation

klist (list)

List of k-mer sizes to sketch

strand_preserved (bool)

Set true to ignore rc k-mers

overwrite (str)

Set true to overwrite existing random match chances

threads (int)

Number of threads to use (default = 1)

PopPUNK.sketchlib.checkSketchlibLibrary()[source]

Gets the location of the sketchlib library

Returns:
lib (str)

Location of sketchlib .so/.dyld

PopPUNK.sketchlib.checkSketchlibVersion()[source]

Checks that sketchlib can be run, and returns version

Returns:
version (str)

Version string

PopPUNK.sketchlib.constructDatabase(assemblyList, klist, sketch_size, oPrefix, threads, overwrite, strand_preserved, min_count, use_exact, calc_random=True, codon_phased=False, use_gpu=False, deviceid=0)[source]

Sketch the input assemblies at the requested k-mer lengths

A multithread wrapper around runSketch(). Threads are used to either run multiple sketch processes for each klist value.

Also calculates random match probability based on length of first genome in assemblyList.

Args:
assemblyList (str)

File with locations of assembly files to be sketched

klist (list)

List of k-mer sizes to sketch

sketch_size (int)

Size of sketch (-s option)

oPrefix (str)

Output prefix for resulting sketch files

threads (int)

Number of threads to use (default = 1)

overwrite (bool)

Whether to overwrite sketch DBs, if they already exist. (default = False)

strand_preserved (bool)

Ignore reverse complement k-mers (default = False)

min_count (int)

Minimum count of k-mer in reads to include (default = 0)

use_exact (bool)

Use exact count of k-mer appearance in reads (default = False)

calc_random (bool)

Add random match chances to DB (turn off for queries)

codon_phased (bool)

Use codon phased seeds (default = False)

use_gpu (bool)

Use GPU for read sketching (default = False)

deviceid (int)

GPU device id (default = 0)

Returns:
names (list)

List of names included in the database (from rfile)

PopPUNK.sketchlib.createDatabaseDir(outPrefix, kmers)[source]

Creates the directory to write sketches to, removing old files if unnecessary

Args:
outPrefix (str)

output db prefix

kmers (list)

k-mer sizes in db

PopPUNK.sketchlib.fitKmerCurve(pairwise, klist, jacobian)[source]

Fit the function \(pr = (1-a)(1-c)^k\)

Supply jacobian = -np.hstack((np.ones((klist.shape[0], 1)), klist.reshape(-1, 1)))

Args:
pairwise (numpy.array)

Proportion of shared k-mers at k-mer values in klist

klist (list)

k-mer sizes used

jacobian (numpy.array)

Should be set as above (set once to try and save memory)

Returns:
transformed_params (numpy.array)

Column with core and accessory distance

PopPUNK.sketchlib.getKmersFromReferenceDatabase(dbPrefix)[source]

Get kmers lengths from existing database

Args:
dbPrefix (str)

Prefix for sketch DB files

Returns:
kmers (list)

List of k-mer lengths used in database

PopPUNK.sketchlib.getSeqsInDb(dbname)[source]

Return an array with the sequences in the passed database

Args:
dbname (str)

Sketches database filename

Returns:
seqs (list)

List of sequence names in sketch DB

PopPUNK.sketchlib.getSketchSize(dbPrefix)[source]

Determine sketch size, and ensures consistent in whole database

sys.exit(1) is called if DBs have different sketch sizes

Args:
dbprefix (str)

Prefix for databases

Returns:
sketchSize (int)

sketch size (64x C++ definition)

codonPhased (bool)

whether the DB used codon phased seeds

PopPUNK.sketchlib.get_database_statistics(prefix)[source]

Extract statistics for evaluating databases.

Args:
prefix (str)

Prefix of database

PopPUNK.sketchlib.joinDBs(db1, db2, output, update_random=None, full_names=False)[source]

Join two sketch databases with the low-level HDF5 copy interface

Args:
db1 (str)

Prefix for db1

db2 (str)

Prefix for db2

output (str)

Prefix for joined output

update_random (dict)

Whether to re-calculate the random object. May contain control arguments strand_preserved and threads (see addRandom())

full_names (bool)

If True, db_name and out_name are the full paths to h5 files

PopPUNK.sketchlib.queryDatabase(rNames, qNames, dbPrefix, queryPrefix, klist, self=True, number_plot_fits=0, threads=1, use_gpu=False, deviceid=0)[source]

Calculate core and accessory distances between query sequences and a sketched database

For a reference database, runs the query against itself to find all pairwise core and accessory distances.

Uses the relation \(pr(a, b) = (1-a)(1-c)^k\)

To get the ref and query name for each row of the returned distances, call to the iterator iterDistRows() with the returned refList and queryList

Args:
rNames (list)

Names of references to query

qNames (list)

Names of queries

dbPrefix (str)

Prefix for reference sketch database created by constructDatabase()

queryPrefix (str)

Prefix for query sketch database created by constructDatabase()

klist (list)

K-mer sizes to use in the calculation

self (bool)

Set true if query = ref (default = True)

number_plot_fits (int)

If > 0, the number of k-mer length fits to plot (saved as pdfs). Takes random pairs of comparisons and calls plot_fit() (default = 0)

threads (int)

Number of threads to use in the process (default = 1)

use_gpu (bool)

Use a GPU for querying (default = False)

deviceid (int)

Index of the CUDA GPU device to use (default = 0)

Returns:
distMat (numpy.array)

Core distances (column 0) and accessory distances (column 1) between refList and queryList

PopPUNK.sketchlib.readDBParams(dbPrefix)[source]

Get kmers lengths and sketch sizes from existing database

Calls getKmersFromReferenceDatabase() and getSketchSize() Uses passed values if db missing

Args:
dbPrefix (str)

Prefix for sketch DB files

Returns:
kmers (list)

List of k-mer lengths used in database

sketch_sizes (list)

List of sketch sizes used in database

codonPhased (bool)

whether the DB used codon phased seeds

PopPUNK.sketchlib.removeFromDB(db_name, out_name, removeSeqs, full_names=False)[source]

Remove sketches from the DB the low-level HDF5 copy interface

Args:
db_name (str)

Prefix for hdf database

out_name (str)

Prefix for output (pruned) database

removeSeqs (list)

Names of sequences to remove from database

full_names (bool)

If True, db_name and out_name are the full paths to h5 files

utils.py

General utility functions for data read/writing/manipulation in PopPUNK

PopPUNK.utils.check_and_set_gpu(use_gpu, gpu_lib, quit_on_fail=False)[source]

Check GPU libraries can be loaded and set managed memory.

Args:
use_gpu (bool)

Whether GPU packages have been requested

gpu_lib (bool)

Whether GPU packages are available

Returns:
use_gpu (bool)

Whether GPU packages can be used

PopPUNK.utils.decisionBoundary(intercept, gradient, adj=0.0)[source]

Returns the co-ordinates where the triangle the decision boundary forms meets the x- and y-axes.

Args:
intercept (numpy.array)

Cartesian co-ordinates of point along line (transformLine()) which intercepts the boundary

gradient (float)

Gradient of the line

adj (float)

Fraction by which to shift the intercept up the y axis

Returns:
x (float)

The x-axis intercept

y (float)

The y-axis intercept

PopPUNK.utils.get_match_search_depth(rlist, rank_list)[source]

Return a default search depth for lineage model fitting.

Args:
rlist (list)

List of sequences in database

rank_list (list)

List of ranks to be used to fit lineage models

Returns:
max_search_depth (int)

Maximum kNN used for lineage model fitting

PopPUNK.utils.isolateNameToLabel(names)[source]

Function to process isolate names to labels appropriate for visualisation.

Args:
names (list)

List of isolate names.

Returns:
labels (list)

List of isolate labels.

PopPUNK.utils.iterDistRows(refSeqs, querySeqs, self=True)[source]

Gets the ref and query ID for each row of the distance matrix

Returns an iterable with ref and query ID pairs by row.

Args:
refSeqs (list)

List of reference sequence names.

querySeqs (list)

List of query sequence names.

self (bool)

Whether a self-comparison, used when constructing a database. Requires refSeqs == querySeqs Default is True

Returns:
ref, query (str, str)

Iterable of tuples with ref and query names for each distMat row.

PopPUNK.utils.joinClusterDicts(d1, d2)[source]

Join two dictionaries returned by readIsolateTypeFromCsv() with return_dict = True. Useful for concatenating ref and query assignments

Args:
d1 (dict of dicts)

First dictionary to concat

d2 (dict of dicts)

Second dictionary to concat

Returns:
d1 (dict of dicts)

d1 with d2 appended

PopPUNK.utils.listDistInts(refSeqs, querySeqs, self=True)[source]

Gets the ref and query ID for each row of the distance matrix

Returns an iterable with ref and query ID pairs by row.

Args:
refSeqs (list)

List of reference sequence names.

querySeqs (list)

List of query sequence names.

self (bool)

Whether a self-comparison, used when constructing a database. Requires refSeqs == querySeqs Default is True

Returns:
ref, query (str, str)

Iterable of tuples with ref and query names for each distMat row.

PopPUNK.utils.readIsolateTypeFromCsv(clustCSV, mode='clusters', return_dict=False)[source]

Read cluster definitions from CSV file.

Args:
clustCSV (str)

File name of CSV with isolate assignments

mode (str)

Type of file to read ‘clusters’, ‘lineages’, or ‘external’

return_dict (bool)

If True, return a dict with sample->cluster instead of sets [default = False]

Returns:
clusters (dict)

Dictionary of cluster assignments (keys are cluster names, values are sets containing samples in the cluster). Or if return_dict is set keys are sample names, values are cluster assignments.

PopPUNK.utils.readPickle(pklName, enforce_self=False, distances=True)[source]

Loads core and accessory distances saved by storePickle()

Called during --fit-model

Args:
pklName (str)

Prefix for saved files

enforce_self (bool)

Error if self == False

[default = True]

distances (bool)

Read the distance matrix

[default = True]

Returns:
rlist (list)

List of reference sequence names (for iterDistRows())

qlist (list)

List of query sequence names (for iterDistRows())

self (bool)

Whether an all-vs-all self DB (for iterDistRows())

X (numpy.array)

n x 2 array of core and accessory distances

PopPUNK.utils.readRfile(rFile, oneSeq=False)[source]

Reads in files for sketching. Names and sequence, tab separated

Args:
rFile (str)

File with locations of assembly files to be sketched

oneSeq (bool)

Return only the first sequence listed, rather than a list (used with mash)

Returns:
names (list)

Array of sequence names

sequences (list of lists)

Array of sequence files

PopPUNK.utils.read_rlist_from_distance_pickle(fn, allow_non_self=True, include_queries=False)[source]

Return the list of reference sequences from a distance pickle.

Args:
fn (str)

Name of distance pickle

allow_non_self (bool)

Whether non-self distance datasets are permissible

include_queries (bool)

Whether queries should be included in the rlist

Returns:
rlist (list)

List of reference sequence names

PopPUNK.utils.set_env(**environ)[source]

Temporarily set the process environment variables. >>> with set_env(PLUGINS_DIR=u’test/plugins’): … “PLUGINS_DIR” in os.environ True >>> “PLUGINS_DIR” in os.environ False

PopPUNK.utils.setupDBFuncs(args)[source]

Wraps common database access functions from sketchlib and mash, to try and make their API more similar

Args:
args (argparse.opts)

Parsed command lines options

qc_dict (dict)

Table of parameters for QC function

Returns:
dbFuncs (dict)

Functions with consistent arguments to use as the database API

PopPUNK.utils.stderr_redirected(to='/dev/null')[source]

import os

with stdout_redirected(to=filename):

print(“from Python”) os.system(“echo non-Python applications are also supported”)

PopPUNK.utils.storePickle(rlist, qlist, self, X, pklName)[source]

Saves core and accessory distances in a .npy file, names in a .pkl

Called during --create-db

Args:
rlist (list)

List of reference sequence names (for iterDistRows())

qlist (list)

List of query sequence names (for iterDistRows())

self (bool)

Whether an all-vs-all self DB (for iterDistRows())

X (numpy.array)

n x 2 array of core and accessory distances

If None, do not save

pklName (str)

Prefix for output files

PopPUNK.utils.transformLine(s, mean0, mean1)[source]

Return x and y co-ordinates for traversing along a line between mean0 and mean1, parameterised by a single scalar distance s from the start point mean0.

Args:
s (float)

Distance along line from mean0

mean0 (numpy.array)

Start position of line (x0, y0)

mean1 (numpy.array)

End position of line (x1, y1)

Returns:
x (float)

The Cartesian x-coordinate

y (float)

The Cartesian y-coordinate

PopPUNK.utils.update_distance_matrices(refList, distMat, queryList=None, query_ref_distMat=None, query_query_distMat=None, threads=1)[source]

Convert distances from long form (1 matrix with n_comparisons rows and 2 columns) to a square form (2 NxN matrices), with merging of query distances if necessary.

Args:
refList (list)

List of references

distMat (numpy.array)

Two column long form list of core and accessory distances for pairwise comparisons between reference db sequences

queryList (list)

List of queries

query_ref_distMat (numpy.array)

Two column long form list of core and accessory distances for pairwise comparisons between queries and reference db sequences

query_query_distMat (numpy.array)

Two column long form list of core and accessory distances for pairwise comparisons between query sequences

threads (int)

Number of threads to use

Returns:
seqLabels (list)

Combined list of reference and query sequences

coreMat (numpy.array)

NxN array of core distances for N sequences

accMat (numpy.array)

NxN array of accessory distances for N sequences

visualise.py

poppunk_visualise main function

PopPUNK.visualise.main()[source]

Main function. Parses cmd line args and runs in the specified mode.

web.py

Functions used by the web API to convert a sketch to an h5 database, then generate visualisations and post results to PopPUNK-web.

PopPUNK.web.api(query, ref_db)[source]

Post cluster and tree information to microreact

PopPUNK.web.calc_prevalence(cluster, cluster_list, num_samples)[source]

Cluster prevalences for Plotly.js

PopPUNK.web.graphml_to_json(network_dir)[source]

Converts full GraphML file to JSON subgraph

PopPUNK.web.highlight_cluster(query, cluster)[source]

Colour assigned cluster in Microreact output

PopPUNK.web.sketch_to_hdf5(sketches_dict, output)[source]

Convert dict of JSON sketches to query hdf5 database

PopPUNK.web.summarise_clusters(output, species, species_db, qNames)[source]

Retreieve assigned query and all cluster prevalences. Write list of all isolates in cluster for tree subsetting