Reference documentation¶
Documentation for module functions (for developers)
assign.py¶
poppunk_assign
main function
- PopPUNK.assign.assign_query(dbFuncs, ref_db, q_files, output, qc_dict, update_db, write_references, distances, serial, stable, threads, overwrite, plot_fit, graph_weights, model_dir, strand_preserved, previous_clustering, external_clustering, core, accessory, gpu_sketch, gpu_dist, gpu_graph, deviceid, save_partial_query_graph, use_full_network)[source]¶
Code for assign query mode for CLI
- PopPUNK.assign.assign_query_hdf5(dbFuncs, ref_db, qNames, output, qc_dict, update_db, write_references, distances, serial, stable, threads, overwrite, plot_fit, graph_weights, model_dir, strand_preserved, previous_clustering, external_clustering, core, accessory, gpu_dist, gpu_graph, save_partial_query_graph, use_full_network)[source]¶
Code for assign query mode taking hdf5 as input. Written as a separate function so it can be called by web APIs
bgmm.py¶
Functions used to fit the mixture model to a database. Access using
BGMMFit
.
BGMM using sklearn
- PopPUNK.bgmm.findBetweenLabel_bgmm(means, assignments)[source]¶
Identify between-strain links
Finds the component with the largest number of points assigned to it
- Args:
- means (numpy.array)
K x 2 array of mixture component means
- assignments (numpy.array)
Sample cluster assignments
- Returns:
- between_label (int)
The cluster label with the most points assigned to it
- PopPUNK.bgmm.findWithinLabel(means, assignments, rank=0)[source]¶
Identify within-strain links
Finds the component with mean closest to the origin and also makes sure some samples are assigned to it (in the case of small weighted components with a Dirichlet prior some components are unused)
- Args:
- means (numpy.array)
K x 2 array of mixture component means
- assignments (numpy.array)
Sample cluster assignments
- rank (int)
Which label to find, ordered by distance from origin. 0-indexed. (default = 0)
- Returns:
- within_label (int)
The cluster label for the within-strain assignments
- PopPUNK.bgmm.fit2dMultiGaussian(X, dpgmm_max_K=2)[source]¶
Main function to fit BGMM model, called from
fit()
Fits the mixture model specified, saves model parameters to a file, and assigns the samples to a component. Write fit summary stats to STDERR.
- Args:
- X (np.array)
n x 2 array of core and accessory distances for n samples. This should be subsampled to 100000 samples.
- dpgmm_max_K (int)
Maximum number of components to use with the EM fit. (default = 2)
- Returns:
- dpgmm (sklearn.mixture.BayesianGaussianMixture)
Fitted bgmm model
- PopPUNK.bgmm.log_likelihood(X, weights, means, covars, scale)[source]¶
modified sklearn GMM function predicting distribution membership
Returns the mixture LL for points X. Used by
assign_samples()
andplot_contours()
- Args:
- X (numpy.array)
n x 2 array of core and accessory distances for n samples
- weights (numpy.array)
Component weights from
fit2dMultiGaussian()
- means (numpy.array)
Component means from
fit2dMultiGaussian()
- covars (numpy.array)
Component covariances from
fit2dMultiGaussian()
- scale (numpy.array)
Scaling of core and accessory distances from
fit2dMultiGaussian()
- Returns:
- logprob (numpy.array)
The log of the probabilities under the mixture model
- lpr (numpy.array)
The components of the log probability from each mixture component
- PopPUNK.bgmm.log_multivariate_normal_density(X, means, covars, min_covar=1e-07)[source]¶
Log likelihood of multivariate normal density distribution
Used to calculate per component Gaussian likelihood in
assign_samples()
- Args:
- X (numpy.array)
n x 2 array of core and accessory distances for n samples
- means (numpy.array)
Component means from
fit2dMultiGaussian()
- covars (numpy.array)
Component covariances from
fit2dMultiGaussian()
- min_covar (float)
Minimum covariance, added when Choleksy decomposition fails due to too few observations (default = 1.e-7)
- Returns:
- log_prob (numpy.array)
An n-vector with the log-likelihoods for each sample being in this component
dbscan.py¶
Functions used to fit DBSCAN to a database. Access using
DBSCANFit
.
DBSCAN using hdbscan
- PopPUNK.dbscan.evaluate_dbscan_clusters(model)[source]¶
Evaluate whether fitted dbscan model contains non-overlapping clusters
- Args:
- model (DBSCANFit)
Fitted model from
fit()
- Returns:
- indistinct (bool)
Boolean indicating whether putative within- and between-strain clusters of points overlap
- PopPUNK.dbscan.findBetweenLabel(assignments, within_cluster)[source]¶
Identify between-strain links from a DBSCAN model
Finds the component containing the largest number of between-strain links, excluding the cluster identified as containing within-strain links.
- Args:
- assignments (numpy.array)
Sample cluster assignments
- within_cluster (int)
Cluster ID assigned to within-strain assignments, from
findWithinLabel()
- Returns:
- between_cluster (int)
The cluster label for the between-strain assignments
- PopPUNK.dbscan.fitDbScan(X, min_samples, min_cluster_size, cache_out, use_gpu=False)[source]¶
Function to fit DBSCAN model as an alternative to the Gaussian
Fits the DBSCAN model to the distances using hdbscan
- Args:
- X (np.array)
n x 2 array of core and accessory distances for n samples
- min_samples (int)
Parameter for DBSCAN clustering ‘conservativeness’
- min_cluster_size (int)
Minimum number of points in a cluster for HDBSCAN
- cache_out (str)
Prefix for DBSCAN cache used for refitting
- use_gpu (bool)
Whether GPU algorithms should be used in DBSCAN fitting
- Returns:
- hdb (hdbscan.HDBSCAN or cuml.cluster.HDBSCAN)
Fitted HDBSCAN to subsampled data
- labels (list)
Cluster assignments of each sample
- n_clusters (int)
Number of clusters used
mandrake.py¶
- PopPUNK.mandrake.generate_embedding(seqLabels, accMat, perplexity, outPrefix, overwrite, kNN=50, maxIter=10000000, n_threads=1, use_gpu=False, device_id=0)[source]¶
Generate t-SNE projection using accessory distances
Writes a plot of t-SNE clustering of accessory distances (.dot)
- Args:
- seqLabels (list)
Processed names of sequences being analysed.
- accMat (numpy.array)
n x n array of accessory distances for n samples.
- perplexity (int)
Perplexity parameter passed to t-SNE
- outPrefix (str)
Prefix for all generated output files, which will be placed in outPrefix subdirectory
- overwrite (bool)
Overwrite existing output if present (default = False)
- kNN (int)
Number of neigbours to use with SCE (cannot be > n_samples) (default = 50)
- maxIter (int)
Number of iterations to run (default = 1000000)
- n_threads (int)
Number of CPU threads to use (default = 1)
- use_gpu (bool)
Whether to use GPU libraries
- device_id (int)
Device ID of GPU to be used (default = 0)
- Returns:
- mandrake_filename (str)
Filename with .dot of embedding
models.py¶
Classes used for model fits
- class PopPUNK.models.BGMMFit(outPrefix, max_samples=100000, max_batch_size=100000, assign_points=True)[source]¶
Class for fits using the Gaussian mixture model. Inherits from
ClusterFit
.Must first run either
fit()
orload()
before calling other functions- Args:
- outPrefix (str)
The output prefix used for reading/writing
- max_samples (int)
The number of subsamples to fit the model to (default = 100000)
- assign(X, max_batch_size=100000, values=False, progress=True)[source]¶
Assign the clustering of new samples using
assign_samples()
- Args:
- X (numpy.array)
Core and accessory distances
- values (bool)
Return the responsibilities of assignment rather than most likely cluster
- max_batch_size (int)
Size of batches to be assigned
- progress (bool)
Show progress bar
[default = True]
- Returns:
- y (numpy.array)
Cluster assignments or values by samples
- fit(X, max_components)[source]¶
Extends
fit()
Fits the BGMM and returns assignments by calling
fit2dMultiGaussian()
.Fitted parameters are stored in the object.
- Args:
- X (numpy.array)
The core and accessory distances to cluster. Must be set if preprocess is set.
- max_components (int)
Maximum number of mixture components to use.
- Returns:
- y (numpy.array)
Cluster assignments of samples in X
- load(fit_npz, fit_obj)[source]¶
Load the model from disk. Called from
loadClusterFit()
- Args:
- fit_npz (dict)
Fit npz opened with
numpy.load()
- fit_obj (sklearn.mixture.BayesianGaussianMixture)
The saved fit object
- plot(X, y)[source]¶
Extends
plot()
Write a summary of the fit, and plot the results using
PopPUNK.plot.plot_results()
andPopPUNK.plot.plot_contours()
- Args:
- X (numpy.array)
Core and accessory distances
- y (numpy.array)
Cluster assignments from
assign()
- class PopPUNK.models.ClusterFit(outPrefix, default_dtype=numpy.float32)[source]¶
Parent class for all models used to cluster distances
- Args:
- outPrefix (str)
The output prefix used for reading/writing
- fit(X=None)[source]¶
Initial steps for all fit functions.
Creates output directory. If preprocess is set then subsamples passed X
- Args:
- X (numpy.array)
The core and accessory distances to cluster. Must be set if preprocess is set.
(default = None)
- default_dtype (numpy dtype)
Type to use if no X provided
- class PopPUNK.models.DBSCANFit(outPrefix, use_gpu=False, max_batch_size=5000, max_samples=100000, assign_points=True)[source]¶
Class for fits using HDBSCAN. Inherits from
ClusterFit
.Must first run either
fit()
orload()
before calling other functions- Args:
- outPrefix (str)
The output prefix used for reading/writing
- max_samples (int)
The number of subsamples to fit the model to (default = 100000)
- assign(X, no_scale=False, progress=True, max_batch_size=5000, use_gpu=False)[source]¶
Assign the clustering of new samples using
assign_samples_dbscan()
- Args:
- X (numpy.array or cupy.array)
Core and accessory distances
- no_scale (bool)
Do not scale X [default = False]
- progress (bool)
Show progress bar [default = True]
- max_batch_size (int)
Batch size used for assignments [default = 5000]
- use_gpu (bool)
Use GPU-enabled algorithms for clustering [default = False]
- Returns:
- y (numpy.array)
Cluster assignments by samples
- fit(X, max_num_clusters, min_cluster_prop, use_gpu=False)[source]¶
Extends
fit()
Fits the distances with HDBSCAN and returns assignments by calling
fitDbScan()
.Fitted parameters are stored in the object.
- Args:
- X (numpy.array)
The core and accessory distances to cluster. Must be set if preprocess is set.
- max_num_clusters (int)
Maximum number of clusters in DBSCAN fitting
- min_cluster_prop (float)
Minimum proportion of points in a cluster in DBSCAN fitting
- use_gpu (bool)
Whether GPU algorithms should be used in DBSCAN fitting
- Returns:
- y (numpy.array)
Cluster assignments of samples in X
- load(fit_npz, fit_obj)[source]¶
Load the model from disk. Called from
loadClusterFit()
- Args:
- fit_npz (dict)
Fit npz opened with
numpy.load()
- fit_obj (hdbscan.HDBSCAN)
The saved fit object
- plot(X=None, y=None)[source]¶
Extends
plot()
Write a summary of the fit, and plot the results using
PopPUNK.plot.plot_dbscan_results()
- Args:
- X (numpy.array)
Core and accessory distances
- y (numpy.array)
Cluster assignments from
assign()
- class PopPUNK.models.LineageFit(outPrefix, ranks, max_search_depth, reciprocal_only, count_unique_distances, dist_col=None, use_gpu=False)[source]¶
Class for fits using the lineage assignment model. Inherits from
ClusterFit
.Must first run either
fit()
orload()
before calling other functions- Args:
- outPrefix (str)
The output prefix used for reading/writing
- ranks (list)
The ranks used in the fit
- assign(rank)[source]¶
Get the edges for the network. A little different from other methods, as it doesn’t go through the long form distance vector (as coo_matrix is basically already in the correct gt format)
- Args:
- rank (int)
Rank to assign at
- Returns:
- y (list of tuples)
Edges to include in network
- edge_weights(rank)[source]¶
Get the distances for each edge returned by assign
- Args:
- rank (int)
Rank assigned at
- Returns:
- weights (list)
Distance for each assignment
- extend(qqDists, qrDists)[source]¶
Update the sparse distance matrix of nearest neighbours after querying
- Args:
- qqDists (numpy or cupy ndarray)
Two column array of query-query distances
- qqDists (numpy or cupy ndarray)
Two column array of reference-query distances
- Returns:
- y (list of tuples)
Edges to include in network
- fit(X, accessory)[source]¶
Extends
fit()
Gets assignments by using nearest neigbours.
- Args:
- X (numpy.array)
The core and accessory distances to cluster. Must be set if preprocess is set.
- accessory (bool)
Use accessory rather than core distances
- Returns:
- y (numpy.array)
Cluster assignments of samples in X
- load(fit_npz, fit_obj)[source]¶
Load the model from disk. Called from
loadClusterFit()
- Args:
- fit_npz (dict)
Fit npz opened with
numpy.load()
- fit_obj (sklearn.mixture.BayesianGaussianMixture)
The saved fit object
- plot(X, y=None)[source]¶
Extends
plot()
Write a summary of the fit, and plot the results using
PopPUNK.plot.plot_results()
andPopPUNK.plot.plot_contours()
- Args:
- X (numpy.array)
Core and accessory distances
- y (any)
Unused variable for compatibility with other plotting functions
Alias for field number 2
Alias for field number 0
Alias for field number 1
- class PopPUNK.models.RefineFit(outPrefix)[source]¶
Class for fits using a triangular boundary and network properties. Inherits from
ClusterFit
.Must first run either
fit()
orload()
before calling other functions- Args:
- outPrefix (str)
The output prefix used for reading/writing
- apply_threshold(X, threshold)[source]¶
Applies a boundary threshold, given by user. Does not run optimisation.
- Args:
- X (numpy.array)
The core and accessory distances to cluster. Must be set if preprocess is set.
- threshold (float)
The value along the x-axis (core distance) at which to draw the assignment boundary
- Returns:
- y (numpy.array)
Cluster assignments of samples in X
- assign(X, slope=None)[source]¶
Assign the clustering of new samples
- Args:
- X (numpy.array)
Core and accessory distances
- slope (int)
Override self.slope. Default - use self.slope Set to 0 for a vertical line, 1 for a horizontal line, or 2 to use a slope
- Returns:
- y (numpy.array)
Cluster assignments by samples
- fit(X, sample_names, model, max_move, min_move, startFile=None, indiv_refine=False, unconstrained=False, multi_boundary=0, score_idx=0, no_local=False, betweenness_sample=100, sample_size=None, use_gpu=False)[source]¶
Extends
fit()
Fits the distances by optimising network score, by calling
refineFit2D()
.Fitted parameters are stored in the object.
- Args:
- X (numpy.array)
The core and accessory distances to cluster. Must be set if preprocess is set.
- sample_names (list)
Sample names in X (accessed by
iterDistRows()
)- model (ClusterFit)
The model fit to refine
- max_move (float)
Maximum distance to move away from start point
- min_move (float)
Minimum distance to move away from start point
- startFile (str)
A file defining an initial fit, rather than one from
--fit-model
. See documentation for format. (default = None).- indiv_refine (str)
Run refinement for core or accessory distances separately (default = None).
- multi_boundary (int)
Produce cluster output at multiple boundary positions downward from the optimum. (default = 0).
- unconstrained (bool)
If True, search in 2D and change the slope of the boundary
- score_idx (int)
Index of score from
networkSummary()
to use [default = 0]- no_local (bool)
Turn off the local optimisation step. Quicker, but may be less well refined.
- betweenness_sample (int)
Number of sequences per component used to estimate betweenness using a GPU. Smaller numbers are faster but less precise [default = 100]
- sample_size (int)
Number of nodes to subsample for graph statistic calculation
- use_gpu (bool)
Whether to use cugraph for graph analyses
- Returns:
- y (numpy.array)
Cluster assignments of samples in X
- load(fit_npz, fit_obj)[source]¶
Load the model from disk. Called from
loadClusterFit()
- Args:
- fit_npz (dict)
Fit npz opened with
numpy.load()
- fit_obj (None)
The saved fit object (not used)
- plot(X, y=None)[source]¶
Extends
plot()
Write a summary of the fit, and plot the results using
PopPUNK.plot.plot_refined_results()
- Args:
- X (numpy.array)
Core and accessory distances
- y (numpy.array)
Assignments (unused)
- PopPUNK.models.assign_samples(chunk, X, y, model, scale, chunk_size, values=False)[source]¶
Runs a models assignment on a chunk of input
- Args:
- chunk (int)
Index of chunk to process
- X (NumpyShared)
n x 2 array of core and accessory distances for n samples
- y (NumpyShared)
An n-vector to store results, with the most likely cluster memberships or an n by k matrix with the component responsibilities for each sample.
- weights (numpy.array)
Component weights from
BGMMFit
- means (numpy.array)
Component means from
BGMMFit
- covars (numpy.array)
Component covariances from
BGMMFit
- scale (numpy.array)
Scaling of core and accessory distances from
BGMMFit
- chunk_size (int)
Size of each chunk in X
- values (bool)
Whether to return the responsibilities, rather than the most likely assignment (used for entropy calculation).
Default is False
- PopPUNK.models.loadClusterFit(pkl_file, npz_file, outPrefix='', max_samples=100000, use_gpu=False)[source]¶
Call this to load a fitted model
- Args:
- pkl_file (str)
Location of saved .pkl file on disk
- npz_file (str)
Location of saved .npz file on disk
- outPrefix (str)
Output prefix for model to save to (e.g. plots)
- max_samples (int)
Maximum samples if subsampling X [default = 100000]
- use_gpu (bool)
Whether to load npz file with GPU libraries for lineage models
- Returns:
- load_obj (model)
Loaded model
network.py¶
Functions used to construct the network, and update with new queries. Main
entry point is constructNetwork()
for new reference
databases, and findQueryLinksToNetwork()
for querying
databases.
Network functions
- PopPUNK.network.addQueryToNetwork(dbFuncs, rList, qList, G, assignments, model, queryDB, kmers=None, distance_type='euclidean', queryQuery=False, strand_preserved=False, weights=None, threads=1, use_gpu=False)[source]¶
Finds edges between queries and items in the reference database, and modifies the network to include them.
- Args:
- dbFuncs (list)
List of backend functions from
setupDBFuncs()
- rList (list)
List of reference names
- qList (list)
List of query names
- G (graph)
Network to add to (mutated)
- assignments (numpy.array)
Cluster assignment of items in qlist
- model (ClusterModel)
Model fitted to reference database
- queryDB (str)
Query database location
- distances (str)
Prefix of distance files for extending network
- kmers (list)
List of k-mer sizes
- distance_type (str)
Distance type to use as weights in network
- queryQuery (bool)
Add in all query-query distances (default = False)
- strand_preserved (bool)
Whether to treat strand as known (i.e. ignore rc k-mers) when adding random distances. Only used if queryQuery = True [default = False]
- weights (numpy.array)
If passed, the core,accessory distances for each assignment, which will be annotated as an edge attribute
- threads (int)
Number of threads to use if new db created
- use_gpu (bool)
Whether to use cugraph for analysis
(default = 1)
- Returns:
- distMat (numpy.array)
Query-query distances
- PopPUNK.network.checkNetworkVertexCount(seq_list, G, use_gpu)[source]¶
Checks the number of network vertices matches the number of sequence names.
- Args:
- seq_list (list)
The list of sequence names
- G (graph)
The network of sequences
- use_gpu (bool)
Whether to use cugraph for graph analyses
- PopPUNK.network.cliquePrune(component, graph, reference_indices, components_list)[source]¶
Wrapper function around
getCliqueRefs()
so it can be called by a multiprocessing pool
- PopPUNK.network.construct_dense_weighted_network(rlist, distMat, weights_type=None, use_gpu=False)[source]¶
Construct an undirected network using sequence lists, assignments of pairwise distances to clusters, and the identifier of the cluster assigned to within-strain distances. Nodes are samples and edges where samples are within the same cluster
Will print summary statistics about the network to
STDERR
- Args:
- rlist (list)
List of reference sequence labels
- distMat (2 column ndarray)
Numpy array of pairwise distances
- weights_type (str)
Type of weight to use for network
- use_gpu (bool)
Whether to use GPUs for network construction
- Returns:
- G (graph)
The resulting network
- PopPUNK.network.construct_network_from_assignments(rlist, qlist, assignments, within_label=1, int_offset=0, weights=None, distMat=None, weights_type=None, previous_network=None, old_ids=None, adding_qq_dists=False, previous_pkl=None, betweenness_sample=100, summarise=True, sample_size=None, use_gpu=False)[source]¶
Construct an undirected network using sequence lists, assignments of pairwise distances to clusters, and the identifier of the cluster assigned to within-strain distances. Nodes are samples and edges where samples are within the same cluster
Will print summary statistics about the network to
STDERR
- Args:
- rlist (list)
List of reference sequence labels
- qlist (list)
List of query sequence labels
- assignments (numpy.array or int)
Labels of most likely cluster assignment
- within_label (int)
The label for the cluster representing within-strain distances
- int_offset (int)
Constant integer to add to each node index
- weights (list)
List of weights for each edge in the network
- distMat (2 column ndarray)
Numpy array of pairwise distances
- weights_type (str)
Measure to calculate from the distMat to use as edge weights in network - options are core, accessory or euclidean distance
- previous_network (str)
Name of file containing a previous network to be integrated into this new network
- old_ids (list)
Ordered list of vertex names in previous network
- adding_qq_dists (bool)
Boolean specifying whether query-query edges are being added to an existing network, such that not all the sequence IDs will be found in the old IDs, which should already be correctly ordered
- previous_pkl (str)
Name of file containing the names of the sequences in the previous_network
- betweenness_sample (int)
Number of sequences per component used to estimate betweenness using a GPU. Smaller numbers are faster but less precise [default = 100]
- summarise (bool)
Whether to calculate and print network summaries with
networkSummary()
(default = True)- sample_size (int)
Number of nodes to subsample for graph statistic calculation
- use_gpu (bool)
Whether to use GPUs for network construction
- Returns:
- G (graph)
The resulting network
- PopPUNK.network.construct_network_from_df(rlist, qlist, G_df, weights=False, distMat=None, previous_network=None, adding_qq_dists=False, old_ids=None, previous_pkl=None, betweenness_sample=100, summarise=True, sample_size=None, use_gpu=False)[source]¶
Construct an undirected network using a data frame of edges. Nodes are samples and edges where samples are within the same cluster
Will print summary statistics about the network to
STDERR
- Args:
- rlist (list)
List of reference sequence labels
- qlist (list)
List of query sequence labels
- G_df (cudf or pandas data frame)
Data frame in which the first two columns are the nodes linked by edges
- weights (bool)
Whether weights in the G_df data frame should be included in the network
- distMat (2 column ndarray)
Numpy array of pairwise distances
- previous_network (str or graph object)
Name of file containing a previous network to be integrated into this new network, or the already-loaded graph object
- adding_qq_dists (bool)
Boolean specifying whether query-query edges are being added to an existing network, such that not all the sequence IDs will be found in the old IDs, which should already be correctly ordered
- old_ids (list)
Ordered list of vertex names in previous network
- previous_pkl (str)
Name of file containing the names of the sequences in the previous_network
- betweenness_sample (int)
Number of sequences per component used to estimate betweenness using a GPU. Smaller numbers are faster but less precise [default = 100]
- summarise (bool)
Whether to calculate and print network summaries with
networkSummary()
(default = True)- sample_size (int)
Number of nodes to subsample for graph statistic calculation
- use_gpu (bool)
Whether to use GPUs for network construction
- Returns:
- G (graph)
The resulting network
- PopPUNK.network.construct_network_from_edge_list(rlist, qlist, edge_list, weights=None, distMat=None, previous_network=None, adding_qq_dists=False, old_ids=None, previous_pkl=None, betweenness_sample=100, summarise=True, sample_size=None, use_gpu=False)[source]¶
Construct an undirected network using a list of edges as tuples. Nodes are samples and edges where samples are within the same cluster
Will print summary statistics about the network to
STDERR
- Args:
- rlist (list)
List of reference sequence labels
- qlist (list)
List of query sequence labels
- edge_list (list of tuples)
List of tuples describing the edges of the graph
- weights (list)
List of edge weights
- distMat (2 column ndarray)
Numpy array of pairwise distances
- previous_network (str or graph object)
Name of file containing a previous network to be integrated into this new network, or the already-loaded graph object
- adding_qq_dists (bool)
Boolean specifying whether query-query edges are being added to an existing network, such that not all the sequence IDs will be found in the old IDs, which should already be correctly ordered
- old_ids (list)
Ordered list of vertex names in previous network
- previous_pkl (str)
Name of file containing the names of the sequences in the previous_network
- betweenness_sample (int)
Number of sequences per component used to estimate betweenness using a GPU. Smaller numbers are faster but less precise [default = 100]
- summarise (bool)
Whether to calculate and print network summaries with
networkSummary()
(default = True)- sample_size (int)
Number of nodes to subsample for graph statistic calculation
- use_gpu (bool)
Whether to use GPUs for network construction
- Returns:
- G (graph)
The resulting network
- PopPUNK.network.construct_network_from_sparse_matrix(rlist, qlist, sparse_input, weights=None, previous_network=None, previous_pkl=None, betweenness_sample=100, summarise=True, sample_size=None, use_gpu=False)[source]¶
Construct an undirected network using a sparse matrix. Nodes are samples and edges where samples are within the same cluster
Will print summary statistics about the network to
STDERR
- Args:
- rlist (list)
List of reference sequence labels
- qlist (list)
List of query sequence labels
- sparse_input (numpy.array)
Sparse distance matrix from lineage fit
- weights (list)
List of weights for each edge in the network
- distMat (2 column ndarray)
Numpy array of pairwise distances
- previous_network (str)
Name of file containing a previous network to be integrated into this new network
- previous_pkl (str)
Name of file containing the names of the sequences in the previous_network
- betweenness_sample (int)
Number of sequences per component used to estimate betweenness using a GPU. Smaller numbers are faster but less precise [default = 100]
- summarise (bool)
Whether to calculate and print network summaries with
networkSummary()
(default = True)- sample_size (int)
Number of nodes to subsample for graph statistic calculation
- use_gpu (bool)
Whether to use GPUs for network construction
- Returns:
- G (graph)
The resulting network
- PopPUNK.network.cugraph_to_graph_tool(G, rlist)[source]¶
Save a network to disk
- Args:
- G (cugraph network)
Cugraph network
- rlist (list)
List of sequence names
- Returns:
- G (graph-tool network)
Graph tool network
- PopPUNK.network.extractReferences(G, dbOrder, outPrefix, outSuffix='', type_isolate=None, existingRefs=None, threads=1, use_gpu=False)[source]¶
Extract references for each cluster based on cliques
Writes chosen references to file by calling
writeReferences()
- Args:
- G (graph)
A network used to define clusters
- dbOrder (list)
The order of files in the sketches, so returned references are in the same order
- outPrefix (str)
Prefix for output file
- outSuffix (str)
Suffix for output file (.refs will be appended)
- type_isolate (str)
Isolate to be included in set of references
- existingRefs (list)
References that should be used for each clique
- use_gpu (bool)
Use cugraph for graph analysis (default = False)
- Returns:
- refFileName (str)
The name of the file references were written to
- references (list)
An updated list of the reference names
- PopPUNK.network.fetchNetwork(network_dir, model, refList, ref_graph=False, core_only=False, accessory_only=False, use_gpu=False)[source]¶
Load the network based on input options
Returns the network as a graph-tool format graph, and sets the slope parameter of the passed model object.
- Args:
- network_dir (str)
A network used to define clusters
- model (ClusterFit)
A fitted model object
- refList (list)
Names of references that should be in the network
- ref_graph (bool)
Use ref only graph, if available [default = False]
- core_only (bool)
Return the network created using only core distances [default = False]
- accessory_only (bool)
Return the network created using only accessory distances [default = False]
- use_gpu (bool)
Use cugraph library to load graph
- Returns:
- genomeNetwork (graph)
The loaded network
- cluster_file (str)
The CSV of cluster assignments corresponding to this network
- PopPUNK.network.generate_cugraph(G_df, max_index, weights=False, renumber=True)[source]¶
Builds cugraph graph to ensure all nodes are included in the graph, even if singletons.
- Args:
- G_df (cudf)
cudf data frame containing edge list
- max_index (int)
The 0-indexed maximum of the node indices
- renumber (bool)
Whether to renumber the vertices when added to the graph
- Returns:
- G_new (graph)
Dictionary of cluster assignments (keys are sequence names)
- PopPUNK.network.generate_minimum_spanning_tree(G, from_cugraph=False)[source]¶
Generate a minimum spanning tree from a network
- Args:
- G (network)
Graph tool network
- from_cugraph (bool)
If a pre-calculated MST from cugraph [default = False]
- Returns:
- mst_network (str)
Minimum spanning tree (as graph-tool graph)
- PopPUNK.network.getCliqueRefs(G, reference_indices={})[source]¶
Recursively prune a network of its cliques. Returns one vertex from a clique at each stage
- Args:
- G (graph)
The graph to get clique representatives from
- reference_indices (set)
The unique list of vertices being kept, to add to
- PopPUNK.network.get_vertex_list(G, use_gpu=False)[source]¶
Generate a list of node indices
- Args:
- G (network)
Graph tool network
- use_gpu (bool)
Whether graph is a cugraph or not [default = False]
- Returns:
- vlist (list)
List of integers corresponding to nodes
- PopPUNK.network.load_network_file(fn, use_gpu=False)[source]¶
Load the network based on input options
Returns the network as a graph-tool format graph, and sets the slope parameter of the passed model object.
- Args:
- fn (str)
Network file name
- use_gpu (bool)
Use cugraph library to load graph
- Returns:
- genomeNetwork (graph)
The loaded network
- PopPUNK.network.networkSummary(G, calc_betweenness=True, betweenness_sample=100, subsample=None, use_gpu=False)[source]¶
Provides summary values about the network
- Args:
- G (graph)
The network of strains
- calc_betweenness (bool)
Whether to calculate betweenness stats
- betweenness_sample (int)
Number of sequences per component used to estimate betweenness using a GPU. Smaller numbers are faster but less precise [default = 100]
- subsample (int)
Number of vertices to randomly subsample from graph
- use_gpu (bool)
Whether to use cugraph for graph analysis
- Returns:
- metrics (list)
List with # components, density, transitivity, mean betweenness and weighted mean betweenness
- scores (list)
List of scores
- PopPUNK.network.network_to_edges(prev_G_fn, rlist, adding_qq_dists=False, old_ids=None, previous_pkl=None, weights=False, use_gpu=False)[source]¶
Load previous network, extract the edges to match the vertex order specified in rlist, and also return weights if specified.
- Args:
- prev_G_fn (str or graph object)
Path of file containing existing network, or already-loaded graph object
- adding_qq_dists (bool)
Boolean specifying whether query-query edges are being added to an existing network, such that not all the sequence IDs will be found in the old IDs, which should already be correctly ordered
- rlist (list)
List of reference sequence labels in new network
- old_ids (list)
List of IDs of vertices in existing network
- previous_pkl (str)
Path of pkl file containing names of sequences in previous network
- weights (bool)
Whether to return edge weights (default = False)
- use_gpu (bool)
Whether to use cugraph for graph analyses
- Returns:
- source_ids (list)
Source nodes for each edge
- target_ids (list)
Target nodes for each edge
- edge_weights (list)
Weights for each new edge
- PopPUNK.network.printClusters(G, rlist, outPrefix=None, oldClusterFile=None, externalClusterCSV=None, printRef=True, printCSV=True, clustering_type='combined', write_unwords=True, use_gpu=False)[source]¶
Get cluster assignments
Also writes assignments to a CSV file
- Args:
- G (graph)
Network used to define clusters
- rlist (list)
Names of samples
- outPrefix (str)
Prefix for output CSV Default = None
- oldClusterFile (str)
CSV with previous cluster assignments. Pass to ensure consistency in cluster assignment name. Default = None
- externalClusterCSV (str)
CSV with cluster assignments from any source. Will print a file relating these to new cluster assignments Default = None
- printRef (bool)
If false, print only query sequences in the output Default = True
- printCSV (bool)
Print results to file Default = True
- clustering_type (str)
Type of clustering network, used for comparison with old clusters Default = ‘combined’
- write_unwords (bool)
Write clusters with a pronouncable name rather than numerical index Default = True
- use_gpu (bool)
Whether to use cugraph for network analysis
- Returns:
- clustering (dict)
Dictionary of cluster assignments (keys are sequence names)
- PopPUNK.network.printExternalClusters(newClusters, extClusterFile, outPrefix, oldNames, printRef=True)[source]¶
Prints cluster assignments with respect to previously defined clusters or labels.
- Args:
- newClusters (set iterable)
The components from the graph G, defining the PopPUNK clusters
- extClusterFile (str)
A CSV file containing definitions of the external clusters for each sample (does not need to contain all samples)
- outPrefix (str)
Prefix for output CSV (_external_clusters.csv)
- oldNames (list)
A list of the reference sequences
- printRef (bool)
If false, print only query sequences in the output
Default = True
- PopPUNK.network.print_network_summary(G, sample_size=None, betweenness_sample=100, use_gpu=False)[source]¶
Wrapper function for printing network information
- Args:
- G (graph)
List of reference sequence labels
- sample_size (int)
Number of nodes to subsample for graph statistic calculation
- betweenness_sample (int)
Number of sequences per component used to estimate betweenness using a GPU. Smaller numbers are faster but less precise [default = 100]
- use_gpu (bool)
Whether to use GPUs for network construction
- PopPUNK.network.process_previous_network(previous_network=None, adding_qq_dists=False, old_ids=None, previous_pkl=None, vertex_labels=None, weights=False, use_gpu=False)[source]¶
Extract edge types from an existing network
- Args:
- previous_network (str or graph object)
Name of file containing a previous network to be integrated into this new network, or already-loaded graph object
- adding_qq_dists (bool)
Boolean specifying whether query-query edges are being added to an existing network, such that not all the sequence IDs will be found in the old IDs, which should already be correctly ordered
- old_ids (list)
Ordered list of vertex names in previous network
- previous_pkl (str)
Name of file containing the names of the sequences in the previous_network ordered based on the original network construction
- vertex_labels (list)
Ordered list of sequence labels
- weights (bool)
Whether weights should be extracted from the previous network
- use_gpu (bool)
Whether to use GPUs for network construction
- Returns:
- extra_sources (list)
List of source node identifiers
- extra_targets (list)
List of destination node identifiers
- extra_weights (list or None)
List of edge weights
- PopPUNK.network.process_weights(distMat, weights_type)[source]¶
Calculate edge weights from the distance matrix
- Args:
- distMat (2 column ndarray)
Numpy array of pairwise distances
- weights_type (str)
Measure to calculate from the distMat to use as edge weights in network - options are core, accessory or euclidean distance
- Returns:
- processed_weights (list)
Edge weights
- PopPUNK.network.prune_graph(prefix, reflist, samples_to_keep, output_db_name, threads, use_gpu)[source]¶
Keep only the specified sequences in a graph
- Args:
- prefix (str)
Name of directory containing network
- reflist (list)
Ordered list of sequences of database
- samples_to_keep (list)
The names of samples to be retained in the graph
- output_db_name (str)
Name of output directory
- threads (int)
Number of CPU threads to use when recalculating random match chances [default = 1].
- use_gpu (bool)
Whether graph is a cugraph or not [default = False]
- PopPUNK.network.remove_nodes_from_graph(G, reflist, samples_to_keep, use_gpu)[source]¶
Return a modified graph containing only the requested nodes
- Args:
- reflist (list)
Ordered list of sequences of database
- samples_to_keep (list)
The names of samples to be retained in the graph
- use_gpu (bool)
Whether graph is a cugraph or not [default = False]
- Returns:
- G_new (graph)
Pruned graph
- PopPUNK.network.remove_non_query_components(G, rlist, qlist, use_gpu=False)[source]¶
Removes all components that do not contain a query sequence.
- Args:
- G (graph)
Network of queries linked to reference sequences
- rlist (list)
List of reference sequence labels
- qlist (list)
List of query sequence labels
- use_gpu (bool)
Whether to use GPUs for network construction
- Returns:
- G (graph)
The resulting network
- pruned_names (list)
The labels of the sequences in the pruned network
- PopPUNK.network.save_network(G, prefix=None, suffix=None, use_graphml=False, use_gpu=False)[source]¶
Save a network to disk
- Args:
- G (network)
Graph tool network
- prefix (str)
Prefix for output file
- use_graphml (bool)
Whether to output a graph-tool file in graphml format
- use_gpu (bool)
Whether graph is a cugraph or not [default = False]
- PopPUNK.network.sparse_mat_to_network(sparse_mat, rlist, use_gpu=False)[source]¶
Generate a network from a lineage rank fit
- Args:
- sparse_mat (scipy or cupyx sparse matrix)
Sparse matrix of kNN from lineage fit
- rlist (list)
List of sequence names
- use_gpu (bool)
Whether GPU libraries should be used
- Returns:
- G (network)
Graph tool or cugraph network
- PopPUNK.network.translate_network_indices(G_ref_df, reference_indices)[source]¶
Function for ensuring an updated reference network retains numbering consistent with sample names
- Args:
- G_ref_df (cudf data frame)
List of edges in reference network
- reference_indices (list)
The ordered list of reference indices in the original network
- Returns:
- G_ref (cugraph network)
Network of reference sequences
- PopPUNK.network.vertex_betweenness(graph, norm=True)[source]¶
Returns betweenness for nodes in the graph
- PopPUNK.network.writeReferences(refList, outPrefix, outSuffix='')[source]¶
Writes chosen references to file
- Args:
- refList (list)
Reference names to write
- outPrefix (str)
Prefix for output file
- outSuffix (str)
Suffix for output file (.refs will be appended)
- Returns:
- refFileName (str)
The name of the file references were written to
refine.py¶
Functions used to refine an existing model. Access using
RefineFit
.
Refine mixture model using network properties
Alias for field number 2
Alias for field number 0
Alias for field number 1
- PopPUNK.refine.check_search_range(scale, mean0, mean1, lower_s, upper_s)[source]¶
Checks a search range is within a valid range
- Args:
- scale (np.array)
Rescaling factor to [0, 1] for each axis
- mean0 (np.array)
(x, y) of starting point defining line
- mean1 (np.array)
(x, y) of end point defining line
- lower_s (float)
distance along line to start search
- upper_s (float)
distance along line to end search
- Returns:
- min_x, max_x
minimum and maximum x-intercepts of the search range
- min_y, max_y
minimum and maximum x-intercepts of the search range
- PopPUNK.refine.expand_cugraph_network(G, G_extra_df)[source]¶
Reconstruct a cugraph network with additional edges.
- Args:
- G (cugraph network)
Original cugraph network
- extra_edges (cudf dataframe)
Data frame of edges to add
- Returns:
- G (cugraph network)
Expanded cugraph network
- PopPUNK.refine.growNetwork(sample_names, i_vec, j_vec, idx_vec, s_range, score_idx=0, thread_idx=0, betweenness_sample=100, write_clusters=None, sample_size=None, use_gpu=False)[source]¶
Construct a network, then add edges to it iteratively. Input is from
pp_sketchlib.iterateBoundary1D
or``pp_sketchlib.iterateBoundary2D``- Args:
- sample_names (list)
Sample names corresponding to distMat (accessed by iterator)
- i_vec (list)
Ordered ref vertex index to add
- j_vec (list)
Ordered query (==ref) vertex index to add
- idx_vec (list)
For each i, j tuple, the index of the intercept at which these enter the network. These are sorted and increasing
- s_range (list)
Offsets which correspond to idx_vec entries
- score_idx (int)
Index of score from
networkSummary()
to use [default = 0]- thread_idx (int)
Optional thread idx (if multithreaded) to offset progress bar by
- betweenness_sample (int)
Number of sequences per component used to estimate betweenness using a GPU. Smaller numbers are faster but less precise [default = 100]
- write_clusters (str)
Set to a prefix to write the clusters from each position to files [default = None]
- sample_size (int)
Number of nodes to subsample for graph statistic calculation
- use_gpu (bool)
Whether to use cugraph for graph analyses
- Returns:
- scores (list)
-1 * network score for each of x_range. Where network score is from
networkSummary()
- PopPUNK.refine.likelihoodBoundary(s, model, start, end, within, between)[source]¶
Wrapper function around
fit2dMultiGaussian()
so that it can go into a root-finding function for probabilities between components- Args:
- s (float)
Distance along line from mean0
- model (BGMMFit)
Fitted mixture model
- start (numpy.array)
The co-ordinates of the centre of the within-strain distribution
- end (numpy.array)
The co-ordinates of the centre of the between-strain distribution
- within (int)
Label of the within-strain distribution
- between (int)
Label of the between-strain distribution
- Returns:
- responsibility (float)
The difference between responsibilities of assignment to the within component and the between assignment
- PopPUNK.refine.multi_refine(distMat, sample_names, mean0, mean1, scale, s_max, n_boundary_points, output_prefix, num_processes=1, betweenness_sample=100, sample_size=None, use_gpu=False)[source]¶
Move the refinement boundary between the optimum and where it meets an axis. Discrete steps, output the clusers at each step
- Args:
- distMat (numpy.array)
n x 2 array of core and accessory distances for n samples
- sample_names (list)
List of query sequence labels
- mean0 (numpy.array)
Start point to define search line
- mean1 (numpy.array)
End point to define search line
- scale (numpy.array)
Scaling factor of distMat
- s_max (float)
The optimal s position from refinement (
refineFit()
)- n_boundary_points (int)
Number of positions to try drawing the boundary at
- num_processes (int)
Number of threads to use in the global optimisation step. (default = 1)
- betweenness_sample (int)
Number of sequences per component used to estimate betweenness using a GPU. Smaller numbers are faster but less precise [default = 100]
- sample_size (int)
Number of nodes to subsample for graph statistic calculation
- use_gpu (bool)
Whether to use cugraph for graph analyses
- PopPUNK.refine.newNetwork(s, sample_names, distMat, mean0, mean1, gradient, slope=2, score_idx=0, cpus=1, betweenness_sample=100, sample_size=None, use_gpu=False)[source]¶
Wrapper function for
construct_network_from_edge_list()
which is called by optimisation functions moving a triangular decision boundary.Given the boundary parameterisation, constructs the network and returns its score, to be minimised.
- Args:
- s (float)
Distance along line between start_point and mean1 from start_point
- sample_names (list)
Sample names corresponding to distMat (accessed by iterator)
- distMat (numpy.array or NumpyShared)
Core and accessory distances or NumpyShared describing these in sharedmem
- mean0 (numpy.array)
Start point
- mean1 (numpy.array)
End point
- gradient (float)
Gradient of line to move along
- slope (int)
Set to 0 for a vertical line, 1 for a horizontal line, or 2 to use a slope [default = 2]
- score_idx (int)
Index of score from
networkSummary()
to use [default = 0]- cpus (int)
Number of CPUs to use for calculating assignment
- betweenness_sample (int)
Number of sequences per component used to estimate betweenness using a GPU. Smaller numbers are faster but less precise [default = 100]
- sample_size (int)
Number of nodes to subsample for graph statistic calculation
- use_gpu (bool)
Whether to use cugraph for graph analysis
- Returns:
- score (float)
-1 * network score. Where network score is from
networkSummary()
- PopPUNK.refine.newNetwork2D(y_idx, sample_names, distMat, x_range, y_range, score_idx=0, betweenness_sample=100, sample_size=None, use_gpu=False)[source]¶
Wrapper function for thresholdIterate2D and
growNetwork()
.For a given y_max, constructs networks across x_range and returns a list of scores
- Args:
- y_idx (float)
Maximum y-intercept of boundary, as index into y_range
- sample_names (list)
Sample names corresponding to distMat (accessed by iterator)
- distMat (numpy.array or NumpyShared)
Core and accessory distances or NumpyShared describing these in sharedmem
- x_range (list)
Sorted list of x-intercepts to search
- y_range (list)
Sorted list of y-intercepts to search
- score_idx (int)
Index of score from
networkSummary()
to use [default = 0]- betweenness_sample (int)
Number of sequences per component used to estimate betweenness using a GPU. Smaller numbers are faster but less precise [default = 100]
- sample_size (int)
Number of nodes to subsample for graph statistic calculation
- use_gpu (bool)
Whether to use cugraph for graph analysis
- Returns:
- scores (list)
-1 * network score for each of x_range. Where network score is from
networkSummary()
- PopPUNK.refine.readManualStart(startFile)[source]¶
Reads a file to define a manual start point, rather than using
--fit-model
Throws and exits if incorrectly formatted.
- Args:
- startFile (str)
Name of file with values to read
- Returns:
- mean0 (numpy.array)
Centre of within-strain distribution
- mean1 (numpy.array)
Centre of between-strain distribution
- scaled (bool)
True if means are scaled between [0,1]
- PopPUNK.refine.refineFit(distMat, sample_names, mean0, mean1, scale, max_move, min_move, slope=2, score_idx=0, unconstrained=False, no_local=False, num_processes=1, betweenness_sample=100, sample_size=None, use_gpu=False)[source]¶
Try to refine a fit by maximising a network score based on transitivity and density.
Iteratively move the decision boundary to do this, using starting point from existing model.
- Args:
- distMat (numpy.array)
n x 2 array of core and accessory distances for n samples
- sample_names (list)
List of query sequence labels
- mean0 (numpy.array)
Start point to define search line
- mean1 (numpy.array)
End point to define search line
- scale (numpy.array)
Scaling factor of distMat
- max_move (float)
Maximum distance to move away from start point
- min_move (float)
Minimum distance to move away from start point
- slope (int)
Set to 0 for a vertical line, 1 for a horizontal line, or 2 to use a slope
- score_idx (int)
Index of score from
networkSummary()
to use [default = 0]- unconstrained (bool)
If True, search in 2D and change the slope of the boundary
- no_local (bool)
Turn off the local optimisation step. Quicker, but may be less well refined.
- num_processes (int)
Number of threads to use in the global optimisation step. (default = 1)
- betweenness_sample (int)
Number of sequences per component used to estimate betweenness using a GPU. Smaller numbers are faster but less precise [default = 100]
- sample_size (int)
Number of nodes to subsample for graph statistic calculation
- use_gpu (bool)
Whether to use cugraph for graph analyses
- Returns:
- optimal_x (float)
x-coordinate of refined fit
- optimal_y (float)
y-coordinate of refined fit
- optimised_s (float)
Position along search range of refined fit
plot.py¶
Plots of GMM results, k-mer fits, and microreact output
- PopPUNK.plot.createMicroreact(prefix, microreact_files, api_key=None)[source]¶
Creates a .microreact file, and instance via the API
- Args:
- prefix (str)
Prefix for output file
- microreact_files (str)
List of Microreact files [clusters, dot, tree, mst_tree]
- api_key (str)
API key for your account
- PopPUNK.plot.distHistogram(dists, rank, outPrefix)[source]¶
Plot a histogram of distances (1D)
- Args:
- dists (np.array)
Distance vector
- rank (int)
Rank (used for name and title)
- outPrefix (int)
Full path prefix for plot file
- PopPUNK.plot.drawMST(mst, outPrefix, isolate_clustering, clustering_name, overwrite)[source]¶
Plot a layout of the minimum spanning tree
- Args:
- mst (graph_tool.Graph)
A minimum spanning tree
- outPrefix (str)
Output prefix for save files
- isolate_clustering (dict)
Dictionary of ID: cluster, used for colouring vertices
- clustering_name (str)
Name of clustering scheme to be used for colouring
- overwrite (bool)
Overwrite existing output files
- PopPUNK.plot.get_grid(minimum, maximum, resolution)[source]¶
Get a square grid of points to evaluate a function across
Used for
plot_scatter()
andplot_contours()
- Args:
- minimum (float)
Minimum value for grid
- maximum (float)
Maximum value for grid
- resolution (int)
Number of points along each axis
- Returns:
- xx (numpy.array)
x values across n x n grid
- yy (numpy.array)
y values across n x n grid
- xy (numpy.array)
n x 2 pairs of x, y values grid is over
- PopPUNK.plot.outputsForCytoscape(G, G_mst, isolate_names, clustering, outPrefix, epiCsv, queryList=None, suffix=None, writeCsv=True, use_partial_query_graph=None)[source]¶
Write outputs for cytoscape. A graphml of the network, and CSV with metadata
- Args:
- G (graph)
The network to write
- G_mst (graph)
The minimum spanning tree of G
- isolate_names (list)
Ordered list of sequence names
- clustering (dict)
Dictionary of cluster assignments (keys are nodeNames).
- outPrefix (str)
Prefix for files to be written
- epiCsv (str)
Optional CSV of epi data to paste in the output in addition to the clusters.
- queryList (list)
Optional list of isolates that have been added as a query. (default = None)
- suffix (string)
String to append to network file name. (default = None)
- writeCsv (bool)
Whether to print CSV file to accompany network
- use_partial_query_graph (str)
File listing sequences to be included in output graph
- PopPUNK.plot.outputsForGrapetree(combined_list, clustering, nj_tree, mst_tree, outPrefix, epiCsv, queryList=None, overwrite=False)[source]¶
Generate files for Grapetree
Write a neighbour joining tree (.nwk) from core distances and cluster assignment (.csv)
- Args:
- combined_list (list)
Name of sequences being analysed. The part of the name before the first ‘.’ will be shown in the output
- clustering (dict or dict of dicts)
List of cluster assignments from
printClusters()
. Further clusterings (e.g. 1D core only) can be included by passing these as a dict.- nj_tree (str or None)
String representation of a Newick-formatted NJ tree
- mst_tree (str or None)
String representation of a Newick-formatted minimum-spanning tree
- outPrefix (str)
Prefix for all generated output files, which will be placed in outPrefix subdirectory.
- epiCsv (str)
A CSV containing other information, to include with the CSV of clusters
- queryList (list)
Optional list of isolates that have been added as a query for colouring in the CSV. (default = None)
- overwrite (bool)
Overwrite existing output if present (default = False).
- PopPUNK.plot.outputsForMicroreact(combined_list, clustering, nj_tree, mst_tree, accMat, perplexity, maxIter, outPrefix, epiCsv, queryList=None, overwrite=False, n_threads=1, use_gpu=False, device_id=0)[source]¶
Generate files for microreact
Output a neighbour joining tree (.nwk) from core distances, a plot of t-SNE clustering of accessory distances (.dot) and cluster assignment (.csv)
- Args:
- combined_list (list)
Name of sequences being analysed. The part of the name before the first ‘.’ will be shown in the output
- clustering (dict or dict of dicts)
List of cluster assignments from
printClusters()
. Further clusterings (e.g. 1D core only) can be included by passing these as a dict.- nj_tree (str or None)
String representation of a Newick-formatted NJ tree
- mst_tree (str or None)
String representation of a Newick-formatted minimum-spanning tree
- accMat (numpy.array)
n x n array of accessory distances for n samples.
- perplexity (int)
Perplexity parameter passed to mandrake
- maxIter (int)
Maximum iterations for mandrake
- outPrefix (str)
Prefix for all generated output files, which will be placed in outPrefix subdirectory
- epiCsv (str)
A CSV containing other information, to include with the CSV of clusters
- queryList (list)
Optional list of isolates that have been added as a query for colouring in the CSV. (default = None)
- overwrite (bool)
Overwrite existing output if present (default = False)
- n_threads (int)
Number of CPU threads to use (default = 1)
- use_gpu (bool)
Whether to use a GPU for t-SNE generation
- device_id (int)
Device ID of GPU to be used (default = 0)
- Returns:
- outfiles (list)
List of output files create
- PopPUNK.plot.outputsForPhandango(combined_list, clustering, nj_tree, mst_tree, outPrefix, epiCsv, queryList=None, overwrite=False)[source]¶
Generate files for Phandango
Write a neighbour joining tree (.tree) from core distances and cluster assignment (.csv)
- Args:
- combined_list (list)
Name of sequences being analysed. The part of the name before the first ‘.’ will be shown in the output
- clustering (dict or dict of dicts)
List of cluster assignments from
printClusters()
. Further clusterings (e.g. 1D core only) can be included by passing these as a dict.- nj_tree (str or None)
String representation of a Newick-formatted NJ tree
- mst_tree (str or None)
String representation of a Newick-formatted minimum-spanning tree
- outPrefix (str)
Prefix for all generated output files, which will be placed in outPrefix subdirectory
- epiCsv (str)
A CSV containing other information, to include with the CSV of clusters
- queryList (list)
Optional list of isolates that have been added as a query for colouring in the CSV. (default = None)
- overwrite (bool)
Overwrite existing output if present (default = False)
- threads (int)
Number of threads to use with rapidnj
- PopPUNK.plot.plot_contours(model, assignments, title, out_prefix)[source]¶
Draw contours of mixture model assignments
Will draw the decision boundary for between/within in red
- Args:
- model (BGMMFit)
Model we are plotting from
- assignments (numpy.array)
n-vectors of cluster assignments for model
- title (str)
The title to display above the plot
- out_prefix (str)
Prefix for output plot file (.pdf will be appended)
- PopPUNK.plot.plot_database_evaluations(prefix, genome_lengths, ambiguous_bases)[source]¶
Plot histograms of sequence characteristics for database evaluation.
- Args:
- prefix (str)
Prefix for output files
- genome_lengths (list)
Lengths of genomes in database
- ambiguous_bases (list)
Counts of ambiguous bases in genomes in database
- PopPUNK.plot.plot_dbscan_results(X, y, n_clusters, out_prefix, use_gpu)[source]¶
Draw a scatter plot (png) to show the DBSCAN model fit
A scatter plot of core and accessory distances, coloured by component membership. Black is noise
- Args:
- X (numpy.array)
n x 2 array of core and accessory distances for n samples.
- Y (numpy.array)
n x 1 array of cluster assignments for n samples.
- n_clusters (int)
Number of clusters used (excluding noise)
- out_prefix (str)
Prefix for output file (.png will be appended)
- use_gpu (bool)
Whether model was fitted with GPU-enabled code
- PopPUNK.plot.plot_evaluation_histogram(input_data, n_bins=100, prefix='hist', suffix='', plt_title='histogram', xlab='x')[source]¶
Plot histograms of sequence characteristics for database evaluation.
- Args:
- input_data (list)
Input data (list of numbers)
- n_bins (int)
Number of bins to use for the histogram
- prefix (str)
Prefix of database
- suffix (str)
Suffix specifying plot type
- plt_title (str)
Title for plot
- xlab (str)
Title for the horizontal axis
- PopPUNK.plot.plot_fit(klist, raw_matching, raw_fit, corrected_matching, corrected_fit, out_prefix, title)[source]¶
Draw a scatter plot (pdf) of k-mer sizes vs match probability, and the fit used to assign core and accessory distance
K-mer sizes on x-axis, log(pr(match)) on y - expect a straight line fit with intercept representing accessory distance and slope core distance
- Args:
- klist (list)
List of k-mer sizes
- raw_matching (list)
Proportion of matching k-mers at each klist value
- raw_fit (numpy.array)
Fit to klist and raw_matching from
fitKmerCurve()
- corrected_matching (list)
Corrected proportion of matching k-mers at each klist value
- corrected_fit (numpy.array)
Fit to klist and corrected_matching from
fitKmerCurve()
- out_prefix (str)
Prefix for output plot file (.pdf will be appended)
- title (str)
The title to display above the plot
- PopPUNK.plot.plot_refined_results(X, Y, x_boundary, y_boundary, core_boundary, accessory_boundary, mean0, mean1, min_move, max_move, scale, threshold, indiv_boundaries, unconstrained, title, out_prefix)[source]¶
Draw a scatter plot (png) to show the refined model fit
A scatter plot of core and accessory distances, coloured by component membership. The triangular decision boundary is also shown
- Args:
- X (numpy.array)
n x 2 array of core and accessory distances for n samples.
- Y (numpy.array)
n x 1 array of cluster assignments for n samples.
- x_boundary (float)
Intercept of boundary with x-axis, from
RefineFit
- y_boundary (float)
Intercept of boundary with y-axis, from
RefineFit
- core_boundary (float)
Intercept of 1D (core) boundary with x-axis, from
RefineFit
- accessory_boundary (float)
Intercept of 1D (core) boundary with y-axis, from
RefineFit
- mean0 (numpy.array)
Centre of within-strain distribution
- mean1 (numpy.array)
Centre of between-strain distribution
- min_move (float)
Minimum s range
- max_move (float)
Maximum s range
- scale (numpy.array)
Scaling factor from
RefineFit
- threshold (bool)
If fit was just from a simple thresholding
- indiv_boundaries (bool)
Whether to draw lines for core and accessory refinement
- title (str)
The title to display above the plot
- out_prefix (str)
Prefix for output plot file (.png will be appended)
- PopPUNK.plot.plot_results(X, Y, means, covariances, scale, title, out_prefix)[source]¶
Draw a scatter plot (png) to show the BGMM model fit
A scatter plot of core and accessory distances, coloured by component membership. Also shown are ellipses for each component (centre: means axes: covariances).
This is based on the example in the sklearn documentation.
- Args:
- X (numpy.array)
n x 2 array of core and accessory distances for n samples.
- Y (numpy.array)
n x 1 array of cluster assignments for n samples.
- means (numpy.array)
Component means from
BGMMFit
- covars (numpy.array)
Component covariances from
BGMMFit
- scale (numpy.array)
Scaling factor from
BGMMFit
- out_prefix (str)
Prefix for output plot file (.png will be appended)
- title (str)
The title to display above the plot
- PopPUNK.plot.plot_scatter(X, out_prefix, title, kde=True)[source]¶
Draws a 2D scatter plot (png) of the core and accessory distances
Also draws contours of kernel density estimare
- Args:
- X (numpy.array)
n x 2 array of core and accessory distances for n samples.
- out_prefix (str)
Prefix for output plot file (.png will be appended)
- title (str)
The title to display above the plot
- kde (bool)
Whether to draw kernel density estimate contours
(default = True)
- PopPUNK.plot.writeClusterCsv(outfile, nodeNames, nodeLabels, clustering, output_format='microreact', epiCsv=None, queryNames=None, suffix='_Cluster')[source]¶
Print CSV file of clustering and optionally epi data
Writes CSV output of clusters which can be used as input to microreact and cytoscape. Uses pandas to deal with CSV reading and writing nicely.
The epiCsv, if provided, should have the node labels in the first column.
- Args:
- outfile (str)
File to write the CSV to.
- nodeNames (list)
Names of sequences in clustering (includes path).
- nodeLabels (list)
Names of sequences to write in CSV (usually has path removed).
- clustering (dict or dict of dicts)
Dictionary of cluster assignments (keys are nodeNames). Pass a dict with depth two to include multiple possible clusterings.
- output_format (str)
Software for which CSV should be formatted (microreact, phandango, grapetree and cytoscape are accepted)
- epiCsv (str)
Optional CSV of epi data to paste in the output in addition to the clusters (default = None).
- queryNames (list)
Optional list of isolates that have been added as a query.
(default = None)
sparse_mst.py¶
sketchlib.py¶
Sketchlib functions for database construction
- PopPUNK.sketchlib.addRandom(oPrefix, sequence_names, klist, strand_preserved=False, overwrite=False, threads=1)[source]¶
Add chance of random match to a HDF5 sketch DB
- Args:
- oPrefix (str)
Sketch database prefix
- sequence_names (list)
Names of sequences to include in calculation
- klist (list)
List of k-mer sizes to sketch
- strand_preserved (bool)
Set true to ignore rc k-mers
- overwrite (str)
Set true to overwrite existing random match chances
- threads (int)
Number of threads to use (default = 1)
- PopPUNK.sketchlib.checkSketchlibLibrary()[source]¶
Gets the location of the sketchlib library
- Returns:
- lib (str)
Location of sketchlib .so/.dyld
- PopPUNK.sketchlib.checkSketchlibVersion()[source]¶
Checks that sketchlib can be run, and returns version
- Returns:
- version (str)
Version string
- PopPUNK.sketchlib.constructDatabase(assemblyList, klist, sketch_size, oPrefix, threads, overwrite, strand_preserved, min_count, use_exact, calc_random=True, codon_phased=False, use_gpu=False, deviceid=0)[source]¶
Sketch the input assemblies at the requested k-mer lengths
A multithread wrapper around
runSketch()
. Threads are used to either run multiple sketch processes for each klist value.Also calculates random match probability based on length of first genome in assemblyList.
- Args:
- assemblyList (str)
File with locations of assembly files to be sketched
- klist (list)
List of k-mer sizes to sketch
- sketch_size (int)
Size of sketch (
-s
option)- oPrefix (str)
Output prefix for resulting sketch files
- threads (int)
Number of threads to use (default = 1)
- overwrite (bool)
Whether to overwrite sketch DBs, if they already exist. (default = False)
- strand_preserved (bool)
Ignore reverse complement k-mers (default = False)
- min_count (int)
Minimum count of k-mer in reads to include (default = 0)
- use_exact (bool)
Use exact count of k-mer appearance in reads (default = False)
- calc_random (bool)
Add random match chances to DB (turn off for queries)
- codon_phased (bool)
Use codon phased seeds (default = False)
- use_gpu (bool)
Use GPU for read sketching (default = False)
- deviceid (int)
GPU device id (default = 0)
- Returns:
- names (list)
List of names included in the database (from rfile)
- PopPUNK.sketchlib.createDatabaseDir(outPrefix, kmers)[source]¶
Creates the directory to write sketches to, removing old files if unnecessary
- Args:
- outPrefix (str)
output db prefix
- kmers (list)
k-mer sizes in db
- PopPUNK.sketchlib.fitKmerCurve(pairwise, klist, jacobian)[source]¶
Fit the function \(pr = (1-a)(1-c)^k\)
Supply
jacobian = -np.hstack((np.ones((klist.shape[0], 1)), klist.reshape(-1, 1)))
- Args:
- pairwise (numpy.array)
Proportion of shared k-mers at k-mer values in klist
- klist (list)
k-mer sizes used
- jacobian (numpy.array)
Should be set as above (set once to try and save memory)
- Returns:
- transformed_params (numpy.array)
Column with core and accessory distance
- PopPUNK.sketchlib.getKmersFromReferenceDatabase(dbPrefix)[source]¶
Get kmers lengths from existing database
- Args:
- dbPrefix (str)
Prefix for sketch DB files
- Returns:
- kmers (list)
List of k-mer lengths used in database
- PopPUNK.sketchlib.getSeqsInDb(dbname)[source]¶
Return an array with the sequences in the passed database
- Args:
- dbname (str)
Sketches database filename
- Returns:
- seqs (list)
List of sequence names in sketch DB
- PopPUNK.sketchlib.getSketchSize(dbPrefix)[source]¶
Determine sketch size, and ensures consistent in whole database
sys.exit(1)
is called if DBs have different sketch sizes- Args:
- dbprefix (str)
Prefix for databases
- Returns:
- sketchSize (int)
sketch size (64x C++ definition)
- codonPhased (bool)
whether the DB used codon phased seeds
- PopPUNK.sketchlib.get_database_statistics(prefix)[source]¶
Extract statistics for evaluating databases.
- Args:
- prefix (str)
Prefix of database
- PopPUNK.sketchlib.joinDBs(db1, db2, output, update_random=None, full_names=False)[source]¶
Join two sketch databases with the low-level HDF5 copy interface
- Args:
- db1 (str)
Prefix for db1
- db2 (str)
Prefix for db2
- output (str)
Prefix for joined output
- update_random (dict)
Whether to re-calculate the random object. May contain control arguments strand_preserved and threads (see
addRandom()
)- full_names (bool)
If True, db_name and out_name are the full paths to h5 files
- PopPUNK.sketchlib.queryDatabase(rNames, qNames, dbPrefix, queryPrefix, klist, self=True, number_plot_fits=0, threads=1, use_gpu=False, deviceid=0)[source]¶
Calculate core and accessory distances between query sequences and a sketched database
For a reference database, runs the query against itself to find all pairwise core and accessory distances.
Uses the relation \(pr(a, b) = (1-a)(1-c)^k\)
To get the ref and query name for each row of the returned distances, call to the iterator
iterDistRows()
with the returned refList and queryList- Args:
- rNames (list)
Names of references to query
- qNames (list)
Names of queries
- dbPrefix (str)
Prefix for reference sketch database created by
constructDatabase()
- queryPrefix (str)
Prefix for query sketch database created by
constructDatabase()
- klist (list)
K-mer sizes to use in the calculation
- self (bool)
Set true if query = ref (default = True)
- number_plot_fits (int)
If > 0, the number of k-mer length fits to plot (saved as pdfs). Takes random pairs of comparisons and calls
plot_fit()
(default = 0)- threads (int)
Number of threads to use in the process (default = 1)
- use_gpu (bool)
Use a GPU for querying (default = False)
- deviceid (int)
Index of the CUDA GPU device to use (default = 0)
- Returns:
- distMat (numpy.array)
Core distances (column 0) and accessory distances (column 1) between refList and queryList
- PopPUNK.sketchlib.readDBParams(dbPrefix)[source]¶
Get kmers lengths and sketch sizes from existing database
Calls
getKmersFromReferenceDatabase()
andgetSketchSize()
Uses passed values if db missing- Args:
- dbPrefix (str)
Prefix for sketch DB files
- Returns:
- kmers (list)
List of k-mer lengths used in database
- sketch_sizes (list)
List of sketch sizes used in database
- codonPhased (bool)
whether the DB used codon phased seeds
- PopPUNK.sketchlib.removeFromDB(db_name, out_name, removeSeqs, full_names=False)[source]¶
Remove sketches from the DB the low-level HDF5 copy interface
- Args:
- db_name (str)
Prefix for hdf database
- out_name (str)
Prefix for output (pruned) database
- removeSeqs (list)
Names of sequences to remove from database
- full_names (bool)
If True, db_name and out_name are the full paths to h5 files
utils.py¶
General utility functions for data read/writing/manipulation in PopPUNK
- PopPUNK.utils.check_and_set_gpu(use_gpu, gpu_lib, quit_on_fail=False)[source]¶
Check GPU libraries can be loaded and set managed memory.
- Args:
- use_gpu (bool)
Whether GPU packages have been requested
- gpu_lib (bool)
Whether GPU packages are available
- Returns:
- use_gpu (bool)
Whether GPU packages can be used
- PopPUNK.utils.decisionBoundary(intercept, gradient, adj=0.0)[source]¶
Returns the co-ordinates where the triangle the decision boundary forms meets the x- and y-axes.
- Args:
- intercept (numpy.array)
Cartesian co-ordinates of point along line (
transformLine()
) which intercepts the boundary- gradient (float)
Gradient of the line
- adj (float)
Fraction by which to shift the intercept up the y axis
- Returns:
- x (float)
The x-axis intercept
- y (float)
The y-axis intercept
- PopPUNK.utils.get_match_search_depth(rlist, rank_list)[source]¶
Return a default search depth for lineage model fitting.
- Args:
- rlist (list)
List of sequences in database
- rank_list (list)
List of ranks to be used to fit lineage models
- Returns:
- max_search_depth (int)
Maximum kNN used for lineage model fitting
- PopPUNK.utils.isolateNameToLabel(names)[source]¶
Function to process isolate names to labels appropriate for visualisation.
- Args:
- names (list)
List of isolate names.
- Returns:
- labels (list)
List of isolate labels.
- PopPUNK.utils.iterDistRows(refSeqs, querySeqs, self=True)[source]¶
Gets the ref and query ID for each row of the distance matrix
Returns an iterable with ref and query ID pairs by row.
- Args:
- refSeqs (list)
List of reference sequence names.
- querySeqs (list)
List of query sequence names.
- self (bool)
Whether a self-comparison, used when constructing a database. Requires refSeqs == querySeqs Default is True
- Returns:
- ref, query (str, str)
Iterable of tuples with ref and query names for each distMat row.
- PopPUNK.utils.joinClusterDicts(d1, d2)[source]¶
Join two dictionaries returned by
readIsolateTypeFromCsv()
with return_dict = True. Useful for concatenating ref and query assignments- Args:
- d1 (dict of dicts)
First dictionary to concat
- d2 (dict of dicts)
Second dictionary to concat
- Returns:
- d1 (dict of dicts)
d1 with d2 appended
- PopPUNK.utils.listDistInts(refSeqs, querySeqs, self=True)[source]¶
Gets the ref and query ID for each row of the distance matrix
Returns an iterable with ref and query ID pairs by row.
- Args:
- refSeqs (list)
List of reference sequence names.
- querySeqs (list)
List of query sequence names.
- self (bool)
Whether a self-comparison, used when constructing a database. Requires refSeqs == querySeqs Default is True
- Returns:
- ref, query (str, str)
Iterable of tuples with ref and query names for each distMat row.
- PopPUNK.utils.readIsolateTypeFromCsv(clustCSV, mode='clusters', return_dict=False)[source]¶
Read cluster definitions from CSV file.
- Args:
- clustCSV (str)
File name of CSV with isolate assignments
- mode (str)
Type of file to read ‘clusters’, ‘lineages’, or ‘external’
- return_dict (bool)
If True, return a dict with sample->cluster instead of sets [default = False]
- Returns:
- clusters (dict)
Dictionary of cluster assignments (keys are cluster names, values are sets containing samples in the cluster). Or if return_dict is set keys are sample names, values are cluster assignments.
- PopPUNK.utils.readPickle(pklName, enforce_self=False, distances=True)[source]¶
Loads core and accessory distances saved by
storePickle()
Called during
--fit-model
- Args:
- pklName (str)
Prefix for saved files
- enforce_self (bool)
Error if self == False
[default = True]
- distances (bool)
Read the distance matrix
[default = True]
- Returns:
- rlist (list)
List of reference sequence names (for
iterDistRows()
)- qlist (list)
List of query sequence names (for
iterDistRows()
)- self (bool)
Whether an all-vs-all self DB (for
iterDistRows()
)- X (numpy.array)
n x 2 array of core and accessory distances
- PopPUNK.utils.readRfile(rFile, oneSeq=False)[source]¶
Reads in files for sketching. Names and sequence, tab separated
- Args:
- rFile (str)
File with locations of assembly files to be sketched
- oneSeq (bool)
Return only the first sequence listed, rather than a list (used with mash)
- Returns:
- names (list)
Array of sequence names
- sequences (list of lists)
Array of sequence files
- PopPUNK.utils.read_rlist_from_distance_pickle(fn, allow_non_self=True, include_queries=False)[source]¶
Return the list of reference sequences from a distance pickle.
- Args:
- fn (str)
Name of distance pickle
- allow_non_self (bool)
Whether non-self distance datasets are permissible
- include_queries (bool)
Whether queries should be included in the rlist
- Returns:
- rlist (list)
List of reference sequence names
- PopPUNK.utils.set_env(**environ)[source]¶
Temporarily set the process environment variables. >>> with set_env(PLUGINS_DIR=u’test/plugins’): … “PLUGINS_DIR” in os.environ True >>> “PLUGINS_DIR” in os.environ False
- PopPUNK.utils.setupDBFuncs(args)[source]¶
Wraps common database access functions from sketchlib and mash, to try and make their API more similar
- Args:
- args (argparse.opts)
Parsed command lines options
- qc_dict (dict)
Table of parameters for QC function
- Returns:
- dbFuncs (dict)
Functions with consistent arguments to use as the database API
- PopPUNK.utils.stderr_redirected(to='/dev/null')[source]¶
import os
- with stdout_redirected(to=filename):
print(“from Python”) os.system(“echo non-Python applications are also supported”)
- PopPUNK.utils.storePickle(rlist, qlist, self, X, pklName)[source]¶
Saves core and accessory distances in a .npy file, names in a .pkl
Called during
--create-db
- Args:
- rlist (list)
List of reference sequence names (for
iterDistRows()
)- qlist (list)
List of query sequence names (for
iterDistRows()
)- self (bool)
Whether an all-vs-all self DB (for
iterDistRows()
)- X (numpy.array)
n x 2 array of core and accessory distances
If None, do not save
- pklName (str)
Prefix for output files
- PopPUNK.utils.transformLine(s, mean0, mean1)[source]¶
Return x and y co-ordinates for traversing along a line between mean0 and mean1, parameterised by a single scalar distance s from the start point mean0.
- Args:
- s (float)
Distance along line from mean0
- mean0 (numpy.array)
Start position of line (x0, y0)
- mean1 (numpy.array)
End position of line (x1, y1)
- Returns:
- x (float)
The Cartesian x-coordinate
- y (float)
The Cartesian y-coordinate
- PopPUNK.utils.update_distance_matrices(refList, distMat, queryList=None, query_ref_distMat=None, query_query_distMat=None, threads=1)[source]¶
Convert distances from long form (1 matrix with n_comparisons rows and 2 columns) to a square form (2 NxN matrices), with merging of query distances if necessary.
- Args:
- refList (list)
List of references
- distMat (numpy.array)
Two column long form list of core and accessory distances for pairwise comparisons between reference db sequences
- queryList (list)
List of queries
- query_ref_distMat (numpy.array)
Two column long form list of core and accessory distances for pairwise comparisons between queries and reference db sequences
- query_query_distMat (numpy.array)
Two column long form list of core and accessory distances for pairwise comparisons between query sequences
- threads (int)
Number of threads to use
- Returns:
- seqLabels (list)
Combined list of reference and query sequences
- coreMat (numpy.array)
NxN array of core distances for N sequences
- accMat (numpy.array)
NxN array of accessory distances for N sequences
visualise.py¶
poppunk_visualise
main function
web.py¶
Functions used by the web API to convert a sketch to an h5 database, then generate visualisations and post results to PopPUNK-web.
- PopPUNK.web.calc_prevalence(cluster, cluster_list, num_samples)[source]¶
Cluster prevalences for Plotly.js