API reference¶
This section documents the public API of coresg-graphhdbscan.
Main public API¶
The most important user-facing class is GraphCoreSGHDBSCAN.
Important public methods include:
fitfit_predictlabels_formodelplot_condensed_treeinteractive_condensed_tree
Important fitted result attributes include:
models_condensed_trees_labels_by_m_coresg_dist_matrix_similarity_graph_connected_graph_
Main classes¶
Graph-based CoreSG + HDBSCAN interface. |
|
CoreSG-based hierarchical density clustering backend. |
|
Lightweight wrapper that mimics the HDBSCAN attributes used by this package. |
Main utility function¶
Plot the condensed tree for a selected fitted |
Module reference¶
Graph module¶
Graph-based wrapper around CoreSG-HDBSCAN.
- class coresg_graphhdbscan.graph.GraphCoreSGHDBSCAN(min_samples=10, sim_graph_method='sc_umap', metric='euclidean', metric_kwds=None, add_neighbor=True, no_noise=True, n_neighbors=15, heuristic_connect=False, min_cluster_size=None, save_models=False, similarity_backend='auto', **kwargs)[source]¶
Bases:
CoreSGHDBSCANGraph-based CoreSG + HDBSCAN interface.
This class constructs a similarity graph from feature data or accepts a precomputed graph representation, transforms that graph into a graph-derived distance representation, and then runs the CoreSG-HDBSCAN clustering pipeline.
- Parameters:
min_samples (int or iterable of int, default=10) – Main clustering hyperparameter. A single integer gives one fitted solution, while an iterable allows fitting multiple values in one run.
sim_graph_method ({"sc_umap", "sc_gauss", "jaccard_phenograph", "precomputed"}, default="sc_umap") – Graph-construction backend.
metric (str, callable, or None, default="euclidean") – Distance metric used during similarity graph construction. Supported string metrics are “cityblock”, “cosine”, “euclidean”, “l1”, “l2”, “manhattan”, “braycurtis”, “canberra”, “chebyshev”, “correlation”, “dice”, “hamming”, “jaccard”, “mahalanobis”, “minkowski”, “rogerstanimoto”, “russellrao”, “seuclidean”, “sokalmichener”, “sokalsneath”, “sqeuclidean”, “yule”, and the package-specific “hybrid_euclidean_cosine”. The metric “kulsinski” is not supported. The combination metric=”yule” with sim_graph_method=”sc_gauss” is also not supported because it can produce non-finite graph weights.
metric_kwds (dict or None, default=None) – Additional keyword arguments passed to the selected distance metric.
add_neighbor (bool, default=True) – Controls how weighted structural similarity is expanded into graph edges.
no_noise (bool, default=True) – If True, points initially labeled as noise are reassigned after clustering.
n_neighbors (int, default=15) – Number of neighbors used during graph construction.
heuristic_connect (bool, default=False) – If True, increase
n_neighborsuntil the WSS dissimilarity graph becomes connected, except in precomputed mode, where bridge edges are used instead. If False, disconnected components are connected with synthetic bridge edges.min_cluster_size (int or None, default=None) – Minimum cluster size used in the clustering stage. If None, the package follows the selected
min_samplesvalue for each run.save_models (bool, default=False) – If True, save hdbscan models for different min_samples which can add some memory overhead. If False, just save labels and condensed trees for each min_samples.
**kwargs – Additional keyword arguments passed to internal graph-construction helpers.
- similarity_graph_¶
Initial similarity graph.
- Type:
- similarity_graph_WSS¶
Weighted structural similarity graph.
- Type:
- dissimilarity_graph_¶
Graph after conversion from similarity to dissimilarity.
- Type:
- connected_graph_¶
Final connected graph used by the clustering stage.
- Type:
- dist_matrix_¶
Dense matrix used by the CoreSG-HDBSCAN pipeline.
- Type:
- coresg_¶
Internal fitted CoreSG-HDBSCAN object.
- Type:
- models_¶
Dictionary of saved per-
min_samplesmodels. Populated only whensave_models=True.- Type:
- condensed_trees_¶
Dictionary of condensed tree objects keyed by fitted
min_samplesvalue.- Type:
- compute_similarity_sparse(graph)[source]¶
Fast weighted structural similarity as a sparse matrix.
This is algebraically equivalent to the original
compute_similarityimplementation, but avoids Python-level all-pairs iteration. The weighted adjacency vector for each node includes an explicit self-loop of weight 1 before cosine normalization.- Return type:
- compute_similarity(graph)[source]¶
Backward-compatible NetworkX wrapper over the sparse implementation.
- static similarity_to_dissimilarity_sparse(similarity_matrix)[source]¶
- Parameters:
similarity_matrix (scipy.sparse.csr_matrix)
- Return type:
- connect_graph_heuristically(graph, n_obs)[source]¶
Connect disconnected components with synthetic bridge edges.
This function assumes graph is already a dissimilarity graph:
smaller weight = closer larger weight = farther
It does not rebuild the similarity graph. It only adds bridge edges of distance weight 1 between disconnected components.
- static compute_full_distance_matrix(graph)[source]¶
Compute the full dense matrix of shortest path distances using Floyd–Warshall.
- static compute_sparse_distance_dict(graph)[source]¶
Compute a dictionary-of-dictionaries of shortest path distances. For each node, run single_source_dijkstra_path_length and store the results.
- graph_metric(u, v)[source]¶
Custom distance metric that uses the precomputed sparse distance dictionary. The data points are mapped to node indices using self._point_to_index.
- static compute_custom_distance_matrix(graph)[source]¶
Compute the pairwise distance matrix used by the graph-based pipeline.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input feature matrix.
- Returns:
Pairwise distance matrix.
- Return type:
- static dense_from_sparse_edges_fill1(D_sparse)[source]¶
Create the dense edge-distance matrix expected by CoreSG/HDBSCAN.
Non-edges are filled with 1, diagonal with 0, and sparse entries overwrite the corresponding distances.
- Parameters:
D_sparse (scipy.sparse.csr_matrix)
- Return type:
- static reassign_noise_via_mst(mst_graph, labels0, c=5)[source]¶
Reassign noise labels by propagating labels over a precomputed MST.
- Parameters:
mst_graph (networkx.Graph) – Minimum spanning tree of the final connected WSS graph.
labels0 (ndarray) – Initial labels with noise marked as -1.
c (int, default=5) – Number of largest edge weights to keep in the lexicographic path signature during propagation.
- fit(X, y=None)[source]¶
Fit the model on feature data or a precomputed graph.
- Parameters:
X (array-like of shape (n_samples, n_features) or graph-like) – Input feature matrix when
sim_graph_methodis not"precomputed". In"precomputed"mode, this may be anetworkx.Graph, a SciPy sparse adjacency matrix, or a square dense adjacency matrix.- Returns:
self – Fitted estimator.
- Return type:
- fit_predict(X, y=None, m=None, c=5, **fit_params)[source]¶
Fit the model and return cluster labels.
- Parameters:
X (array-like of shape (n_samples, n_features) or graph-like) – Input feature matrix or supported precomputed graph representation.
- Returns:
Cluster labels for the fitted solution.
- Return type:
- fit_coresg(X, m_list, coresg_kwargs=None)[source]¶
Build graph-derived distances and run CoreSGHDBSCAN on them.
- labels_for(m, no_noise=None, c=5)[source]¶
Return labels for a selected
min_samplesvalue.- Parameters:
- Returns:
Cluster labels for the requested fitted solution.
- Return type:
Notes
labels_by_m_[m]stores the directly fitted labels.labels_for(m)may additionally apply noise reassignment.
- plot_condensed_tree(m, figsize=(10, 6), **kwargs)[source]¶
Plot the condensed tree for a selected
min_samplesvalue.- Parameters:
- Returns:
Displays the condensed tree plot.
- Return type:
None
- Raises:
ValueError – If the model has not been fitted yet.
KeyError – If the requested
mis not available in the stored results.
Notes
This method first looks for the condensed tree in
self.coresg_.condensed_trees_. If it is not found there, it falls back toself.coresg_.models_[m].condensed_tree_when full models have been saved.Examples
>>> g.fit(X) >>> g.plot_condensed_tree(10)
- interactive_condensed_tree(figsize=(10, 6))[source]¶
Create an interactive condensed tree explorer across fitted
min_samplesvalues.- Parameters:
figsize (tuple of float, optional) – Figure size passed to Matplotlib for each displayed condensed tree, by default
(10, 6).- Returns:
A selection slider widget for browsing condensed trees across available
min_samplesvalues.- Return type:
ipywidgets.Widget
- Raises:
ImportError – If
ipywidgetsis not installed.RuntimeError – If the model has not been fitted yet.
ValueError – If no condensed trees are available.
Notes
This method is intended for use in an interactive Jupyter environment. It uses the stored condensed trees in
self.coresg_.condensed_trees_and falls back to any available entries inself.coresg_.models_.Examples
>>> g.fit(X) >>> widget = g.interactive_condensed_tree()
Core module¶
CoreSG-HDBSCAN core implementation
- coresg_graphhdbscan.core.prim_mrd_mst_edges(X, core)[source]¶
Compute MST edges on a mutual-reachability graph using Prim’s algorithm.
- Parameters:
D (numpy.ndarray) – Dense pairwise distance matrix.
core (numpy.ndarray) – Core-distance vector.
eps (float, default=1e-12) – Numerical tolerance.
X (numpy.ndarray)
- Returns:
Array of undirected MST edges with shape
(n_edges, 2).- Return type:
- coresg_graphhdbscan.core.prim_mrd_mst_edges_from_D(D, core)[source]¶
Compute MST edges from a precomputed distance matrix.
- Parameters:
D (numpy.ndarray) – Dense pairwise distance matrix.
core (numpy.ndarray) – Core-distance vector.
eps (float, default=1e-12) – Numerical tolerance.
- Returns:
Array of undirected MST edges with shape
(n_edges, 2).- Return type:
- class coresg_graphhdbscan.core.CoreSGModel(labels, probabilities, stabilities, condensed_tree_array, single_linkage_tree)[source]¶
Bases:
objectLightweight wrapper that mimics the HDBSCAN attributes used by this package. Stored result object for one fitted
min_samplesvalue.- Parameters:
labels (numpy.ndarray)
probabilities (numpy.ndarray)
stabilities (numpy.ndarray)
condensed_tree_array (numpy.recarray)
single_linkage_tree (numpy.ndarray)
- labels_¶
Cluster labels for each sample.
- Type:
- probabilities_¶
Membership strengths for each sample.
- Type:
- cluster_persistence_¶
Persistence score for each cluster.
- Type:
- cluster_persistence_¶
Cluster persistence values returned by the HDBSCAN*-style cluster selection step.
- Type:
- condensed_tree_¶
Condensed tree object for plotting and inspection.
- Type:
hdbscan.plots.CondensedTree
- class coresg_graphhdbscan.core.CoreSGHDBSCAN(min_samples_list, metric='euclidean', eps=1e-12, min_cluster_size=None, save_models=False)[source]¶
Bases:
objectCoreSG-based hierarchical density clustering backend.
This class implements the lower-level CoreSG-HDBSCAN pipeline operating on feature vectors or distance representations.
Workflow¶
Compute the full distance matrix once.
Compute self-inclusive core distances for all values in
min_samples_list.Build the CORE-SG graph from: - the kmax nearest-neighbor graph with ties - the MST on the complete MRD graph for kmax
Precompute a sparse neighbor table for fast edge distance lookup.
For each
m: - compute MRD edge weights - build the sparse weighted graph - compute the MST - build the single-linkage tree - condense the tree and extract clusters
- param min_samples_list:
List of
min_samplesvalues to evaluate.- type min_samples_list:
list[int]
- param metric:
Distance metric mode.
- type metric:
str, default=”euclidean”
- param eps:
Numerical tolerance used in graph construction.
- type eps:
float, default=1e-12
- param min_cluster_size:
Minimum cluster size. If
None, the package default behavior is used.- type min_cluster_size:
int or None, default=None
- X_: numpy.ndarray | None = None¶
- D_: numpy.ndarray | None = None¶
- core_: Dict[int, numpy.ndarray]¶
- edges_ut_: numpy.ndarray | None = None¶
- idx_with_self_: numpy.ndarray | None = None¶
- dst_with_self_: numpy.ndarray | None = None¶
- idx_no_self_: numpy.ndarray | None = None¶
- dst_no_self_: numpy.ndarray | None = None¶
- A_knn_: scipy.sparse.csr_matrix | None = None¶
- msts_: Dict[int, Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]]¶
- models_: Dict[int, CoreSGModel]¶
- labels_by_m_: Dict[int, numpy.ndarray]¶
- fit(X)[source]¶
- Parameters:
X (numpy.ndarray)
- Return type:
- fit_from_distance_matrix(D)[source]¶
Build CORE-SG from a precomputed distance matrix D (NxN).
D[i,j] is the base dissimilarity between points i and j.
We compute self-inclusive core distances and kmax-NNG from D.
We build CORE-SG edges via kmax-NNG ∪ MST_kmax (on MRD_kmax).
After this, you can call self.run(…) exactly as usual.
- Parameters:
D (numpy.ndarray)
- Return type:
- run(cluster_selection_method='eom', allow_single_cluster=False, match_reference_implementation=False, cluster_selection_epsilon=0.0)[source]¶
Run Core-SG clustering for all requested
min_samplesvalues.Stores¶
- models_dict
Saved per-
mmodels whensave_models=True.- condensed_trees_dict
Condensed tree objects for all fitted
mvalues.- labels_by_m_dict
Stored labels for all fitted
mvalues.
- Parameters:
- Return type:
- coresg_graphhdbscan.core.plot_condensed_tree_for_m(models_dict, m, title_prefix='', figsize=(8, 5))[source]¶
Plot the condensed tree for a selected fitted
min_samplesvalue.- Parameters:
model (CoreSGHDBSCAN or GraphCoreSGHDBSCAN) – Fitted clustering object.
m (int) – Selected
min_samplesvalue.title_prefix (str)
- Return type:
None
Metrics module¶
Clustering evaluation helpers.