API reference¶

This section documents the public API of coresg-graphhdbscan.

Main public API¶

The most important user-facing class is GraphCoreSGHDBSCAN.

Important public methods include:

fit
fit_predict
labels_for
model
plot_condensed_tree
interactive_condensed_tree

Important fitted result attributes include:

models_
condensed_trees_
labels_by_m_
coresg_
dist_matrix_
similarity_graph_
connected_graph_

Main classes¶

`coresg_graphhdbscan.graph.GraphCoreSGHDBSCAN`	Graph-based CoreSG + HDBSCAN interface.
`coresg_graphhdbscan.core.CoreSGHDBSCAN`	CoreSG-based hierarchical density clustering backend.
`coresg_graphhdbscan.core.CoreSGModel`	Lightweight wrapper that mimics the HDBSCAN attributes used by this package.

Main utility function¶

coresg_graphhdbscan.core.plot_condensed_tree_for_m

Plot the condensed tree for a selected fitted min_samples value.

Module reference¶

Graph module¶

Graph-based wrapper around CoreSG-HDBSCAN.

class coresg_graphhdbscan.graph.GraphCoreSGHDBSCAN(min_samples=10, sim_graph_method='sc_umap', metric='euclidean', metric_kwds=None, add_neighbor=True, no_noise=True, n_neighbors=15, heuristic_connect=False, min_cluster_size=None, save_models=False, similarity_backend='auto', **kwargs)[source]¶

Bases: CoreSGHDBSCAN

Graph-based CoreSG + HDBSCAN interface.

This class constructs a similarity graph from feature data or accepts a precomputed graph representation, transforms that graph into a graph-derived distance representation, and then runs the CoreSG-HDBSCAN clustering pipeline.

Parameters:

min_samples (int or iterable of int, default=10) – Main clustering hyperparameter. A single integer gives one fitted solution, while an iterable allows fitting multiple values in one run.
sim_graph_method ({"sc_umap", "sc_gauss", "jaccard_phenograph", "precomputed"}, default="sc_umap") – Graph-construction backend.
metric (str, callable, or None, default="euclidean") – Distance metric used during similarity graph construction. Supported string metrics are “cityblock”, “cosine”, “euclidean”, “l1”, “l2”, “manhattan”, “braycurtis”, “canberra”, “chebyshev”, “correlation”, “dice”, “hamming”, “jaccard”, “mahalanobis”, “minkowski”, “rogerstanimoto”, “russellrao”, “seuclidean”, “sokalmichener”, “sokalsneath”, “sqeuclidean”, “yule”, and the package-specific “hybrid_euclidean_cosine”. The metric “kulsinski” is not supported. The combination metric=”yule” with sim_graph_method=”sc_gauss” is also not supported because it can produce non-finite graph weights.
metric_kwds (dict or None, default=None) – Additional keyword arguments passed to the selected distance metric.
add_neighbor (bool, default=True) – Controls how weighted structural similarity is expanded into graph edges.
no_noise (bool, default=True) – If True, points initially labeled as noise are reassigned after clustering.
n_neighbors (int, default=15) – Number of neighbors used during graph construction.
heuristic_connect (bool, default=False) – If True, increase n_neighbors until the WSS dissimilarity graph becomes connected, except in precomputed mode, where bridge edges are used instead. If False, disconnected components are connected with synthetic bridge edges.
min_cluster_size (int or None, default=None) – Minimum cluster size used in the clustering stage. If None, the package follows the selected min_samples value for each run.
save_models (bool, default=False) – If True, save hdbscan models for different min_samples which can add some memory overhead. If False, just save labels and condensed trees for each min_samples.
**kwargs – Additional keyword arguments passed to internal graph-construction helpers.

similarity_graph_¶

Initial similarity graph.

Type:: networkx.Graph

similarity_graph_WSS¶

Weighted structural similarity graph.

Type:: networkx.Graph

dissimilarity_graph_¶

Graph after conversion from similarity to dissimilarity.

Type:: networkx.Graph

connected_graph_¶

Final connected graph used by the clustering stage.

Type:: networkx.Graph

dist_matrix_¶

Dense matrix used by the CoreSG-HDBSCAN pipeline.

Type:: numpy.ndarray

coresg_¶

Internal fitted CoreSG-HDBSCAN object.

Type:: CoreSGHDBSCAN

models_¶

Dictionary of saved per-min_samples models. Populated only when save_models=True.

Type:: dict

condensed_trees_¶

Dictionary of condensed tree objects keyed by fitted min_samples value.

Type:: dict

labels_by_m_¶

Dictionary of stored labels keyed by fitted min_samples value.

Type:: dict

compute_similarity_sparse(graph)[source]¶

Fast weighted structural similarity as a sparse matrix.

This is algebraically equivalent to the original compute_similarity implementation, but avoids Python-level all-pairs iteration. The weighted adjacency vector for each node includes an explicit self-loop of weight 1 before cosine normalization.

Return type:: scipy.sparse.csr_matrix

compute_similarity(graph)[source]¶: Backward-compatible NetworkX wrapper over the sparse implementation.

static similarity_to_dissimilarity_sparse(similarity_matrix)[source]¶

Parameters:: similarity_matrix (scipy.sparse.csr_matrix)
Return type:: scipy.sparse.csr_matrix

static similarity_to_dissimilarity(similarity_graph)[source]¶

static is_graph_connected(graph)[source]¶

create_similarity_graph(data)[source]¶

connect_graph_heuristically(graph, n_obs)[source]¶

Connect disconnected components with synthetic bridge edges.

This function assumes graph is already a dissimilarity graph:

smaller weight = closer larger weight = farther

It does not rebuild the similarity graph. It only adds bridge edges of distance weight 1 between disconnected components.

static compute_full_distance_matrix(graph)[source]¶: Compute the full dense matrix of shortest path distances using Floyd–Warshall.

static compute_sparse_distance_dict(graph)[source]¶: Compute a dictionary-of-dictionaries of shortest path distances. For each node, run single_source_dijkstra_path_length and store the results.

graph_metric(u, v)[source]¶: Custom distance metric that uses the precomputed sparse distance dictionary. The data points are mapped to node indices using self._point_to_index.

static compute_custom_distance_matrix(graph)[source]¶

Compute the pairwise distance matrix used by the graph-based pipeline.

Parameters:: X (array-like of shape (n_samples, n_features)) – Input feature matrix.
Returns:: Pairwise distance matrix.
Return type:: numpy.ndarray

static dense_from_sparse_edges_fill1(D_sparse)[source]¶

Create the dense edge-distance matrix expected by CoreSG/HDBSCAN.

Non-edges are filled with 1, diagonal with 0, and sparse entries overwrite the corresponding distances.

Parameters:: D_sparse (scipy.sparse.csr_matrix)
Return type:: numpy.ndarray

static reassign_noise_via_mst(mst_graph, labels0, c=5)[source]¶

Reassign noise labels by propagating labels over a precomputed MST.

Parameters:

mst_graph (networkx.Graph) – Minimum spanning tree of the final connected WSS graph.
labels0 (ndarray) – Initial labels with noise marked as -1.
c (int, default=5) – Number of largest edge weights to keep in the lexicographic path signature during propagation.

fit(X, y=None)[source]¶

Fit the model on feature data or a precomputed graph.

Parameters:: X (array-like of shape (n_samples, n_features) or graph-like) – Input feature matrix when sim_graph_method is not "precomputed". In "precomputed" mode, this may be a networkx.Graph, a SciPy sparse adjacency matrix, or a square dense adjacency matrix.
Returns:: self – Fitted estimator.
Return type:: GraphCoreSGHDBSCAN

fit_predict(X, y=None, m=None, c=5, **fit_params)[source]¶

Fit the model and return cluster labels.

Parameters:: X (array-like of shape (n_samples, n_features) or graph-like) – Input feature matrix or supported precomputed graph representation.
Returns:: Cluster labels for the fitted solution.
Return type:: numpy.ndarray

fit_coresg(X, m_list, coresg_kwargs=None)[source]¶: Build graph-derived distances and run CoreSGHDBSCAN on them.

labels_for(m, no_noise=None, c=5)[source]¶

Return labels for a selected min_samples value.

Parameters:

m (int) – Selected min_samples value.
no_noise (bool or None, optional) – If True, apply MST-based noise reassignment. If None, use the instance-level no_noise setting.
c (int, optional) – Tie-breaking path length used during noise reassignment.

Returns:

Cluster labels for the requested fitted solution.

Return type:

numpy.ndarray

Notes

labels_by_m_[m] stores the directly fitted labels. labels_for(m) may additionally apply noise reassignment.

plot_condensed_tree(m, figsize=(10, 6), **kwargs)[source]¶

Plot the condensed tree for a selected min_samples value.

Parameters:

m (int) – The min_samples value whose condensed tree should be displayed.
figsize (tuple of float, optional) – Figure size passed to Matplotlib, by default (8, 5).
**kwargs – Additional keyword arguments forwarded to CondensedTree.plot().

Returns:

Displays the condensed tree plot.

Return type:

None

Raises:

ValueError – If the model has not been fitted yet.
KeyError – If the requested m is not available in the stored results.

Notes

This method first looks for the condensed tree in self.coresg_.condensed_trees_. If it is not found there, it falls back to self.coresg_.models_[m].condensed_tree_ when full models have been saved.

Examples

>>> g.fit(X)
>>> g.plot_condensed_tree(10)

interactive_condensed_tree(figsize=(10, 6))[source]¶

Create an interactive condensed tree explorer across fitted min_samples values.

Parameters:

figsize (tuple of float, optional) – Figure size passed to Matplotlib for each displayed condensed tree, by default (10, 6).

Returns:

A selection slider widget for browsing condensed trees across available min_samples values.

Return type:

ipywidgets.Widget

Raises:

ImportError – If ipywidgets is not installed.
RuntimeError – If the model has not been fitted yet.
ValueError – If no condensed trees are available.

Notes

This method is intended for use in an interactive Jupyter environment. It uses the stored condensed trees in self.coresg_.condensed_trees_ and falls back to any available entries in self.coresg_.models_.

Examples

>>> g.fit(X)
>>> widget = g.interactive_condensed_tree()

Core module¶

CoreSG-HDBSCAN core implementation

coresg_graphhdbscan.core.prim_mrd_mst_edges(X, core)[source]¶

Compute MST edges on a mutual-reachability graph using Prim’s algorithm.

Parameters:

D (numpy.ndarray) – Dense pairwise distance matrix.
core (numpy.ndarray) – Core-distance vector.
eps (float, default=1e-12) – Numerical tolerance.
X (numpy.ndarray)

Returns:

Array of undirected MST edges with shape (n_edges, 2).

Return type:

numpy.ndarray

coresg_graphhdbscan.core.prim_mrd_mst_edges_from_D(D, core)[source]¶

Compute MST edges from a precomputed distance matrix.

Parameters:

D (numpy.ndarray) – Dense pairwise distance matrix.
core (numpy.ndarray) – Core-distance vector.
eps (float, default=1e-12) – Numerical tolerance.

Returns:

Array of undirected MST edges with shape (n_edges, 2).

Return type:

numpy.ndarray

class coresg_graphhdbscan.core.CoreSGModel(labels, probabilities, stabilities, condensed_tree_array, single_linkage_tree)[source]¶

Bases: object

Lightweight wrapper that mimics the HDBSCAN attributes used by this package. Stored result object for one fitted min_samples value.

Parameters:

labels (numpy.ndarray)
probabilities (numpy.ndarray)
stabilities (numpy.ndarray)
condensed_tree_array (numpy.recarray)
single_linkage_tree (numpy.ndarray)

labels_¶

Cluster labels for each sample.

Type:: numpy.ndarray

probabilities_¶

Membership strengths for each sample.

Type:: numpy.ndarray

cluster_persistence_¶

Persistence score for each cluster.

Type:: numpy.ndarray

single_linkage_tree_¶

Single-linkage tree wrapper.

Type:: object

cluster_persistence_¶

Cluster persistence values returned by the HDBSCAN*-style cluster selection step.

Type:: numpy.ndarray

condensed_tree_¶

Condensed tree object for plotting and inspection.

Type:: hdbscan.plots.CondensedTree

class coresg_graphhdbscan.core.CoreSGHDBSCAN(min_samples_list, metric='euclidean', eps=1e-12, min_cluster_size=None, save_models=False)[source]¶

Bases: object

CoreSG-based hierarchical density clustering backend.

This class implements the lower-level CoreSG-HDBSCAN pipeline operating on feature vectors or distance representations.

Workflow¶

Compute the full distance matrix once.
Compute self-inclusive core distances for all values in min_samples_list.
Build the CORE-SG graph from: - the kmax nearest-neighbor graph with ties - the MST on the complete MRD graph for kmax
Precompute a sparse neighbor table for fast edge distance lookup.
For each m: - compute MRD edge weights - build the sparse weighted graph - compute the MST - build the single-linkage tree - condense the tree and extract clusters

param min_samples_list:: List of min_samples values to evaluate.
type min_samples_list:: list[int]
param metric:: Distance metric mode.
type metric:: str, default=”euclidean”
param eps:: Numerical tolerance used in graph construction.
type eps:: float, default=1e-12
param min_cluster_size:: Minimum cluster size. If None, the package default behavior is used.
type min_cluster_size:: int or None, default=None

min_samples_list: List[int]¶

metric: str = 'euclidean'¶

eps: float = 1e-12¶

min_cluster_size: int | None = None¶

save_models: bool = False¶

X_: numpy.ndarray | None = None¶

N_: int | None = None¶

D_: numpy.ndarray | None = None¶

core_: Dict[int, numpy.ndarray]¶

kmax_: int | None = None¶

edges_ut_: numpy.ndarray | None = None¶

idx_with_self_: numpy.ndarray | None = None¶

dst_with_self_: numpy.ndarray | None = None¶

idx_no_self_: numpy.ndarray | None = None¶

dst_no_self_: numpy.ndarray | None = None¶

A_knn_: scipy.sparse.csr_matrix | None = None¶

msts_: Dict[int, Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]]¶

mst_times_: Dict[int, float]¶

models_: Dict[int, CoreSGModel]¶

condensed_trees_: Dict[int, object]¶

labels_by_m_: Dict[int, numpy.ndarray]¶

times_: Dict[int, float]¶

fit(X)[source]¶

Parameters:: X (numpy.ndarray)
Return type:: CoreSGHDBSCAN

fit_from_distance_matrix(D)[source]¶

Build CORE-SG from a precomputed distance matrix D (NxN).

D[i,j] is the base dissimilarity between points i and j.
We compute self-inclusive core distances and kmax-NNG from D.
We build CORE-SG edges via kmax-NNG ∪ MST_kmax (on MRD_kmax).

After this, you can call self.run(…) exactly as usual.

Parameters:: D (numpy.ndarray)
Return type:: CoreSGHDBSCAN

model(min_samples)[source]¶

run(cluster_selection_method='eom', allow_single_cluster=False, match_reference_implementation=False, cluster_selection_epsilon=0.0)[source]¶

Run Core-SG clustering for all requested min_samples values.

Stores¶

models_dict: Saved per-m models when save_models=True.
condensed_trees_dict: Condensed tree objects for all fitted m values.
labels_by_m_dict: Stored labels for all fitted m values.

Parameters:

cluster_selection_method (str)
allow_single_cluster (bool)
match_reference_implementation (bool)
cluster_selection_epsilon (float)

Return type:

CoreSGHDBSCAN

plot_condensed_tree(m, figsize=(8, 5))[source]¶

Parameters:: m (int)

Parameters:

min_samples_list (List[int])
metric (str)
eps (float)
min_cluster_size (int | None)
save_models (bool)

coresg_graphhdbscan.core.plot_condensed_tree_for_m(models_dict, m, title_prefix='', figsize=(8, 5))[source]¶

Plot the condensed tree for a selected fitted min_samples value.

Parameters:

model (CoreSGHDBSCAN or GraphCoreSGHDBSCAN) – Fitted clustering object.
m (int) – Selected min_samples value.
title_prefix (str)

Return type:

None

Metrics module¶

Clustering evaluation helpers.

coresg_graphhdbscan.metrics.evaluate_clustering(true_labels, predicted_labels)[source]¶

Compute AMI and ARI between true and predicted labels.

Returns:: (ami, ari)
Return type:: tuple