API reference

This section documents the public API of coresg-graphhdbscan.

Main public API

The most important user-facing class is GraphCoreSGHDBSCAN.

Important public methods include:

  • fit

  • fit_predict

  • labels_for

  • model

  • plot_condensed_tree

  • interactive_condensed_tree

Important fitted result attributes include:

  • models_

  • condensed_trees_

  • labels_by_m_

  • coresg_

  • dist_matrix_

  • similarity_graph_

  • connected_graph_

Main classes

coresg_graphhdbscan.graph.GraphCoreSGHDBSCAN

Graph-based CoreSG + HDBSCAN interface.

coresg_graphhdbscan.core.CoreSGHDBSCAN

CoreSG-based hierarchical density clustering backend.

coresg_graphhdbscan.core.CoreSGModel

Lightweight wrapper that mimics the HDBSCAN attributes used by this package.

Main utility function

coresg_graphhdbscan.core.plot_condensed_tree_for_m

Plot the condensed tree for a selected fitted min_samples value.

Module reference

Graph module

Graph-based wrapper around CoreSG-HDBSCAN.

class coresg_graphhdbscan.graph.GraphCoreSGHDBSCAN(min_samples=10, sim_graph_method='sc_umap', metric='euclidean', metric_kwds=None, add_neighbor=True, no_noise=True, n_neighbors=15, heuristic_connect=False, min_cluster_size=None, save_models=False, similarity_backend='auto', **kwargs)[source]

Bases: CoreSGHDBSCAN

Graph-based CoreSG + HDBSCAN interface.

This class constructs a similarity graph from feature data or accepts a precomputed graph representation, transforms that graph into a graph-derived distance representation, and then runs the CoreSG-HDBSCAN clustering pipeline.

Parameters:
  • min_samples (int or iterable of int, default=10) – Main clustering hyperparameter. A single integer gives one fitted solution, while an iterable allows fitting multiple values in one run.

  • sim_graph_method ({"sc_umap", "sc_gauss", "jaccard_phenograph", "precomputed"}, default="sc_umap") – Graph-construction backend.

  • metric (str, callable, or None, default="euclidean") – Distance metric used during similarity graph construction. Supported string metrics are “cityblock”, “cosine”, “euclidean”, “l1”, “l2”, “manhattan”, “braycurtis”, “canberra”, “chebyshev”, “correlation”, “dice”, “hamming”, “jaccard”, “mahalanobis”, “minkowski”, “rogerstanimoto”, “russellrao”, “seuclidean”, “sokalmichener”, “sokalsneath”, “sqeuclidean”, “yule”, and the package-specific “hybrid_euclidean_cosine”. The metric “kulsinski” is not supported. The combination metric=”yule” with sim_graph_method=”sc_gauss” is also not supported because it can produce non-finite graph weights.

  • metric_kwds (dict or None, default=None) – Additional keyword arguments passed to the selected distance metric.

  • add_neighbor (bool, default=True) – Controls how weighted structural similarity is expanded into graph edges.

  • no_noise (bool, default=True) – If True, points initially labeled as noise are reassigned after clustering.

  • n_neighbors (int, default=15) – Number of neighbors used during graph construction.

  • heuristic_connect (bool, default=False) – If True, increase n_neighbors until the WSS dissimilarity graph becomes connected, except in precomputed mode, where bridge edges are used instead. If False, disconnected components are connected with synthetic bridge edges.

  • min_cluster_size (int or None, default=None) – Minimum cluster size used in the clustering stage. If None, the package follows the selected min_samples value for each run.

  • save_models (bool, default=False) – If True, save hdbscan models for different min_samples which can add some memory overhead. If False, just save labels and condensed trees for each min_samples.

  • **kwargs – Additional keyword arguments passed to internal graph-construction helpers.

similarity_graph_

Initial similarity graph.

Type:

networkx.Graph

similarity_graph_WSS

Weighted structural similarity graph.

Type:

networkx.Graph

dissimilarity_graph_

Graph after conversion from similarity to dissimilarity.

Type:

networkx.Graph

connected_graph_

Final connected graph used by the clustering stage.

Type:

networkx.Graph

dist_matrix_

Dense matrix used by the CoreSG-HDBSCAN pipeline.

Type:

numpy.ndarray

coresg_

Internal fitted CoreSG-HDBSCAN object.

Type:

CoreSGHDBSCAN

models_

Dictionary of saved per-min_samples models. Populated only when save_models=True.

Type:

dict

condensed_trees_

Dictionary of condensed tree objects keyed by fitted min_samples value.

Type:

dict

labels_by_m_

Dictionary of stored labels keyed by fitted min_samples value.

Type:

dict

compute_similarity_sparse(graph)[source]

Fast weighted structural similarity as a sparse matrix.

This is algebraically equivalent to the original compute_similarity implementation, but avoids Python-level all-pairs iteration. The weighted adjacency vector for each node includes an explicit self-loop of weight 1 before cosine normalization.

Return type:

scipy.sparse.csr_matrix

compute_similarity(graph)[source]

Backward-compatible NetworkX wrapper over the sparse implementation.

static similarity_to_dissimilarity_sparse(similarity_matrix)[source]
Parameters:

similarity_matrix (scipy.sparse.csr_matrix)

Return type:

scipy.sparse.csr_matrix

static similarity_to_dissimilarity(similarity_graph)[source]
static is_graph_connected(graph)[source]
create_similarity_graph(data)[source]
connect_graph_heuristically(graph, n_obs)[source]

Connect disconnected components with synthetic bridge edges.

This function assumes graph is already a dissimilarity graph:

smaller weight = closer larger weight = farther

It does not rebuild the similarity graph. It only adds bridge edges of distance weight 1 between disconnected components.

static compute_full_distance_matrix(graph)[source]

Compute the full dense matrix of shortest path distances using Floyd–Warshall.

static compute_sparse_distance_dict(graph)[source]

Compute a dictionary-of-dictionaries of shortest path distances. For each node, run single_source_dijkstra_path_length and store the results.

graph_metric(u, v)[source]

Custom distance metric that uses the precomputed sparse distance dictionary. The data points are mapped to node indices using self._point_to_index.

static compute_custom_distance_matrix(graph)[source]

Compute the pairwise distance matrix used by the graph-based pipeline.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input feature matrix.

Returns:

Pairwise distance matrix.

Return type:

numpy.ndarray

static dense_from_sparse_edges_fill1(D_sparse)[source]

Create the dense edge-distance matrix expected by CoreSG/HDBSCAN.

Non-edges are filled with 1, diagonal with 0, and sparse entries overwrite the corresponding distances.

Parameters:

D_sparse (scipy.sparse.csr_matrix)

Return type:

numpy.ndarray

static reassign_noise_via_mst(mst_graph, labels0, c=5)[source]

Reassign noise labels by propagating labels over a precomputed MST.

Parameters:
  • mst_graph (networkx.Graph) – Minimum spanning tree of the final connected WSS graph.

  • labels0 (ndarray) – Initial labels with noise marked as -1.

  • c (int, default=5) – Number of largest edge weights to keep in the lexicographic path signature during propagation.

fit(X, y=None)[source]

Fit the model on feature data or a precomputed graph.

Parameters:

X (array-like of shape (n_samples, n_features) or graph-like) – Input feature matrix when sim_graph_method is not "precomputed". In "precomputed" mode, this may be a networkx.Graph, a SciPy sparse adjacency matrix, or a square dense adjacency matrix.

Returns:

self – Fitted estimator.

Return type:

GraphCoreSGHDBSCAN

fit_predict(X, y=None, m=None, c=5, **fit_params)[source]

Fit the model and return cluster labels.

Parameters:

X (array-like of shape (n_samples, n_features) or graph-like) – Input feature matrix or supported precomputed graph representation.

Returns:

Cluster labels for the fitted solution.

Return type:

numpy.ndarray

fit_coresg(X, m_list, coresg_kwargs=None)[source]

Build graph-derived distances and run CoreSGHDBSCAN on them.

labels_for(m, no_noise=None, c=5)[source]

Return labels for a selected min_samples value.

Parameters:
  • m (int) – Selected min_samples value.

  • no_noise (bool or None, optional) – If True, apply MST-based noise reassignment. If None, use the instance-level no_noise setting.

  • c (int, optional) – Tie-breaking path length used during noise reassignment.

Returns:

Cluster labels for the requested fitted solution.

Return type:

numpy.ndarray

Notes

labels_by_m_[m] stores the directly fitted labels. labels_for(m) may additionally apply noise reassignment.

plot_condensed_tree(m, figsize=(10, 6), **kwargs)[source]

Plot the condensed tree for a selected min_samples value.

Parameters:
  • m (int) – The min_samples value whose condensed tree should be displayed.

  • figsize (tuple of float, optional) – Figure size passed to Matplotlib, by default (8, 5).

  • **kwargs – Additional keyword arguments forwarded to CondensedTree.plot().

Returns:

Displays the condensed tree plot.

Return type:

None

Raises:
  • ValueError – If the model has not been fitted yet.

  • KeyError – If the requested m is not available in the stored results.

Notes

This method first looks for the condensed tree in self.coresg_.condensed_trees_. If it is not found there, it falls back to self.coresg_.models_[m].condensed_tree_ when full models have been saved.

Examples

>>> g.fit(X)
>>> g.plot_condensed_tree(10)
interactive_condensed_tree(figsize=(10, 6))[source]

Create an interactive condensed tree explorer across fitted min_samples values.

Parameters:

figsize (tuple of float, optional) – Figure size passed to Matplotlib for each displayed condensed tree, by default (10, 6).

Returns:

A selection slider widget for browsing condensed trees across available min_samples values.

Return type:

ipywidgets.Widget

Raises:

Notes

This method is intended for use in an interactive Jupyter environment. It uses the stored condensed trees in self.coresg_.condensed_trees_ and falls back to any available entries in self.coresg_.models_.

Examples

>>> g.fit(X)
>>> widget = g.interactive_condensed_tree()

Core module

CoreSG-HDBSCAN core implementation

coresg_graphhdbscan.core.prim_mrd_mst_edges(X, core)[source]

Compute MST edges on a mutual-reachability graph using Prim’s algorithm.

Parameters:
Returns:

Array of undirected MST edges with shape (n_edges, 2).

Return type:

numpy.ndarray

coresg_graphhdbscan.core.prim_mrd_mst_edges_from_D(D, core)[source]

Compute MST edges from a precomputed distance matrix.

Parameters:
Returns:

Array of undirected MST edges with shape (n_edges, 2).

Return type:

numpy.ndarray

class coresg_graphhdbscan.core.CoreSGModel(labels, probabilities, stabilities, condensed_tree_array, single_linkage_tree)[source]

Bases: object

Lightweight wrapper that mimics the HDBSCAN attributes used by this package. Stored result object for one fitted min_samples value.

Parameters:
labels_

Cluster labels for each sample.

Type:

numpy.ndarray

probabilities_

Membership strengths for each sample.

Type:

numpy.ndarray

cluster_persistence_

Persistence score for each cluster.

Type:

numpy.ndarray

single_linkage_tree_

Single-linkage tree wrapper.

Type:

object

cluster_persistence_

Cluster persistence values returned by the HDBSCAN*-style cluster selection step.

Type:

numpy.ndarray

condensed_tree_

Condensed tree object for plotting and inspection.

Type:

hdbscan.plots.CondensedTree

class coresg_graphhdbscan.core.CoreSGHDBSCAN(min_samples_list, metric='euclidean', eps=1e-12, min_cluster_size=None, save_models=False)[source]

Bases: object

CoreSG-based hierarchical density clustering backend.

This class implements the lower-level CoreSG-HDBSCAN pipeline operating on feature vectors or distance representations.

Workflow

  1. Compute the full distance matrix once.

  2. Compute self-inclusive core distances for all values in min_samples_list.

  3. Build the CORE-SG graph from: - the kmax nearest-neighbor graph with ties - the MST on the complete MRD graph for kmax

  4. Precompute a sparse neighbor table for fast edge distance lookup.

  5. For each m: - compute MRD edge weights - build the sparse weighted graph - compute the MST - build the single-linkage tree - condense the tree and extract clusters

param min_samples_list:

List of min_samples values to evaluate.

type min_samples_list:

list[int]

param metric:

Distance metric mode.

type metric:

str, default=”euclidean”

param eps:

Numerical tolerance used in graph construction.

type eps:

float, default=1e-12

param min_cluster_size:

Minimum cluster size. If None, the package default behavior is used.

type min_cluster_size:

int or None, default=None

min_samples_list: List[int]
metric: str = 'euclidean'
eps: float = 1e-12
min_cluster_size: int | None = None
save_models: bool = False
X_: numpy.ndarray | None = None
N_: int | None = None
D_: numpy.ndarray | None = None
core_: Dict[int, numpy.ndarray]
kmax_: int | None = None
edges_ut_: numpy.ndarray | None = None
idx_with_self_: numpy.ndarray | None = None
dst_with_self_: numpy.ndarray | None = None
idx_no_self_: numpy.ndarray | None = None
dst_no_self_: numpy.ndarray | None = None
A_knn_: scipy.sparse.csr_matrix | None = None
msts_: Dict[int, Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]]
mst_times_: Dict[int, float]
models_: Dict[int, CoreSGModel]
condensed_trees_: Dict[int, object]
labels_by_m_: Dict[int, numpy.ndarray]
times_: Dict[int, float]
fit(X)[source]
Parameters:

X (numpy.ndarray)

Return type:

CoreSGHDBSCAN

fit_from_distance_matrix(D)[source]

Build CORE-SG from a precomputed distance matrix D (NxN).

  • D[i,j] is the base dissimilarity between points i and j.

  • We compute self-inclusive core distances and kmax-NNG from D.

  • We build CORE-SG edges via kmax-NNG ∪ MST_kmax (on MRD_kmax).

After this, you can call self.run(…) exactly as usual.

Parameters:

D (numpy.ndarray)

Return type:

CoreSGHDBSCAN

model(min_samples)[source]
run(cluster_selection_method='eom', allow_single_cluster=False, match_reference_implementation=False, cluster_selection_epsilon=0.0)[source]

Run Core-SG clustering for all requested min_samples values.

Stores
models_dict

Saved per-m models when save_models=True.

condensed_trees_dict

Condensed tree objects for all fitted m values.

labels_by_m_dict

Stored labels for all fitted m values.

Parameters:
  • cluster_selection_method (str)

  • allow_single_cluster (bool)

  • match_reference_implementation (bool)

  • cluster_selection_epsilon (float)

Return type:

CoreSGHDBSCAN

plot_condensed_tree(m, figsize=(8, 5))[source]
Parameters:

m (int)

Parameters:
coresg_graphhdbscan.core.plot_condensed_tree_for_m(models_dict, m, title_prefix='', figsize=(8, 5))[source]

Plot the condensed tree for a selected fitted min_samples value.

Parameters:
Return type:

None

Metrics module

Clustering evaluation helpers.

coresg_graphhdbscan.metrics.evaluate_clustering(true_labels, predicted_labels)[source]

Compute AMI and ARI between true and predicted labels.

Returns:

(ami, ari)

Return type:

tuple