coresg_graphhdbscan.graph.GraphCoreSGHDBSCAN

class coresg_graphhdbscan.graph.GraphCoreSGHDBSCAN(min_samples=10, sim_graph_method='sc_umap', metric='euclidean', metric_kwds=None, add_neighbor=True, no_noise=True, n_neighbors=15, heuristic_connect=False, min_cluster_size=None, save_models=False, similarity_backend='auto', **kwargs)[source]

Bases: CoreSGHDBSCAN

Graph-based CoreSG + HDBSCAN interface.

This class constructs a similarity graph from feature data or accepts a precomputed graph representation, transforms that graph into a graph-derived distance representation, and then runs the CoreSG-HDBSCAN clustering pipeline.

Parameters:
  • min_samples (int or iterable of int, default=10) – Main clustering hyperparameter. A single integer gives one fitted solution, while an iterable allows fitting multiple values in one run.

  • sim_graph_method ({"sc_umap", "sc_gauss", "jaccard_phenograph", "precomputed"}, default="sc_umap") – Graph-construction backend.

  • metric (str, callable, or None, default="euclidean") – Distance metric used during similarity graph construction. Supported string metrics are “cityblock”, “cosine”, “euclidean”, “l1”, “l2”, “manhattan”, “braycurtis”, “canberra”, “chebyshev”, “correlation”, “dice”, “hamming”, “jaccard”, “mahalanobis”, “minkowski”, “rogerstanimoto”, “russellrao”, “seuclidean”, “sokalmichener”, “sokalsneath”, “sqeuclidean”, “yule”, and the package-specific “hybrid_euclidean_cosine”. The metric “kulsinski” is not supported. The combination metric=”yule” with sim_graph_method=”sc_gauss” is also not supported because it can produce non-finite graph weights.

  • metric_kwds (dict or None, default=None) – Additional keyword arguments passed to the selected distance metric.

  • add_neighbor (bool, default=True) – Controls how weighted structural similarity is expanded into graph edges.

  • no_noise (bool, default=True) – If True, points initially labeled as noise are reassigned after clustering.

  • n_neighbors (int, default=15) – Number of neighbors used during graph construction.

  • heuristic_connect (bool, default=False) – If True, increase n_neighbors until the WSS dissimilarity graph becomes connected, except in precomputed mode, where bridge edges are used instead. If False, disconnected components are connected with synthetic bridge edges.

  • min_cluster_size (int or None, default=None) – Minimum cluster size used in the clustering stage. If None, the package follows the selected min_samples value for each run.

  • save_models (bool, default=False) – If True, save hdbscan models for different min_samples which can add some memory overhead. If False, just save labels and condensed trees for each min_samples.

  • **kwargs – Additional keyword arguments passed to internal graph-construction helpers.

similarity_graph_

Initial similarity graph.

Type:

networkx.Graph

similarity_graph_WSS

Weighted structural similarity graph.

Type:

networkx.Graph

dissimilarity_graph_

Graph after conversion from similarity to dissimilarity.

Type:

networkx.Graph

connected_graph_

Final connected graph used by the clustering stage.

Type:

networkx.Graph

dist_matrix_

Dense matrix used by the CoreSG-HDBSCAN pipeline.

Type:

numpy.ndarray

coresg_

Internal fitted CoreSG-HDBSCAN object.

Type:

CoreSGHDBSCAN

models_

Dictionary of saved per-min_samples models. Populated only when save_models=True.

Type:

dict

condensed_trees_

Dictionary of condensed tree objects keyed by fitted min_samples value.

Type:

dict

labels_by_m_

Dictionary of stored labels keyed by fitted min_samples value.

Type:

dict

__init__(min_samples=10, sim_graph_method='sc_umap', metric='euclidean', metric_kwds=None, add_neighbor=True, no_noise=True, n_neighbors=15, heuristic_connect=False, min_cluster_size=None, save_models=False, similarity_backend='auto', **kwargs)[source]

Methods

__init__([min_samples, sim_graph_method, ...])

compute_custom_distance_matrix(graph)

Compute the pairwise distance matrix used by the graph-based pipeline.

compute_full_distance_matrix(graph)

Compute the full dense matrix of shortest path distances using Floyd–Warshall.

compute_similarity(graph)

Backward-compatible NetworkX wrapper over the sparse implementation.

compute_similarity_sparse(graph)

Fast weighted structural similarity as a sparse matrix.

compute_sparse_distance_dict(graph)

Compute a dictionary-of-dictionaries of shortest path distances.

connect_graph_heuristically(graph, n_obs)

Connect disconnected components with synthetic bridge edges.

create_similarity_graph(data)

dense_from_sparse_edges_fill1(D_sparse)

Create the dense edge-distance matrix expected by CoreSG/HDBSCAN.

fit(X[, y])

Fit the model on feature data or a precomputed graph.

fit_coresg(X, m_list[, coresg_kwargs])

Build graph-derived distances and run CoreSGHDBSCAN on them.

fit_from_distance_matrix(D)

Build CORE-SG from a precomputed distance matrix D (NxN).

fit_predict(X[, y, m, c])

Fit the model and return cluster labels.

graph_metric(u, v)

Custom distance metric that uses the precomputed sparse distance dictionary.

interactive_condensed_tree([figsize])

Create an interactive condensed tree explorer across fitted min_samples values.

is_graph_connected(graph)

labels_for(m[, no_noise, c])

Return labels for a selected min_samples value.

model(min_samples)

plot_condensed_tree(m[, figsize])

Plot the condensed tree for a selected min_samples value.

reassign_noise_via_mst(mst_graph, labels0[, c])

Reassign noise labels by propagating labels over a precomputed MST.

run([cluster_selection_method, ...])

Run Core-SG clustering for all requested min_samples values.

similarity_to_dissimilarity(similarity_graph)

similarity_to_dissimilarity_sparse(...)

Attributes

A_knn_

D_

N_

X_

dst_no_self_

dst_with_self_

edges_ut_

eps

idx_no_self_

idx_with_self_

kmax_

metric

min_cluster_size

save_models

min_samples_list

core_

msts_

mst_times_

models_

condensed_trees_

labels_by_m_

times_

compute_similarity_sparse(graph)[source]

Fast weighted structural similarity as a sparse matrix.

This is algebraically equivalent to the original compute_similarity implementation, but avoids Python-level all-pairs iteration. The weighted adjacency vector for each node includes an explicit self-loop of weight 1 before cosine normalization.

Return type:

scipy.sparse.csr_matrix

compute_similarity(graph)[source]

Backward-compatible NetworkX wrapper over the sparse implementation.

static similarity_to_dissimilarity_sparse(similarity_matrix)[source]
Parameters:

similarity_matrix (scipy.sparse.csr_matrix)

Return type:

scipy.sparse.csr_matrix

static similarity_to_dissimilarity(similarity_graph)[source]
static is_graph_connected(graph)[source]
create_similarity_graph(data)[source]
connect_graph_heuristically(graph, n_obs)[source]

Connect disconnected components with synthetic bridge edges.

This function assumes graph is already a dissimilarity graph:

smaller weight = closer larger weight = farther

It does not rebuild the similarity graph. It only adds bridge edges of distance weight 1 between disconnected components.

static compute_full_distance_matrix(graph)[source]

Compute the full dense matrix of shortest path distances using Floyd–Warshall.

static compute_sparse_distance_dict(graph)[source]

Compute a dictionary-of-dictionaries of shortest path distances. For each node, run single_source_dijkstra_path_length and store the results.

graph_metric(u, v)[source]

Custom distance metric that uses the precomputed sparse distance dictionary. The data points are mapped to node indices using self._point_to_index.

static compute_custom_distance_matrix(graph)[source]

Compute the pairwise distance matrix used by the graph-based pipeline.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input feature matrix.

Returns:

Pairwise distance matrix.

Return type:

numpy.ndarray

static dense_from_sparse_edges_fill1(D_sparse)[source]

Create the dense edge-distance matrix expected by CoreSG/HDBSCAN.

Non-edges are filled with 1, diagonal with 0, and sparse entries overwrite the corresponding distances.

Parameters:

D_sparse (scipy.sparse.csr_matrix)

Return type:

numpy.ndarray

static reassign_noise_via_mst(mst_graph, labels0, c=5)[source]

Reassign noise labels by propagating labels over a precomputed MST.

Parameters:
  • mst_graph (networkx.Graph) – Minimum spanning tree of the final connected WSS graph.

  • labels0 (ndarray) – Initial labels with noise marked as -1.

  • c (int, default=5) – Number of largest edge weights to keep in the lexicographic path signature during propagation.

fit(X, y=None)[source]

Fit the model on feature data or a precomputed graph.

Parameters:

X (array-like of shape (n_samples, n_features) or graph-like) – Input feature matrix when sim_graph_method is not "precomputed". In "precomputed" mode, this may be a networkx.Graph, a SciPy sparse adjacency matrix, or a square dense adjacency matrix.

Returns:

self – Fitted estimator.

Return type:

GraphCoreSGHDBSCAN

fit_predict(X, y=None, m=None, c=5, **fit_params)[source]

Fit the model and return cluster labels.

Parameters:

X (array-like of shape (n_samples, n_features) or graph-like) – Input feature matrix or supported precomputed graph representation.

Returns:

Cluster labels for the fitted solution.

Return type:

numpy.ndarray

fit_coresg(X, m_list, coresg_kwargs=None)[source]

Build graph-derived distances and run CoreSGHDBSCAN on them.

labels_for(m, no_noise=None, c=5)[source]

Return labels for a selected min_samples value.

Parameters:
  • m (int) – Selected min_samples value.

  • no_noise (bool or None, optional) – If True, apply MST-based noise reassignment. If None, use the instance-level no_noise setting.

  • c (int, optional) – Tie-breaking path length used during noise reassignment.

Returns:

Cluster labels for the requested fitted solution.

Return type:

numpy.ndarray

Notes

labels_by_m_[m] stores the directly fitted labels. labels_for(m) may additionally apply noise reassignment.

plot_condensed_tree(m, figsize=(10, 6), **kwargs)[source]

Plot the condensed tree for a selected min_samples value.

Parameters:
  • m (int) – The min_samples value whose condensed tree should be displayed.

  • figsize (tuple of float, optional) – Figure size passed to Matplotlib, by default (8, 5).

  • **kwargs – Additional keyword arguments forwarded to CondensedTree.plot().

Returns:

Displays the condensed tree plot.

Return type:

None

Raises:
  • ValueError – If the model has not been fitted yet.

  • KeyError – If the requested m is not available in the stored results.

Notes

This method first looks for the condensed tree in self.coresg_.condensed_trees_. If it is not found there, it falls back to self.coresg_.models_[m].condensed_tree_ when full models have been saved.

Examples

>>> g.fit(X)
>>> g.plot_condensed_tree(10)
interactive_condensed_tree(figsize=(10, 6))[source]

Create an interactive condensed tree explorer across fitted min_samples values.

Parameters:

figsize (tuple of float, optional) – Figure size passed to Matplotlib for each displayed condensed tree, by default (10, 6).

Returns:

A selection slider widget for browsing condensed trees across available min_samples values.

Return type:

ipywidgets.Widget

Raises:

Notes

This method is intended for use in an interactive Jupyter environment. It uses the stored condensed trees in self.coresg_.condensed_trees_ and falls back to any available entries in self.coresg_.models_.

Examples

>>> g.fit(X)
>>> widget = g.interactive_condensed_tree()