coresg_graphhdbscan.graph.GraphCoreSGHDBSCAN¶
- class coresg_graphhdbscan.graph.GraphCoreSGHDBSCAN(min_samples=10, sim_graph_method='sc_umap', metric='euclidean', metric_kwds=None, add_neighbor=True, no_noise=True, n_neighbors=15, heuristic_connect=False, min_cluster_size=None, save_models=False, similarity_backend='auto', **kwargs)[source]¶
Bases:
CoreSGHDBSCANGraph-based CoreSG + HDBSCAN interface.
This class constructs a similarity graph from feature data or accepts a precomputed graph representation, transforms that graph into a graph-derived distance representation, and then runs the CoreSG-HDBSCAN clustering pipeline.
- Parameters:
min_samples (int or iterable of int, default=10) – Main clustering hyperparameter. A single integer gives one fitted solution, while an iterable allows fitting multiple values in one run.
sim_graph_method ({"sc_umap", "sc_gauss", "jaccard_phenograph", "precomputed"}, default="sc_umap") – Graph-construction backend.
metric (str, callable, or None, default="euclidean") – Distance metric used during similarity graph construction. Supported string metrics are “cityblock”, “cosine”, “euclidean”, “l1”, “l2”, “manhattan”, “braycurtis”, “canberra”, “chebyshev”, “correlation”, “dice”, “hamming”, “jaccard”, “mahalanobis”, “minkowski”, “rogerstanimoto”, “russellrao”, “seuclidean”, “sokalmichener”, “sokalsneath”, “sqeuclidean”, “yule”, and the package-specific “hybrid_euclidean_cosine”. The metric “kulsinski” is not supported. The combination metric=”yule” with sim_graph_method=”sc_gauss” is also not supported because it can produce non-finite graph weights.
metric_kwds (dict or None, default=None) – Additional keyword arguments passed to the selected distance metric.
add_neighbor (bool, default=True) – Controls how weighted structural similarity is expanded into graph edges.
no_noise (bool, default=True) – If True, points initially labeled as noise are reassigned after clustering.
n_neighbors (int, default=15) – Number of neighbors used during graph construction.
heuristic_connect (bool, default=False) – If True, increase
n_neighborsuntil the WSS dissimilarity graph becomes connected, except in precomputed mode, where bridge edges are used instead. If False, disconnected components are connected with synthetic bridge edges.min_cluster_size (int or None, default=None) – Minimum cluster size used in the clustering stage. If None, the package follows the selected
min_samplesvalue for each run.save_models (bool, default=False) – If True, save hdbscan models for different min_samples which can add some memory overhead. If False, just save labels and condensed trees for each min_samples.
**kwargs – Additional keyword arguments passed to internal graph-construction helpers.
- similarity_graph_¶
Initial similarity graph.
- Type:
- similarity_graph_WSS¶
Weighted structural similarity graph.
- Type:
- dissimilarity_graph_¶
Graph after conversion from similarity to dissimilarity.
- Type:
- connected_graph_¶
Final connected graph used by the clustering stage.
- Type:
- dist_matrix_¶
Dense matrix used by the CoreSG-HDBSCAN pipeline.
- Type:
- coresg_¶
Internal fitted CoreSG-HDBSCAN object.
- Type:
- models_¶
Dictionary of saved per-
min_samplesmodels. Populated only whensave_models=True.- Type:
- condensed_trees_¶
Dictionary of condensed tree objects keyed by fitted
min_samplesvalue.- Type:
- __init__(min_samples=10, sim_graph_method='sc_umap', metric='euclidean', metric_kwds=None, add_neighbor=True, no_noise=True, n_neighbors=15, heuristic_connect=False, min_cluster_size=None, save_models=False, similarity_backend='auto', **kwargs)[source]¶
Methods
__init__([min_samples, sim_graph_method, ...])Compute the pairwise distance matrix used by the graph-based pipeline.
compute_full_distance_matrix(graph)Compute the full dense matrix of shortest path distances using Floyd–Warshall.
compute_similarity(graph)Backward-compatible NetworkX wrapper over the sparse implementation.
compute_similarity_sparse(graph)Fast weighted structural similarity as a sparse matrix.
compute_sparse_distance_dict(graph)Compute a dictionary-of-dictionaries of shortest path distances.
connect_graph_heuristically(graph, n_obs)Connect disconnected components with synthetic bridge edges.
create_similarity_graph(data)dense_from_sparse_edges_fill1(D_sparse)Create the dense edge-distance matrix expected by CoreSG/HDBSCAN.
fit(X[, y])Fit the model on feature data or a precomputed graph.
fit_coresg(X, m_list[, coresg_kwargs])Build graph-derived distances and run CoreSGHDBSCAN on them.
fit_from_distance_matrix(D)Build CORE-SG from a precomputed distance matrix D (NxN).
fit_predict(X[, y, m, c])Fit the model and return cluster labels.
graph_metric(u, v)Custom distance metric that uses the precomputed sparse distance dictionary.
interactive_condensed_tree([figsize])Create an interactive condensed tree explorer across fitted
min_samplesvalues.is_graph_connected(graph)labels_for(m[, no_noise, c])Return labels for a selected
min_samplesvalue.model(min_samples)plot_condensed_tree(m[, figsize])Plot the condensed tree for a selected
min_samplesvalue.reassign_noise_via_mst(mst_graph, labels0[, c])Reassign noise labels by propagating labels over a precomputed MST.
run([cluster_selection_method, ...])Run Core-SG clustering for all requested
min_samplesvalues.similarity_to_dissimilarity(similarity_graph)Attributes
A_knn_D_N_X_dst_no_self_dst_with_self_edges_ut_epsidx_no_self_idx_with_self_kmax_metricmin_cluster_sizesave_modelsmin_samples_listcore_msts_mst_times_times_- compute_similarity_sparse(graph)[source]¶
Fast weighted structural similarity as a sparse matrix.
This is algebraically equivalent to the original
compute_similarityimplementation, but avoids Python-level all-pairs iteration. The weighted adjacency vector for each node includes an explicit self-loop of weight 1 before cosine normalization.- Return type:
- compute_similarity(graph)[source]¶
Backward-compatible NetworkX wrapper over the sparse implementation.
- static similarity_to_dissimilarity_sparse(similarity_matrix)[source]¶
- Parameters:
similarity_matrix (scipy.sparse.csr_matrix)
- Return type:
- connect_graph_heuristically(graph, n_obs)[source]¶
Connect disconnected components with synthetic bridge edges.
This function assumes graph is already a dissimilarity graph:
smaller weight = closer larger weight = farther
It does not rebuild the similarity graph. It only adds bridge edges of distance weight 1 between disconnected components.
- static compute_full_distance_matrix(graph)[source]¶
Compute the full dense matrix of shortest path distances using Floyd–Warshall.
- static compute_sparse_distance_dict(graph)[source]¶
Compute a dictionary-of-dictionaries of shortest path distances. For each node, run single_source_dijkstra_path_length and store the results.
- graph_metric(u, v)[source]¶
Custom distance metric that uses the precomputed sparse distance dictionary. The data points are mapped to node indices using self._point_to_index.
- static compute_custom_distance_matrix(graph)[source]¶
Compute the pairwise distance matrix used by the graph-based pipeline.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input feature matrix.
- Returns:
Pairwise distance matrix.
- Return type:
- static dense_from_sparse_edges_fill1(D_sparse)[source]¶
Create the dense edge-distance matrix expected by CoreSG/HDBSCAN.
Non-edges are filled with 1, diagonal with 0, and sparse entries overwrite the corresponding distances.
- Parameters:
D_sparse (scipy.sparse.csr_matrix)
- Return type:
- static reassign_noise_via_mst(mst_graph, labels0, c=5)[source]¶
Reassign noise labels by propagating labels over a precomputed MST.
- Parameters:
mst_graph (networkx.Graph) – Minimum spanning tree of the final connected WSS graph.
labels0 (ndarray) – Initial labels with noise marked as -1.
c (int, default=5) – Number of largest edge weights to keep in the lexicographic path signature during propagation.
- fit(X, y=None)[source]¶
Fit the model on feature data or a precomputed graph.
- Parameters:
X (array-like of shape (n_samples, n_features) or graph-like) – Input feature matrix when
sim_graph_methodis not"precomputed". In"precomputed"mode, this may be anetworkx.Graph, a SciPy sparse adjacency matrix, or a square dense adjacency matrix.- Returns:
self – Fitted estimator.
- Return type:
- fit_predict(X, y=None, m=None, c=5, **fit_params)[source]¶
Fit the model and return cluster labels.
- Parameters:
X (array-like of shape (n_samples, n_features) or graph-like) – Input feature matrix or supported precomputed graph representation.
- Returns:
Cluster labels for the fitted solution.
- Return type:
- fit_coresg(X, m_list, coresg_kwargs=None)[source]¶
Build graph-derived distances and run CoreSGHDBSCAN on them.
- labels_for(m, no_noise=None, c=5)[source]¶
Return labels for a selected
min_samplesvalue.- Parameters:
- Returns:
Cluster labels for the requested fitted solution.
- Return type:
Notes
labels_by_m_[m]stores the directly fitted labels.labels_for(m)may additionally apply noise reassignment.
- plot_condensed_tree(m, figsize=(10, 6), **kwargs)[source]¶
Plot the condensed tree for a selected
min_samplesvalue.- Parameters:
- Returns:
Displays the condensed tree plot.
- Return type:
None
- Raises:
ValueError – If the model has not been fitted yet.
KeyError – If the requested
mis not available in the stored results.
Notes
This method first looks for the condensed tree in
self.coresg_.condensed_trees_. If it is not found there, it falls back toself.coresg_.models_[m].condensed_tree_when full models have been saved.Examples
>>> g.fit(X) >>> g.plot_condensed_tree(10)
- interactive_condensed_tree(figsize=(10, 6))[source]¶
Create an interactive condensed tree explorer across fitted
min_samplesvalues.- Parameters:
figsize (tuple of float, optional) – Figure size passed to Matplotlib for each displayed condensed tree, by default
(10, 6).- Returns:
A selection slider widget for browsing condensed trees across available
min_samplesvalues.- Return type:
ipywidgets.Widget
- Raises:
ImportError – If
ipywidgetsis not installed.RuntimeError – If the model has not been fitted yet.
ValueError – If no condensed trees are available.
Notes
This method is intended for use in an interactive Jupyter environment. It uses the stored condensed trees in
self.coresg_.condensed_trees_and falls back to any available entries inself.coresg_.models_.Examples
>>> g.fit(X) >>> widget = g.interactive_condensed_tree()