Overview¶
coresg-graphhdbscan is a Python package for density-based clustering on similarity graphs derived from feature-vector data or directly provided by the user.
The package combines two main ideas:
construction of a weighted graph that reflects local similarity structure
application of a specialized CoreSG-HDBSCAN* pipeline designed to operate on graph data representations
This package is designed for settings where feature-vector representations are not suitable for clustering, such as Euclidean geometry in very high-dimensional spaces, where clustering should instead be guided by a learned or hand-crafted similarity graph as an intrinsically lower-dimensional representation of the data.
Main features¶
The package currently supports the following capabilities:
graph-based clustering through
GraphCoreSGHDBSCANmultiple
min_samplesvalues in a single model runthree graph-construction backends plus a precomputed graph mode
support for multiple distance metrics during similarity graph construction, including
euclidean,cosine,correlation,manhattan,jaccard,minkowski,mahalanobis,seuclidean, and the package-specifichybrid_euclidean_cosinesupport for
metric_kwdswhen a distance metric requires additional argumentsoptional relabeling of noise points by density-based label propagation
compatibility with NetworkX graphs, dense adjacency matrices, and sparse adjacency matrices in precomputed mode
access to HDBSCAN*-style outputs such as labels, probabilities, and condensed trees
direct access to stored labels, condensed trees, and optional per-
mmodels
Package structure¶
The package is centered around two public classes:
CoreSGHDBSCANThe lower-level CoreSG implementation operating on feature vectors or distance matrices.
GraphCoreSGHDBSCANThe graph-oriented wrapper that constructs a similarity graph, converts it to a weighted structutal dissimilarity graph, and then runs CoreSGHDBSCAN.
For most users, GraphCoreSGHDBSCAN is the main entry point.
When to use this package¶
This package is especially useful when:
you work with very high-dimensional data, and simple metrics like Euclidean distance alone is not the best description of local structure
a similarity graph is more meaningful than a raw feature-space view
you want to compare several
min_samplesvalues efficiently in one runyou want HDBSCAN*-style hierarchical clustering behavior on top of a graph-based representation
you already have a graph or adjacency matrix and want to cluster directly from it
Typical workflow¶
A typical graph-based clustering workflow in this package is:
construct or provide a similarity graph
convert it into a weighted structural similarity graph
convert similarity to dissimilarity
ensure graph connectivity
run CoreSGHDBSCAN for one or more
min_samplesvaluesinspect the condensed tree and choose a solution
optionally reassign noise points if
no_noise=True
Typical entry point¶
For most users, the main entry point is:
from coresg_graphhdbscan import GraphCoreSGHDBSCAN
A simple starting example is:
from coresg_graphhdbscan import GraphCoreSGHDBSCAN
model = GraphCoreSGHDBSCAN(
min_samples=10,
sim_graph_method="sc_umap",
metric="euclidean",
n_neighbors=15,
no_noise=True,
heuristic_connect=False,
)
model.fit(X)
labels = model.fit_predict(X)
Stored results after fitting¶
After fitting, the package stores results for each tested
min_samples value.
The main user-facing object is GraphCoreSGHDBSCAN. Internally, it
runs a Core-SG engine across one or more min_samples values and
stores per-m results.
The most important fitted result containers are:
labels_by_m_: dictionary mapping each fittedmin_samplesvalue to its stored cluster labels.condensed_trees_: dictionary mapping each fittedmin_samplesvalue to its condensed tree object.models_: dictionary mapping each fittedmin_samplesvalue to a saved model object whensave_models=True.
This makes it possible to inspect a fitted solution directly without re-running the model.
Typical post-fit access looks like:
g.fit(X)
labels_10 = g.labels_by_m_[10]
tree_10 = g.condensed_trees_[10]
# only available when save_models=True
model_10 = g.models_[10]