Overview¶

coresg-graphhdbscan is a Python package for density-based clustering on similarity graphs derived from feature-vector data or directly provided by the user.

The package combines two main ideas:

construction of a weighted graph that reflects local similarity structure
application of a specialized CoreSG-HDBSCAN* pipeline designed to operate on graph data representations

This package is designed for settings where feature-vector representations are not suitable for clustering, such as Euclidean geometry in very high-dimensional spaces, where clustering should instead be guided by a learned or hand-crafted similarity graph as an intrinsically lower-dimensional representation of the data.

Main features¶

The package currently supports the following capabilities:

graph-based clustering through GraphCoreSGHDBSCAN
multiple min_samples values in a single model run
three graph-construction backends plus a precomputed graph mode
support for multiple distance metrics during similarity graph construction, including euclidean, cosine, correlation, manhattan, jaccard, minkowski, mahalanobis, seuclidean, and the package-specific hybrid_euclidean_cosine
support for metric_kwds when a distance metric requires additional arguments
optional relabeling of noise points by density-based label propagation
compatibility with NetworkX graphs, dense adjacency matrices, and sparse adjacency matrices in precomputed mode
access to HDBSCAN*-style outputs such as labels, probabilities, and condensed trees
direct access to stored labels, condensed trees, and optional per-m models

Package structure¶

The package is centered around two public classes:

CoreSGHDBSCAN: The lower-level CoreSG implementation operating on feature vectors or distance matrices.
GraphCoreSGHDBSCAN: The graph-oriented wrapper that constructs a similarity graph, converts it to a weighted structutal dissimilarity graph, and then runs CoreSGHDBSCAN.

For most users, GraphCoreSGHDBSCAN is the main entry point.

When to use this package¶

This package is especially useful when:

you work with very high-dimensional data, and simple metrics like Euclidean distance alone is not the best description of local structure
a similarity graph is more meaningful than a raw feature-space view
you want to compare several min_samples values efficiently in one run
you want HDBSCAN*-style hierarchical clustering behavior on top of a graph-based representation
you already have a graph or adjacency matrix and want to cluster directly from it

Typical workflow¶

A typical graph-based clustering workflow in this package is:

construct or provide a similarity graph
convert it into a weighted structural similarity graph
convert similarity to dissimilarity
ensure graph connectivity
run CoreSGHDBSCAN for one or more min_samples values
inspect the condensed tree and choose a solution
optionally reassign noise points if no_noise=True

Typical entry point¶

For most users, the main entry point is:

from coresg_graphhdbscan import GraphCoreSGHDBSCAN

A simple starting example is:

from coresg_graphhdbscan import GraphCoreSGHDBSCAN

model = GraphCoreSGHDBSCAN(
    min_samples=10,
    sim_graph_method="sc_umap",
    metric="euclidean",
    n_neighbors=15,
    no_noise=True,
    heuristic_connect=False,
)

model.fit(X)
labels = model.fit_predict(X)

Stored results after fitting¶

After fitting, the package stores results for each tested min_samples value.

The main user-facing object is GraphCoreSGHDBSCAN. Internally, it runs a Core-SG engine across one or more min_samples values and stores per-m results.

The most important fitted result containers are:

labels_by_m_: dictionary mapping each fitted min_samples value to its stored cluster labels.
condensed_trees_: dictionary mapping each fitted min_samples value to its condensed tree object.
models_: dictionary mapping each fitted min_samples value to a saved model object when save_models=True.

This makes it possible to inspect a fitted solution directly without re-running the model.

Typical post-fit access looks like:

g.fit(X)

labels_10 = g.labels_by_m_[10]
tree_10 = g.condensed_trees_[10]

# only available when save_models=True
model_10 = g.models_[10]