Overview

coresg-graphhdbscan is a Python package for density-based clustering on similarity graphs derived from feature-vector data or directly provided by the user.

The package combines two main ideas:

  1. construction of a weighted graph that reflects local similarity structure

  2. application of a specialized CoreSG-HDBSCAN* pipeline designed to operate on graph data representations

This package is designed for settings where feature-vector representations are not suitable for clustering, such as Euclidean geometry in very high-dimensional spaces, where clustering should instead be guided by a learned or hand-crafted similarity graph as an intrinsically lower-dimensional representation of the data.

Main features

The package currently supports the following capabilities:

  • graph-based clustering through GraphCoreSGHDBSCAN

  • multiple min_samples values in a single model run

  • three graph-construction backends plus a precomputed graph mode

  • support for multiple distance metrics during similarity graph construction, including euclidean, cosine, correlation, manhattan, jaccard, minkowski, mahalanobis, seuclidean, and the package-specific hybrid_euclidean_cosine

  • support for metric_kwds when a distance metric requires additional arguments

  • optional relabeling of noise points by density-based label propagation

  • compatibility with NetworkX graphs, dense adjacency matrices, and sparse adjacency matrices in precomputed mode

  • access to HDBSCAN*-style outputs such as labels, probabilities, and condensed trees

  • direct access to stored labels, condensed trees, and optional per-m models

Package structure

The package is centered around two public classes:

CoreSGHDBSCAN

The lower-level CoreSG implementation operating on feature vectors or distance matrices.

GraphCoreSGHDBSCAN

The graph-oriented wrapper that constructs a similarity graph, converts it to a weighted structutal dissimilarity graph, and then runs CoreSGHDBSCAN.

For most users, GraphCoreSGHDBSCAN is the main entry point.

When to use this package

This package is especially useful when:

  • you work with very high-dimensional data, and simple metrics like Euclidean distance alone is not the best description of local structure

  • a similarity graph is more meaningful than a raw feature-space view

  • you want to compare several min_samples values efficiently in one run

  • you want HDBSCAN*-style hierarchical clustering behavior on top of a graph-based representation

  • you already have a graph or adjacency matrix and want to cluster directly from it

Typical workflow

A typical graph-based clustering workflow in this package is:

  1. construct or provide a similarity graph

  2. convert it into a weighted structural similarity graph

  3. convert similarity to dissimilarity

  4. ensure graph connectivity

  5. run CoreSGHDBSCAN for one or more min_samples values

  6. inspect the condensed tree and choose a solution

  7. optionally reassign noise points if no_noise=True

Typical entry point

For most users, the main entry point is:

from coresg_graphhdbscan import GraphCoreSGHDBSCAN

A simple starting example is:

from coresg_graphhdbscan import GraphCoreSGHDBSCAN

model = GraphCoreSGHDBSCAN(
    min_samples=10,
    sim_graph_method="sc_umap",
    metric="euclidean",
    n_neighbors=15,
    no_noise=True,
    heuristic_connect=False,
)

model.fit(X)
labels = model.fit_predict(X)

Stored results after fitting

After fitting, the package stores results for each tested min_samples value.

The main user-facing object is GraphCoreSGHDBSCAN. Internally, it runs a Core-SG engine across one or more min_samples values and stores per-m results.

The most important fitted result containers are:

  • labels_by_m_: dictionary mapping each fitted min_samples value to its stored cluster labels.

  • condensed_trees_: dictionary mapping each fitted min_samples value to its condensed tree object.

  • models_: dictionary mapping each fitted min_samples value to a saved model object when save_models=True.

This makes it possible to inspect a fitted solution directly without re-running the model.

Typical post-fit access looks like:

g.fit(X)

labels_10 = g.labels_by_m_[10]
tree_10 = g.condensed_trees_[10]

# only available when save_models=True
model_10 = g.models_[10]