Installation and quick start¶

This page explains how to install coresg-graphhdbscan and run a first clustering example.

Installation¶

Local installation¶

Install the package from a local checkout in editable mode:

git clone https://github.com/Campello-Lab/GraphHDBSCAN.git
cd GraphHDBSCAN
pip install -e .

This is the most convenient option during development because changes in the source tree are picked up without reinstalling the package each time.

GitHub installation¶

If you want to install directly from the repository:

pip install git+https://github.com/Campello-Lab/GraphHDBSCAN.git

Typical dependencies¶

Depending on the selected graph backend, the following libraries may be required:

numpy
scipy
scikit-learn
networkx
hdbscan
scanpy for sc_gauss and sc_umap
scanpy.external for jaccard_phenograph

These dependencies are normally handled through the package installation, but they are useful to know when setting up a fresh environment or troubleshooting imports.

Recommended environment setup¶

A clean Python environment is recommended, especially when working with scientific Python packages.

For example:

python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -e .

If you use Conda, an equivalent workflow is:

conda create -n graphhdbscan python=3.11
conda activate graphhdbscan
pip install -e .

Minimal import test¶

After installation, verify that the package imports correctly:

from coresg_graphhdbscan import GraphCoreSGHDBSCAN

If this import works, the package is installed and the public entry point is available.

Quick start¶

Simple example¶

A minimal clustering workflow looks like this:

from coresg_graphhdbscan import GraphCoreSGHDBSCAN

model = GraphCoreSGHDBSCAN(
    min_samples=10,
    sim_graph_method="sc_umap",
    metric="euclidean",
    n_neighbors=15,
    no_noise=True,
    heuristic_connect=False,
    save_models=False,
)

model.fit(X)
labels = model.fit_predict(X)

This example uses:

min_samples=10 as a default clustering setting
sim_graph_method="sc_umap" as the default graph builder
metric="euclidean" as the default metric strategy
n_neighbors=15 as the default local graph size
no_noise=True to reassign noise points after clustering
save_models=False to avoid storing full per-min_samples model objects

Single `min_samples` value¶

If you want one clustering solution only:

model = GraphCoreSGHDBSCAN(min_samples=10)
labels = model.fit_predict(X)

Multiple `min_samples` values¶

One strength of the package is that you can fit several min_samples values in a single run and inspect them later.

model = GraphCoreSGHDBSCAN(min_samples=[5, 10, 15])
model.fit(X)

labels_5 = model.labels_for(5)
labels_10 = model.labels_for(10)
labels_15 = model.labels_for(15)

This is useful when you want to compare clustering solutions across several density settings without repeating the full workflow from scratch.

Inspecting stored results¶

After fitting, the package stores labels and condensed trees for each fitted min_samples value.

model = GraphCoreSGHDBSCAN(
    min_samples=range(2, 20),
    sim_graph_method="sc_gauss",
    metric="euclidean",
    n_neighbors=16,
    save_models=False,
)
model.fit(X)

labels_10 = model.labels_by_m_[10]
tree_10 = model.condensed_trees_[10]

If you want full saved per-min_samples model objects as well, enable save_models=True:

model = GraphCoreSGHDBSCAN(
    min_samples=range(2, 20),
    sim_graph_method="sc_gauss",
    metric="euclidean",
    n_neighbors=16,
    save_models=True,
)
model.fit(X)

labels_10 = model.labels_by_m_[10]
tree_10 = model.condensed_trees_[10]
model_10 = model.models_[10]

labels_by_m_[m] stores the directly fitted labels. By contrast, labels_for(m) may apply post-processing depending on the no_noise setting.

Precomputed graph input¶

If you already have a graph or adjacency representation, use sim_graph_method="precomputed":

model = GraphCoreSGHDBSCAN(
    min_samples=10,
    sim_graph_method="precomputed",
    no_noise=True,
)
model.fit(my_graph)

In precomputed mode, the input to fit(...) may be:

a networkx.Graph
a SciPy sparse adjacency matrix
a square dense adjacency matrix

Inspecting the hierarchy¶

After fitting, you can inspect the hierarchical clustering structure through the condensed tree:

model.fit(X)
model.plot_condensed_tree(10)

If you fit multiple min_samples values, inspect a specific one by passing the selected value.

You can also browse condensed trees interactively in a notebook environment:

model.fit(X)
model.interactive_condensed_tree()

Typical workflow¶

A common workflow is:

install the package in a clean environment
import GraphCoreSGHDBSCAN
choose a graph builder and metric
fit the model on feature data or a precomputed graph
inspect the condensed tree
retrieve labels for the selected min_samples value

A good exploratory run looks like:

from coresg_graphhdbscan import GraphCoreSGHDBSCAN

g = GraphCoreSGHDBSCAN(
    min_samples=range(2, 20),
    sim_graph_method="sc_gauss",
    n_neighbors=16,
    no_noise=True,
    metric="euclidean",
    heuristic_connect=True,
    save_models=True,
)

g.fit(X)
g.plot_condensed_tree(4)
labels_18 = g.labels_for(18)
tree_18 = g.condensed_trees_[18]
model_18 = g.models_[18]

Troubleshooting installation¶

If imports fail, the cause is usually one of these:

the environment is missing optional scientific dependencies
binary packages such as NumPy, SciPy, or scikit-learn are mismatched
optional graph-construction dependencies are not installed

Helpful check:

python -c "from coresg_graphhdbscan import GraphCoreSGHDBSCAN; print('ok')"