Installation and quick start¶
This page explains how to install coresg-graphhdbscan and run a first
clustering example.
Installation¶
Local installation¶
Install the package from a local checkout in editable mode:
git clone https://github.com/Campello-Lab/GraphHDBSCAN.git
cd GraphHDBSCAN
pip install -e .
This is the most convenient option during development because changes in the source tree are picked up without reinstalling the package each time.
GitHub installation¶
If you want to install directly from the repository:
pip install git+https://github.com/Campello-Lab/GraphHDBSCAN.git
Typical dependencies¶
Depending on the selected graph backend, the following libraries may be required:
numpyscipyscikit-learnnetworkxhdbscanscanpyforsc_gaussandsc_umapscanpy.externalforjaccard_phenograph
These dependencies are normally handled through the package installation, but they are useful to know when setting up a fresh environment or troubleshooting imports.
Recommended environment setup¶
A clean Python environment is recommended, especially when working with scientific Python packages.
For example:
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -e .
If you use Conda, an equivalent workflow is:
conda create -n graphhdbscan python=3.11
conda activate graphhdbscan
pip install -e .
Minimal import test¶
After installation, verify that the package imports correctly:
from coresg_graphhdbscan import GraphCoreSGHDBSCAN
If this import works, the package is installed and the public entry point is available.
Quick start¶
Simple example¶
A minimal clustering workflow looks like this:
from coresg_graphhdbscan import GraphCoreSGHDBSCAN
model = GraphCoreSGHDBSCAN(
min_samples=10,
sim_graph_method="sc_umap",
metric="euclidean",
n_neighbors=15,
no_noise=True,
heuristic_connect=False,
save_models=False,
)
model.fit(X)
labels = model.fit_predict(X)
This example uses:
min_samples=10as a default clustering settingsim_graph_method="sc_umap"as the default graph buildermetric="euclidean"as the default metric strategyn_neighbors=15as the default local graph sizeno_noise=Trueto reassign noise points after clusteringsave_models=Falseto avoid storing full per-min_samplesmodel objects
Single min_samples value¶
If you want one clustering solution only:
model = GraphCoreSGHDBSCAN(min_samples=10)
labels = model.fit_predict(X)
Multiple min_samples values¶
One strength of the package is that you can fit several min_samples values
in a single run and inspect them later.
model = GraphCoreSGHDBSCAN(min_samples=[5, 10, 15])
model.fit(X)
labels_5 = model.labels_for(5)
labels_10 = model.labels_for(10)
labels_15 = model.labels_for(15)
This is useful when you want to compare clustering solutions across several density settings without repeating the full workflow from scratch.
Inspecting stored results¶
After fitting, the package stores labels and condensed trees for each fitted
min_samples value.
model = GraphCoreSGHDBSCAN(
min_samples=range(2, 20),
sim_graph_method="sc_gauss",
metric="euclidean",
n_neighbors=16,
save_models=False,
)
model.fit(X)
labels_10 = model.labels_by_m_[10]
tree_10 = model.condensed_trees_[10]
If you want full saved per-min_samples model objects as well, enable
save_models=True:
model = GraphCoreSGHDBSCAN(
min_samples=range(2, 20),
sim_graph_method="sc_gauss",
metric="euclidean",
n_neighbors=16,
save_models=True,
)
model.fit(X)
labels_10 = model.labels_by_m_[10]
tree_10 = model.condensed_trees_[10]
model_10 = model.models_[10]
labels_by_m_[m] stores the directly fitted labels. By contrast,
labels_for(m) may apply post-processing depending on the no_noise
setting.
Precomputed graph input¶
If you already have a graph or adjacency representation, use
sim_graph_method="precomputed":
model = GraphCoreSGHDBSCAN(
min_samples=10,
sim_graph_method="precomputed",
no_noise=True,
)
model.fit(my_graph)
In precomputed mode, the input to fit(...) may be:
a
networkx.Grapha SciPy sparse adjacency matrix
a square dense adjacency matrix
Inspecting the hierarchy¶
After fitting, you can inspect the hierarchical clustering structure through the condensed tree:
model.fit(X)
model.plot_condensed_tree(10)
If you fit multiple min_samples values, inspect a specific one by passing
the selected value.
You can also browse condensed trees interactively in a notebook environment:
model.fit(X)
model.interactive_condensed_tree()
Typical workflow¶
A common workflow is:
install the package in a clean environment
import
GraphCoreSGHDBSCANchoose a graph builder and metric
fit the model on feature data or a precomputed graph
inspect the condensed tree
retrieve labels for the selected
min_samplesvalue
A good exploratory run looks like:
from coresg_graphhdbscan import GraphCoreSGHDBSCAN
g = GraphCoreSGHDBSCAN(
min_samples=range(2, 20),
sim_graph_method="sc_gauss",
n_neighbors=16,
no_noise=True,
metric="euclidean",
heuristic_connect=True,
save_models=True,
)
g.fit(X)
g.plot_condensed_tree(4)
labels_18 = g.labels_for(18)
tree_18 = g.condensed_trees_[18]
model_18 = g.models_[18]
Troubleshooting installation¶
If imports fail, the cause is usually one of these:
the environment is missing optional scientific dependencies
binary packages such as NumPy, SciPy, or scikit-learn are mismatched
optional graph-construction dependencies are not installed
Helpful check:
python -c "from coresg_graphhdbscan import GraphCoreSGHDBSCAN; print('ok')"