Parameter selection¶

This page explains how to choose the main parameters of GraphCoreSGHDBSCAN in practice.

For most users, the most important decisions are:

min_samples
sim_graph_method
metric
n_neighbors
heuristic_connect
no_noise
min_cluster_size
save_models
similarity_backend

Advanced users may also use metric_kwds when the selected distance metric requires additional arguments.

Start here¶

A good default starting point for many datasets is:

from coresg_graphhdbscan import GraphCoreSGHDBSCAN

model = GraphCoreSGHDBSCAN(
    min_samples=10,
    sim_graph_method="sc_umap",
    metric="euclidean",
    n_neighbors=15,
    no_noise=True,
    heuristic_connect=False,
)

A simple way to think about the main settings is:

use min_samples to control the smoothness of the density estimates
use sim_graph_method to choose how the similarity graph is built
use metric to choose the geometry used during graph construction
use n_neighbors to control local graph density
use heuristic_connect to decide how disconnected graphs are handled
use no_noise to decide whether noise points should be reassigned
use save_models to decide whether full per-min_samples model objects should be stored
use similarity_backend to choose whether accelerated graph-construction backends are used when available

Constructor¶

The public constructor is:

GraphCoreSGHDBSCAN(
    min_samples=10,
    sim_graph_method="sc_umap",
    metric="euclidean",
    metric_kwds=None,
    add_neighbor=True,
    no_noise=True,
    n_neighbors=15,
    heuristic_connect=False,
    min_cluster_size=None,
    save_models=False,
    similarity_backend="auto",
    **kwargs
)

At-a-glance reference¶

Parameter	Default	Practical meaning
`min_samples`	`10`	Controls the smoothness of the density estimates as a single level or within an entire range.
`sim_graph_method`	`"sc_umap"`	Chooses how the similarity graph is built.
`metric`	`"euclidean"`	Chooses the distance metric used during similarity graph construction.
`metric_kwds`	`None`	Optional keyword arguments passed to the selected distance metric.
`add_neighbor`	`True`	Controls how weighted structural similarity is expanded into graph edges.
`no_noise`	`True`	Reassigns points initially labeled `-1` after clustering.
`n_neighbors`	`15`	Controls local graph density.
`heuristic_connect`	`False`	Chooses how originally disconnected graphs are handled.
`min_cluster_size`	`None`	(Optional) Minimum cluster size in the clustering stage.
`save_models`	`False`	Stores full saved models for each fitted `min_samples` value.
`similarity_backend`	`"auto"`	Chooses the backend used for similarity graph construction when alternative implementations are available.

How to choose each parameter¶

`min_samples`¶

Default: 10

This is the main clustering hyperparameter. It may be:

a single integer, such as 10
an iterable of integers, such as [5, 10, 15] or range(2, 10)

Internally, the package converts it into an internal list of values used by CoreSGHDBSCAN.

Examples:

min_samples=10 gives [10]
min_samples=7 gives [7]
min_samples=[5, 10, 15] gives [5, 10, 15]
min_samples=range(2, 10) gives [2, 3, 4, 5, 6, 7, 8, 9]

Practical interpretation:

smaller values usually produce finer, more local cluster structure
larger values usually produce more conservative and more stable clusters
multiple values are useful when you want to compare density settings in one run

Recommended workflow:

Start with 10.
If clusters seem too coarse, try smaller values.
If clusters seem unstable or fragmented, try larger values.
When in doubt, fit several values and compare the condensed trees.

Example:

model = GraphCoreSGHDBSCAN(min_samples=[5, 10, 15])
model.fit(X)

labels_5 = model.labels_for(5)
labels_10 = model.labels_for(10)
labels_15 = model.labels_for(15)

`sim_graph_method`¶

Default: "sc_umap"

This parameter chooses the graph-construction backend.

Supported values are:

"sc_gauss"
"sc_umap"
"jaccard_phenograph"
"precomputed"

Choosing a method:

sc_umap: Good default choice. Uses Scanpy’s UMAP-style connectivity routine.
sc_gauss: Useful when you want Scanpy’s Gaussian connectivity construction.
jaccard_phenograph: Useful when you want a PhenoGraph-style Jaccard neighborhood graph. The backend used for this graph can be controlled with similarity_backend. With similarity_backend="auto", the package uses the accelerated numba backend when available and otherwise falls back to the default PhenoGraph-based path.
precomputed: Use this when you already have a graph or adjacency representation and do not want the package to build a graph from raw features.

Supported inputs in precomputed mode:

a networkx.Graph
a SciPy sparse adjacency matrix
a square dense adjacency matrix

When using "precomputed", the input to fit(...) is treated as an already constructed graph representation rather than raw feature data.

Practical recommendation:

start with "sc_umap"
try "sc_gauss" if you prefer Gaussian connectivity
use "jaccard_phenograph" for PhenoGraph-style neighborhood structure
use "precomputed" when your graph is part of the experimental design

`similarity_backend`¶

Default: "auto"

This parameter controls which backend is used for similarity graph construction when alternative implementations are available.

Supported values are:

"auto"
"default"
"numba"

Currently, this option mainly affects sim_graph_method="jaccard_phenograph".

auto: Uses the accelerated numba implementation when numba is available. If numba is not available, the package falls back to the default implementation.
default: Uses the original default implementation. For sim_graph_method="jaccard_phenograph", this means using the Scanpy/PhenoGraph graph-construction path.
numba: Uses the numba-accelerated implementation when available. For sim_graph_method="jaccard_phenograph", this computes the PhenoGraph-style Jaccard graph using a compiled implementation. If numba is not installed, an import error is raised.

For jaccard_phenograph, the numba backend is designed to reproduce the same PhenoGraph-style undirected Jaccard graph as the default backend, while reducing the time spent in Jaccard graph construction.

The undirected graph is constructed in the same style as PhenoGraph: directed Jaccard weights are computed first, both directions are averaged, and the lower-triangular sparse graph is retained internally before conversion to the package graph representation.

Practical recommendation:

keep similarity_backend="auto" for normal use
use similarity_backend="default" when you want the original backend for comparison or debugging
use similarity_backend="numba" when you specifically want the accelerated implementation and want an error if numba is unavailable

Example:

model = GraphCoreSGHDBSCAN(
    sim_graph_method="jaccard_phenograph",
    similarity_backend="auto",
    n_neighbors=15,
    metric="euclidean",
)

To force the accelerated backend:

model = GraphCoreSGHDBSCAN(
    sim_graph_method="jaccard_phenograph",
    similarity_backend="numba",
    n_neighbors=15,
)

To force the original PhenoGraph-based backend:

model = GraphCoreSGHDBSCAN(
    sim_graph_method="jaccard_phenograph",
    similarity_backend="default",
    n_neighbors=15,
)

`metric`¶

Default: "euclidean"

This controls the distance metric used during similarity graph construction.

Supported distance metrics are:

"cityblock"
"cosine"
"euclidean"
"l1"
"l2"
"manhattan"
"braycurtis"
"canberra"
"chebyshev"
"correlation"
"dice"
"hamming"
"jaccard"
"mahalanobis"
"minkowski"
"rogerstanimoto"
"russellrao"
"seuclidean"
"sokalmichener"
"sokalsneath"
"sqeuclidean"
"yule"
"hybrid_euclidean_cosine"

Choosing a metric:

euclidean: Default choice for standard continuous feature spaces.
cosine: Useful when angular similarity is more meaningful than raw magnitude.
correlation: Useful when similarity should depend on the shape or pattern of the feature vector rather than absolute scale.
manhattan or l1: Useful when L1 geometry is preferred.
jaccard and other binary metrics: Useful for binary or boolean feature representations.
minkowski: Supports custom p values through metric_kwds.
mahalanobis: Requires an inverse covariance matrix VI through metric_kwds.
seuclidean: Requires a variance vector V through metric_kwds.
hybrid_euclidean_cosine: Package-specific mode. Full pairwise distances remain Euclidean, but neighborhood graph construction uses cosine geometry.

Practical recommendation:

use "euclidean" as a default starting point
use "cosine" or "correlation" when direction or pattern matters more than magnitude
use "minkowski", "mahalanobis", or "seuclidean" only when their assumptions match your data
use "hybrid_euclidean_cosine" when you want Euclidean full distances but cosine-based local neighborhoods

The metric "kulsinski" is not supported because it is not available in current versions of scikit-learn’s pairwise_distances.

The combination metric="yule" with sim_graph_method="sc_gauss" is intentionally not supported because it can produce non-finite graph weights. Use metric="yule" with sim_graph_method="sc_umap" or sim_graph_method="jaccard_phenograph" instead.

Examples:

model = GraphCoreSGHDBSCAN(
    min_samples=10,
    sim_graph_method="sc_umap",
    metric="correlation",
    n_neighbors=15,
)

model = GraphCoreSGHDBSCAN(
    min_samples=10,
    sim_graph_method="sc_umap",
    metric="minkowski",
    metric_kwds={"p": 1.5},
    n_neighbors=15,
)

import numpy as np

VI = np.linalg.pinv(np.cov(X, rowvar=False))

model = GraphCoreSGHDBSCAN(
    min_samples=10,
    sim_graph_method="sc_umap",
    metric="mahalanobis",
    metric_kwds={"VI": VI},
    n_neighbors=15,
)

`metric_kwds`¶

Default: None

This optional dictionary is passed to the selected distance metric during similarity graph construction.

It is mainly needed for metrics that require additional parameters.

Examples:

use metric_kwds={"p": 1.5} with metric="minkowski"
use metric_kwds={"VI": VI} with metric="mahalanobis"
use metric_kwds={"V": V} with metric="seuclidean"

Example:

import numpy as np

V = np.var(X, axis=0, ddof=1)

model = GraphCoreSGHDBSCAN(
    min_samples=10,
    sim_graph_method="sc_umap",
    metric="seuclidean",
    metric_kwds={"V": V},
    n_neighbors=15,
)

`n_neighbors`¶

Default: 15

This is the number of neighbors used during similarity graph construction.

Practical interpretation:

smaller values make the graph more local and sparse
larger values make the graph denser and may improve connectivity
increasing this value is often the first thing to try when the graph is too fragmented

Practical recommendation:

start with 15
increase it if connectivity is poor
decrease it if the graph becomes overly broad or too smoothed

Example:

GraphCoreSGHDBSCAN(
    sim_graph_method="sc_gauss",
    n_neighbors=20,
)

`add_neighbor`¶

Default: True

This controls how weighted structural similarity is computed.

When enabled, an edge may still be added even when two nodes do not already share a direct edge, as long as their weighted structural similarity is greater than zero.

Practical recommendation:

keep the default unless you are specifically studying graph-construction behavior
change it only when you want to examine the effect of this edge-expansion step

`heuristic_connect`¶

Default: False

The final graph used for clustering must be connected. This parameter controls how originally disconnected graphs are handled.

heuristic_connect=False: Default behavior. If the graph has multiple connected components, the package connects consecutive components by adding edges with maximum distance, equivalent to weight 1 in the dissimilarity graph.
heuristic_connect=True: The package repeatedly increases n_neighbors until the graph becomes connected.

Example fitting log:

Trying n_neighbors = 16
Trying n_neighbors = 17

Practical recommendation:

use False when you want a simple and predictable fallback
use True when you prefer connectivity to come from a denser neighborhood graph rather than from synthetic bridge edges

`no_noise`¶

Default: True

If enabled, points initially labeled as -1 are reassigned by an MST-based label propagation step after clustering.

Conceptually, this post-processing step:

builds a mutual-reachability view from graph-derived distances and core distances
computes an MST
propagates labels from labeled points to unlabeled points in increasing edge-weight order
resolves competition using a top-c path comparison rule

Practical recommendation:

use True if you prefer a full assignment with no final noise labels
use False if you want to preserve the original HDBSCAN*-style noise behavior

`min_cluster_size`¶

Default: None

This is the minimum cluster size used in the clustering stage.

When left as None, the package uses the selected min_samples value for each run, so the effective minimum cluster size becomes m for each fitted min_samples = m.

If you set min_cluster_size explicitly, that fixed value is used for all selected min_samples values.

Practical recommendation:

leave it as None if you want cluster size to track min_samples
set it explicitly if you want a fixed minimum cluster size independent of the selected min_samples values

`save_models`¶

Default: False

This controls whether full per-min_samples model objects are stored after fitting.

save_models=False: The package still stores labels and condensed trees for each fitted min_samples value, but it does not keep full saved model objects.
save_models=True: The package also stores full per-m models in models_.

Practical recommendation:

use False if you mainly want labels and condensed trees with lower memory usage
use True if you want direct access to saved per-m model objects

Example:

model = GraphCoreSGHDBSCAN(
    min_samples=range(2, 20),
    sim_graph_method="sc_gauss",
    metric="euclidean",
    save_models=True,
)
model.fit(X)

labels_10 = model.labels_by_m_[10]
tree_10 = model.condensed_trees_[10]
model_10 = model.models_[10]

Notes:

labels_by_m_ and condensed_trees_ are available after fitting regardless of save_models.
models_ is mainly useful when you want to inspect the full saved result object for a specific min_samples value.

Practical selection workflow¶

A useful tuning order is:

choose sim_graph_method based on how you want the graph to be built
choose metric based on the geometry that makes sense for your data
start with n_neighbors=15
tune min_samples
decide whether you want no_noise=True
only then adjust heuristic_connect and add_neighbor if needed

A good exploratory run looks like:

g = GraphCoreSGHDBSCAN(
    min_samples=range(2, 20),
    sim_graph_method="sc_gauss",
    n_neighbors=16,
    no_noise=True,
    metric="euclidean",
    heuristic_connect=True,
    save_models=True,
)
g.fit(X)

After fitting several values, the package stores results by min_samples value. Labels are available in labels_by_m_, condensed trees are available in condensed_trees_, and full saved models are available in models_ when save_models=True.

Then inspect the hierarchy and choose a specific solution:

g.plot_condensed_tree(4)
labels_18 = g.labels_for(18)
tree_18 = g.condensed_trees_[18]
model_18 = g.models_[18]

Ready-to-use presets¶

Default baseline¶

model = GraphCoreSGHDBSCAN()
model.fit(X)
labels = model.fit_predict(X)

More conservative clustering¶

model = GraphCoreSGHDBSCAN(
    min_samples=20,
    sim_graph_method="sc_umap",
    metric="euclidean",
    n_neighbors=20,
)

Finer local structure¶

model = GraphCoreSGHDBSCAN(
    min_samples=5,
    sim_graph_method="sc_umap",
    metric="euclidean",
    n_neighbors=12,
)

Cosine-based graph construction¶

model = GraphCoreSGHDBSCAN(
    min_samples=[5, 10],
    sim_graph_method="sc_gauss",
    metric="cosine",
    n_neighbors=20,
)
model.fit(X)
labels_10 = model.labels_for(10)

Correlation-based graph construction¶

model = GraphCoreSGHDBSCAN(
    min_samples=[5, 10],
    sim_graph_method="sc_umap",
    metric="correlation",
    n_neighbors=20,
)

model.fit(X)
labels_10 = model.labels_for(10)

Minkowski distance with custom p¶

model = GraphCoreSGHDBSCAN(
    min_samples=10,
    sim_graph_method="sc_umap",
    metric="minkowski",
    metric_kwds={"p": 1.5},
    n_neighbors=15,
)

model.fit(X)

Hybrid Euclidean-cosine mode¶

model = GraphCoreSGHDBSCAN(
    min_samples=range(2, 10),
    sim_graph_method="sc_umap",
    metric="hybrid_euclidean_cosine",
    n_neighbors=16,
)
model.fit(X)

Precomputed graph input¶

model = GraphCoreSGHDBSCAN(
    min_samples=10,
    sim_graph_method="precomputed",
    no_noise=True,
)
model.fit(my_graph)

PhenoGraph-style Jaccard graph¶

model = GraphCoreSGHDBSCAN(
    min_samples=10,
    sim_graph_method="jaccard_phenograph",
    similarity_backend="auto",
    metric="euclidean",
    n_neighbors=15,
)
model.fit(X)

For reproducibility checks against the original backend, use:

model = GraphCoreSGHDBSCAN(
    min_samples=10,
    sim_graph_method="jaccard_phenograph",
    similarity_backend="default",
    n_neighbors=15,
)

Troubleshooting by symptom¶

Too many tiny clusters¶

Try:

increasing min_samples
increasing n_neighbors
using metric="euclidean" if cosine-based neighborhoods are too fine

Clusters are too coarse¶

Try:

decreasing min_samples
decreasing n_neighbors
checking whether no_noise=True is absorbing points you would rather keep as noise

Graph is disconnected¶

Try:

increasing n_neighbors
setting heuristic_connect=True
checking whether the selected metric is making neighborhoods too sparse

Too many noise points¶

Try:

lowering min_samples
increasing n_neighbors
using no_noise=True if a full assignment is desired

Jaccard graph construction is slow¶

Try:

using similarity_backend="auto" or similarity_backend="numba"
reducing n_neighbors if the neighborhood graph is unnecessarily dense
using similarity_backend="default" only when you need the original PhenoGraph-based backend for comparison or debugging

Practical notes¶

If the graph is disconnected and heuristic_connect=False, the package connects components with synthetic edges of weight 1. This is simple and effective, but it is a design choice worth reporting in experiments.
min_cluster_size=None means that the package matches cluster size to each selected min_samples value.
When several min_samples values are passed, fit once and retrieve labels later for the requested value.
Some graph builders depend on optional packages and will raise a clear import error if those packages are not installed.
labels_by_m_[m] stores the directly fitted labels for a selected min_samples value.
labels_for(m) may additionally apply noise reassignment depending on the no_noise setting.
condensed_trees_[m] gives direct access to the condensed tree for a selected min_samples value.
models_[m] is available when save_models=True.
similarity_backend="auto" uses accelerated similarity-graph construction when available. Currently this mainly affects sim_graph_method="jaccard_phenograph".
The numba backend may have a one-time compilation cost on first use, but can substantially reduce the time spent constructing PhenoGraph-style Jaccard graphs for larger datasets.

Parameter selection¶

Start here¶

Constructor¶

At-a-glance reference¶

How to choose each parameter¶

min_samples¶

sim_graph_method¶

similarity_backend¶

metric¶

metric_kwds¶

n_neighbors¶

add_neighbor¶

heuristic_connect¶

no_noise¶

min_cluster_size¶

save_models¶

Practical selection workflow¶

Ready-to-use presets¶

Default baseline¶

More conservative clustering¶

Finer local structure¶

Cosine-based graph construction¶

Correlation-based graph construction¶

Minkowski distance with custom p¶

Hybrid Euclidean-cosine mode¶

Precomputed graph input¶

PhenoGraph-style Jaccard graph¶

Troubleshooting by symptom¶

Too many tiny clusters¶

Clusters are too coarse¶

Graph is disconnected¶

Too many noise points¶

Jaccard graph construction is slow¶

Practical notes¶

`min_samples`¶

`sim_graph_method`¶

`similarity_backend`¶

`metric`¶

`metric_kwds`¶

`n_neighbors`¶

`add_neighbor`¶

`heuristic_connect`¶

`no_noise`¶

`min_cluster_size`¶

`save_models`¶