Parameter selection
===================

This page explains how to choose the main parameters of
``GraphCoreSGHDBSCAN`` in practice.

For most users, the most important decisions are:

- ``min_samples``
- ``sim_graph_method``
- ``metric``
- ``n_neighbors``
- ``heuristic_connect``
- ``no_noise``
- ``min_cluster_size``
- ``save_models``
- ``similarity_backend``

Advanced users may also use ``metric_kwds`` when the selected distance
metric requires additional arguments.

Start here
----------

A good default starting point for many datasets is:

.. code-block:: python

   from coresg_graphhdbscan import GraphCoreSGHDBSCAN

   model = GraphCoreSGHDBSCAN(
       min_samples=10,
       sim_graph_method="sc_umap",
       metric="euclidean",
       n_neighbors=15,
       no_noise=True,
       heuristic_connect=False,
   )

A simple way to think about the main settings is:

- use ``min_samples`` to control the smoothness of the density estimates
- use ``sim_graph_method`` to choose how the similarity graph is built
- use ``metric`` to choose the geometry used during graph construction
- use ``n_neighbors`` to control local graph density
- use ``heuristic_connect`` to decide how disconnected graphs are handled
- use ``no_noise`` to decide whether noise points should be reassigned
- use ``save_models`` to decide whether full per-``min_samples`` model objects should be stored
- use ``similarity_backend`` to choose whether accelerated graph-construction backends are used when available

Constructor
-----------

The public constructor is:

.. code-block:: python

   GraphCoreSGHDBSCAN(
       min_samples=10,
       sim_graph_method="sc_umap",
       metric="euclidean",
       metric_kwds=None,
       add_neighbor=True,
       no_noise=True,
       n_neighbors=15,
       heuristic_connect=False,
       min_cluster_size=None,
       save_models=False,
       similarity_backend="auto",
       **kwargs
   )

At-a-glance reference
---------------------

.. list-table::
   :header-rows: 1
   :widths: 18 14 68

   * - Parameter
     - Default
     - Practical meaning
   * - ``min_samples``
     - ``10``
     - Controls the smoothness of the density estimates as a single level or within an entire range.
   * - ``sim_graph_method``
     - ``"sc_umap"``
     - Chooses how the similarity graph is built.
   * - ``metric``
     - ``"euclidean"``
     - Chooses the distance metric used during similarity graph construction.
   * - ``metric_kwds``
     - ``None``
     - Optional keyword arguments passed to the selected distance metric.
   * - ``add_neighbor``
     - ``True``
     - Controls how weighted structural similarity is expanded into graph edges.
   * - ``no_noise``
     - ``True``
     - Reassigns points initially labeled ``-1`` after clustering.
   * - ``n_neighbors``
     - ``15``
     - Controls local graph density.
   * - ``heuristic_connect``
     - ``False``
     - Chooses how originally disconnected graphs are handled.
   * - ``min_cluster_size``
     - ``None``
     - (Optional) Minimum cluster size in the clustering stage.
   * - ``save_models``
     - ``False``
     - Stores full saved models for each fitted ``min_samples`` value.
   * - ``similarity_backend``
     - ``"auto"``
     - Chooses the backend used for similarity graph construction when alternative implementations are available.

How to choose each parameter
----------------------------

``min_samples``
^^^^^^^^^^^^^^^

Default: ``10``

This is the main clustering hyperparameter. It may be:

- a single integer, such as ``10``
- an iterable of integers, such as ``[5, 10, 15]`` or ``range(2, 10)``

Internally, the package converts it into an internal list of values used by
CoreSGHDBSCAN.

Examples:

- ``min_samples=10`` gives ``[10]``
- ``min_samples=7`` gives ``[7]``
- ``min_samples=[5, 10, 15]`` gives ``[5, 10, 15]``
- ``min_samples=range(2, 10)`` gives ``[2, 3, 4, 5, 6, 7, 8, 9]``

Practical interpretation:

- smaller values usually produce finer, more local cluster structure
- larger values usually produce more conservative and more stable clusters
- multiple values are useful when you want to compare density settings in one run

Recommended workflow:

1. Start with ``10``.
2. If clusters seem too coarse, try smaller values.
3. If clusters seem unstable or fragmented, try larger values.
4. When in doubt, fit several values and compare the condensed trees.

Example:

.. code-block:: python

   model = GraphCoreSGHDBSCAN(min_samples=[5, 10, 15])
   model.fit(X)

   labels_5 = model.labels_for(5)
   labels_10 = model.labels_for(10)
   labels_15 = model.labels_for(15)

``sim_graph_method``
^^^^^^^^^^^^^^^^^^^^

Default: ``"sc_umap"``

This parameter chooses the graph-construction backend.

Supported values are:

- ``"sc_gauss"``
- ``"sc_umap"``
- ``"jaccard_phenograph"``
- ``"precomputed"``

Choosing a method:

``sc_umap``
   Good default choice. Uses Scanpy's UMAP-style connectivity routine.

``sc_gauss``
   Useful when you want Scanpy's Gaussian connectivity construction.

``jaccard_phenograph``
   Useful when you want a PhenoGraph-style Jaccard neighborhood graph. The
   backend used for this graph can be controlled with
   ``similarity_backend``. With ``similarity_backend="auto"``, the package uses
   the accelerated ``numba`` backend when available and otherwise falls back to
   the default PhenoGraph-based path.

``precomputed``
   Use this when you already have a graph or adjacency representation and do
   not want the package to build a graph from raw features.

Supported inputs in ``precomputed`` mode:

- a ``networkx.Graph``
- a SciPy sparse adjacency matrix
- a square dense adjacency matrix

When using ``"precomputed"``, the input to ``fit(...)`` is treated as an
already constructed graph representation rather than raw feature data.

Practical recommendation:

- start with ``"sc_umap"``
- try ``"sc_gauss"`` if you prefer Gaussian connectivity
- use ``"jaccard_phenograph"`` for PhenoGraph-style neighborhood structure
- use ``"precomputed"`` when your graph is part of the experimental design

``similarity_backend``
^^^^^^^^^^^^^^^^^^^^^^

Default: ``"auto"``

This parameter controls which backend is used for similarity graph construction
when alternative implementations are available.

Supported values are:

- ``"auto"``
- ``"default"``
- ``"numba"``

Currently, this option mainly affects ``sim_graph_method="jaccard_phenograph"``.

``auto``
   Uses the accelerated ``numba`` implementation when ``numba`` is available.
   If ``numba`` is not available, the package falls back to the default
   implementation.

``default``
   Uses the original default implementation. For
   ``sim_graph_method="jaccard_phenograph"``, this means using the
   Scanpy/PhenoGraph graph-construction path.

``numba``
   Uses the ``numba``-accelerated implementation when available. For
   ``sim_graph_method="jaccard_phenograph"``, this computes the
   PhenoGraph-style Jaccard graph using a compiled implementation. If
   ``numba`` is not installed, an import error is raised.

For ``jaccard_phenograph``, the ``numba`` backend is designed to reproduce the
same PhenoGraph-style undirected Jaccard graph as the default backend, while
reducing the time spent in Jaccard graph construction.

The undirected graph is constructed in the same style as PhenoGraph: directed
Jaccard weights are computed first, both directions are averaged, and the
lower-triangular sparse graph is retained internally before conversion to the
package graph representation.

Practical recommendation:

- keep ``similarity_backend="auto"`` for normal use
- use ``similarity_backend="default"`` when you want the original backend for
  comparison or debugging
- use ``similarity_backend="numba"`` when you specifically want the accelerated
  implementation and want an error if ``numba`` is unavailable

Example:

.. code-block:: python

   model = GraphCoreSGHDBSCAN(
       sim_graph_method="jaccard_phenograph",
       similarity_backend="auto",
       n_neighbors=15,
       metric="euclidean",
   )

To force the accelerated backend:

.. code-block:: python

   model = GraphCoreSGHDBSCAN(
       sim_graph_method="jaccard_phenograph",
       similarity_backend="numba",
       n_neighbors=15,
   )

To force the original PhenoGraph-based backend:

.. code-block:: python

   model = GraphCoreSGHDBSCAN(
       sim_graph_method="jaccard_phenograph",
       similarity_backend="default",
       n_neighbors=15,
   )

``metric``
^^^^^^^^^^

Default: ``"euclidean"``

This controls the distance metric used during similarity graph
construction.

Supported distance metrics are:

- ``"cityblock"``
- ``"cosine"``
- ``"euclidean"``
- ``"l1"``
- ``"l2"``
- ``"manhattan"``
- ``"braycurtis"``
- ``"canberra"``
- ``"chebyshev"``
- ``"correlation"``
- ``"dice"``
- ``"hamming"``
- ``"jaccard"``
- ``"mahalanobis"``
- ``"minkowski"``
- ``"rogerstanimoto"``
- ``"russellrao"``
- ``"seuclidean"``
- ``"sokalmichener"``
- ``"sokalsneath"``
- ``"sqeuclidean"``
- ``"yule"``
- ``"hybrid_euclidean_cosine"``

Choosing a metric:

``euclidean``
    Default choice for standard continuous feature spaces.

``cosine``
    Useful when angular similarity is more meaningful than raw magnitude.

``correlation``
    Useful when similarity should depend on the shape or pattern of the
    feature vector rather than absolute scale.

``manhattan`` or ``l1``
    Useful when L1 geometry is preferred.

``jaccard`` and other binary metrics
    Useful for binary or boolean feature representations.

``minkowski``
    Supports custom ``p`` values through ``metric_kwds``.

``mahalanobis``
    Requires an inverse covariance matrix ``VI`` through ``metric_kwds``.

``seuclidean``
    Requires a variance vector ``V`` through ``metric_kwds``.

``hybrid_euclidean_cosine``
    Package-specific mode. Full pairwise distances remain Euclidean, but
    neighborhood graph construction uses cosine geometry.

Practical recommendation:

- use ``"euclidean"`` as a default starting point
- use ``"cosine"`` or ``"correlation"`` when direction or pattern matters more than magnitude
- use ``"minkowski"``, ``"mahalanobis"``, or ``"seuclidean"`` only when their assumptions match your data
- use ``"hybrid_euclidean_cosine"`` when you want Euclidean full distances but cosine-based local neighborhoods

The metric ``"kulsinski"`` is not supported because it is not available
in current versions of ``scikit-learn``'s ``pairwise_distances``.

The combination ``metric="yule"`` with ``sim_graph_method="sc_gauss"``
is intentionally not supported because it can produce non-finite graph
weights. Use ``metric="yule"`` with ``sim_graph_method="sc_umap"`` or
``sim_graph_method="jaccard_phenograph"`` instead.

Examples:

.. code-block:: python

   model = GraphCoreSGHDBSCAN(
       min_samples=10,
       sim_graph_method="sc_umap",
       metric="correlation",
       n_neighbors=15,
   )

.. code-block:: python

   model = GraphCoreSGHDBSCAN(
       min_samples=10,
       sim_graph_method="sc_umap",
       metric="minkowski",
       metric_kwds={"p": 1.5},
       n_neighbors=15,
   )

.. code-block:: python

   import numpy as np

   VI = np.linalg.pinv(np.cov(X, rowvar=False))

   model = GraphCoreSGHDBSCAN(
       min_samples=10,
       sim_graph_method="sc_umap",
       metric="mahalanobis",
       metric_kwds={"VI": VI},
       n_neighbors=15,
   )

``metric_kwds``
^^^^^^^^^^^^^^^

Default: ``None``

This optional dictionary is passed to the selected distance metric during
similarity graph construction.

It is mainly needed for metrics that require additional parameters.

Examples:

- use ``metric_kwds={"p": 1.5}`` with ``metric="minkowski"``
- use ``metric_kwds={"VI": VI}`` with ``metric="mahalanobis"``
- use ``metric_kwds={"V": V}`` with ``metric="seuclidean"``

Example:

.. code-block:: python

   import numpy as np

   V = np.var(X, axis=0, ddof=1)

   model = GraphCoreSGHDBSCAN(
       min_samples=10,
       sim_graph_method="sc_umap",
       metric="seuclidean",
       metric_kwds={"V": V},
       n_neighbors=15,
   )

``n_neighbors``
^^^^^^^^^^^^^^^

Default: ``15``

This is the number of neighbors used during similarity graph construction.

Practical interpretation:

- smaller values make the graph more local and sparse
- larger values make the graph denser and may improve connectivity
- increasing this value is often the first thing to try when the graph is too fragmented

Practical recommendation:

- start with ``15``
- increase it if connectivity is poor
- decrease it if the graph becomes overly broad or too smoothed

Example:

.. code-block:: python

   GraphCoreSGHDBSCAN(
       sim_graph_method="sc_gauss",
       n_neighbors=20,
   )

``add_neighbor``
^^^^^^^^^^^^^^^^

Default: ``True``

This controls how weighted structural similarity is computed.

When enabled, an edge may still be added even when two nodes do not already
share a direct edge, as long as their weighted structural similarity is greater
than zero.

Practical recommendation:

- keep the default unless you are specifically studying graph-construction behavior
- change it only when you want to examine the effect of this edge-expansion step

``heuristic_connect``
^^^^^^^^^^^^^^^^^^^^^

Default: ``False``

The final graph used for clustering must be connected. This parameter
controls how originally disconnected graphs are handled.

``heuristic_connect=False``
   Default behavior. If the graph has multiple connected components, the
   package connects consecutive components by adding edges with maximum
   distance, equivalent to weight ``1`` in the dissimilarity graph.

``heuristic_connect=True``
   The package repeatedly increases ``n_neighbors`` until the graph becomes
   connected.

Example fitting log:

.. code-block:: text

   Trying n_neighbors = 16
   Trying n_neighbors = 17

Practical recommendation:

- use ``False`` when you want a simple and predictable fallback
- use ``True`` when you prefer connectivity to come from a denser neighborhood graph
  rather than from synthetic bridge edges

``no_noise``
^^^^^^^^^^^^

Default: ``True``

If enabled, points initially labeled as ``-1`` are reassigned by an
MST-based label propagation step after clustering.

Conceptually, this post-processing step:

1. builds a mutual-reachability view from graph-derived distances and core distances
2. computes an MST
3. propagates labels from labeled points to unlabeled points in increasing edge-weight order
4. resolves competition using a top-``c`` path comparison rule

Practical recommendation:

- use ``True`` if you prefer a full assignment with no final noise labels
- use ``False`` if you want to preserve the original HDBSCAN*-style noise behavior

``min_cluster_size``
^^^^^^^^^^^^^^^^^^^^

Default: ``None``

This is the minimum cluster size used in the clustering stage.

When left as ``None``, the package uses the selected ``min_samples`` value
for each run, so the effective minimum cluster size becomes ``m`` for each
fitted ``min_samples = m``.

If you set ``min_cluster_size`` explicitly, that fixed value is used for all
selected ``min_samples`` values.

Practical recommendation:

- leave it as ``None`` if you want cluster size to track ``min_samples``
- set it explicitly if you want a fixed minimum cluster size independent of
  the selected ``min_samples`` values


``save_models``
^^^^^^^^^^^^^^^

Default: ``False``

This controls whether full per-``min_samples`` model objects are stored after
fitting.

``save_models=False``
   The package still stores labels and condensed trees for each fitted
   ``min_samples`` value, but it does not keep full saved model objects.

``save_models=True``
   The package also stores full per-``m`` models in ``models_``.

Practical recommendation:

- use ``False`` if you mainly want labels and condensed trees with lower
  memory usage
- use ``True`` if you want direct access to saved per-``m`` model objects

Example:

.. code-block:: python

   model = GraphCoreSGHDBSCAN(
       min_samples=range(2, 20),
       sim_graph_method="sc_gauss",
       metric="euclidean",
       save_models=True,
   )
   model.fit(X)

   labels_10 = model.labels_by_m_[10]
   tree_10 = model.condensed_trees_[10]
   model_10 = model.models_[10]

Notes:

- ``labels_by_m_`` and ``condensed_trees_`` are available after fitting
  regardless of ``save_models``.
- ``models_`` is mainly useful when you want to inspect the full saved result
  object for a specific ``min_samples`` value.

Practical selection workflow
----------------------------

A useful tuning order is:

1. choose ``sim_graph_method`` based on how you want the graph to be built
2. choose ``metric`` based on the geometry that makes sense for your data
3. start with ``n_neighbors=15``
4. tune ``min_samples``
5. decide whether you want ``no_noise=True``
6. only then adjust ``heuristic_connect`` and ``add_neighbor`` if needed

A good exploratory run looks like:

.. code-block:: python

   g = GraphCoreSGHDBSCAN(
       min_samples=range(2, 20),
       sim_graph_method="sc_gauss",
       n_neighbors=16,
       no_noise=True,
       metric="euclidean",
       heuristic_connect=True,
       save_models=True,
   )
   g.fit(X)

After fitting several values, the package stores results by
``min_samples`` value. Labels are available in ``labels_by_m_``, condensed
trees are available in ``condensed_trees_``, and full saved models are
available in ``models_`` when ``save_models=True``.

Then inspect the hierarchy and choose a specific solution:

.. code-block:: python

   g.plot_condensed_tree(4)
   labels_18 = g.labels_for(18)
   tree_18 = g.condensed_trees_[18]
   model_18 = g.models_[18]


Ready-to-use presets
--------------------

Default baseline
^^^^^^^^^^^^^^^^

.. code-block:: python

   model = GraphCoreSGHDBSCAN()
   model.fit(X)
   labels = model.fit_predict(X)

More conservative clustering
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: python

   model = GraphCoreSGHDBSCAN(
       min_samples=20,
       sim_graph_method="sc_umap",
       metric="euclidean",
       n_neighbors=20,
   )

Finer local structure
^^^^^^^^^^^^^^^^^^^^^

.. code-block:: python

   model = GraphCoreSGHDBSCAN(
       min_samples=5,
       sim_graph_method="sc_umap",
       metric="euclidean",
       n_neighbors=12,
   )

Cosine-based graph construction
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: python

   model = GraphCoreSGHDBSCAN(
       min_samples=[5, 10],
       sim_graph_method="sc_gauss",
       metric="cosine",
       n_neighbors=20,
   )
   model.fit(X)
   labels_10 = model.labels_for(10)

Correlation-based graph construction
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: python

   model = GraphCoreSGHDBSCAN(
       min_samples=[5, 10],
       sim_graph_method="sc_umap",
       metric="correlation",
       n_neighbors=20,
   )

   model.fit(X)
   labels_10 = model.labels_for(10)

Minkowski distance with custom p
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: python

   model = GraphCoreSGHDBSCAN(
       min_samples=10,
       sim_graph_method="sc_umap",
       metric="minkowski",
       metric_kwds={"p": 1.5},
       n_neighbors=15,
   )

   model.fit(X)

Hybrid Euclidean-cosine mode
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: python

   model = GraphCoreSGHDBSCAN(
       min_samples=range(2, 10),
       sim_graph_method="sc_umap",
       metric="hybrid_euclidean_cosine",
       n_neighbors=16,
   )
   model.fit(X)

Precomputed graph input
^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: python

   model = GraphCoreSGHDBSCAN(
       min_samples=10,
       sim_graph_method="precomputed",
       no_noise=True,
   )
   model.fit(my_graph)

PhenoGraph-style Jaccard graph
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: python

   model = GraphCoreSGHDBSCAN(
       min_samples=10,
       sim_graph_method="jaccard_phenograph",
       similarity_backend="auto",
       metric="euclidean",
       n_neighbors=15,
   )
   model.fit(X)

For reproducibility checks against the original backend, use:

.. code-block:: python

   model = GraphCoreSGHDBSCAN(
       min_samples=10,
       sim_graph_method="jaccard_phenograph",
       similarity_backend="default",
       n_neighbors=15,
   )

Troubleshooting by symptom
--------------------------

Too many tiny clusters
^^^^^^^^^^^^^^^^^^^^^^

Try:

- increasing ``min_samples``
- increasing ``n_neighbors``
- using ``metric="euclidean"`` if cosine-based neighborhoods are too fine

Clusters are too coarse
^^^^^^^^^^^^^^^^^^^^^^^

Try:

- decreasing ``min_samples``
- decreasing ``n_neighbors``
- checking whether ``no_noise=True`` is absorbing points you would rather keep as noise

Graph is disconnected
^^^^^^^^^^^^^^^^^^^^^

Try:

- increasing ``n_neighbors``
- setting ``heuristic_connect=True``
- checking whether the selected metric is making neighborhoods too sparse

Too many noise points
^^^^^^^^^^^^^^^^^^^^^

Try:

- lowering ``min_samples``
- increasing ``n_neighbors``
- using ``no_noise=True`` if a full assignment is desired

Jaccard graph construction is slow
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Try:

- using ``similarity_backend="auto"`` or ``similarity_backend="numba"``
- reducing ``n_neighbors`` if the neighborhood graph is unnecessarily dense
- using ``similarity_backend="default"`` only when you need the original
  PhenoGraph-based backend for comparison or debugging

Practical notes
---------------

- If the graph is disconnected and ``heuristic_connect=False``, the package
  connects components with synthetic edges of weight ``1``. This is simple and
  effective, but it is a design choice worth reporting in experiments.
- ``min_cluster_size=None`` means that the package matches cluster size to each
  selected ``min_samples`` value.
- When several ``min_samples`` values are passed, fit once and retrieve labels
  later for the requested value.
- Some graph builders depend on optional packages and will raise a clear import
  error if those packages are not installed.
- ``labels_by_m_[m]`` stores the directly fitted labels for a selected
  ``min_samples`` value.
- ``labels_for(m)`` may additionally apply noise reassignment depending on
  the ``no_noise`` setting.
- ``condensed_trees_[m]`` gives direct access to the condensed tree for a
  selected ``min_samples`` value.
- ``models_[m]`` is available when ``save_models=True``.
- ``similarity_backend="auto"`` uses accelerated similarity-graph construction
  when available. Currently this mainly affects
  ``sim_graph_method="jaccard_phenograph"``.
- The ``numba`` backend may have a one-time compilation cost on first use, but
  can substantially reduce the time spent constructing PhenoGraph-style
  Jaccard graphs for larger datasets.