Installation and quick start
============================

This page explains how to install ``coresg-graphhdbscan`` and run a first
clustering example.

Installation
------------

Local installation
^^^^^^^^^^^^^^^^^^

Install the package from a local checkout in editable mode:

.. code-block:: bash

   git clone https://github.com/Campello-Lab/GraphHDBSCAN.git
   cd GraphHDBSCAN
   pip install -e .

This is the most convenient option during development because changes in the
source tree are picked up without reinstalling the package each time.

GitHub installation
^^^^^^^^^^^^^^^^^^^

If you want to install directly from the repository:

.. code-block:: bash

   pip install git+https://github.com/Campello-Lab/GraphHDBSCAN.git

Typical dependencies
--------------------

Depending on the selected graph backend, the following libraries may be
required:

- ``numpy``
- ``scipy``
- ``scikit-learn``
- ``networkx``
- ``hdbscan``
- ``scanpy`` for ``sc_gauss`` and ``sc_umap``
- ``scanpy.external`` for ``jaccard_phenograph``

These dependencies are normally handled through the package installation, but
they are useful to know when setting up a fresh environment or troubleshooting
imports.

Recommended environment setup
-----------------------------

A clean Python environment is recommended, especially when working with
scientific Python packages.

For example:

.. code-block:: bash

   python -m venv .venv
   source .venv/bin/activate
   pip install --upgrade pip
   pip install -e .

If you use Conda, an equivalent workflow is:

.. code-block:: bash

   conda create -n graphhdbscan python=3.11
   conda activate graphhdbscan
   pip install -e .

Minimal import test
-------------------

After installation, verify that the package imports correctly:

.. code-block:: python

   from coresg_graphhdbscan import GraphCoreSGHDBSCAN

If this import works, the package is installed and the public entry point is
available.

Quick start
-----------

Simple example
^^^^^^^^^^^^^^

A minimal clustering workflow looks like this:

.. code-block:: python

   from coresg_graphhdbscan import GraphCoreSGHDBSCAN

   model = GraphCoreSGHDBSCAN(
       min_samples=10,
       sim_graph_method="sc_umap",
       metric="euclidean",
       n_neighbors=15,
       no_noise=True,
       heuristic_connect=False,
       save_models=False,
   )

   model.fit(X)
   labels = model.fit_predict(X)

This example uses:

- ``min_samples=10`` as a default clustering setting
- ``sim_graph_method="sc_umap"`` as the default graph builder
- ``metric="euclidean"`` as the default metric strategy
- ``n_neighbors=15`` as the default local graph size
- ``no_noise=True`` to reassign noise points after clustering
- ``save_models=False`` to avoid storing full per-``min_samples`` model objects

Single ``min_samples`` value
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If you want one clustering solution only:

.. code-block:: python

   model = GraphCoreSGHDBSCAN(min_samples=10)
   labels = model.fit_predict(X)

Multiple ``min_samples`` values
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

One strength of the package is that you can fit several ``min_samples`` values
in a single run and inspect them later.

.. code-block:: python

   model = GraphCoreSGHDBSCAN(min_samples=[5, 10, 15])
   model.fit(X)

   labels_5 = model.labels_for(5)
   labels_10 = model.labels_for(10)
   labels_15 = model.labels_for(15)

This is useful when you want to compare clustering solutions across several
density settings without repeating the full workflow from scratch.

Inspecting stored results
^^^^^^^^^^^^^^^^^^^^^^^^^

After fitting, the package stores labels and condensed trees for each fitted
``min_samples`` value.

.. code-block:: python

   model = GraphCoreSGHDBSCAN(
       min_samples=range(2, 20),
       sim_graph_method="sc_gauss",
       metric="euclidean",
       n_neighbors=16,
       save_models=False,
   )
   model.fit(X)

   labels_10 = model.labels_by_m_[10]
   tree_10 = model.condensed_trees_[10]

If you want full saved per-``min_samples`` model objects as well, enable
``save_models=True``:

.. code-block:: python

   model = GraphCoreSGHDBSCAN(
       min_samples=range(2, 20),
       sim_graph_method="sc_gauss",
       metric="euclidean",
       n_neighbors=16,
       save_models=True,
   )
   model.fit(X)

   labels_10 = model.labels_by_m_[10]
   tree_10 = model.condensed_trees_[10]
   model_10 = model.models_[10]

``labels_by_m_[m]`` stores the directly fitted labels. By contrast,
``labels_for(m)`` may apply post-processing depending on the ``no_noise``
setting.

Precomputed graph input
^^^^^^^^^^^^^^^^^^^^^^^

If you already have a graph or adjacency representation, use
``sim_graph_method="precomputed"``:

.. code-block:: python

   model = GraphCoreSGHDBSCAN(
       min_samples=10,
       sim_graph_method="precomputed",
       no_noise=True,
   )
   model.fit(my_graph)

In precomputed mode, the input to ``fit(...)`` may be:

- a ``networkx.Graph``
- a SciPy sparse adjacency matrix
- a square dense adjacency matrix

Inspecting the hierarchy
^^^^^^^^^^^^^^^^^^^^^^^^

After fitting, you can inspect the hierarchical clustering structure through the
condensed tree:

.. code-block:: python

   model.fit(X)
   model.plot_condensed_tree(10)

If you fit multiple ``min_samples`` values, inspect a specific one by passing
the selected value.

You can also browse condensed trees interactively in a notebook environment:

.. code-block:: python

   model.fit(X)
   model.interactive_condensed_tree()

Typical workflow
----------------

A common workflow is:

1. install the package in a clean environment
2. import ``GraphCoreSGHDBSCAN``
3. choose a graph builder and metric
4. fit the model on feature data or a precomputed graph
5. inspect the condensed tree
6. retrieve labels for the selected ``min_samples`` value

A good exploratory run looks like:

.. code-block:: python

   from coresg_graphhdbscan import GraphCoreSGHDBSCAN

   g = GraphCoreSGHDBSCAN(
       min_samples=range(2, 20),
       sim_graph_method="sc_gauss",
       n_neighbors=16,
       no_noise=True,
       metric="euclidean",
       heuristic_connect=True,
       save_models=True,
   )

   g.fit(X)
   g.plot_condensed_tree(4)
   labels_18 = g.labels_for(18)
   tree_18 = g.condensed_trees_[18]
   model_18 = g.models_[18]

Troubleshooting installation
----------------------------

If imports fail, the cause is usually one of these:

- the environment is missing optional scientific dependencies
- binary packages such as NumPy, SciPy, or scikit-learn are mismatched
- optional graph-construction dependencies are not installed

Helpful check:

.. code-block:: bash

   python -c "from coresg_graphhdbscan import GraphCoreSGHDBSCAN; print('ok')"

Related pages
-------------

For more detail, continue with:

- :doc:`overview`
- :doc:`parameters`
- :doc:`examples`
- :doc:`api`