Usage ===== This page shows how to work with ``GraphCoreSGHDBSCAN`` in practice after the package is installed. The package supports two main workflows: - build a similarity graph from feature data and cluster from that graph - provide a graph or adjacency representation directly in ``precomputed`` mode For most users, ``GraphCoreSGHDBSCAN`` is the main entry point. Basic workflow -------------- A typical workflow is: 1. prepare a feature matrix ``X`` or a precomputed graph 2. create a ``GraphCoreSGHDBSCAN`` model 3. call ``fit(...)`` or ``fit_predict(...)`` 4. inspect labels 5. inspect the condensed tree 6. compare several ``min_samples`` values if needed Minimal example --------------- The simplest usage pattern is: .. code-block:: python from coresg_graphhdbscan import GraphCoreSGHDBSCAN model = GraphCoreSGHDBSCAN() labels = model.fit_predict(X) This uses the default configuration: - ``min_samples=10`` - ``sim_graph_method="sc_umap"`` - ``metric="euclidean"`` - ``metric_kwds=None`` - ``add_neighbor=True`` - ``no_noise=True`` - ``n_neighbors=15`` - ``heuristic_connect=False`` - ``min_cluster_size=None`` fit vs. fit_predict ------------------- Use ``fit_predict(X)`` when you want labels immediately for a single requested configuration. .. code-block:: python model = GraphCoreSGHDBSCAN(min_samples=10) labels = model.fit_predict(X) Use ``fit(X)`` when you want to inspect the fitted object, view the hierarchy, or retrieve results for several ``min_samples`` values later. .. code-block:: python model = GraphCoreSGHDBSCAN(min_samples=10) model.fit(X) Single ``min_samples`` value ---------------------------- If you want one clustering solution only, pass a single integer: .. code-block:: python model = GraphCoreSGHDBSCAN(min_samples=10) labels = model.fit_predict(X) This is the simplest and most common starting point. Multiple ``min_samples`` values in one run ------------------------------------------ One of the package's main strengths is the ability to fit several ``min_samples`` values in one model run. .. code-block:: python from coresg_graphhdbscan import GraphCoreSGHDBSCAN model = GraphCoreSGHDBSCAN(min_samples=[5, 10, 15]) model.fit(X) labels_5 = model.labels_for(5) labels_10 = model.labels_for(10) labels_15 = model.labels_for(15) This is useful when you want to compare clustering behavior across several density settings without rebuilding the full workflow from scratch. You can also use ranges: .. code-block:: python model = GraphCoreSGHDBSCAN(min_samples=range(2, 10)) model.fit(X) Inspecting the hierarchy ------------------------ After fitting, you can inspect the hierarchical structure through the condensed tree. Static condensed tree ^^^^^^^^^^^^^^^^^^^^^ The most reliable visualization method is the static condensed tree: .. code-block:: python model.fit(X) model.plot_condensed_tree(10) If you fit several ``min_samples`` values, pass the specific value you want to inspect. Interactive condensed tree ^^^^^^^^^^^^^^^^^^^^^^^^^^ For live notebook work, the package also provides an interactive condensed tree: .. code-block:: python widget = model.interactive_condensed_tree() widget This view lets you change ``min_samples`` interactively and inspect the corresponding condensed tree without refitting the model. It is useful when you fit several ``min_samples`` values in one run and want to compare the resulting hierarchies visually. .. image:: ../_static/interactive_condensed_tree.png :alt: Interactive condensed tree widget with a slider for min_samples and a condensed tree display. :align: center :width: 85% In this interface, the slider controls the selected value of ``min_samples``. As the selected value changes, the displayed condensed tree updates so that you can compare hierarchical structure across different density settings. A typical workflow is: .. code-block:: python model = GraphCoreSGHDBSCAN(min_samples=range(2, 20)) model.fit(X) widget = model.interactive_condensed_tree() widget This feature is most useful in a live Jupyter environment. Choosing a graph-construction backend ------------------------------------- The package supports several graph-construction backends through ``sim_graph_method``. Default UMAP-style graph ^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: python model = GraphCoreSGHDBSCAN( sim_graph_method="sc_umap", n_neighbors=15, ) model.fit(X) This is the default and a good starting point for many datasets. Gaussian connectivity graph ^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: python model = GraphCoreSGHDBSCAN( sim_graph_method="sc_gauss", n_neighbors=15, ) model.fit(X) This uses Scanpy's Gaussian connectivity routine. PhenoGraph-style graph ^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: python model = GraphCoreSGHDBSCAN( sim_graph_method="jaccard_phenograph", n_neighbors=15, ) model.fit(X) This uses a PhenoGraph-style graph construction. Using different distance metrics -------------------------------- The ``metric`` parameter controls the distance measure used during similarity graph construction. The default is: .. code-block:: python model = GraphCoreSGHDBSCAN( sim_graph_method="sc_umap", metric="euclidean", ) model.fit(X) Other supported distance metrics include: - ``"cityblock"`` - ``"cosine"`` - ``"euclidean"`` - ``"l1"`` - ``"l2"`` - ``"manhattan"`` - ``"braycurtis"`` - ``"canberra"`` - ``"chebyshev"`` - ``"correlation"`` - ``"dice"`` - ``"hamming"`` - ``"jaccard"`` - ``"mahalanobis"`` - ``"minkowski"`` - ``"rogerstanimoto"`` - ``"russellrao"`` - ``"seuclidean"`` - ``"sokalmichener"`` - ``"sokalsneath"`` - ``"sqeuclidean"`` - ``"yule"`` - ``"hybrid_euclidean_cosine"`` Cosine distance ^^^^^^^^^^^^^^^ .. code-block:: python model = GraphCoreSGHDBSCAN( sim_graph_method="sc_gauss", metric="cosine", ) model.fit(X) Correlation distance ^^^^^^^^^^^^^^^^^^^^ .. code-block:: python model = GraphCoreSGHDBSCAN( sim_graph_method="sc_umap", metric="correlation", ) model.fit(X) Minkowski distance with metric_kwds ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Some metrics require additional keyword arguments. These can be passed through ``metric_kwds``. .. code-block:: python model = GraphCoreSGHDBSCAN( sim_graph_method="sc_umap", metric="minkowski", metric_kwds={"p": 1.5}, ) model.fit(X) Mahalanobis distance ^^^^^^^^^^^^^^^^^^^^ For Mahalanobis distance, pass the inverse covariance matrix ``VI``: .. code-block:: python import numpy as np VI = np.linalg.pinv(np.cov(X, rowvar=False)) model = GraphCoreSGHDBSCAN( sim_graph_method="sc_umap", metric="mahalanobis", metric_kwds={"VI": VI}, ) model.fit(X) Standardized Euclidean distance ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ For standardized Euclidean distance, pass the variance vector ``V``: .. code-block:: python import numpy as np V = np.var(X, axis=0, ddof=1) model = GraphCoreSGHDBSCAN( sim_graph_method="sc_umap", metric="seuclidean", metric_kwds={"V": V}, ) model.fit(X) Hybrid Euclidean-cosine mode ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: python model = GraphCoreSGHDBSCAN( sim_graph_method="sc_umap", metric="hybrid_euclidean_cosine", ) model.fit(X) In this mode, full distances remain Euclidean while neighborhood graph construction uses cosine geometry. Unsupported metric and combination ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The metric ``"kulsinski"`` is not supported because it is not available in current versions of ``scikit-learn``'s ``pairwise_distances``. The following combination is intentionally not supported: .. code-block:: python GraphCoreSGHDBSCAN( sim_graph_method="sc_gauss", metric="yule", ) This combination can produce non-finite graph weights. Use ``metric="yule"`` with ``sim_graph_method="sc_umap"`` or ``sim_graph_method="jaccard_phenograph"`` instead. Using precomputed graphs ------------------------ If you already have a graph or adjacency representation, use ``sim_graph_method="precomputed"``. .. code-block:: python model = GraphCoreSGHDBSCAN( min_samples=10, sim_graph_method="precomputed", no_noise=True, ) model.fit(my_graph) In precomputed mode, the input to ``fit(...)`` may be: - a ``networkx.Graph`` - a SciPy sparse adjacency matrix - a square dense adjacency matrix This mode is useful when the graph is already part of the experimental design or has been built by another method. Connectivity handling --------------------- The final graph used by clustering must be connected. Default behavior ^^^^^^^^^^^^^^^^ With ``heuristic_connect=False``, disconnected components are connected using a simple fallback that adds synthetic bridge edges. .. code-block:: python model = GraphCoreSGHDBSCAN( heuristic_connect=False, ) Heuristic connectivity ^^^^^^^^^^^^^^^^^^^^^^ With ``heuristic_connect=True``, the package increases ``n_neighbors`` until the graph becomes connected. .. code-block:: python model = GraphCoreSGHDBSCAN( heuristic_connect=True, n_neighbors=15, ) During fitting, the package may report messages such as: .. code-block:: text Trying n_neighbors = 16 Trying n_neighbors = 17 Noise reassignment ------------------ With ``no_noise=True``, points initially labeled ``-1`` are reassigned by an MST-based propagation step after clustering. .. code-block:: python model = GraphCoreSGHDBSCAN( no_noise=True, ) If you want to preserve the original HDBSCAN*-style noise labels, disable it: .. code-block:: python model = GraphCoreSGHDBSCAN( no_noise=False, ) Useful outputs after fitting ---------------------------- After fitting, users commonly inspect: - ``model.coresg_`` for the internal CoreSG object - ``model.similarity_graph_`` for the initial similarity graph - ``model.similarity_graph_WSS`` for the weighted structural similarity graph - ``model.dissimilarity_graph_`` for the graph after similarity-to-dissimilarity conversion - ``model.connected_graph_`` for the final connected graph - ``model.dist_matrix_`` for the dense matrix used by CoreSGHDBSCAN If you fit multiple ``min_samples`` values, retrieve a selected solution with: .. code-block:: python labels = model.labels_for(10) For lower-level access, labels are also stored inside the internal model objects associated with each fitted ``min_samples`` value. Common usage patterns --------------------- Default configuration ^^^^^^^^^^^^^^^^^^^^^ .. code-block:: python from coresg_graphhdbscan import GraphCoreSGHDBSCAN model = GraphCoreSGHDBSCAN() labels = model.fit_predict(X) Explore several density settings ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: python model = GraphCoreSGHDBSCAN( min_samples=range(2, 20), sim_graph_method="sc_gauss", n_neighbors=16, no_noise=True, metric="euclidean", heuristic_connect=True, ) model.fit(X) model.plot_condensed_tree(4) labels_18 = model.labels_for(18) Cosine-based graph construction ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: python model = GraphCoreSGHDBSCAN( min_samples=[5, 10], sim_graph_method="sc_gauss", metric="cosine", n_neighbors=20, ) model.fit(X) labels_10 = model.labels_for(10) Precomputed graph workflow ^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: python model = GraphCoreSGHDBSCAN( min_samples=10, sim_graph_method="precomputed", no_noise=True, ) model.fit(my_graph) model.plot_condensed_tree(10) Working with fitted results --------------------------- After calling ``fit()``, the package stores results for each fitted ``min_samples`` value. .. code-block:: python g = GraphCoreSGHDBSCAN( min_samples=range(2, 20), sim_graph_method="sc_gauss", metric="euclidean", n_neighbors=16, no_noise=True, save_models=True, ) g.fit(X) Stored labels ^^^^^^^^^^^^^ .. code-block:: python labels_5 = g.labels_by_m_[5] You can also use: .. code-block:: python labels_5 = g.labels_for(5) Note that ``labels_for(m)`` may apply noise reassignment depending on the ``no_noise`` setting, while ``labels_by_m_[m]`` is the directly stored fitted result. Stored condensed trees ^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: python tree_5 = g.condensed_trees_[5] Plot a selected condensed tree: .. code-block:: python g.plot_condensed_tree(5) Interactive condensed tree browser: .. code-block:: python g.interactive_condensed_tree() Stored models ^^^^^^^^^^^^^ If ``save_models=True``, full per-``m`` models are available: .. code-block:: python model_5 = g.models_[5] Practical notes --------------- - ``min_samples`` is the main clustering hyperparameter and is often the first thing to tune. - ``min_cluster_size=None`` means that the package follows the selected ``min_samples`` value for each run. - ``plot_condensed_tree(...)`` is the most reliable visualization for static documentation and reports. - ``interactive_condensed_tree()`` is best suited for live notebooks. - Some graph builders depend on optional packages and will raise an import error if those packages are not installed. Notebook tip ^^^^^^^^^^^^ In notebook environments, use a trailing semicolon to suppress the display of the fitted object representation: .. code-block:: python g.fit(X); Related pages ------------- For more detail, see: - :doc:`overview` - :doc:`installation` - :doc:`parameters` - :doc:`examples` - :doc:`api`