Overview ======== ``coresg-graphhdbscan`` is a Python package for density-based clustering on similarity graphs derived from feature-vector data or directly provided by the user. The package combines two main ideas: 1. construction of a weighted graph that reflects local similarity structure 2. application of a specialized CoreSG-HDBSCAN* pipeline designed to operate on graph data representations This package is designed for settings where feature-vector representations are not suitable for clustering, such as Euclidean geometry in very high-dimensional spaces, where clustering should instead be guided by a learned or hand-crafted similarity graph as an intrinsically lower-dimensional representation of the data. Main features ------------- The package currently supports the following capabilities: - graph-based clustering through ``GraphCoreSGHDBSCAN`` - multiple ``min_samples`` values in a single model run - three graph-construction backends plus a precomputed graph mode - support for multiple distance metrics during similarity graph construction, including ``euclidean``, ``cosine``, ``correlation``, ``manhattan``, ``jaccard``, ``minkowski``, ``mahalanobis``, ``seuclidean``, and the package-specific ``hybrid_euclidean_cosine`` - support for ``metric_kwds`` when a distance metric requires additional arguments - optional relabeling of noise points by density-based label propagation - compatibility with NetworkX graphs, dense adjacency matrices, and sparse adjacency matrices in precomputed mode - access to HDBSCAN*-style outputs such as labels, probabilities, and condensed trees - direct access to stored labels, condensed trees, and optional per-``m`` models Package structure ----------------- The package is centered around two public classes: ``CoreSGHDBSCAN`` The lower-level CoreSG implementation operating on feature vectors or distance matrices. ``GraphCoreSGHDBSCAN`` The graph-oriented wrapper that constructs a similarity graph, converts it to a weighted structutal dissimilarity graph, and then runs CoreSGHDBSCAN. For most users, ``GraphCoreSGHDBSCAN`` is the main entry point. When to use this package ------------------------ This package is especially useful when: - you work with very high-dimensional data, and simple metrics like Euclidean distance alone is not the best description of local structure - a similarity graph is more meaningful than a raw feature-space view - you want to compare several ``min_samples`` values efficiently in one run - you want HDBSCAN*-style hierarchical clustering behavior on top of a graph-based representation - you already have a graph or adjacency matrix and want to cluster directly from it Typical workflow ---------------- A typical graph-based clustering workflow in this package is: 1. construct or provide a similarity graph 2. convert it into a weighted structural similarity graph 3. convert similarity to dissimilarity 4. ensure graph connectivity 5. run CoreSGHDBSCAN for one or more ``min_samples`` values 6. inspect the condensed tree and choose a solution 7. optionally reassign noise points if ``no_noise=True`` Typical entry point ------------------- For most users, the main entry point is: .. code-block:: python from coresg_graphhdbscan import GraphCoreSGHDBSCAN A simple starting example is: .. code-block:: python from coresg_graphhdbscan import GraphCoreSGHDBSCAN model = GraphCoreSGHDBSCAN( min_samples=10, sim_graph_method="sc_umap", metric="euclidean", n_neighbors=15, no_noise=True, heuristic_connect=False, ) model.fit(X) labels = model.fit_predict(X) Stored results after fitting ---------------------------- After fitting, the package stores results for each tested ``min_samples`` value. The main user-facing object is ``GraphCoreSGHDBSCAN``. Internally, it runs a Core-SG engine across one or more ``min_samples`` values and stores per-``m`` results. The most important fitted result containers are: - ``labels_by_m_``: dictionary mapping each fitted ``min_samples`` value to its stored cluster labels. - ``condensed_trees_``: dictionary mapping each fitted ``min_samples`` value to its condensed tree object. - ``models_``: dictionary mapping each fitted ``min_samples`` value to a saved model object when ``save_models=True``. This makes it possible to inspect a fitted solution directly without re-running the model. Typical post-fit access looks like: .. code-block:: python g.fit(X) labels_10 = g.labels_by_m_[10] tree_10 = g.condensed_trees_[10] # only available when save_models=True model_10 = g.models_[10] Related pages ------------- For more detail, see: - :doc:`installation` - :doc:`usage` - :doc:`parameters` - :doc:`api`