8000 ENH Add `HDBSCAN` as a new estimator in `sklearn.cluster` by Micky774 · Pull Request #26385 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

ENH Add HDBSCAN as a new estimator in sklearn.cluster #26385

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
May 31, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
bd04bda
Include `HDBSCAN` as a sub-module for `sklearn.cluster` (#22616)
Micky774 Oct 12, 2022
488fb1f
CLN Cleaned `cluster/_hdbscan/_linkage.pyx` (#24857)
Micky774 Dec 19, 2022
df7cf3a
DOC Adds `HDBSCAN.dbscan_clustering` section to `plot_hdbscan.py` (#2…
Micky774 Feb 7, 2023
800c2a6
ENH Extends outlier encoding scheme to `HDBSCAN.dbscan_clustering` (#…
Micky774 Feb 9, 2023
29c53b7
CLN Cleaned `cluster/_hdbscan/_reachability.pyx` (#24701)
Micky774 Feb 14, 2023
70ae320
CLN Update `cluster/_hdbscan/_tree.pyx` style and syntax (#25768)
Micky774 Mar 7, 2023
a6e061e
CLN Cleaned `TreeUnionFind` in `_hdbscan/_tree.pyx` (#25827)
Micky774 Mar 14, 2023
d013d38
ENH `_hdbscan` `HIERARCHY` data type introduction (#25826)
Micky774 Mar 28, 2023
792833a
8000 CLN BFS Style Improvement (#26096)
Micky774 Apr 5, 2023
c02c455
CLN `_hdbscan/_tree.pyx` algorithms refactor (#26011)
Micky774 Apr 19, 2023
607dea8
CLN HDBSCAN `_tree.pyx::do_labelling` refactor (#26101)
Micky774 May 16, 2023
792485a
Updated changelog
Micky774 May 16, 2023
0661984
Updated typing
Micky774 May 16, 2023
4da655c
Updated typing, moved away from deprecated numpy API
Micky774 May 16, 2023
7215ff0
Updated shape semantics based on underlying data structure
Micky774 May 19, 2023
861b143
Doc fixes
Micky774 May 20, 2023
87a43e5
Updated shape semantics in remaining Cython code
Micky774 May 20, 2023
15513df
Merge branch 'main' into hdbscan_sync
Micky774 May 21, 2023
3d46b14
Use cimport for PyArrayObject and add noexcept for UnionFind
Micky774 May 22, 2023
495e21f
Updated cimport strategy
Micky774 May 22, 2023
b4bdfbd
Added copy of license
Micky774 May 24, 2023
3e9ed2e
Updated documentation of helper functions/methods
Micky774 May 24, 2023
fe11344
Improved testing documentation
Micky774 May 24, 2023
04c8b69
Updated documentation to explicitly include example
Micky774 May 31, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/modules/classes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,7 @@ Classes
cluster.AgglomerativeClustering
cluster.Birch
cluster.DBSCAN
cluster.HDBSCAN
cluster.FeatureAgglomeration
cluster.KMeans
cluster.BisectingKMeans
Expand Down
123 changes: 118 additions & 5 deletions doc/modules/clustering.rst
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,13 @@ Overview of clustering methods
transductive
- Distances between nearest points

* - :ref:`HDBSCAN <hdbscan>`
- minimum cluster membership, minimum point neighbors
- large ``n_samples``, medium ``n_clusters``
- Non-flat geometry, uneven cluster sizes, outlier removal,
transductive, hierarchical, variable cluster density
- Distances between nearest points

* - :ref:`OPTICS <optics>`
- minimum cluster membership
- Very large ``n_samples``, large ``n_clusters``
Expand Down Expand Up @@ -392,8 +399,8 @@ for centroids to be the mean of the points within a given region. These
candidates are then filtered in a post-processing stage to eliminate
near-duplicates to form the final set of centroids.

The position of centroid candidates is iteratively adjusted using a technique called hill
climbing, which finds local maxima of the estimated probability density.
The position of centroid candidates is iteratively adjusted using a technique called hill
climbing, which finds local maxima of the estimated probability density.
Given a candidate centroid :math:`x` for iteration :math:`t`, the candidate
is updated according to the following equation:

Expand All @@ -412,14 +419,14 @@ its neighborhood:

m(x) = \frac{1}{|N(x)|} \sum_{x_j \in N(x)}x_j - x

In general, the equation for :math:`m` depends on a kernel used for density estimation.
In general, the equation for :math:`m` depends on a kernel used for density estimation.
The generic formula is:

.. math::

m(x) = \frac{\sum_{x_j \in N(x)}K(x_j - x)x_j}{\sum_{x_j \in N(x)}K(x_j - x)} - x

In our implementation, :math:`K(x)` is equal to 1 if :math:`x` is small enough and is
In our implementation, :math:`K(x)` is equal to 1 if :math:`x` is small enough and is
equal to 0 otherwise. Effectively :math:`K(y - x)` indicates whether :math:`y` is in
the neighborhood of :math:`x`.

Expand Down Expand Up @@ -961,6 +968,112 @@ by black points below.
Schubert, E., Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (2017).
In ACM Transactions on Database Systems (TODS), 42(3), 19.

.. _hdbscan:

HDBSCAN
=======

The :class:`HDBSCAN` algorithm can be seen as an extension of :class:`DBSCAN`
and :class:`OPTICS`. Specifically, :class:`DBSCAN` assumes that the clustering
criterion (i.e. density requirement) is *globally homogeneous*.
In other words, :class:`DBSCAN` may struggle to successfully capture clusters
with different densities.
:class:`HDBSCAN` alleviates this assumption and explores all possible density
scales by building an alternative representation of the clustering problem.

.. note::

This implementation is adapted from the original implementation of HDBSCAN,
`scikit-learn-contrib/hdbscan <https://github.com/scikit-learn-contrib/hdbscan>`_ based on [LJ2017]_.

Mutual Reachability Graph
-------------------------

HDBSCAN first defines :math:`d_c(x_p)`, the *core distance* of a sample :math:`x_p`, as the
distance to its `min_samples` th-nearest neighbor, counting itself. For example,
if `min_samples=5` and :math:`x_*` is the 5th-nearest neighbor of :math:`x_p`
then the core distance is:

.. math:: d_c(x_p)=d(x_p, x_*).

Next it defines :math:`d_m(x_p, x_q)`, the *mutual reachability distance* of two points
:math:`x_p, x_q`, as:

.. math:: d_m(x_p, x_q) = \max\{d_c(x_p), d_c(x_q), d(x_p, x_q)\}

These two notions allow us to construct the *mutual reachability graph*
:math:`G_{ms}` defined for a fixed choice of `min_samples` by associating each
sample :math:`x_p` with a vertex of the graph, and thus edges between points
:math:`x_p, x_q` are the mutual reachability distance :math:`d_m(x_p, x_q)`
between them. We may build subsets of this graph, denoted as
:math:`G_{ms,\varepsilon}`, by removing any edges with value greater than :math:`\varepsilon`:
from the original graph. Any points whose core distance is less than :math:`\varepsilon`:
are at this staged marked as noise. The remaining points are then clustered by
finding the connected components of this trimmed graph.

.. note::

Taking the connected components of a trimmed graph :math:`G_{ms,\varepsilon}` is
equivalent to running DBSCAN* with `min_samples` and :math:`\varepsilon`. DBSCAN* is a
slightly modified version of DBSCAN mentioned in [CM2013]_.

Hierarchical Clustering
-----------------------
HDBSCAN can be seen as an algorithm which performs DBSCAN* clustering across all
values of :math:`\varepsilon`. As mentioned prior, this is equivalent to finding the connected
components of the mutual reachability graphs for all values of :math:`\varepsilon`. To do this
efficiently, HDBSCAN first extracts a minimum spanning tree (MST) from the fully
-connected mutual reachability graph, then greedily cuts the edges with highest
weight. An outline of the HDBSCAN algorithm is as follows:

1. Extract the MST of :math:`G_{ms}`
2. Extend the MST by adding a "self edge" for each vertex, with weight equal
to the core distance of the underlying sample.
3. Initialize a single cluster and label for the MST.
4. Remove the edge with the greatest weight from the MST (ties are
removed simultaneously).
5. Assign cluster labels to the connected components which contain the
end points of the now-removed edge. If the component does not have at least
one edge it is instead assigned a "null" label marking it as noise.
6. Repeat 4-5 until there are no more connected components.

HDBSCAN is therefore able to obtain all possible partitions achievable by
DBSCAN* for a fixed choice of `min_samples` in a hierarchical fashion.
Indeed, this allows HDBSCAN to perform clustering across multiple densities
and as such it no longer needs :math:`\varepsilon` to be given as a hyperparameter. Instead
it relies solely on the choice of `min_samples`, which tends to be a more robust
hyperparameter.

.. |hdbscan_ground_truth| image:: ../auto_examples/cluster/images/sphx_glr_plot_hdbscan_005.png
:target: ../auto_examples/cluster/plot_hdbscan.html
:scale: 75
.. |hdbscan_results| image:: ../auto_examples/cluster/images/sphx_glr_plot_hdbscan_007.png
:target: ../auto_examples/cluster/plot_hdbscan.html
:scale: 75

.. centered:: |hdbscan_ground_truth|
.. centered:: |hdbscan_results|

HDBSCAN can be smoothed with an additional hyperparameter `min_cluster_size`
which specifies that during the hierarchical clustering, components with fewer
than `minimum_cluster_size` many samples are considered noise. In practice, one
can set `minimum_cluster_size = min_samples` to couple the parameters and
simplify the hyperparameter space.

.. topic:: References:

.. [CM2013] Campello, R.J.G.B., Moulavi, D., Sander, J. (2013). Density-Based Clustering
Based on Hierarchical Density Estimates. In: Pei, J., Tseng, V.S., Cao, L.,
Motoda, H., Xu, G. (eds) Advances in Knowledge Discovery and Data Mining.
PAKDD 2013. Lecture Notes in Computer Science(), vol 7819. Springer, Berlin,
Heidelberg.
:doi:`Density-Based Clustering Based on Hierarchical Density Estimates <10.1007/978-3-642-37456-2_14>`

.. [LJ2017] L. McInnes and J. Healy, (2017). Accelerated Hierarchical Density Based
Clustering. In: IEEE International Conference on Data Mining Workshops (ICDMW),
2017, pp. 33-42.
:doi:`Accelerated Hierarchical Density Based Clustering <10.1109/ICDMW.2017.12>`

.. _optics:

OPTICS
Expand Down Expand Up @@ -1033,7 +1146,7 @@ represented as children of a larger parent cluster.
Different distance metrics can be supplied via the ``metric`` keyword.

For large datasets, similar (but not identical) results can be obtained via
`HDBSCAN <https://hdbscan.readthedocs.io>`_. The HDBSCAN implementation is
:class:`HDBSCAN`. The HDBSCAN implementation is
multithreaded, and has better algorithmic runtime complexity than OPTICS,
at the cost of worse memory scaling. For extremely large datasets that
exhaust system memory using HDBSCAN, OPTICS will maintain :math:`n` (as opposed
Expand Down
13 changes: 13 additions & 0 deletions doc/whats_new/v1.3.rst
Original file line number Diff line number Diff line change
Expand Up @@ -192,6 +192,19 @@ Changelog
:user:`Jérémie du Boisberranger <jeremiedbb>`,
:user:`Guillaume Lemaitre <glemaitre>`.

- |MajorFeature| Added :class:`cluster.HDBSCAN`, a modern hierarchical density-based
clustering algorithm. Similarly to :class:`cluster.OPTICS`, it can be seen as a
generalization of :class:`DBSCAN` by allowing for hierarchical instead of flat
clustering, however it varies in its approach from :class:`cluster.OPTICS`. This
algorithm is very robust with respect to its hyperparameters' values and can
be used on a wide variety of data without much, if any, tuning.

This implementation is an adaptation from the original implementation of HDBSCAN in
`scikit-learn-contrib/hdbscan <https://github.com/scikit-learn-contrib/hdbscan>`_,
by :user:`Leland McInnes <lmcinnes>` et al.

:pr:`26385` by :user:`Meekail Zain <micky774>`

:mod:`sklearn.covariance`
.........................

Expand Down
9 changes: 9 additions & 0 deletions examples/cluster/plot_cluster_comparison.py
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,9 @@
"min_samples": 7,
"xi": 0.05,
"min_cluster_size": 0.1,
"allow_single_cluster": True,
"hdbscan_min_cluster_size": 15,
"hdbscan_min_samples": 3,
}

datasets = [
Expand Down Expand Up @@ -161,6 +164,11 @@
affinity="nearest_neighbors",
)
dbscan = cluster.DBSCAN(eps=params["eps"])
hdbscan = cluster.HDBSCAN(
min_samples=params["hdbscan_min_samples"],
min_cluster_size=params["hdbscan_min_cluster_size"],
allow_single_cluster=params["allow_single_cluster"],
)
optics = cluster.OPTICS(
min_samples=params["min_samples"],
xi=params["xi"],
Expand Down Expand Up @@ -188,6 +196,7 @@
("Ward", ward),
("Agglomerative\nClustering", average_linkage),
("DBSCAN", dbscan),
("HDBSCAN", hdbscan),
("OPTICS", optics),
("BIRCH", birch),
("Gaussian\nMixture", gmm),
Expand Down
Loading
0