8000 FEAT Add `HDBSCAN` as a new estimator in `sklearn.cluster` (#26385) · scikit-learn/scikit-learn@415b11a · GitHub
[go: up one dir, main page]

Skip to content

Commit 415b11a

Browse files
Micky774thomasjpfanjjerphanglemaitre
authored
FEAT Add HDBSCAN as a new estimator in sklearn.cluster (#26385)
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: Julien Jerphanion <git@jjerphan.xyz> Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
1 parent fecfb34 commit 415b11a

19 files changed

+3297
-10
lines changed

doc/modules/classes.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -104,6 +104,7 @@ Classes
104104
cluster.AgglomerativeClustering
105105
cluster.Birch
106106
cluster.DBSCAN
107+
cluster.HDBSCAN
107108
cluster.FeatureAgglomeration
108109
cluster.KMeans
109110
cluster.BisectingKMeans

doc/modules/clustering.rst

Lines changed: 118 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,13 @@ Overview of clustering methods
9393
transductive
9494
- Distances between nearest points
9595

96+
* - :ref:`HDBSCAN <hdbscan>`
97+
- minimum cluster membership, minimum point neighbors
98+
- large ``n_samples``, medium ``n_clusters``
99+
- Non-flat geometry, uneven cluster sizes, outlier removal,
100+
transductive, hierarchical, variable cluster density
101+
- Distances between nearest points
102+
96103
* - :ref:`OPTICS <optics>`
97104
- minimum cluster membership
98105
- Very large ``n_samples``, large ``n_clusters``
@@ -392,8 +399,8 @@ for centroids to be the mean of the points within a given region. These
392399
candidates are then filtered in a post-processing stage to eliminate
393400
near-duplicates to form the final set of centroids.
394401

395-
The position of centroid candidates is iteratively adjusted using a technique called hill
396-
climbing, which finds local maxima of the estimated probability density.
402+
The position of centroid candidates is iteratively adjusted using a technique called hill
403+
climbing, which finds local maxima of the estimated probability density.
397404
Given a candidate centroid :math:`x` for iteration :math:`t`, the candidate
398405
is updated according to the following equation:
399406

@@ -412,14 +419,14 @@ its neighborhood:
412419
413420
m(x) = \frac{1}{|N(x)|} \sum_{x_j \in N(x)}x_j - x
414421
415-
In general, the equation for :math:`m` depends on a kernel used for density estimation.
422+
In general, the equation for :math:`m` depends on a kernel used for density estimation.
416423
The generic formula is:
417424

418425
.. math::
419426
420427
m(x) = \frac{\sum_{x_j \in N(x)}K(x_j - x)x_j}{\sum_{x_j \in N(x)}K(x_j - x)} - x
421428
422-
In our implementation, :math:`K(x)` is equal to 1 if :math:`x` is small enough and is
429+
In our implementation, :math:`K(x)` is equal to 1 if :math:`x` is small enough and is
423430
equal to 0 otherwise. Effectively :math:`K(y - x)` indicates whether :math:`y` is in
424431
the neighborhood of :math:`x`.
425432

@@ -961,6 +968,112 @@ by black points below.
961968
Schubert, E., Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (2017).
962969
In ACM Transactions on Database Systems (TODS), 42(3), 19.
963970

971+
.. _hdbscan:
972+
973+
HDBSCAN
974+
=======
975+
976+
The :class:`HDBSCAN` algorithm can be seen as an extension of :class:`DBSCAN`
977+
and :class:`OPTICS`. Specifically, :class:`DBSCAN` assumes that the clustering
978+
criterion (i.e. density requirement) is *globally homogeneous*.
979+
In other words, :class:`DBSCAN` may struggle to successfully capture clusters
980+
with different densities.
981+
:class:`HDBSCAN` alleviates this assumption and explores all possible density
982+
scales by building an alternative representation of the clustering problem.
983+
984+
.. note::
985+
986+
This implementation is adapted from the original implementation of HDBSCAN,
987+
`scikit-learn-contrib/hdbscan <https://github.com/scikit-learn-contrib/hdbscan>`_ based on [LJ2017]_.
988+
989+
Mutual Reachability Graph
990+
-------------------------
991+
992+
HDBSCAN first defines :math:`d_c(x_p)`, the *core distance* of a sample :math:`x_p`, as the
993+
distance to its `min_samples` th-nearest neighbor, counting itself. For example,
994+
if `min_samples=5` and :math:`x_*` is the 5th-nearest neighbor of :math:`x_p`
995+
then the core distance is:
996+
997+
.. math:: d_c(x_p)=d(x_p, x_*).
998+
999+
Next it defines :math:`d_m(x_p, x_q)`, the *mutual reachability distance* of two points
1000+
:math:`x_p, x_q`, as:
1001+
1002+
.. math:: d_m(x_p, x_q) = \max\{d_c(x_p), d_c(x_q), d(x_p, x_q)\}
1003+
1004+
These two notions allow us to construct the *mutual reachability graph*
1005+
:math:`G_{ms}` defined for a fixed choice of `min_samples` by associating each
1006+
sample :math:`x_p` with a vertex of the graph, and thus edges between points
1007+
:math:`x_p, x_q` are the mutual reachability distance :math:`d_m(x_p, x_q)`
1008+
between them. We may build subsets of this graph, denoted as
1009+
:math:`G_{ms,\varepsilon}`, by removing any edges with value greater than :math:`\varepsilon`:
1010+
from the original graph. Any points whose core distance is less than :math:`\varepsilon`:
1011+
are at this staged marked as noise. The remaining points are then clustered by
1012+
finding the connected components of this trimmed graph.
1013+
1014+
.. note::
1015+
1016+
Taking the connected components of a trimmed graph :math:`G_{ms,\varepsilon}` is
1017+
equivalent to running DBSCAN* with `min_samples` and :math:`\varepsilon`. DBSCAN* is a
1018+
slightly modified version of DBSCAN mentioned in [CM2013]_.
1019+
1020+
Hierarchical Clustering
1021+
-----------------------
1022+
HDBSCAN can be seen as an algorithm which performs DBSCAN* clustering across all
1023+
values of :math:`\varepsilon`. As mentioned prior, this is equivalent to finding the connected
1024+
components of the mutual reachability graphs for all values of :math:`\varepsilon`. To do this
1025+
efficiently, HDBSCAN first extracts a minimum spanning tree (MST) from the fully
1026+
-connected mutual reachability graph, then greedily cuts the edges with highest
1027+
weight. An outline of the HDBSCAN algorithm is as follows:
1028+
1029+
1. Extract the MST of :math:`G_{ms}`
1030+
2. Extend the MST by adding a "self edge" for each vertex, with weight equal
1031+
to the core distance of the underlying sample.
1032+
3. Initialize a single cluster and label for the MST.
1033+
4. Remove the edge with the greatest weight from the MST (ties are
1034+
removed simultaneously).
1035+
5. Assign cluster labels to the connected components which contain the
1036+
end points of the now-removed edge. If the component does not have at least
1037+
one edge it is instead assigned a "null" label marking it as noise.
1038+
6. Repeat 4-5 until there are no more connected components.
1039+
1040+
HDBSCAN is therefore able to obtain all possible partitions achievable by
1041+
DBSCAN* for a fixed choice of `min_samples` in a hierarchical fashion.
1042+
Indeed, this allows HDBSCAN to perform clustering across multiple densities
1043+
and as such it no longer needs :math:`\varepsilon` to be given as a hyperparameter. Instead
1044+
it relies solely on the choice of `min_samples`, which tends to be a more robust
1045+
hyperparameter.
1046+
1047+
.. |hdbscan_ground_truth| image:: ../auto_examples/cluster/images/sphx_glr_plot_hdbscan_005.png
1048+
:target: ../auto_examples/cluster/plot_hdbscan.html
1049+
:scale: 75
1050+
.. |hdbscan_results| image:: ../auto_examples/cluster/images/sphx_glr_plot_hdbscan_007.png
1051+
:target: ../auto_examples/cluster/plot_hdbscan.html
1052+
:scale: 75
1053+
1054+
.. centered:: |hdbscan_ground_truth|
1055+
.. centered:: |hdbscan_results|
1056+
1057+
HDBSCAN can be smoothed with an additional hyperparameter `min_cluster_size`
1058+
which specifies that during the hierarchical clustering, components with fewer
1059+
than `minimum_cluster_size` many samples are considered noise. In practice, one
1060+
can set `minimum_cluster_size = min_samples` to couple the parameters and
1061+
simplify the hyperparameter space.
1062+
1063+
.. topic:: References:
1064+
1065+
.. [CM2013] Campello, R.J.G.B., Moulavi, D., Sander, J. (2013). Density-Based Clustering
1066+
Based on Hierarchical Density Estimates. In: Pei, J., Tseng, V.S., Cao, L.,
1067+
Motoda, H., Xu, G. (eds) Advances in Knowledge Discovery and Data Mining.
1068+
PAKDD 2013. Lecture Notes in Computer Science(), vol 7819. Springer, Berlin,
1069+
Heidelberg.
1070+
:doi:`Density-Based Clustering Based on Hierarchical Density Estimates <10.1007/978-3-642-37456-2_14>`
1071+
1072+
.. [LJ2017] L. McInnes and J. Healy, (2017). Accelerated Hierarchical Density Based
1073+
Clustering. In: IEEE International Conference on Data Mining Workshops (ICDMW),
1074+
2017, pp. 33-42.
1075+
:doi:`Accelerated Hierarchical Density Based Clustering <10.1109/ICDMW.2017.12>`
1076+
9641077
.. _optics:
9651078

9661079
OPTICS
@@ -1033,7 +1146,7 @@ represented as children of a larger parent cluster.
10331146
Different distance metrics can be supplied via the ``metric`` keyword.
10341147

10351148
For large datasets, similar (but not identical) results can be obtained via
1036-
`HDBSCAN <https://hdbscan.readthedocs.io>`_. The HDBSCAN implementation is
1149+
:class:`HDBSCAN`. The HDBSCAN implementation is
10371150
multithreaded, and has better algorithmic runtime complexity than OPTICS,
10381151
at the cost of worse memory scaling. For extremely large datasets that
10391152
exhaust system memory using HDBSCAN, OPTICS will maintain :math:`n` (as opposed

doc/whats_new/v1.3.rst

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -202,6 +202,19 @@ Changelog
202202
:user:`Jérémie du Boisberranger <jeremiedbb>`,
203203
:user:`Guillaume Lemaitre <glemaitre>`.
204204

205+
- |MajorFeature| Added :class:`cluster.HDBSCAN`, a modern hierarchical density-based
206+
clustering algorithm. Similarly to :class:`cluster.OPTICS`, it can be seen as a
207+
generalization of :class:`DBSCAN` by allowing for hierarchical instead of flat
208+
clustering, however it varies in its approach from :class:`cluster.OPTICS`. This
209+
algorithm is very robust with respect to its hyperparameters' values and can
210+
be used on a wide variety of data without much, if any, tuning.
211+
212+
This implementation is an adaptation from the original implementation of HDBSCAN in
213+
`scikit-learn-contrib/hdbscan <https://github.com/scikit-learn-contrib/hdbscan>`_,
214+
by :user:`Leland McInnes <lmcinnes>` et al.
215+
216+
:pr:`26385` by :user:`Meekail Zain <micky774>`
217+
205218
:mod:`sklearn.covariance`
206219
.........................
207220

examples/cluster/plot_cluster_comparison.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,9 @@
7979
"min_samples": 7,
8080
"xi": 0.05,
8181
"min_cluster_size": 0.1,
82+
"allow_single_cluster": True,
83+
"hdbscan_min_cluster_size": 15,
84+
"hdbscan_min_samples": 3,
8285
}
8386

8487
datasets = [
@@ -161,6 +164,11 @@
161164
affinity="nearest_neighbors",
162165
)
163166
dbscan = cluster.DBSCAN(eps=params["eps"])
167+
hdbscan = cluster.HDBSCAN(
168+
min_samples=params["hdbscan_min_samples"],
169+
min_cluster_size=params["hdbscan_min_cluster_size"],
170+
allow_single_cluster=params["allow_single_cluster"],
171+
)
164172
optics = cluster.OPTICS(
165173
min_samples=params["min_samples"],
166174
xi=params["xi"],
@@ -188,6 +196,7 @@
188196
("Ward", ward),
189197
("Agglomerative\nClustering", average_linkage),
190198
("DBSCAN", dbscan),
199+
("HDBSCAN", hdbscan),
191200
("OPTICS", optics),
192201
("BIRCH", birch),
193202
("Gaussian\nMixture", gmm),

0 commit comments

Comments
 (0)
0