@@ -93,6 +93,13 @@ Overview of clustering methods
93
93
transductive
94
94
- Distances between nearest points
95
95
96
+ * - :ref: `HDBSCAN <hdbscan >`
97
+ - minimum cluster membership, minimum point neighbors
98
+ - large ``n_samples ``, medium ``n_clusters ``
99
+ - Non-flat geometry, uneven cluster sizes, outlier removal,
100
+ transductive, hierarchical, variable cluster density
101
+ - Distances between nearest points
102
+
96
103
* - :ref: `OPTICS <optics >`
97
104
- minimum cluster membership
98
105
- Very large ``n_samples ``, large ``n_clusters ``
@@ -392,8 +399,8 @@ for centroids to be the mean of the points within a given region. These
392
399
candidates are then filtered in a post-processing stage to eliminate
393
400
near-duplicates to form the final set of centroids.
394
401
395
- The position of centroid candidates is iteratively adjusted using a technique called hill
396
- climbing, which finds local maxima of the estimated probability density.
402
+ The position of centroid candidates is iteratively adjusted using a technique called hill
403
+ climbing, which finds local maxima of the estimated probability density.
397
404
Given a candidate centroid :math: `x` for iteration :math: `t`, the candidate
398
405
is updated according to the following equation:
399
406
@@ -412,14 +419,14 @@ its neighborhood:
412
419
413
420
m(x) = \frac {1 }{|N(x)|} \sum _{x_j \in N(x)}x_j - x
414
421
415
- In general, the equation for :math: `m` depends on a kernel used for density estimation.
422
+ In general, the equation for :math: `m` depends on a kernel used for density estimation.
416
423
The generic formula is:
417
424
418
425
.. math ::
419
426
420
427
m(x) = \frac {\sum _{x_j \in N(x)}K(x_j - x)x_j}{\sum _{x_j \in N(x)}K(x_j - x)} - x
421
428
422
- In our implementation, :math: `K(x)` is equal to 1 if :math: `x` is small enough and is
429
+ In our implementation, :math: `K(x)` is equal to 1 if :math: `x` is small enough and is
423
430
equal to 0 otherwise. Effectively :math: `K(y - x)` indicates whether :math: `y` is in
424
431
the neighborhood of :math: `x`.
425
432
@@ -961,6 +968,112 @@ by black points below.
961
968
Schubert, E., Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (2017).
962
969
In ACM Transactions on Database Systems (TODS), 42(3), 19.
963
970
971
+ .. _hdbscan :
972
+
973
+ HDBSCAN
974
+ =======
975
+
976
+ The :class: `HDBSCAN ` algorithm can be seen as an extension of :class: `DBSCAN `
977
+ and :class: `OPTICS `. Specifically, :class: `DBSCAN ` assumes that the clustering
978
+ criterion (i.e. density requirement) is *globally homogeneous *.
979
+ In other words, :class: `DBSCAN ` may struggle to successfully capture clusters
980
+ with different densities.
981
+ :class: `HDBSCAN ` alleviates this assumption and explores all possible density
982
+ scales by building an alternative representation of the clustering problem.
983
+
984
+ .. note ::
985
+
986
+ This implementation is adapted from the original implementation of HDBSCAN,
987
+ `scikit-learn-contrib/hdbscan <https://github.com/scikit-learn-contrib/hdbscan >`_ based on [LJ2017 ]_.
988
+
989
+ Mutual Reachability Graph
990
+ -------------------------
991
+
992
+ HDBSCAN first defines :math: `d_c(x_p)`, the *core distance * of a sample :math: `x_p`, as the
993
+ distance to its `min_samples ` th-nearest neighbor, counting itself. For example,
994
+ if `min_samples=5 ` and :math: `x_*` is the 5th-nearest neighbor of :math: `x_p`
995
+ then the core distance is:
996
+
997
+ .. math :: d_c(x_p)=d(x_p, x_*).
998
+
999
+ Next it defines :math: `d_m(x_p, x_q)`, the *mutual reachability distance * of two points
1000
+ :math: `x_p, x_q`, as:
1001
+
1002
+ .. math :: d_m(x_p, x_q) = \max\{d_c(x_p), d_c(x_q), d(x_p, x_q)\}
1003
+
1004
+ These two notions allow us to construct the *mutual reachability graph *
1005
+ :math: `G_{ms}` defined for a fixed choice of `min_samples ` by associating each
1006
+ sample :math: `x_p` with a vertex of the graph, and thus edges between points
1007
+ :math: `x_p, x_q` are the mutual reachability distance :math: `d_m(x_p, x_q)`
1008
+ between them. We may build subsets of this graph, denoted as
1009
+ :math: `G_{ms,\varepsilon }`, by removing any edges with value greater than :math: `\varepsilon `:
1010
+ from the original graph. Any points whose core distance is less than :math: `\varepsilon `:
1011
+ are at this staged marked as noise. The remaining points are then clustered by
1012
+ finding the connected components of this trimmed graph.
1013
+
1014
+ .. note ::
1015
+
1016
+ Taking the connected components of a trimmed graph :math: `G_{ms,\varepsilon }` is
1017
+ equivalent to running DBSCAN* with `min_samples ` and :math: `\varepsilon `. DBSCAN* is a
1018
+ slightly modified version of DBSCAN mentioned in [CM2013 ]_.
1019
+
1020
+ Hierarchical Clustering
1021
+ -----------------------
1022
+ HDBSCAN can be seen as an algorithm which performs DBSCAN* clustering across all
1023
+ values of :math: `\varepsilon `. As mentioned prior, this is equivalent to finding the connected
1024
+ components of the mutual reachability graphs for all values of :math: `\varepsilon `. To do this
1025
+ efficiently, HDBSCAN first extracts a minimum spanning tree (MST) from the fully
1026
+ -connected mutual reachability graph, then greedily cuts the edges with highest
1027
+ weight. An outline of the HDBSCAN algorithm is as follows:
1028
+
1029
+ 1. Extract the MST of :math: `G_{ms}`
1030
+ 2. Extend the MST by adding a "self edge" for each vertex, with weight equal
1031
+ to the core distance of the underlying sample.
1032
+ 3. Initialize a single cluster and label for the MST.
1033
+ 4. Remove the edge with the greatest weight from the MST (ties are
1034
+ removed simultaneously).
1035
+ 5. Assign cluster labels to the connected components which contain the
1036
+ end points of the now-removed edge. If the component does not have at least
1037
+ one edge it is instead assigned a "null" label marking it as noise.
1038
+ 6. Repeat 4-5 until there are no more connected components.
1039
+
1040
+ HDBSCAN is therefore able to obtain all possible partitions achievable by
1041
+ DBSCAN* for a fixed choice of `min_samples ` in a hierarchical fashion.
1042
+ Indeed, this allows HDBSCAN to perform clustering across multiple densities
1043
+ and as such it no longer needs :math: `\varepsilon ` to be given as a hyperparameter. Instead
1044
+ it relies solely on the choice of `min_samples `, which tends to be a more robust
1045
+ hyperparameter.
1046
+
1047
+ .. |hdbscan_ground_truth | image :: ../auto_examples/cluster/images/sphx_glr_plot_hdbscan_005.png
1048
+ :target: ../auto_examples/cluster/plot_hdbscan.html
1049
+ :scale: 75
1050
+ .. |hdbscan_results | image :: ../auto_examples/cluster/images/sphx_glr_plot_hdbscan_007.png
1051
+ :target: ../auto_examples/cluster/plot_hdbscan.html
1052
+ :scale: 75
1053
+
1054
+ .. centered :: |hdbscan_ground_truth|
1055
+ .. centered :: |hdbscan_results|
1056
+
1057
+ HDBSCAN can be smoothed with an additional hyperparameter `min_cluster_size `
1058
+ which specifies that during the hierarchical clustering, components with fewer
1059
+ than `minimum_cluster_size ` many samples are considered noise. In practice, one
1060
+ can set `minimum_cluster_size = min_samples ` to couple the parameters and
1061
+ simplify the hyperparameter space.
1062
+
1063
+ .. topic :: References:
1064
+
1065
+ .. [CM2013 ] Campello, R.J.G.B., Moulavi, D., Sander, J. (2013). Density-Based Clustering
1066
+ Based on Hierarchical Density Estimates. In: Pei, J., Tseng, V.S., Cao, L.,
1067
+ Motoda, H., Xu, G. (eds) Advances in Knowledge Discovery and Data Mining.
1068
+ PAKDD 2013. Lecture Notes in Computer Science(), vol 7819. Springer, Berlin,
1069
+ Heidelberg.
1070
+ :doi: `Density-Based Clustering Based on Hierarchical Density Estimates <10.1007/978-3-642-37456-2_14> `
1071
+
1072
+ .. [LJ2017 ] L. McInnes and J. Healy, (2017). Accelerated Hierarchical Density Based
1073
+ Clustering. In: IEEE International Conference on Data Mining Workshops (ICDMW),
1074
+ 2017, pp. 33-42.
1075
+ :doi: `Accelerated Hierarchical Density Based Clustering <10.1109/ICDMW.2017.12> `
1076
+
964
1077
.. _optics :
965
1078
966
1079
OPTICS
@@ -1033,7 +1146,7 @@ represented as children of a larger parent cluster.
1033
1146
Different distance metrics can be supplied via the ``metric `` keyword.
1034
1147
1035
1148
For large datasets, similar (but not identical) results can be obtained via
1036
- `HDBSCAN < https://hdbscan.readthedocs.io >`_ . The HDBSCAN implementation is
1149
+ :class: `HDBSCAN ` . The HDBSCAN implementation is
1037
1150
multithreaded, and has better algorithmic runtime complexity than OPTICS,
1038
1151
at the cost of worse memory scaling. For extremely large datasets that
1039
1152
exhaust system memory using HDBSCAN, OPTICS will maintain :math: `n` (as opposed
0 commit comments