8000 Revert "MNT remove optics from clustering.rst (#13381)" · xhluca/scikit-learn@3d8e2fb · GitHub
[go: up one dir, main page]

Skip to content

Commit 3d8e2fb

Browse files
author
Xing
authored
Revert "MNT remove optics from clustering.rst (scikit-learn#13381)"
This reverts commit 45fa7d2.
1 parent c518ded commit 3d8e2fb

File tree

1 file changed

+111
-0
lines changed

1 file changed

+111
-0
lines changed

doc/modules/clustering.rst

Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -91,6 +91,12 @@ Overview of clustering methods
9191
- Non-flat geometry, uneven cluster sizes
9292
- Distances between nearest points
9393

94+
* - :ref:`OPTICS <optics>`
95+
- minimum cluster membership
96+
- Very large ``n_samples``, large ``n_clusters``
97+
- Non-flat geometry, uneven cluster sizes, variable cluster density
98+
- Distances between points
99+
94100
* - :ref:`Gaussian mixtures <mixture>`
95101
- many
96102
- Not scalable
@@ -796,6 +802,11 @@ by black points below.
796802
be used (e.g. with sparse matrices). This matrix will consume n^2 floats.
797803
A couple of mechanisms for getting around this are:
798804

805+
- Use :ref:`OPTICS <optics>` clustering in conjunction with the
806+
`extract_dbscan` method. OPTICS clustering also calculates the full
807+
pairwise matrix, but only keeps one row in memory at a time (memory
808+
complexity n).
809+
799810
- A sparse radius neighborhood graph (where missing entries are presumed to
800811
be out of eps) can be precomputed in a memory-efficient way and dbscan
801812
can be run over this with ``metric='precomputed'``. See
@@ -814,6 +825,106 @@ by black points below.
814825
In Proceedings of the 2nd International Conference on Knowledge Discovery
815826
and Data Mining, Portland, OR, AAAI Press, pp. 226–231. 1996
816827

828+
.. _optics:
829+
830+
OPTICS
831+
======
832+
833+
The :class:`OPTICS` algorithm shares many similarities with the
834+
:class:`DBSCAN` algorithm, and can be considered a generalization of
835+
DBSCAN that relaxes the ``eps`` requirement from a single value to a value
836+
range. The key difference between DBSCAN and OPTICS is that the OPTICS
837+
algorithm builds a *reachability* graph, which assigns each sample both a
838+
``reachability_`` distance, and a spot within the cluster ``ordering_``
839+
attribute; these two attributes are assigned when the model is fitted, and are
840+
used to determine cluster membership. If OPTICS is run with the default value
841+
of *inf* set for ``max_eps``, then DBSCAN style cluster extraction can be
842+
performed in linear time for any given ``eps`` value using the
843+
``extract_dbscan`` method. Setting ``max_eps`` to a lower value will result
844+
in shorter run times, and can be thought of as the maximum cluster object size
845+
(in diameter) that OPTICS will be able to extract.
846+
847+
.. |optics_results| image:: ../auto_examples/cluster/images/sphx_glr_plot_optics_001.png
848+
:target: ../auto_examples/cluster/plot_optics.html
849+
:scale: 50
850+
851+
.. centered:: |optics_results|
852+
853+
The *reachability* distances generated by OPTICS allow for variable density
854+
extraction of clusters within a single data set. As shown in the above plot,
855+
combining *reachability* distances and data set ``ordering_`` produces a
856+
*reachability plot*, where point density is represented on the Y-axis, and
857+
points are ordered such that nearby points are adjacent. 'Cutting' the
858+
reachability plot at a single value produces DBSCAN like results; all points
859+
above the 'cut' are classified as noise, and each time that there is a break
860+
when reading from left to right signifies a new cluster. The default cluster
861+
extraction with OPTICS looks at changes in slope within the graph to guess at
862+
natural clusters. There are also other possibilities for analysis on the graph
863+
itself, such as generating hierarchical representations of the data through
864+
reachability-plot dendrograms. The plot above has been color-coded so that
865+
cluster colors in planar space match the linear segment clusters of the
866+
reachability plot-- note that the blue and red clusters are adjacent in the
867+
reachability plot, and can be hierarchically represented as children of a
868+
larger parent cluster.
869+
870+
.. topic:: Examples:
871+
872+
* :ref:`sphx_glr_auto_examples_cluster_plot_optics.py`
873+
874+
875+
.. topic:: Comparison with DBSCAN
876+
877+
The results from OPTICS ``extract_dbscan`` method and DBSCAN are not quite
878+
identical. Specifically, while *core_samples* returned from both OPTICS
879+
and DBSCAN are guaranteed to be identical, labeling of periphery and noise
880+
points is not. This is in part because the first sample processed by
881+
OPTICS will always have a reachability distance that is set to ``inf``,
882+
and will thus generally be marked as noise rather than periphery. This
883+
affects adjacent points when they are considered as candidates for being
884+
marked as either periphery or noise. While this effect is quite local to
885+
the starting point of the dataset and is unlikely to be noticed on even
886+
moderately large datasets, it is worth also noting that non-core boundry
887+
points may switch cluster labels on the rare occasion that they are
888+
equidistant to a competeing cluster due to how the graph is read from left
889+
to right when assigning labels.
890+
891+
Note that for any single value of ``eps``, DBSCAN will tend to have a
892+
shorter run time than OPTICS; however, for repeated runs at varying ``eps``
893+
values, a single run of OPTICS may require less cumulative runtime than
894+
DBSCAN. It is also important to note that OPTICS output can be unstable at
895+
``eps`` values very close to the initial ``max_eps`` value. OPTICS seems
896+
to produce near identical results to DBSCAN provided that ``eps`` passed to
897+
``extract_dbscan`` is a half order of magnitude less than the inital
898+
``max_eps`` that was used to fit; using a value close to ``max_eps``
899+
will throw a warning, and using a value larger will result in an exception.
900+
901+
.. topic:: Computational Complexity
902+
903+
Spatial indexing trees are used to avoid calculating the full distance
904+
matrix, and allow for efficient memory usage on large sets of samples.
905+
Different distance metrics can be supplied via the ``metric`` keyword.
906+
907+
For large datasets, similar (but not identical) results can be obtained via
908+
`HDBSCAN <https://hdbscan.readthedocs.io>`_. The HDBSCAN implementation is
909+
multithreaded, and has better algorithmic runtime complexity than OPTICS--
910+
at the cost of worse memory scaling. For extremely large datasets that
911+
exhaust system memory using HDBSCAN, OPTICS will maintain *n* (as opposed
912+
to *n^2* memory scaling); however, tuning of the ``max_eps`` parameter
913+
will likely need to be used to give a solution in a reasonable amount of
914+
wall time.
915+
916+
.. topic:: References:
917+
918+
* "OPTICS: ordering points to identify the clustering structure."
919+
Ankerst, Mihael, Markus M. Breunig, Hans-Peter Kriegel, and Jörg Sander.
920+
In ACM Sigmod Record, vol. 28, no. 2, pp. 49-60. ACM, 1999.
921+
922+
* "Automatic extraction of clusters from hierarchical clustering
923+
representations."
924+
Sander, Jörg, Xuejie Qin, Zhiyong Lu, Nan Niu, and Alex Kovarsky.
925+
In Advances in knowledge discovery and data mining,
926+
pp. 75-87. Springer Berlin Heidelberg, 2003.
927+
817928
.. _birch:
818929

819930
Birch

0 commit comments

Comments
 (0)
0