8000 MNT remove optics from clustering.rst (#13381) · xhluca/scikit-learn@45fa7d2 · GitHub
[go: up one dir, main page]

Skip to content

Commit 45fa7d2

Browse files
adrinjalaliXing
authored and
Xing
committed
MNT remove optics from clustering.rst (scikit-learn#13381)
1 parent e168bd7 commit 45fa7d2

File tree

1 file changed

+0
-111
lines changed

1 file changed

+0
-111
lines changed

doc/modules/clustering.rst

Lines changed: 0 additions & 111 deletions
Original file line numberDiff line numberDiff line change
@@ -91,12 +91,6 @@ Overview of clustering methods
9191
- Non-flat geometry, uneven cluster sizes
9292
- Distances between nearest points
9393

94-
* - :ref:`OPTICS <optics>`
95-
- minimum cluster membership
96-
- Very large ``n_samples``, large ``n_clusters``
97-
- Non-flat geometry, uneven cluster sizes, variable cluster density
98-
- Distances between points
99-
10094
* - :ref:`Gaussian mixtures <mixture>`
10195
- many
10296
- Not scalable
@@ -802,11 +796,6 @@ by black points below.
802796
be used (e.g. with sparse matrices). This matrix will consume n^2 floats.
803797
A couple of mechanisms for getting around this are:
804798

805-
- Use :ref:`OPTICS <optics>` clustering in conjunction with the
806-
`extract_dbscan` method. OPTICS clustering also calculates the full
807-
pairwise matrix, but only keeps one row in memory at a time (memory
808-
complexity n).
809-
810799
- A sparse radius neighborhood graph (where missing entries are presumed to
811800
be out of eps) can be precomputed in a memory-efficient way and dbscan
812801
can be run over this with ``metric='precomputed'``. See
@@ -825,106 +814,6 @@ by black points below.
825814
In Proceedings of the 2nd International Conference on Knowledge Discovery
826815
and Data Mining, Portland, OR, AAAI Press, pp. 226–231. 1996
827816

828-
.. _optics:
829-
830-
OPTICS
831-
======
832-
833-
The :class:`OPTICS` algorithm shares many similarities with the
834-
:class:`DBSCAN` algorithm, and can be considered a generalization of
835-
DBSCAN that relaxes the ``eps`` requirement from a single value to a value
836-
range. The key difference between DBSCAN and OPTICS is that the OPTICS
837-
algorithm builds a *reachability* graph, which assigns each sample both a
838-
``reachability_`` distance, and a spot within the cluster ``ordering_``
839-
attribute; these two attributes are assigned when the model is fitted, and are
840-
used to determine cluster membership. If OPTICS is run with the default value
841-
of *inf* set for ``max_eps``, then DBSCAN style cluster extraction can be
842-
performed in linear time for any given ``eps`` value using the
843-
``extract_dbscan`` method. Setting ``max_eps`` to a lower value will result
844-
in shorter run times, and can be thought of as the maximum cluster object size
845-
(in diameter) that OPTICS will be able to extract.
846-
847-
.. |optics_results| image:: ../auto_examples/cluster/images/sphx_glr_plot_optics_001.png
848-
:target: ../auto_examples/cluster/plot_optics.html
849-
:scale: 50
850-
851-
.. centered:: |optics_results|
852-
853-
The *reachability* distances generated by OPTICS allow for variable density
854-
extraction of clusters within a single data set. As shown in the above plot,
855-
combining *reachability* distances and data set ``ordering_`` produces a
856-
*reachability plot*, where point density is represented on the Y-axis, and
857-
points are ordered such that nearby points are adjacent. 'Cutting' the
858-
reachability plot at a single value produces DBSCAN like results; all points
859-
above the 'cut' are classified as noise, and each time that there is a break
860-
when reading from left to right signifies a new cluster. The default cluster
861-
extraction with OPTICS looks at changes in slope within the graph to guess at
862-
natural clusters. There are also other possibilities for analysis on the graph
863-
itself, such as generating hierarchical representations of the data through
864-
reachability-plot dendrograms. The plot above has been color-coded so that
865-
cluster colors in planar space match the linear segment clusters of the
866-
reachability plot-- note that the blue and red clusters are adjacent in the
867-
reachability plot, and can be hierarchically represented as children of a
868-
larger parent cluster.
869-
870-
.. topic:: Examples:
871-
872-
* :ref:`sphx_glr_auto_examples_cluster_plot_optics.py`
873-
874-
875-
.. topic:: Comparison with DBSCAN
876-
877-
The results from OPTICS ``extract_dbscan`` method and DBSCAN are not quite
878-
identical. Specifically, while *core_samples* returned from both OPTICS
879-
and DBSCAN are guaranteed to be identical, labeling of periphery and noise
880-
points is not. This is in part because the first sample processed by
881-
OPTICS will always have a reachability distance that is set to ``inf``,
882-
and will thus generally be marked as noise rather than periphery. This
883-
affects adjacent points when they are considered as candidates for being
884-
marked as either periphery or noise. While this effect is quite local to
885-
the starting point of the dataset and is unlikely to be noticed on even
886-
moderately large datasets, it is worth also noting that non-core boundry
887-
points may switch cluster labels on the rare occasion that they are
888-
equidistant to a competeing cluster due to how the graph is read from left
889-
to right when assigning labels.
890-
891-
Note that for any single value of ``eps``, DBSCAN will tend to have a
892-
shorter run time than OPTICS; however, for repeated runs at varying ``eps``
893-
values, a single run of OPTICS may require less cumulative runtime than
894-
DBSCAN. It is also important to note that OPTICS output can be unstable at
895-
``eps`` values very close to the initial ``max_eps`` value. OPTICS seems
896-
to produce near identical results to DBSCAN provided that ``eps`` passed to
897-
``extract_dbscan`` is a half order of magnitude less than the inital
898-
``max_eps`` that was used to fit; using a value close to ``max_eps``
899-
will throw a warning, and using a value larger will result in an exception.
900-
901-
.. topic:: Computational Complexity
902-
903-
Spatial indexing trees are used to avoid calculating the full distance
904-
matrix, and allow for efficient memory usage on large sets of samples.
905-
Different distance metrics can be supplied via the ``metric`` keyword.
906-
907-
For large datasets, similar (but not identical) results can be obtained via
908-
`HDBSCAN <https://hdbscan.readthedocs.io>`_. The HDBSCAN implementation is
909-
multithreaded, and has better algorithmic runtime complexity than OPTICS--
910-
at the cost of worse memory scaling. For extremely large datasets that
911-
exhaust system memory using HDBSCAN, OPTICS will maintain *n* (as opposed
912-
to *n^2* memory scaling); however, tuning of the ``max_eps`` parameter
913-
will likely need to be used to give a solution in a reasonable amount of
914-
wall time.
915-
916-
.. topic:: References:
917-
918-
* "OPTICS: ordering points to identify the clustering structure."
919-
Ankerst, Mihael, Markus M. Breunig, Hans-Peter Kriegel, and Jörg Sander.
920-
In ACM Sigmod Record, vol. 28, no. 2, pp. 49-60. ACM, 1999.
921-
922-
* "Automatic extraction of clusters from hierarchical clustering
923-
representations."
924-
Sander, Jörg, Xuejie Qin, Zhiyong Lu, Nan Niu, and Alex Kovarsky.
925-
In Advances in knowledge discovery and data mining,
926-
pp. 75-87. Springer Berlin Heidelberg, 2003.
927-
928817
.. _birch:
929818

930819
Birch

0 commit comments

Comments
 (0)
0