@@ -91,12 +91,6 @@ Overview of clustering methods
91
91
- Non-flat geometry, uneven cluster sizes
92
92
- Distances between nearest points
93
93
94
- * - :ref: `OPTICS <optics >`
95
- - minimum cluster membership
96
- - Very large ``n_samples ``, large ``n_clusters ``
97
- - Non-flat geometry, uneven cluster sizes, variable cluster density
98
- - Distances between points
99
-
100
94
* - :ref: `Gaussian mixtures <mixture >`
101
95
- many
102
96
- Not scalable
@@ -812,11 +806,6 @@ by black points below.
812
806
be used (e.g., with sparse matrices). This matrix will consume n^2 floats.
813
807
A couple of mechanisms for getting around this are:
814
808
815
- - Use :ref: `OPTICS <optics >` clustering in conjunction with the
816
- `extract_dbscan ` method. OPTICS clustering also calculates the full
817
- pairwise matrix, but only keeps one row in memory at a time (memory
818
- complexity n).
819
-
820
809
- A sparse radius neighborhood graph (where missing entries are presumed to
821
810
be out of eps) can be precomputed in a memory-efficient way and dbscan
822
811
can be run over this with ``metric='precomputed' ``. See
@@ -839,92 +828,6 @@ by black points below.
839
828
Schubert, E., Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (2017).
840
829
In ACM Transactions on Database Systems (TODS), 42(3), 19.
841
830
842
- .. _optics :
843
-
844
- OPTICS
845
- ======
846
-
847
- The :class: `OPTICS ` algorithm shares many similarities with the :class: `DBSCAN `
848
- algorithm, and can be considered a generalization of DBSCAN that relaxes the
849
- ``eps `` requirement from a single value to a value range. The key difference
850
- between DBSCAN and OPTICS is that the OPTICS algorithm builds a *reachability *
851
- graph, which assigns each sample both a ``reachability_ `` distance, and a spot
852
- within the cluster ``ordering_ `` attribute; these two attributes are assigned
853
- when the model is fitted, and are used to determine cluster membership. If
854
- OPTICS is run with the default value of *inf * set for ``max_eps ``, then DBSCAN
855
- style cluster extraction can be performed repeatedly in linear time for any
856
- given ``eps `` value using the ``cluster_optics_dbscan `` method. Setting
857
- ``max_eps `` to a lower value will result in shorter run times, and can be
858
- thought of as the maximum neighborhood radius from each point to find other
859
- potential reachable points.
860
-
861
- .. |optics_results | image :: ../auto_examples/cluster/images/sphx_glr_plot_optics_001.png
862
- :target: ../auto_examples/cluster/plot_optics.html
863
- :scale: 50
864
-
865
- .. centered :: |optics_results|
866
-
867
- The *reachability * distances generated by OPTICS allow for variable density
868
- extraction of clusters within a single data set. As shown in the above plot,
869
- combining *reachability * distances and data set ``ordering_ `` produces a
870
- *reachability plot *, where point density is represented on the Y-axis, and
871
- points are ordered such that nearby points are adjacent. 'Cutting' the
872
- reachability plot at a single value produces DBSCAN like results; all points
873
- above the 'cut' are classified as noise, and each time that there is a break
874
- when reading from left to right signifies a new cluster. The default cluster
875
- extraction with OPTICS looks at the steep slopes within the graph to find
876
- clusters, and the user can define what counts as a steep slope using the
877
- parameter ``xi ``. There are also other possibilities for analysis on the graph
878
- itself, such as generating hierarchical representations of the data through
879
- reachability-plot dendrograms, and the hierarchy of clusters detected by the
880
- algorithm can be accessed through the ``cluster_hierarchy_ `` parameter. The
881
- plot above has been color-coded so that cluster colors in planar space match
882
- the linear segment clusters of the reachability plot. Note that the blue and
883
- red clusters are adjacent in the reachability plot, and can be hierarchically
884
- represented as children of a larger parent cluster.
885
-
886
- .. topic :: Examples:
887
-
888
- * :ref: `sphx_glr_auto_examples_cluster_plot_optics.py `
889
-
890
-
891
- .. topic :: Comparison with DBSCAN
892
-
893
- The results from OPTICS ``cluster_optics_dbscan `` method and DBSCAN are
894
- very similar, but not always identical; specifically, labeling of periphery
895
- and noise points. This is in part because the first samples of each dense
896
- area processed by OPTICS have a large reachability value while being close
897
- to other points in their area, and will thus sometimes be marked as noise
898
- rather than periphery. This affects adjacent points when they are
899
- considered as candidates for being marked as either periphery or noise.
900
-
901
- Note that for any single value of ``eps ``, DBSCAN will tend to have a
902
- shorter run time than OPTICS; however, for repeated runs at varying ``eps ``
903
- values, a single run of OPTICS may require less cumulative runtime than
904
- DBSCAN. It is also important to note that OPTICS' output is close to
905
- DBSCAN's only if ``eps `` and ``max_eps `` are close.
906
-
907
- .. topic :: Computational Complexity
908
-
909
- Spatial indexing trees are used to avoid calculating the full distance
910
- matrix, and allow for efficient memory usage on large sets of samples.
911
- Different distance metrics can be supplied via the ``metric `` keyword.
912
-
913
- For large datasets, similar (but not identical) results can be obtained via
914
- `HDBSCAN <https://hdbscan.readthedocs.io >`_. The HDBSCAN implementation is
915
- multithreaded, and has better algorithmic runtime complexity than OPTICS,
916
- at the cost of worse memory scaling. For extremely large datasets that
917
- exhaust system memory using HDBSCAN, OPTICS will maintain *n * (as opposed
918
- to *n^2 *) memory scaling; however, tuning of the ``max_eps `` parameter
919
- will likely need to be used to give a solution in a reasonable amount of
920
- wall time.
921
-
922
- .. topic :: References:
923
-
924
- * "OPTICS: ordering points to identify the clustering structure."
925
- Ankerst, Mihael, Markus M. Breunig, Hans-Peter Kriegel, and Jörg Sander.
926
- In ACM Sigmod Record, vol. 28, no. 2, pp. 49-60. ACM, 1999.
927
-
928
831
.. _birch :
929
832
930
833
Birch
0 commit comments