@@ -91,12 +91,6 @@ Overview of clustering methods
91
91
- Non-flat geometry, uneven cluster sizes
92
92
- Distances between nearest points
93
93
94
- * - :ref: `OPTICS <optics >`
95
- - minimum cluster membership
96
- - Very large ``n_samples ``, large ``n_clusters ``
97
- - Non-flat geometry, uneven cluster sizes, variable cluster density
98
- - Distances between points
99
-
100
94
* - :ref: `Gaussian mixtures <mixture >`
101
95
- many
102
96
- Not scalable
@@ -802,11 +796,6 @@ by black points below.
802
796
be used (e.g. with sparse matrices). This matrix will consume n^2 floats.
803
797
A couple of mechanisms for getting around this are:
804
798
805
- - Use :ref: `OPTICS <optics >` clustering in conjunction with the
806
- `extract_dbscan ` method. OPTICS clustering also calculates the full
807
- pairwise matrix, but only keeps one row in memory at a time (memory
808
- complexity n).
809
-
810
799
- A sparse radius neighborhood graph (where missing entries are presumed to
811
800
be out of eps) can be precomputed in a memory-efficient way and dbscan
812
801
can be run over this with ``metric='precomputed' ``. See
@@ -825,106 +814,6 @@ by black points below.
825
814
In Proceedings of the 2nd International Conference on Knowledge Discovery
826
815
and Data Mining, Portland, OR, AAAI Press, pp. 226–231. 1996
827
816
828
- .. _optics :
829
-
830
- OPTICS
831
- ======
832
-
833
- The :class: `OPTICS ` algorithm shares many similarities with the
834
- :class: `DBSCAN ` algorithm, and can be considered a generalization of
835
- DBSCAN that relaxes the ``eps `` requirement from a single value to a value
836
- range. The key difference between DBSCAN and OPTICS is that the OPTICS
837
- algorithm builds a *reachability * graph, which assigns each sample both a
838
- ``reachability_ `` distance, and a spot within the cluster ``ordering_ ``
839
- attribute; these two attributes are assigned when the model is fitted, and are
840
- used to determine cluster membership. If OPTICS is run with the default value
841
- of *inf * set for ``max_eps ``, then DBSCAN style cluster extraction can be
842
- performed in linear time for any given ``eps `` value using the
843
- ``extract_dbscan `` method. Setting ``max_eps `` to a lower value will result
844
- in shorter run times, and can be thought of as the maximum cluster object size
845
- (in diameter) that OPTICS will be able to extract.
846
-
847
- .. |optics_results | image :: ../auto_examples/cluster/images/sphx_glr_plot_optics_001.png
848
- :target: ../auto_examples/cluster/plot_optics.html
849
- :scale: 50
850
-
851
- .. centered :: |optics_results|
852
-
853
- The *reachability * distances generated by OPTICS allow for variable density
854
- extraction of clusters within a single data set. As shown in the above plot,
855
- combining *reachability * distances and data set ``ordering_ `` produces a
856
- *reachability plot *, where point density is represented on the Y-axis, and
857
- points are ordered such that nearby points are adjacent. 'Cutting' the
858
- reachability plot at a single value produces DBSCAN like results; all points
859
- above the 'cut' are classified as noise, and each time that there is a break
860
- when reading from left to right signifies a new cluster. The default cluster
861
- extraction with OPTICS looks at changes in slope within the graph to guess at
862
- natural clusters. There are also other possibilities for analysis on the graph
863
- itself, such as generating hierarchical representations of the data through
864
- reachability-plot dendrograms. The plot above has been color-coded so that
865
- cluster colors in planar space match the linear segment clusters of the
866
- reachability plot-- note that the blue and red clusters are adjacent in the
867
- reachability plot, and can be hierarchically represented as children of a
868
- larger parent cluster.
869
-
870
- .. topic :: Examples:
871
-
872
- * :ref: `sphx_glr_auto_examples_cluster_plot_optics.py `
873
-
874
-
875
- .. topic :: Comparison with DBSCAN
876
-
877
- The results from OPTICS ``extract_dbscan `` method and DBSCAN are not quite
878
- identical. Specifically, while *core_samples * returned from both OPTICS
879
- and DBSCAN are guaranteed to be identical, labeling of periphery and noise
880
- points is not. This is in part because the first sample processed by
881
- OPTICS will always have a reachability distance that is set to ``inf ``,
882
- and will thus generally be marked as noise rather than periphery. This
883
- affects adjacent points when they are considered as candidates for being
884
- marked as either periphery or noise. While this effect is quite local to
885
- the starting point of the dataset and is unlikely to be noticed on even
886
- moderately large datasets, it is worth also noting that non-core boundry
887
- points may switch cluster labels on the rare occasion that they are
888
- equidistant to a competeing cluster due to how the graph is read from left
889
- to right when assigning labels.
890
-
891
- Note that for any single value of ``eps ``, DBSCAN will tend to have a
892
- shorter run time than OPTICS; however, for repeated runs at varying ``eps ``
893
- values, a single run of OPTICS may require less cumulative runtime than
894
- DBSCAN. It is also important to note that OPTICS output can be unstable at
895
- ``eps `` values very close to the initial ``max_eps `` value. OPTICS seems
896
- to produce near identical results to DBSCAN provided that ``eps `` passed to
897
- ``extract_dbscan `` is a half order of magnitude less than the inital
898
- ``max_eps `` that was used to fit; using a value close to ``max_eps ``
899
- will throw a warning, and using a value larger will result in an exception.
900
-
901
- .. topic :: Computational Complexity
902
-
903
- Spatial indexing trees are used to avoid calculating the full distance
904
- matrix, and allow for efficient memory usage on large sets of samples.
905
- Different distance metrics can be supplied via the ``metric `` keyword.
906
-
907
- For large datasets, similar (but not identical) results can be obtained via
908
- `HDBSCAN <https://hdbscan.readthedocs.io >`_. The HDBSCAN implementation is
909
- multithreaded, and has better algorithmic runtime complexity than OPTICS--
910
- at the cost of worse memory scaling. For extremely large datasets that
911
- exhaust system memory using HDBSCAN, OPTICS will maintain *n * (as opposed
912
- to *n^2 * memory scaling); however, tuning of the ``max_eps `` parameter
913
- will likely need to be used to give a solution in a reasonable amount of
914
- wall time.
915
-
916
- .. topic :: References:
917
-
918
- * "OPTICS: ordering points to identify the clustering structure."
919
- Ankerst, Mihael, Markus M. Breunig, Hans-Peter Kriegel, and Jörg Sander.
920
- In ACM Sigmod Record, vol. 28, no. 2, pp. 49-60. ACM, 1999.
921
-
922
- * "Automatic extraction of clusters from hierarchical clustering
923
- representations."
924
- Sander, Jörg, Xuejie Qin, Zhiyong Lu, Nan Niu, and Alex Kovarsky.
925
- In Advances in knowledge discovery and data mining,
926
- pp. 75-87. Springer Berlin Heidelberg, 2003.
927
-
928
817
.. _birch :
929
818
930
819
Birch
0 commit comments