@@ -91,6 +91,12 @@ Overview of clustering methods
91
91
- Non-flat geometry, uneven cluster sizes
92
92
- Distances between nearest points
93
93
94
+ * - :ref: `OPTICS <optics >`
95
+ - minimum cluster membership
96
+ - Very large ``n_samples ``, large ``n_clusters ``
97
+ - Non-flat geometry, uneven cluster sizes, variable cluster density
98
+ - Distances between points
99
+
94
100
* - :ref: `Gaussian mixtures <mixture >`
95
101
- many
96
102
- Not scalable
@@ -796,6 +802,11 @@ by black points below.
796
802
be used (e.g. with sparse matrices). This matrix will consume n^2 floats.
797
803
A couple of mechanisms for getting around this are:
798
804
805
+ - Use :ref: `OPTICS <optics >` clustering in conjunction with the
806
+ `extract_dbscan ` method. OPTICS clustering also calculates the full
807
+ pairwise matrix, but only keeps one row in memory at a time (memory
808
+ complexity n).
809
+
799
810
- A sparse radius neighborhood graph (where missing entries are presumed to
800
811
be out of eps) can be precomputed in a memory-efficient way and dbscan
801
812
can be run over this with ``metric='precomputed' ``. See
@@ -814,6 +825,106 @@ by black points below.
814
825
In Proceedings of the 2nd International Conference on Knowledge Discovery
815
826
and Data Mining, Portland, OR, AAAI Press, pp. 226–231. 1996
816
827
828
+ .. _optics :
829
+
830
+ OPTICS
831
+ ======
832
+
833
+ The :class: `OPTICS ` algorithm shares many similarities with the
834
+ :class: `DBSCAN ` algorithm, and can be considered a generalization of
835
+ DBSCAN that relaxes the ``eps `` requirement from a single value to a value
836
+ range. The key difference between DBSCAN and OPTICS is that the OPTICS
837
+ algorithm builds a *reachability * graph, which assigns each sample both a
838
+ ``reachability_ `` distance, and a spot within the cluster ``ordering_ ``
839
+ attribute; these two attributes are assigned when the model is fitted, and are
840
+ used to determine cluster membership. If OPTICS is run with the default value
841
+ of *inf * set for ``max_eps ``, then DBSCAN style cluster extraction can be
842
+ performed in linear time for any given ``eps `` value using the
843
+ ``extract_dbscan `` method. Setting ``max_eps `` to a lower value will result
844
+ in shorter run times, and can be thought of as the maximum cluster object size
845
+ (in diameter) that OPTICS will be able to extract.
846
+
847
+ .. |optics_results | image :: ../auto_examples/cluster/images/sphx_glr_plot_optics_001.png
848
+ :target: ../auto_examples/cluster/plot_optics.html
849
+ :scale: 50
850
+
851
+ .. centered :: |optics_results|
852
+
853
+ The *reachability * distances generated by OPTICS allow for variable density
854
+ extraction of clusters within a single data set. As shown in the above plot,
855
+ combining *reachability * distances and data set ``ordering_ `` produces a
856
+ *reachability plot *, where point density is represented on the Y-axis, and
857
+ points are ordered such that nearby points are adjacent. 'Cutting' the
858
+ reachability plot at a single value produces DBSCAN like results; all points
859
+ above the 'cut' are classified as noise, and each time that there is a break
860
+ when reading from left to right signifies a new cluster. The default cluster
861
+ extraction with OPTICS looks at changes in slope within the graph to guess at
862
+ natural clusters. There are also other possibilities for analysis on the graph
863
+ itself, such as generating hierarchical representations of the data through
864
+ reachability-plot dendrograms. The plot above has been color-coded so that
865
+ cluster colors in planar space match the linear segment clusters of the
866
+ reachability plot-- note that the blue and red clusters are adjacent in the
867
+ reachability plot, and can be hierarchically represented as children of a
868
+ larger parent cluster.
869
+
870
+ .. topic :: Examples:
871
+
872
+ * :ref: `sphx_glr_auto_examples_cluster_plot_optics.py `
873
+
874
+
875
+ .. topic :: Comparison with DBSCAN
876
+
877
+ The results from OPTICS ``extract_dbscan `` method and DBSCAN are not quite
878
+ identical. Specifically, while *core_samples * returned from both OPTICS
879
+ and DBSCAN are guaranteed to be identical, labeling of periphery and noise
880
+ points is not. This is in part because the first sample processed by
881
+ OPTICS will always have a reachability distance that is set to ``inf ``,
882
+ and will thus generally be marked as noise rather than periphery. This
883
+ affects adjacent points when they are considered as candidates for being
884
+ marked as either periphery or noise. While this effect is quite local to
885
+ the starting point of the dataset and is unlikely to be noticed on even
886
+ moderately large datasets, it is worth also noting that non-core boundry
887
+ points may switch cluster labels on the rare occasion that they are
888
+ equidistant to a competeing cluster due to how the graph is read from left
889
+ to right when assigning labels.
890
+
891
+ Note that for any single value of ``eps ``, DBSCAN will tend to have a
892
+ shorter run time than OPTICS; however, for repeated runs at varying ``eps ``
893
+ values, a single run of OPTICS may require less cumulative runtime than
894
+ DBSCAN. It is also important to note that OPTICS output can be unstable at
895
+ ``eps `` values very close to the initial ``max_eps `` value. OPTICS seems
896
+ to produce near identical results to DBSCAN provided that ``eps `` passed to
897
+ ``extract_dbscan `` is a half order of magnitude less than the inital
898
+ ``max_eps `` that was used to fit; using a value close to ``max_eps ``
899
+ will throw a warning, and using a value larger will result in an exception.
900
+
901
+ .. topic :: Computational Complexity
902
+
903
+ Spatial indexing trees are used to avoid calculating the full distance
904
+ matrix, and allow for efficient memory usage on large sets of samples.
905
+ Different distance metrics can be supplied via the ``metric `` keyword.
906
+
907
+ For large datasets, similar (but not identical) results can be obtained via
908
+ `HDBSCAN <https://hdbscan.readthedocs.io >`_. The HDBSCAN implementation is
909
+ multithreaded, and has better algorithmic runtime complexity than OPTICS--
910
+ at the cost of worse memory scaling. For extremely large datasets that
911
+ exhaust system memory using HDBSCAN, OPTICS will maintain *n * (as opposed
912
+ to *n^2 * memory scaling); however, tuning of the ``max_eps `` parameter
913
+ will likely need to be used to give a solution in a reasonable amount of
914
+ wall time.
915
+
916
+ .. topic :: References:
917
+
918
+ * "OPTICS: ordering points to identify the clustering structure."
919
+ Ankerst, Mihael, Markus M. Breunig, Hans-Peter Kriegel, and Jörg Sander.
920
+ In ACM Sigmod Record, vol. 28, no. 2, pp. 49-60. ACM, 1999.
921
+
922
+ * "Automatic extraction of clusters from hierarchical clustering
923
+ representations."
924
+ Sander, Jörg, Xuejie Qin, Zhiyong Lu, Nan Niu, and Alex Kovarsky.
925
+ In Advances in knowledge discovery and data mining,
926
+ pp. 75-87. Springer Berlin Heidelberg, 2003.
927
+
817
928
.. _birch :
818
929
819
930
Birch
0 commit comments