xhluca
diff --git a/‎doc/modules/clustering.rst
Lines changed: 111 additions & 0 deletions b/‎doc/modules/clustering.rst
Lines changed: 111 additions & 0 deletions
@@ -91,6 +91,12 @@ Overview of clustering methods
      - Non-flat geometry, uneven cluster sizes
      - Distances between nearest points
 
+   * - :ref:`OPTICS <optics>`
+     - minimum cluster membership
+     - Very large ``n_samples``, large ``n_clusters``
+     - Non-flat geometry, uneven cluster sizes, variable cluster density 
+     - Distances between points
+   
    * - :ref:`Gaussian mixtures <mixture>`
      - many
      - Not scalable
@@ -796,6 +802,11 @@ by black points below.
     be used (e.g. with sparse matrices). This matrix will consume n^2 floats.
     A couple of mechanisms for getting around this are:
 
+    - Use :ref:`OPTICS <optics>` clustering in conjunction with the
+      `extract_dbscan` method. OPTICS clustering also calculates the full
+      pairwise matrix, but only keeps one row in memory at a time (memory
+      complexity n).
+
     - A sparse radius neighborhood graph (where missing entries are presumed to
       be out of eps) can be precomputed in a memory-efficient way and dbscan
       can be run over this with ``metric='precomputed'``.  See
@@ -814,6 +825,106 @@ by black points below.
    In Proceedings of the 2nd International Conference on Knowledge Discovery
    and Data Mining, Portland, OR, AAAI Press, pp. 226–231. 1996
 
+.. _optics:
+
+OPTICS
+======
+
+The :class:`OPTICS` algorithm shares many similarities with the
+:class:`DBSCAN` algorithm, and can be considered a generalization of
+DBSCAN that relaxes the ``eps`` requirement from a single value to a value
+range. The key difference between DBSCAN and OPTICS is that the OPTICS
+algorithm builds a *reachability* graph, which assigns each sample both a
+``reachability_`` distance, and a spot within the cluster ``ordering_``
+attribute; these two attributes are assigned when the model is fitted, and are
+used to determine cluster membership. If OPTICS is run with the default value
+of *inf* set for ``max_eps``, then DBSCAN style cluster extraction can be
+performed in linear time for any given ``eps`` value using the
+``extract_dbscan`` method. Setting ``max_eps`` to a lower value will result
+in shorter run times, and can be thought of as the maximum cluster object size
+(in diameter) that OPTICS will be able to extract.
+
+.. |optics_results| image:: ../auto_examples/cluster/images/sphx_glr_plot_optics_001.png
+        :target: ../auto_examples/cluster/plot_optics.html
+        :scale: 50
+
+.. centered:: |optics_results|
+
+The *reachability* distances generated by OPTICS allow for variable density
+extraction of clusters within a single data set. As shown in the above plot,
+combining *reachability* distances and data set ``ordering_`` produces a
+*reachability plot*, where point density is represented on the Y-axis, and
+points are ordered such that nearby points are adjacent. 'Cutting' the
+reachability plot at a single value produces DBSCAN like results; all points
+above the 'cut' are classified as noise, and each time that there is a break
+when reading from left to right signifies a new cluster. The default cluster
+extraction with OPTICS looks at changes in slope within the graph to guess at
+natural clusters. There are also other possibilities for analysis on the graph
+itself, such as generating hierarchical representations of the data through
+reachability-plot dendrograms. The plot above has been color-coded so that
+cluster colors in planar space match the linear segment clusters of the
+reachability plot-- note that the blue and red clusters are adjacent in the
+reachability plot, and can be hierarchically represented as children of a
+larger parent cluster.
+
+.. topic:: Examples:
+
+     * :ref:`sphx_glr_auto_examples_cluster_plot_optics.py`
+
+
+.. topic:: Comparison with DBSCAN
+    
+    The results from OPTICS ``extract_dbscan`` method and DBSCAN are not quite
+    identical. Specifically, while *core_samples* returned from both OPTICS
+    and DBSCAN are guaranteed to be identical, labeling of periphery and noise
+    points is not. This is in part because the first sample processed by
+    OPTICS will always have a reachability distance that is set to ``inf``,
+    and will thus generally be marked as noise rather than periphery. This
+    affects adjacent points when they are considered as candidates for being
+    marked as either periphery or noise. While this effect is quite local to
+    the starting point of the dataset and is unlikely to be noticed on even
+    moderately large datasets, it is worth also noting that non-core boundry
+    points may switch cluster labels on the rare occasion that they are
+    equidistant to a competeing cluster due to how the graph is read from left
+    to right when assigning labels. 
+
+    Note that for any single value of ``eps``, DBSCAN will tend to have a
+    shorter run time than OPTICS; however, for repeated runs at varying ``eps``
+    values, a single run of OPTICS may require less cumulative runtime than
+    DBSCAN. It is also important to note that OPTICS output can be unstable at
+    ``eps`` values very close to the initial ``max_eps`` value. OPTICS seems
+    to produce near identical results to DBSCAN provided that ``eps`` passed to
+    ``extract_dbscan`` is a half order of magnitude less than the inital
+    ``max_eps`` that was used to fit; using a value close to ``max_eps``
+    will throw a warning, and using a value larger will result in an exception. 
+
+.. topic:: Computational Complexity
+
+    Spatial indexing trees are used to avoid calculating the full distance
+    matrix, and allow for efficient memory usage on large sets of samples.
+    Different distance metrics can be supplied via the ``metric`` keyword.
+    
+    For large datasets, similar (but not identical) results can be obtained via
+    `HDBSCAN <https://hdbscan.readthedocs.io>`_. The HDBSCAN implementation is
+    multithreaded, and has better algorithmic runtime complexity than OPTICS--
+    at the cost of worse memory scaling. For extremely large datasets that
+    exhaust system memory using HDBSCAN, OPTICS will maintain *n* (as opposed
+    to *n^2* memory scaling); however, tuning of the ``max_eps`` parameter
+    will likely need to be used to give a solution in a reasonable amount of
+    wall time.
+
+.. topic:: References:
+
+ *  "OPTICS: ordering points to identify the clustering structure."
+    Ankerst, Mihael, Markus M. Breunig, Hans-Peter Kriegel, and Jörg Sander.
+    In ACM Sigmod Record, vol. 28, no. 2, pp. 49-60. ACM, 1999.
+
+ *  "Automatic extraction of clusters from hierarchical clustering
+    representations."
+    Sander, Jörg, Xuejie Qin, Zhiyong Lu, Nan Niu, and Alex Kovarsky.
+    In Advances in knowledge discovery and data mining,
+    pp. 75-87. Springer Berlin Heidelberg, 2003. 
+
 .. _birch:
 
 Birch