8000 ENH Fix processing order in OPTICS. See #12090 · scikit-learn/scikit-learn@b17e351 · GitHub
[go: up one dir, main page]

Skip to content

Commit b17e351

Browse files
committed
ENH Fix processing order in OPTICS. See #12090
The current code may expand the wrong point, because it only considers the neighbors of the current point, not all currently unprocessed points (which may have a better reachability). Using the distance from the latest point as tiebreaker then does not work anymore, because it might not yet be computed for all unprocessed points when using an index. If we choose the first on ties, we get the same result same as R and ELKI. But the order of points in "unproc" is also unstable, so we cannot rely on the first smallest to have the smallest index. Instead of the cython quick_scan, we now use numpy masked arrays.
1 parent dd95eba commit b17e351

File tree

5 files changed

+146
-280
lines changed

5 files changed

+146
-280
lines changed

doc/whats_new/v0.21.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -45,8 +45,8 @@ Support for Python 3.4 and below has been officially dropped.
4545

4646
- |MajorFeature| A new clustering algorithm: :class:`cluster.OPTICS`: an
4747
algoritm related to :class:`cluster.DBSCAN`, that has hyperparameters easier
48-
to set and that scales better, by :user:`Shane <espg>` and
49-
:user:`Adrin Jalali <adrinjalali>`.
48+
to set and that scales better, by :user:`Shane <espg>`,
49+
:user:`Adrin Jalali <adrinjalali>`, and :user:`Erich Schubert <kno10>`.
5050

5151
:mod:`sklearn.preprocessing`
5252
............................

sklearn/cluster/_optics_inner.pyx

Lines changed: 0 additions & 38 deletions
This file was deleted.

sklearn/cluster/optics_.py

Lines changed: 15 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,6 @@
2020
from ..neighbors import NearestNeighbors
2121
from ..base import BaseEstimator, ClusterMixin
2222
from ..metrics import pairwise_distances
23-
from ._optics_inner import quick_scan
2423

2524

2625
def optics(X, min_samples=5, max_eps=np.inf, metric='euclidean',
@@ -37,6 +36,11 @@ def optics(X, min_samples=5, max_eps=np.inf, metric='euclidean',
3736
neighborhood radius. Better suited for usage on large point datasets than
3837
the current sklearn implementation of DBSCAN.
3938
39+
This implementation deviates from the original OPTICS by first performing
40+
k-nearest-neighborhood searches on all points to identify core sizes, then
41+
computing only the distances to unprocessed points when constructing the
42+
cluster order.
43+
4044
Read more in the :ref:`User Guide <optics>`.
4145
4246
Parameters
@@ -190,6 +194,11 @@ class OPTICS(BaseEstimator, ClusterMixin):
190194
neighborhood radius. Better suited for usage on large point datasets than
191195
the current sklearn implementation of DBSCAN.
192196
197+
This implementation deviates from the original OPTICS by first performing
198+
k-nearest-neighborhood searches on all points to identify core sizes, then
199+
computing only the distances to unprocessed points when constructing the
200+
cluster order.
201+
193202
Read more in the :ref:`User Guide <optics>`.
194203
195204
Parameters
@@ -313,15 +322,15 @@ class OPTICS(BaseEstimator, ClusterMixin):
313322
``clust.reachability_[clust.ordering_]`` to access in cluster order.
314323
315324
ordering_ : array, shape (n_samples,)
316-
The cluster ordered list of sample indices
325+
The cluster ordered list of sample indices.
317326
318327
core_distances_ : array, shape (n_samples,)
319328
Distance at which each sample becomes a core point, indexed by object
320329
order. Points which will never be core have a distance of inf. Use
321330
``clust.core_distances_[clust.ordering_]`` to access in cluster order.
322331
323332
predecessor_ : array, shape (n_samples,)
324-
Point that a sample was reached from.
333+
Point that a sample was reached from, indexed by object order.
325334
Seed points have a predecessor of -1.
326335
327336
See also
@@ -516,9 +525,9 @@ def _set_reach_dist(self, point_index, processed, X, nbrs):
516525
self.reachability_[unproc[improved]] = rdists[improved]
517526
self.predecessor_[unproc[improved]] = point_index
518527

519-
# Define return order based on reachability distance
520-
return (unproc[quick_scan(np.take(self.reachability_, unproc),
521-
dists)])
528+
# Choose next based on smallest reachability distance
529+
return (np.ma.array(self.reachability_, mask=processed)
530+
.argmin(fill_value=np.inf))
522531

523532
def extract_dbscan(self, eps):
524533
"""Performs DBSCAN extraction for an arbitrary epsilon.

sklearn/cluster/setup.py

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -23,10 +23,6 @@ def configuration(parent_package='', top_path=None):
2323
sources=['_dbscan_inner.pyx'],
2424
include_dirs=[numpy.get_include()],
2525
language="c++")
26-
config.add_extension('_optics_inner',
27-
sources=['_optics_inner.pyx'],
28-
include_dirs=[numpy.get_include()],
29-
libraries=libraries)
3026

3127
config.add_extension('_hierarchical',
3228
sources=['_hierarchical.pyx'],

0 commit comments

Comments
 (0)
0