8000 [MRG+2] OPTICS: add extract_xi method by adrinjalali · Pull Request #12077 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG+2] OPTICS: add extract_xi method #12077

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 107 commits into from
Apr 27, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
107 commits
Select commit Hold shift + click to select a range
3ab2e63
add extract method parameter
adrinjalali Sep 15, 2018
7190098
change fancy to sqlnk
adrinjalali Sep 16, 2018
eb12095
test for invalid extract_method value
adrinjalali Sep 16, 2018
58913d8
test easy dbscan
adrinjalali Sep 16, 2018
4b06ba8
pep8
adrinjalali Sep 16, 2018
c9819d2
add extract_sqlnk
adrinjalali Sep 16, 2018
61b440f
mention which parameter is used for which extact method
adrinjalali Sep 16, 2018
b8fcc9c
extractXi in progress
adrinjalali Sep 13, 2018
8af9364
remove R inspired implementation, starting over
adrinjalali Sep 13, 2018
6e4a467
first draft ready
adrinjalali Sep 13, 2018
b791606
improved, but not detecting the last cluster
adrinjalali Sep 13, 2018
2695f2b
some edge cases solved
adrinjalali Sep 14, 2018
2b4cd24
regions are almost correct, global mib is not
adrinjalali Sep 14, 2018
b5830e1
edges between clusters fixed
ad 8000 rinjalali Sep 14, 2018
08a0f9f
edges between clusters fixed
adrinjalali Sep 14, 2018
8b182b8
issues of the first test case fixed
adrinjalali Sep 14, 2018
0f0f157
add extract_xi to OPTICS
adrinjalali Sep 14, 2018
41244a0
fix label order issue
adrinjalali Sep 14, 2018
2bb2c77
add docstrings
adrinjalali Sep 15, 2018
8ec6df2
missing - sign
adrinjalali Sep 15, 2018
5ce8952
merge optics/choose_extractor
adrinjalali Sep 16, 2018
2655edb
continue merge
adrinjalali Sep 16, 2018
f2fe4e7
add xi parameter to OPTICS
adrinjalali Sep 16, 2018
9a65082
add tests for extract methods with no params given
adrinjalali Sep 16, 2018
bfbe759
add xi to docstrings
adrinjalali Sep 16, 2018
d599a03
fix a few more issues with extract_xi integration
adrinjalali Sep 16, 2018
baa94f9
add a xi test
adrinjalali Sep 16, 2018
bff04ee
merge master
adrinjalali Sep 17, 2018
f1c6cf1
merge master
adrinjalali Sep 18, 2018
c16abdc
Merge remote-tracking branch 'upstream/master' into optics/extractXi
adrinjalali Sep 19, 2018
df261c0
merge master and choose_extractor
adrinjalali Sep 19, 2018
02058e9
trying to base on the latest
adrinjalali Feb 26, 2019
309f4a2
trying to base on the latest
adrinjalali Feb 26, 2019
a9b3c7e
fix docstring
adrinjalali Feb 27, 2019
76f7101
Merge remote-tracking branch 'upstream/master' into optics/extractXi
adrinjalali Mar 4, 2019
f2b50bd
pep8
adrinjalali Mar 4, 2019
a72c67e
pep8
adrinjalali Mar 4, 2019
93640e0
docstring fixes
adrinjalali Mar 4, 2019
e80d17f
better same `xi` docstring everywhere
adrinjalali Mar 4, 2019
d36076f
address some of Joel's comments
adrinjalali Mar 4, 2019
17de182
Merge remote-tracking branch 'upstream/master' into optics/extractXi
adrinjalali Mar 4, 2019
9bd42ab
Merge remote-tracking branch 'upstream/master' into optics/extractXi
adrinjalali Mar 6, 2019
a7ba939
fixes and tests for smaller functions
adrinjalali Mar 6, 2019
7c091c4
fixing some minor issues and adding more low level tests
adrinjalali Mar 7, 2019
52b409f
fix test
adrinjalali Mar 8, 2019
613bd5c
test unorered labeling
adrinjalali Mar 8, 2019
15e5dd7
add and pass more tests
adrinjalali Mar 8, 2019
00a2c6f
add comment
adrinjalali Mar 8, 2019
29c63d4
pep8
adrinjalali Mar 8, 2019
edd94ff
randomize input in test
adrinjalali Mar 10, 2019
86217ce
add predecessor correction
adrinjalali Mar 11, 2019
a077a67
pep8
adrinjalali Mar 11, 2019
392d6cc
docstring cleanup and removal of extra flawed 3.b condition
adrinjalali Mar 11, 2019
fd39b77
merge min_cluster_size and min_samples
adrinjalali Mar 14, 2019
6092a45
add edge case tests and fix an mib issue
adrinjalali Mar 14, 2019
29b61b8
set the min_samples for OPTICS in common tests
8000 adrinjalali Mar 14, 2019
c4629b5
pep8
adrinjalali Mar 14, 2019
4c930c2
Merge remote-tracking branch 'upstream/master' into optics/extractXi
adrinjalali Mar 14, 2019
b9d8582
add back optics to docs
adrinjalali Mar 14, 2019
08d4b58
add optics example
adrinjalali Mar 14, 2019
c41da4c
fine tune examples, remove prints
adrinjalali Mar 14, 2019
c876e1f
put back min_cluster_size
adrinjalali Mar 14, 2019
175c074
fix min_samples to NN
adrinjalali Mar 14, 2019
992c81e
fix examples
adrinjalali Mar 14, 2019
13bc11b
expose clusters_ as array
adrinjalali Mar 17, 2019
14021ad
use utils.shuffle
adrinjalali Mar 17, 2019
4bd96ab
add clusters_ test
adrinjalali Mar 17, 2019
da1cbec
restructure and use numpy vector operations
adrinjalali Mar 17, 2019
ec2e379
iterate only through steep points
adrinjalali Mar 17, 2019
b480842
cleanup the prints
adrinjalali Mar 17, 2019
10f6ec2
k, c -> klass, color
adrinjalali Mar 25, 2019
9fc9935
address optics_.py comments
adrinjalali Mar 25, 2019
1d924c9
fix documentation and remove *5 factor from eps
adrinjalali Mar 25, 2019
5bfc27b
Merge remote-tracking branch 'upstream/master' into optics/extractXi
adrinjalali Mar 25, 2019
a4533fa
pep8
adrinjalali Mar 25, 2019
5705daa
address some of Joel's comments
adrinjalali Mar 27, 2019
09e125b
Merge remote-tracking branch 'upstream/master' into optics/extractXi
adrinjalali Mar 27, 2019
c79820b
more fixes and tests pass
adrinjalali Mar 27, 2019
1df4a75
address more comments
adrinjalali Mar 27, 2019
3c03e99
finalize fixes and address remaining comments
adrinjalali Mar 27, 2019
4d84f94
Merge remote-tracking branch 'upstream/master' into optics/extractXi
adrinjalali Mar 27, 2019
c031196
address comments
adrinjalali Mar 28, 2019
17e7784
Merge remote-tracking branch 'upstream/master' into optics/extractXi
adrinjalali Mar 28, 2019
a974376
more addressing comments
adrinjalali Mar 28, 2019
0b8e161
fix min_cluster_size as 1
adrinjalali Mar 28, 2019
293ace7
apply Hanmin's comments
adrinjalali Apr 1, 2019
68ade1b
merge upstream/master
adrinjalali Apr 4, 2019
f61516f
fix predecessor issue and apply other comments
adrinjalali Apr 4, 2019
1ffa5d4
pep8
adrinjalali Apr 4, 2019
ebe0da8
one of the common tests is already fixed
adrinjalali Apr 4, 2019
8470539
Merge remote-tracking branch 'upstream/master' into optics/extractXi
adrinjalali Apr 5, 2019
f6fcfc8
update OPTICS's min_samples docstring
adrinjalali Apr 5, 2019
4e2e2e7
fix docstring
adrinjalali Apr 9, 2019
34f954a
Merge remote-tracking branch 'upstream/master' into optics/extractXi
adrinjalali Apr 14, 2019
d778dbe
apply Hanmin's review
adrinjalali Apr 14, 2019
9f20ebc
Merge remote-tracking branch 'upstream/master' into optics/extractXi
adrinjalali Apr 17, 2019
d76b2a5
apply more comments
adrinjalali Apr 18, 2019
e1c6086
Merge remote-tracking branch 'upstream/master' into optics/extractXi
adrinjalali Apr 22, 2019
1792605
apply hanmin's fix for predecessor correction
adrinjalali Apr 22, 2019
44240d5
apply some of hanmin's style
adrinjalali Apr 22, 2019
310c2a0
Revert "apply hanmin's fix for predecessor correction"
adrinjalali Apr 23, 2019
d83c85d
fix predecessor correction (again!)
adrinjalali Apr 24, 2019
c76dee7
add comment to test
adrinjalali Apr 24, 2019
c96a32e
rename predecessor to predecessor_plot where needed
adrinjalali Apr 24, 2019
ab34b53
apply more of Hanmin's style
adrinjalali Apr 25, 2019
be9812b
credit in whats_new
adrinjalali Apr 25, 2019
c1e5cde
fix assia's name
adrinjalali Apr 25, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/modules/classes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,7 @@ Functions

cluster.affinity_propagation
cluster.cluster_optics_dbscan
cluster.cluster_optics_xi
cluster.compute_optics_graph
cluster.dbscan
cluster.estimate_bandwidth
Expand Down
97 changes: 97 additions & 0 deletions doc/modules/clustering.rst
8000
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,12 @@ Overview of clustering methods
- Non-flat geometry, uneven cluster sizes
- Distances between nearest points

* - :ref:`OPTICS <optics>`
- minimum cluster membership
- Very large ``n_samples``, large ``n_clusters``
- Non-flat geometry, uneven cluster sizes, variable cluster density
- Distances between points

* - :ref:`Gaussian mixtures <mixture>`
- many
- Not scalable
Expand Down Expand Up @@ -806,6 +812,11 @@ by black points below.
be used (e.g., with sparse matrices). This matrix will consume n^2 floats.
A couple of mechanisms for getting around this are:

- Use :ref:`OPTICS <optics>` clustering in conjunction with the
`extract_dbscan` method. OPTICS clustering also calculates the full
pairwise matrix, but only keeps one row in memory at a time (memory
complexity n).

- A sparse radius neighborhood graph (where missing entries are presumed to
be out of eps) can be precomputed in a memory-efficient way and dbscan
can be run over this with ``metric='precomputed'``. See
Expand All @@ -828,6 +839,92 @@ by black points below.
Schubert, E., Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (2017).
In ACM Transactions on Database Systems (TODS), 42(3), 19.

.. _optics:

OPTICS
======

The :class:`OPTICS` algorithm shares many similarities with the :class:`DBSCAN`
algorithm, and can be considered a generalization of DBSCAN that relaxes the
``eps`` requirement from a single value to a value range. The key difference
between DBSCAN and OPTICS is that the OPTICS algorithm builds a *reachability*
graph, which assigns each sample both a ``reachability_`` distance, and a spot
within the cluster ``ordering_`` attribute; these two attributes are assigned
when the model is fitted, and are used to determine cluster membership. If
OPTICS is run with the default value of *inf* set for ``max_eps``, then DBSCAN
style cluster extraction can be performed repeatedly in linear time for any
given ``eps`` value using the ``cluster_optics_dbscan`` method. Setting
``max_eps`` to a lower value will result in shorter run times, and can be
thought of as the maximum neighborhood radius from each point to find other
potential reachable points.

.. |optics_results| image:: ../auto_examples/cluster/images/sphx_glr_plot_optics_001.png
:target: ../auto_examples/cluster/plot_optics.html
:scale: 50

.. centered:: |optics_results|

The *reachability* distances generated by OPTICS allow for variable density
extraction of clusters within a single data set. As shown in the above plot,
combining *reachability* distances and data set ``ordering_`` produces a
*reachability plot*, where point density is represented on the Y-axis, and
points are ordered such that nearby points are adjacent. 'Cutting' the
reachability plot at a single value produces DBSCAN like results; all points
above the 'cut' are classified as noise, and each time that there is a break
when reading from left to right signifies a new cluster. The default cluster
extraction with OPTICS looks at the steep slopes within the graph to find
clusters, and the user can define what counts as a steep slope using the
parameter ``xi``. There are also other possibilities for analysis on the graph
itself, such as generating hierarchical representations of the data through
reachability-plot dendrograms, and the hierarchy of clusters detected by the
algorithm can be accessed through the ``cluster_hierarchy_`` parameter. The
plot above has been color-coded so that cluster colors in planar space match
the linear segment clusters of the reachability plot. Note that the blue and
red clusters are adjacent in the reachability plot, and can be hierarchically
represented as children of a larger parent cluster.

.. topic:: Examples:

* :ref:`sphx_glr_auto_examples_cluster_plot_optics.py`


.. topic:: Comparison with DBSCAN

The results from OPTICS ``cluster_optics_dbscan`` method and DBSCAN are
very similar, but not always identical; specifically, labeling of periphery
and noise points. This is in part because the first samples of each dense
area processed by OPTICS have a large reachability value while being close
to other points in their area, and will thus sometimes be marked as noise
rather than periphery. This affects adjacent points when they are
considered as candidates for being marked as either periphery or noise.

Note that for any single value of ``eps``, DBSCAN will tend to have a
shorter run time than OPTICS; however, for repeated runs at varying ``eps``
values, a single run of OPTICS may require less cumulative runtime than
DBSCAN. It is also important to note that OPTICS' output is close to
DBSCAN's only if ``eps`` and ``max_eps`` are close.

.. topic:: Computational Complexity

Spatial indexing trees are used to avoid calculating the full distance
matrix, and allow for efficient memory usage on large sets of samples.
Different distance metrics can be supplied via the ``metric`` keyword.

For large datasets, similar (but not identical) results can be obtained via
`HDBSCAN <https://hdbscan.readthedocs.io>`_. The HDBSCAN implementation is
multithreaded, and has better algorithmic runtime complexity than OPTICS,
at the cost of worse memory scaling. For extremely large datasets that
exhaust system memory using HDBSCAN, OPTICS will maintain *n* (as opposed
to *n^2*) memory scaling; however, tuning of the ``max_eps`` parameter
will likely need to be used to give a solution in a reasonable amount of
wall time.

.. topic:: References:

* "OPTICS: ordering points to identify the clustering structure."
Ankerst, Mihael, Markus M. Breunig, Hans-Peter Kriegel, and Jörg Sander.
In ACM Sigmod Record, vol. 28, no. 2, pp. 49-60. ACM, 1999.

.. _birch:

Birch
Expand Down
3 changes: 2 additions & 1 deletion doc/whats_new/v0.21.rst
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,8 @@ Support for Python 3.4 and below has been officially dropped.
- |MajorFeature| A new clustering algorithm: :class:`cluster.OPTICS`: an
algoritm related to :class:`cluster.DBSCAN`, that has hyperparameters easier
to set and that scales better, by :user:`Shane <espg>`,
:user:`Adrin Jalali <adrinjalali>`, and :user:`Erich Schubert <kno10>`.
`Adrin Jalali`_, :user:`Erich Schubert <kno10>`, `Hanmin Qin`_, and
:user:`Assia Benbihi <assiaben>`.

- |Fix| Fixed a bug where :class:`cluster.Birch` could occasionally raise an
AttributeError. :issue:`13651` by `Joel Nothman`_.
Expand Down
18 changes: 14 additions & 4 deletions examples/cluster/plot_cluster_comparison.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,14 +74,20 @@
'damping': .9,
'preference': -200,
'n_neighbors': 10,
'n_clusters': 3}
'n_clusters': 3,
'min_samples': 20,
'xi': 0.05,
'min_cluster_size': 0.1}

datasets = [
(noisy_circles, {'damping': .77, 'preference': -240,
'quantile': .2, 'n_clusters': 2}),
'quantile': .2, 'n_clusters': 2,
'min_samples': 20, 'xi': 0.25}),
(noisy_moons, {'damping': .75, 'preference': -220, 'n_clusters': 2}),
(varied, {'eps': .18, 'n_neighbors': 2}),
(aniso, {'eps': .15, 'n_neighbors': 2}),
(varied, {'eps': .18, 'n_neighbors': 2,
'min_samples': 5, 'xi': 0.035, 'min_cluster_size': .2}),
(aniso, {'eps': .15, 'n_neighbors': 2,
'min_samples': 20, 'xi': 0.1, 'min_cluster_size': .2}),
(blobs, {}),
(no_structure, {})]

Expand Down Expand Up @@ -116,6 +122,9 @@
n_clusters=params['n_clusters'], eigen_solver='arpack',
affinity="nearest_neighbors")
dbscan = cluster.DBSCAN(eps=params['eps'])
optics = cluster.OPTICS(min_samples=params['min_samples'],
xi=params['xi'],
min_cluster_size=params['min_cluster_size'])
affinity_propagation = cluster.AffinityPropagation(
damping=params['damping'], preference=params['preference'])
average_linkage = cluster.AgglomerativeClustering(
Expand All @@ -133,6 +142,7 @@
('Ward', ward),
('AgglomerativeClustering', average_linkage),
('DBSCAN', dbscan),
('OPTICS', optics),
('Birch', birch),
('GaussianMixture', gmm)
)
Expand Down
98 changes: 98 additions & 0 deletions examples/cluster/plot_optics.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
"""
===================================
Demo of OPTICS clustering algorithm
===================================
Finds core samples of high density and expands clusters from them.
This example uses data that is generated so that the clusters have
different densities.
The :class:`sklearn.cluster.OPTICS` is first used with its Xi cluster detection
method, and then setting specific thresholds on the reachability, which
corresponds to :class:`sklearn.cluster.DBSCAN`. We can see that the different
clusters of OPTICS's Xi method can be recovered with different choices of
thresholds in DBSCAN.
"""

# Authors: Shane Grigsby <refuge@rocktalus.com>
# Adrin Jalali <adrin.jalali@gmail.com>
# License: BSD 3 clause


from sklearn.cluster import OPTICS, cluster_optics_dbscan
import matplotlib.gridspec as gridspec
import matplotlib.pyplot as plt
import numpy as np

# Generate sample data

np.random.seed(0)
n_points_per_cluster = 250

C1 = [-5, -2] + .8 * np.random.randn(n_points_per_cluster, 2)
C2 = [4, -1] + .1 * np.random.randn(n_points_per_cluster, 2)
C3 = [1, -2] + .2 * np.random.randn(n_points_per_cluster, 2)
C4 = [-2, 3] + .3 * np.random.randn(n_points_per_cluster, 2)
C5 = [3, -2] + 1.6 * np.random.randn(n_points_per_cluster, 2)
C6 = [5, 6] + 2 * np.random.randn(n_points_per_cluster, 2)
X = np.vstack((C1, C2, C3, C4, C5, C6))

clust = OPTICS(min_samples=50, xi=.05, min_cluster_size=.05)

# Run the fit
clust.fit(X)

labels_050 = cluster_optics_dbscan(reachability=clust.reachability_,
core_distances=clust.core_distances_,
ordering=clust.ordering_, eps=0.5)
labels_200 = cluster_optics_dbscan(reachability=clust.reachability_,
core_distances=clust.core_distances_,
ordering=clust.ordering_, eps=2)

space = np.arange(len(X))
reachability = clust.reachability_[clust.ordering_]
labels = clust.labels_[clust.ordering_]

plt.figure(figsize=(10, 7))
G = gridspec.GridSpec(2, 3)
ax1 = plt.subplot(G[0, :])
ax2 = plt.subplot(G[1, 0])
ax3 = plt.subplot(G[1, 1])
ax4 = plt.subplot(G[1, 2])

# Reachability plot
colors = ['g.', 'r.', 'b.', 'y.', 'c.']
for klass, color in zip(range(0, 5), colors):
Xk = space[labels == klass]
Rk = reachability[labels == klass]
ax1.plot(Xk, Rk, color, alpha=0.3)
ax1.plot(space[labels == -1], reachability[labels == -1], 'k.', alpha=0.3)
ax1.plot(space, np.full_like(space, 2., dtype=float), 'k-', alpha=0.5)
ax1.plot(space, np.full_like(space, 0.5, dtype=float), 'k-.', alpha=0.5)
ax1.set_ylabel('Reachability (epsilon distance)')
ax1.set_title('Reachability Plot')

# OPTICS
colors = ['g.', 'r.', 'b.', 'y.', 'c.']
for klass, color in zip(range(0, 5), colors):
Xk = X[clust.labels_ == klass]
ax2.plot(Xk[:, 0], Xk[:, 1], color, alpha=0.3)
ax2.plot(X[clust.labels_ == -1, 0], X[clust.labels_ == -1, 1], 'k+', alpha=0.1)
ax2.set_title('Automatic Clustering\nOPTICS')

# DBSCAN at 0.5
colors = ['g', 'greenyellow', 'olive', 'r', 'b', 'c']
for klass, color in zip(range(0, 6), colors):
Xk = X[labels_050 == klass]
ax3.plot(Xk[:, 0], Xk[:, 1], color, alpha=0.3, marker='.')
ax3.plot(X[labels_050 == -1, 0], X[labels_050 == -1, 1], 'k+', alpha=0.1)
ax3.set_title('Clustering at 0.5 epsilon cut\nDBSCAN')

# DBSCAN at 2.
colors = ['g.', 'm.', 'y.', 'c.']
for klass, color in zip(range(0, 4), colors):
Xk = X[labels_200 == klass]
ax4.plot(Xk[:, 0], Xk[:, 1], color, alpha=0.3)
ax4.plot(X[labels_200 == -1, 0], X[labels_200 == -1, 1], 'k+', alpha=0.1)
ax4.set_title('Clustering at 2.0 epsilon cut\nDBSCAN')

plt.tight_layout()
plt.show()
4 changes: 3 additions & 1 deletion sklearn/cluster/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,8 @@
FeatureAgglomeration)
from .k_means_ import k_means, KMeans, MiniBatchKMeans
from .dbscan_ import dbscan, DBSCAN
from .optics_ import OPTICS, cluster_optics_dbscan, compute_optics_graph
from .optics_ import (OPTICS, cluster_optics_dbscan, compute_optics_graph,
cluster_optics_xi)
from .bicluster import SpectralBiclustering, SpectralCoclustering
from .birch import Birch

Expand All @@ -21,6 +22,7 @@
'DBSCAN',
'OPTICS',
'cluster_optics_dbscan',
'cluster_optics_xi',
'compute_optics_graph',
'KMeans',
'FeatureAgglomeration',
Expand Down
Loading
0