[MRG] Implementation of OPTICS #1984

espg · 2013-05-21T23:56:17Z

Equivalent results to DBSCAN, but allows execution on arbitrarily large datasets. After initial construction, allows multiple 'scans' to quickly extract DBSCAN clusters at variable epsilon distances

Algorithm is tested and validated, but not yet optimized. Tested on 70K points (successful); currently testing on 2 x 10^9 LiDAR points (pending)

First commit; would appreciate any advice on changes to help code conform to the scikit-learn API

Example usage in examples folder

Equivalent results to DBSCAN, but allows execution on arbitrarily large datasets. After initial construction, allows multiple 'scans' to quickly extract DBSCAN clusters at variable epsilon distances

Shows example usage of OPTICS to extract clustering structure

jaquesgrobler · 2013-05-22T11:22:46Z

examples/cluster/plot_optics

+centers = [[1, 1], [-1, -1], [1, -1]]
+X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4)
+
+##############################################################################


I tend to prefer not having these separation lines.. A blank line is fine :)

NelleV · 2013-05-22T11:24:41Z

Your code isn't pep8, and the documention doesn't follow numpydoc's conventions.

NelleV · 2013-05-22T11:25:23Z

sklearn/cluster/optics.py

+
+## Main Class ##
+
+class setOfObjects(BallTree):    


Make sure to follow pep8's convention for class names.

jaquesgrobler · 2013-05-22T11:39:45Z

Thanks for the PR. I'm not that familiar with OPTICS myself, but I had a look through.
Before getting too in depth, coding style wise you should make you code pep8. Few minor things I mentioned from first glance are also mentioned in comments..

The other thing I'll mention, is that the scikit-learn generally follows this API style..
It'd be good if you can make your API match that of the scikit, like making
prep_optics(...), build_optics(...), ExtractDBSCAN(...) ect. match the 'fit(..)' ect. API. Other help functions you can keep as is.

Hope that helps for now. Thanks again 👍

Mainly issues in long lines, long comments

mblondel · 2014-01-08T05:02:45Z

@espg Since this PR has stalled, I think it would be a good idea to extract the source into a github repo or gist and add it to https://github.com/scikit-learn/scikit-learn/wiki/Third-party-projects-and-code-snippets

espg · 2014-01-08T12:38:41Z

@mblondel Do you mind if I keep this PR open until the end of the month to make some changes that bring it more in line with the scikit-learn API?

I was hiking for 2 months on the pacific crest trail and had shoulder surgery after I got back, so the PR had fallen off my radar-- but I can fix the remaining issues in the next 2 weeks if there's still interest in having an OPTICS implementation in scikits-learn.

kevindavenport · 2014-03-28T03:45:02Z

This is great work!

arnabkd · 2014-04-19T16:09:34Z

@espg : I am very interested in an OPTICS implementation in scikits-learn :)

michaelaye · 2014-11-04T22:12:27Z

What's the status of this?

amueller · 2014-11-05T01:44:21Z

Stalled ;) Feel free to pick it up, I'd say. Needs tests, documentation, benchmarks, I'd think.

kevindavenport · 2014-11-05T03:18:36Z

I want to jump in and take this over if it's stalled. I can start this weekend :)

amueller · 2014-11-05T03:20:10Z

I'm pretty sure no-one would mind if it provides a good improvement over DBSCAN.

OPTICS object, with fit method. Documentation updates. extract method

espg · 2014-11-05T05:47:06Z

@michaelaye : Good question. This had fallen off my radar a bit, but I just pushed a fairly large update, the main purpose of which was to fit in with the scikit-learn API; documentation has been increased as well. I believe that fit within the sklearn API was the previous main issue.

The code is pep 8 with the exception of a single long line, which is hard to re-write without losing readability. The algorithm has been tested on points in three dimensions up to order 10^8 with no ill effects (~40 hours on my 4 year old laptop), and memory use is small. I know scikit-learn doesn’t have an OPTICS implementation, so I’d like to see it included and am happy to work with the maintainers if there are any outstanding issues.

While the API is as close as possible to sklearn, there doesn’t seem to be a natural way to ‘rescan’ for a different eps distance. The sklearn methods fit, predict, and transform all expect data as an input, and the whole point of OPTICS is that you don’t need to recompute anything for a lower eps after the first run…so there isn’t really a sklearn-ish way to implement that— I’ve included an ‘extract’ method that takes a new eps distance and nothing else to do this.

I think that there are additional things that could be done to enhance the module, specifically a save and restore function for dealing with large datasets where you want persistence beyond a single python session, and also some additional performance tweaks. That said, this algorithm doesn’t currently exist in sklearn, and it works and matches the sklearn API fairly well. It is also, due to aggressive pruning, several orders of magnitude faster than other implementations that I’ve seen.

espg · 2014-11-05T06:23:47Z

@amueller : I hadn’t refreshed the page, so I didn’t see your message till just now. I ran some benchmarks a while back, I’m not sure if those work or if you have a specific methodology that you expect people to follow. Let me know about the documentation as it stands.

As for having other people jump in— this was my first attempt at contributing to any software package, and I get the distinct feeling that I’m pretty green at it. I’d love any help with making this accepted to the master branch— it does no good to anyone just sitting around :/

amueller · 2014-11-05T15:40:40Z

@espg For persistence, that is basically automatic with pickling. The estimator should be pickle-able (there is actually a test for that afaik), and if not that should be fixed.

For the benchmarks: The results are supposed to be comparable to DBSCAN, right? So I think a performance comparison against DBSCAN, showing how it is more scalable, would be great (or showing that with the current dbscan impl the memory would blow up).

For rescanning: Maybe @GaelVaroquaux can say a word about how he solve a similar problem with agglomerative approaches. My first idea would be to have some attribute that stores the whole hierarchical structure which could be accessed by the user. That is obviously not super convenient.

There are also tests missing, for clustering usually some sanity tests on synthetic data and maybe some that are specific to the algorithm. I can't really say much about that as I am not super familiar with the algorithm. Usually we want near-complete line-coverage, though.

For the Documentation, we also want narrative documentation (at doc/modules/clustering.rst).
That could describe what is similar or different from other clustering approaches in scikit-learn, roughly the intuition of the parameters, the idea of the algorithm, and when it is particularly usefull (like this is a memory-limited variant of dbscan that produces a hierarchical clustering or something).

Maybe also add it to the clustering comparison example.

amueller · 2014-11-05T15:41:27Z

I know this is a lot we ask for but we have to uphold our standards ;) Let me know if you need more info or any help. I'll try to see if I can review the code in more detail (or maybe someone who is more familiar with the algorithm can).

amueller · 2014-11-05T15:56:42Z

It might also be interesting to compare to #3802.

MechCoder · 2014-11-05T16:00:13Z

I can also try reviewing, once you feel that it is in a reviewable state.

GaelVaroquaux · 2014-11-05T16:09:09Z

For rescanning: Maybe @GaelVaroquaux can say a word about how he solve a
similar problem with agglomerative approaches. My first idea would be to have
some attribute that stores the whole hierarchical structure which could be
accessed by the user. That is obviously not super convenient.

joblib.Memory (look for it in FeatureAgglomeration).

jnothman · 2018-07-11T23:20:40Z

So I've tried:

@pytest.mark.parametrize('eps', [0.1, .3, .5])
@pytest.mark.parametrize('min_samples', [1, 10, 20])
def test_dbscan_optics_parity(eps, min_samples):
    # Test that OPTICS clustering labels are < 5 difference of DBSCAN

    centers = [[1, 1], [-1, -1], [1, -1]]
    X, labels_true = make_blobs(n_samples=750, centers=centers,
                                cluster_std=0.4, random_state=0)

    # calculate optics with dbscan extract at 0.3 epsilon
    op = OPTICS(min_samples=min_samples).fit(X)
    core_optics, labels_optics = op.extract_dbscan(eps)

    # calculate dbscan labels
    db = DBSCAN(eps=eps, min_samples=min_samples).fit(X)

    contingency = contingency_matrix(db.labels_, labels_optics)
    # agreement is at best the sum of the max value on either axis
    agree = min(np.sum(np.max(contingency, axis=0)),
                np.sum(np.max(contingency, axis=1)))
    disagree = X.shape[0] - agree

    # verify core_labels match
    assert_array_equal(core_optics, db.core_sample_indices_)

    # verify label mismatch is < 5 labels
    assert disagree < 5

It fails for:

min_samples=10, eps=0.1 with disagree == 38.
min_samples=20, eps=0.3 with disagree == 9.

Otherwise it passes. Is this what we should expect?

jnothman · 2018-07-11T23:21:29Z

(Note that I changed the calculation of agree. I think it's correct...)

espg · 2018-07-12T00:00:32Z

The mis-match should probably be checked as a function of the number of periphery points. The core points are always the same, but depending on how much noise is present, we would expect the disagreement between what is/isn't noise to stay proportionally the same based on cluster size or on how well the algorithm is extracting clusters in general. Testing that there is no more then 6% disagreement between the non-core points works, but I may see if there's a way to write the test that passes at 5% since it feels slightly less arbitrary.

jnothman · 2018-07-12T00:38:42Z

Well we can explicitly check for how many points are not reachable from core points, etc. Or should we be using a count of dbscan.labels_ == -1 as a basis for assessing discrepancy?

Vectorized extract dbscan function to see if would improve performance of periphery point labeling; it did not, but the function is vectorized. Changed unit test with min_samples=1 to min_samples=3, as at min_samples=1 the test isn't meaningful (no noise is possible, all points are marked core). Parameterized parity test. Changed parity test to assert ~5% or better mismatch, instead of 5 points (this is needed for larger clusters, as the starting point mismatch effect scales with cluster size).

espg · 2018-07-12T19:37:29Z

I looked at the original OPTICS paper to see if they claim exact correspondence with DBSCAN; here's the relevant passage:

The clustering created from a cluster-ordered data set by ExtractDBSCAN-Clustering is nearly indistinguishable from a clustering created by DBSCAN. Only some border objects may be missed when extracted by the algorithm ExtractDBSCAN-Clustering if they were processed by the algorithm OPTICS before a core object of the corresponding cluster had been found. However, the fraction of such border objects is so small that we can omit a postprocessing (i.e. reassign those objects to a cluster) without much loss of information.

I think the take away is that core points will be exactly identical (and the unit tests check for this), but the periphery point classification will be "nearly indistinguishable". In practice from the unit testing, "nearly indistinguishable" seems to be the same labeling within ~5% of the portion of the data that isn't processed as core.

jnothman

Thanks for that!

I'm a little concerned that we've still missed something, but we've certainly improved a lot over the last months. I'm going to be bold and approve.

I think @agramfort's approval here is a bit stale. It would be good for someone to have another quick look over tests and/or implementation.

Thanks @espg!!

jnothman · 2018-07-15T06:13:29Z

sklearn/cluster/tests/test_optics.py

I prefer fraction when we're not multiplying by 100

Why do we need to round?

GaelVaroquaux · 2018-07-16T08:00:01Z

I'll try to review and hopefully merge this guy.

GaelVaroquaux · 2018-07-16T09:07:51Z

I am pushing cosmetic changes to this branch, rather than asking for them (with the goal of merging this ASAP).

GaelVaroquaux · 2018-07-16T09:30:28Z

I am not succeeding to push to your fork (maybe a permission thing), I've created a branch on my fork: https://github.com/GaelVaroquaux/scikit-learn/tree/pr_1984

I might merge from it.

GaelVaroquaux · 2018-07-16T09:51:30Z

sklearn/cluster/optics_.py

+        nbrs.fit(X)
+        self.core_distances_[:] = nbrs.kneighbors(X,
+                                                  self.min_samples)[0][:, -1]
+        self._nbrs = nbrs


I do not think that it is necessary to store this object. It is used only in the fit. Storing it adds weight to the memory and to the pickle.

GaelVaroquaux · 2018-07-16T11:11:33Z

Merged via #11547 . Hurray!! Thank you

jnothman · 2018-07-16T13:13:10Z

Wow. That's been a long time coming!!

lucashmsilva · 2018-07-16T13:18:36Z

Is it planned to go out with the next release or is it already available?

jnothman · 2018-07-16T13:26:00Z

Already available in master, and will be in a versioned release in coming weeks.

GaelVaroquaux · 2018-07-16T13:28:05Z

Is it planned to go out with the next release or is it already available?

It's available in master. It will be in the next release.

espg · 2018-07-16T15:40:23Z

Hooray! Thanks @jnothman and @GaelVaroquaux for reviewing to get this over the finish line :-)

GaelVaroquaux · 2018-07-16T15:54:14Z

Thanks a lot for sticking around!!! This is an awesome PR!

lmcinnes · 2018-07-16T17:05:10Z

Congrats @espg! Glad this finally made it!

espg · 2018-07-16T18:07:54Z

It was my first PR, definitely learned a lot about the process...

GaelVaroquaux · 2018-07-16T18:30:39Z

And you stuck around. Congratulations. We know it's not easy, and we are grateful. So are the users! ⁣Sent from my phone. Please forgive typos and briefness.

…

On Jul 16, 2018, 20:26, at 20:26, Shane ***@***.***> wrote: It was my first PR, definitely learned a lot about the process... -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #1984 (comment)

czhao028 · 2018-07-19T17:05:39Z

In the meantime, how can we install Github's version of sklearn so we can access OPTICS? @GaelVaroquaux I tried cloning from Github & running setup.py (still could not import it), as well as running pip3 install git+https://github.com/scikit-learn/scikit-learn.git; is there something I'm missing? How should we be able to install it?

Edit: nevermind, just got it hahaha. from sklearn.cluster.optics_ import OPTICS

jnothman · 2018-07-20T04:29:23Z

If you don't give us a sense of how it broke, we can't help. However, we hope to release within a few weeks.

espg added 2 commits May 21, 2013 17:26

OPTICS clustering algorithm

147323d

Equivalent results to DBSCAN, but allows execution on arbitrarily large datasets. After initial construction, allows multiple 'scans' to quickly extract DBSCAN clusters at variable epsilon distances

Create plot_optics

d18e827

Shows example usage of OPTICS to extract clustering structure

jaquesgrobler reviewed May 22, 2013
View reviewed changes

NelleV reviewed May 22, 2013
View reviewed changes

espg added 2 commits May 22, 2013 13:40

pep8 fixes

70daac1

Mainly issues in long lines, long comments

fixed conditional to be pep8

4fea7ce

GaelVaroquaux mentioned this pull request Jun 9, 2013

WIP: OPTICS clustering #2043

Closed

larsmans force-pushed the master branch from 58a55ad to 4b82379 Compare August 25, 2014 21:50

MechCoder force-pushed the master branch from 6deaea0 to 3f49cee Compare November 3, 2014 12:36

updated to match sklearn API

7cc7e12

OPTICS object, with fit method. Documentation updates. extract method

removed extra files

24dac57

updated documentation comparing OPTICS/DBSCAN

369e4b5

jnothman approved these changes Jul 15, 2018

View reviewed changes

GaelVaroquaux reviewed Jul 16, 2018

View reviewed changes

GaelVaroquaux mentioned this pull request Jul 16, 2018

OPTICS #11547

Merged

GaelVaroquaux closed this Jul 16, 2018

amueller mentioned this pull request Jul 16, 2018

[MRG] import optics function in cluster.__init__ #11567

Merged

jorisvandenbossche mentioned this pull request Jul 16, 2018

TST: optics function is not tested #11568

Closed

espg mentioned this pull request Aug 17, 2018

OPTICS plot fix #11849

Closed

espg mentioned this pull request Sep 4, 2018

[MRG+1] OPTICS Change quick_scan floating point comparison to isclose #11929

Merged

+                  assert_array_equal(core_optics, db.core_sample_indices_)
+                  non_core_count = len(labels_optics) - len(core_optics)
+                  percent_mismatch = np.round((disagree - 1) / non_core_count, 2)

Uh oh!

[MRG] Implementation of OPTICS #1984

[MRG] Implementation of OPTICS #1984

Uh oh!

Conversation

Uh oh! < 8000 /h3> There was an error while loading. Please reload this page.

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh! < 8000 /h3>

There was an error while loading. Please reload this page.