-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[MRG] Implementation of OPTICS #1984
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Equivalent results to DBSCAN, but allows execution on arbitrarily large datasets. After initial construction, allows multiple 'scans' to quickly extract DBSCAN clusters at variable epsilon distances
Shows example usage of OPTICS to extract clustering structure
examples/cluster/plot_optics
Outdated
centers = [[1, 1], [-1, -1], [1, -1]] | ||
X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4) | ||
|
||
############################################################################## |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to prefer not having these separation lines.. A blank line is fine :)
Your code isn't pep8, and the documention doesn't follow numpydoc's conventions. |
sklearn/cluster/optics.py
Outdated
|
||
## Main Class ## | ||
|
||
class setOfObjects(BallTree): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure to follow pep8's convention for class names.
Thanks for the PR. I'm not that familiar with OPTICS myself, but I had a look through. The other thing I'll mention, is that the scikit-learn generally follows this API style.. Hope that helps for now. Thanks again 👍 |
Mainly issues in long lines, long comments
@espg Since this PR has stalled, I think it would be a good idea to extract the source into a github repo or gist and add it to https://github.com/scikit-learn/scikit-learn/wiki/Third-party-projects-and-code-snippets |
@mblondel Do you mind if I keep this PR open until the end of the month to make some changes that bring it more in line with the scikit-learn API? I was hiking for 2 months on the pacific crest trail and had shoulder surgery after I got back, so the PR had fallen off my radar-- but I can fix the remaining issues in the next 2 weeks if there's still interest in having an OPTICS implementation in scikits-learn. |
This is great work! |
@espg : I am very interested in an OPTICS implementation in scikits-learn :) |
What's the status of this? |
Stalled ;) Feel free to pick it up, I'd say. Needs tests, documentation, benchmarks, I'd think. |
I want to jump in and take this over if it's stalled. I can start this weekend :) |
I'm pretty sure no-one would mind if it provides a good improvement over DBSCAN. |
OPTICS object, with fit method. Documentation updates. extract method
@michaelaye : Good question. This had fallen off my radar a bit, but I just pushed a fairly large update, the main purpose of which was to fit in with the scikit-learn API; documentation has been increased as well. I believe that fit within the sklearn API was the previous main issue. The code is pep 8 with the exception of a single long line, which is hard to re-write without losing readability. The algorithm has been tested on points in three dimensions up to order 10^8 with no ill effects (~40 hours on my 4 year old laptop), and memory use is small. I know scikit-learn doesn’t have an OPTICS implementation, so I’d like to see it included and am happy to work with the maintainers if there are any outstanding issues. While the API is as close as possible to sklearn, there doesn’t seem to be a natural way to ‘rescan’ for a different eps distance. The sklearn methods fit, predict, and transform all expect data as an input, and the whole point of OPTICS is that you don’t need to recompute anything for a lower eps after the first run…so there isn’t really a sklearn-ish way to implement that— I’ve included an ‘extract’ method that takes a new eps distance and nothing else to do this. I think that there are additional things that could be done to enhance the module, specifically a save and restore function for dealing with large datasets where you want persistence beyond a single python session, and also some additional performance tweaks. That said, this algorithm doesn’t currently exist in sklearn, and it works and matches the sklearn API fairly well. It is also, due to aggressive pruning, several orders of magnitude faster than other implementations that I’ve seen. |
@amueller : I hadn’t refreshed the page, so I didn’t see your message till just now. I ran some benchmarks a while back, I’m not sure if those work or if you have a specific methodology that you expect people to follow. Let me know about the documentation as it stands. As for having other people jump in— this was my first attempt at contributing to any software package, and I get the distinct feeling that I’m pretty green at it. I’d love any help with making this accepted to the master branch— it does no good to anyone just sitting around :/ |
@espg For persistence, that is basically automatic with pickling. The estimator should be pickle-able (there is actually a test for that afaik), and if not that should be fixed. For the benchmarks: The results are supposed to be comparable to DBSCAN, right? So I think a performance comparison against DBSCAN, showing how it is more scalable, would be great (or showing that with the current dbscan impl the memory would blow up). For rescanning: Maybe @GaelVaroquaux can say a word about how he solve a similar problem with agglomerative approaches. My first idea would be to have some attribute that stores the whole hierarchical structure which could be accessed by the user. That is obviously not super convenient. There are also tests missing, for clustering usually some sanity tests on synthetic data and maybe some that are specific to the algorithm. I can't really say much about that as I am not super familiar with the algorithm. Usually we want near-complete line-coverage, though. For the Documentation, we also want narrative documentation (at Maybe also add it to the clustering comparison example. |
I know this is a lot we ask for but we have to uphold our standards ;) Let me know if you need more info or any help. I'll try to see if I can review the code in more detail (or maybe someone who is more familiar with the algorithm can). |
It might also be interesting to compare to #3802. |
I can also try reviewing, once you feel that it is in a reviewable state. |
joblib.Memory (look for it in FeatureAgglomeration). |
So I've tried: @pytest.mark.parametrize('eps', [0.1, .3, .5])
@pytest.mark.parametrize('min_samples', [1, 10, 20])
def test_dbscan_optics_parity(eps, min_samples):
# Test that OPTICS clustering labels are < 5 difference of DBSCAN
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=750, centers=centers,
cluster_std=0.4, random_state=0)
# calculate optics with dbscan extract at 0.3 epsilon
op = OPTICS(min_samples=min_samples).fit(X)
core_optics, labels_optics = op.extract_dbscan(eps)
# calculate dbscan labels
db = DBSCAN(eps=eps, min_samples=min_samples).fit(X)
contingency = contingency_matrix(db.labels_, labels_optics)
# agreement is at best the sum of the max value on either axis
agree = min(np.sum(np.max(contingency, axis=0)),
np.sum(np.max(contingency, axis=1)))
disagree = X.shape[0] - agree
# verify core_labels match
assert_array_equal(core_optics, db.core_sample_indices_)
# verify label mismatch is < 5 labels
assert disagree < 5 It fails for:
Otherwise it passes. Is this what we should expect? |
(Note that I changed the calculation of |
The mis-match should probably be checked as a function of the number of periphery points. The core points are always the same, but depending on how much noise is present, we would expect the disagreement between what is/isn't noise to stay proportionally the same based on cluster size or on how well the algorithm is extracting clusters in general. Testing that there is no more then 6% disagreement between the non-core points works, but I may see if there's a way to write the test that passes at 5% since it feels slightly less arbitrary. |
Well we can explicitly check for how many points are not reachable from
core points, etc. Or should we be using a count of dbscan.labels_ == -1 as
a basis for assessing discrepancy?
|
Vectorized extract dbscan function to see if would improve performance of periphery point labeling; it did not, but the function is vectorized. Changed unit test with min_samples=1 to min_samples=3, as at min_samples=1 the test isn't meaningful (no noise is possible, all points are marked core). Parameterized parity test. Changed parity test to assert ~5% or better mismatch, instead of 5 points (this is needed for larger clusters, as the starting point mismatch effect scales with cluster size).
I looked at the original OPTICS paper to see if they claim exact correspondence with DBSCAN; here's the relevant passage:
I think the take away is that core points will be exactly identical (and the unit tests check for this), but the periphery point classification will be "nearly indistinguishable". In practice from the unit testing, "nearly indistinguishable" seems to be the same labeling within ~5% of the portion of the data that isn't processed as core. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for that!
I'm a little concerned that we've still missed something, but we've certainly improved a lot over the last months. I'm going to be bold and approve.
I think @agramfort's approval here is a bit stale. It would be good for someone to have another quick look over tests and/or implementation.
Thanks @espg!!
assert_array_equal(core_optics, db.core_sample_indices_) | ||
|
||
non_core_count = len(labels_optics) - len(core_optics) | ||
percent_mismatch = np.round((disagree - 1) / non_core_count, 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer fraction when we're not multiplying by 100
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to round?
I'll try to review and hopefully merge this guy. |
I am pushing cosmetic changes to this branch, rather than asking for them (with the goal of merging this ASAP). |
I am not succeeding to push to your fork (maybe a permission thing), I've created a branch on my fork: https://github.com/GaelVaroquaux/scikit-learn/tree/pr_1984 I might merge from it. |
nbrs.fit(X) | ||
self.core_distances_[:] = nbrs.kneighbors(X, | ||
self.min_samples)[0][:, -1] | ||
self._nbrs = nbrs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think that it is necessary to store this object. It is used only in the fit. Storing it adds weight to the memory and to the pickle.
Merged via #11547 . Hurray!! Thank you |
Wow. That's been a long time coming!! |
Is it planned to go out with the next release or is it already available? |
Already available in master, and will be in a versioned release in coming
weeks.
|
Is it planned to go out with the next release or is it already available?
It's available in master. It will be in the next release.
|
Hooray! Thanks @jnothman and @GaelVaroquaux for reviewing to get this over the finish line :-) |
Thanks a lot for sticking around!!! This is an awesome PR!
|
Congrats @espg! Glad this finally made it! |
It was my first PR, definitely learned a lot about the process... |
And you stuck around. Congratulations. We know it's not easy, and we are grateful.
So are the users!
Sent from my phone. Please forgive typos and briefness.
…On Jul 16, 2018, 20:26, at 20:26, Shane ***@***.***> wrote:
It was my first PR, definitely learned a lot about the process...
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#1984 (comment)
|
In the meantime, how can we install Github's version of sklearn so we can access OPTICS? @GaelVaroquaux I tried cloning from Github & running setup.py (still could not import it), as well as running pip3 install git+https://github.com/scikit-learn/scikit-learn.git; is there something I'm missing? How should we be able to install it? Edit: nevermind, just got it hahaha. from sklearn.cluster.optics_ import OPTICS |
If you don't give us a sense of how it broke, we can't help.
However, we hope to release within a few weeks.
|
Equivalent results to DBSCAN, but allows execution on arbitrarily large datasets. After initial construction, allows multiple 'scans' to quickly extract DBSCAN clusters at variable epsilon distances
Algorithm is tested and validated, but not yet optimized. Tested on 70K points (successful); currently testing on 2 x 10^9 LiDAR points (pending)
First commit; would appreciate any advice on changes to help code conform to the scikit-learn API
Example usage in examples folder