-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG+2] OPTICS: add extract_xi method #12077
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1375b73
to
baa94f9
Compare
Ahh, I see. Thanks a lot @adrinjalali @kno10. So our method actually produce the same result as R's. That's fine. |
@adrinjalali please mention how we assign the labels. |
what do you want me to change? I think I've applied to comments there, sorry if I've missed something, could you please remind me? |
sklearn/cluster/optics_.py
Outdated
@@ -719,7 +719,7 @@ def _correct_predecessor(reachability_plot, predecessor, ordering, s, e): | |||
while s < e: | |||
if reachability_plot[s] > reachability_plot[e]: | |||
return s, e | |||
p_e = predecessor[e] | |||
p_e = ordering[predecessor[e]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm. That comes a bit unexpected. I would have expected the predecessor to be an object index, not a plot order index. I'd assume that users will usually expect the predecessor to be in data order, what do you think?
In particular, this is the "externally" used ordering:
scikit-learn/sklearn/cluster/optics_.py
Lines 163 to 164 in d83c85d
predecessor_ : array, shape (n_samples,) | |
Point that a sample was reached from, indexed by object order. |
I'd rather try to get rid of the reindexing here:
scikit-learn/sklearn/cluster/optics_.py
Line 624 in d83c85d
clusters = _xi_cluster(reachability[ordering], predecessor[ordering], |
or at least use some explicit naming, such as
predecessor_plot
. Otherwise, this is a maintenance nightmare, and the next author trying to modify this will be similarly confused and also need 4-5 attempts to get the indexing right...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's great, thanks a lot @adrinjalali @kno10
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, that's why I renamed the parameters to predecessor_plot
if the order was the same os reachability_plot
. We can/should better document this and clarity in our docs later. But I think this is fine for now.
@adrinjalali we need to update what's new and mention how we assign the labels, but I'm OK to merge this one first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ping @jnothman @espg @GaelVaroquaux
Ridiculously great effort everyone. I think we should make sure to credit
Shane, Adrin, Erich, Assia and Hanmin in what's new... And then hope this
thing survives its release.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, ping @jnothman is your approval still valid? I think we can merge.
Seems like this is ready for merge :) |
Really good to see this merged-- stellar work everyone! =) |
I'd like to wait for @jnothman at least. We've introduced couple of changes since his approval. |
I've been following the conversation more-or-less. Even if I'm a little ashamed at missing lots, I agree with the subsequent changes and am very glad that @qinhanmin2014 and @kno10 have treated this thoroughly. |
Done. |
Thanks everyone for your great work! |
This reverts commit 14724c7.
This reverts commit 14724c7.
Related: #12036, #12053, #11677
Fixes #12376
Adding the Xi extraction method to OPTICS.
@espg, I'll need the
predecessor_
to add that final touch to it. I'm still testing the implementation, but it'll be nice if I could get some feedback. (done)Differences with the paper:
r(e_U + 1)
, but section 4.3.2 compares values withr(e_U)
, this implementation takesr(e_U + 1)
which seems more consistent.min_samples
as the minimum size of clusters, which is the default behavior of this implementation, but we also have amin_cluster_size
to give more freedom to the user.