-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] MNT rename min_cluster_size_ratio to min_cluster_size #11913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sklearn/cluster/optics_.py
Outdated
neighborhood_size = int(min_maxima_ratio * len(ordering)) | ||
|
||
# Should this check for < min_samples? Should this be public? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it should be min_samples. The minimum cluster size in dbscan is less than min_samples. But I don't see why it should be 5 either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have suggested minimum value of 2 in my PR, and changing the name to min_neighbors or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'm accepting your min cluster size. Not so much the min_samples name. See there.
@jnothman it makes sense. I like the change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should probably be adding some tests here
sklearn/cluster/optics_.py
Outdated
neighborhood_size = int(min_maxima_ratio * len(ordering)) | ||
|
||
# Should this check for < min_samples? Should this be public? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'm accepting your min cluster size. Not so much the min_samples name. See there.
This should be merged for release. Seeking reviews. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small typos
sklearn/cluster/optics_.py
Outdated
@@ -221,8 +223,10 @@ class OPTICS(BaseEstimator, ClusterMixin): | |||
significant_min : float, optional | |||
Sets a lower threshold on how small a significant maxima can be. | |||
|
|||
min_cluster_size_ratio : float, optional | |||
Minimum percentage of dataset expected for cluster membership. | |||
min_cluster_size : int > 1 or loat between 0 and 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
loat -> float
sklearn/cluster/optics_.py
Outdated
@@ -528,8 +532,10 @@ def _extract_optics(ordering, reachability, maxima_ratio=.75, | |||
significant_min : float, optional | |||
Sets a lower threshold on how small a significant maxima can be. | |||
|
|||
min_cluster_size_ratio : float, optional | |||
Minimum percentage of dataset expected for cluster membership. | |||
min_cluster_size : int > 1 or loat between 0 and 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
float
Thanks @albertcthomas! Does the implementation appear correct? Does the naming seem to be improved by this change? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not very familiar with OPTICS but if I wanted to use it min_cluster_size
and its description would totally make sense to me.
neighborhood_size = int(min_maxima_ratio * len(ordering)) | ||
|
||
# Should this check for < min_samples? Should this be public? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a specific reason for removing this in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I'm changing 5 to 2 above (using max
instead of if
). @adrinjalali and I seem to agree that min_samples
is not the relevant criterion here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The role of min_samples
is kinda different in OPTICS
than DBSCAN
. In OPTICS you can have the generator epsilon set to a large number (it's inf
by default), to generate a smooth reachability_
array. Similarly, min_samples
can also be set relatively large. Still, since the clusters are detected hierarchically, it can be the case that a cluster is smaller than the value for min_samples
, and that's fine. I agree that it doesn't make much sense probably to do so, but they're independent parameters IMO.
I think we need some input validation (maybe some tests) here. Otherwise this LGTM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
raise ValueError('min_cluster_size must be a positive integer or ' | ||
'a float between 0 and 1. Got %r' % | ||
self.min_cluster_size) | ||
elif self.min_cluster_size > n_samples: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that min_cluster_size == n_samples
is possible if all the samples are in the same unique cluster
There's a |
Agreed @adrinjalali? Any objections @espg?