8000 [MRG] MNT rename min_cluster_size_ratio to min_cluster_size by jnothman · Pull Request #11913 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] MNT rename min_cluster_size_ratio to min_cluster_size #11913

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Sep 9, 2018

Conversation

jnothman
Copy link
Member

Agreed @adrinjalali? Any objections @espg?

@jnothman jnothman added this to the 0.20 milestone Aug 26, 2018
neighborhood_size = int(min_maxima_ratio * len(ordering))

# Should this check for < min_samples? Should this be public?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it should be min_samples. The minimum cluster size in dbscan is less than min_samples. But I don't see why it should be 5 either.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have suggested minimum value of 2 in my PR, and changing the name to min_neighbors or something.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm accepting your min cluster size. Not so much the min_samples name. See there.

@jnothman jnothman changed the title MNT rename min_cluster_size_ratio to min_cluster_size [MRG] MNT rename min_cluster_size_ratio to min_cluster_size Aug 27, 2018
@adrinjalali
Copy link
Member

@jnothman it makes sense. I like the change.

Copy link
Member Author
@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should probably be adding some tests here

neighborhood_size = int(min_maxima_ratio * len(ordering))

# Should this check for < min_samples? Should this be public?
Copy link
< 10000 /details-menu>
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm accepting your min cluster size. Not so much the min_samples name. See there.

@jnothman
Copy link
Member Author

This should be merged for release. Seeking reviews.

Copy link
Contributor
@albertcthomas albertcthomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small typos

@@ -221,8 +223,10 @@ class OPTICS(BaseEstimator, ClusterMixin):
significant_min : float, optional
Sets a lower threshold on how small a significant maxima can be.

min_cluster_size_ratio : float, optional
Minimum percentage of dataset expected for cluster membership.
min_cluster_size : int > 1 or loat between 0 and 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loat -> float

@@ -528,8 +532,10 @@ def _extract_optics(ordering, reachability, maxima_ratio=.75,
significant_min : float, optional
Sets a lower threshold on how small a significant maxima can be.

min_cluster_size_ratio : float, optional
Minimum percentage of dataset expected for cluster membership.
min_cluster_size : int > 1 or loat between 0 and 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

float

@jnothman
Copy link
Member Author
jnothman commented Sep 2, 2018

Thanks @albertcthomas! Does the implementation appear correct? Does the naming seem to be improved by this change?

Copy link
Contributor
@albertcthomas albertcthomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not very familiar with OPTICS but if I wanted to use it min_cluster_size and its description would totally make sense to me.

neighborhood_size = int(min_maxima_ratio * len(ordering))

# Should this check for < min_samples? Should this be public?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific reason for removing this in this PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I'm changing 5 to 2 above (using max instead of if). @adrinjalali and I seem to agree that min_samples is not the relevant criterion here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The role of min_samples is kinda different in OPTICS than DBSCAN. In OPTICS you can have the generator epsilon set to a large number (it's inf by default), to generate a smooth reachability_ array. Similarly, min_samples can also be set relatively large. Still, since the clusters are detected hierarchically, it can be the case that a cluster is smaller than the value for min_samples, and that's fine. I agree that it doesn't make much sense probably to do so, but they're independent parameters IMO.

@qinhanmin2014
Copy link
Member

I think we need some input validation (maybe some tests) here. Otherwise this LGTM.

Copy link
Contributor
@albertcthomas albertcthomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

raise ValueError('min_cluster_size must be a positive integer or '
'a float between 0 and 1. Got %r' %
self.min_cluster_size)
elif self.min_cluster_size > n_samples:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that min_cluster_size == n_samples is possible if all the samples are in the same unique cluster

@adrinjalali
Copy link
Member

There's a flake8 error, otherwise LGTM.

@jnothman jnothman merged commit a86709f into scikit-learn:master Sep 9, 2018
jnothman added a commit to jnothman/scikit-learn that referenced this pull request Sep 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0