-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Clarify DBSCAN eps
parameter misunderstanding
#13563
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
eps
parameter misunderstanding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @kno10
sklearn/cluster/dbscan_.py
Outdated
as in the same neighborhood. | ||
The maximum distance between two samples for one to be considered | ||
as in the neighborhood of the other. This is not a maximum bound | ||
on the distances of points within a cluster, and the most important |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the most important
-> one of the most important
/ an important
since min_samples
is also important?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree wi 10000 th @kno10 that min_samples is not as important because having more or fewer core samples in a region is less essential than determining whether samples are in the same (or any) cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like "and" here as it's not clear what "the most important ..." applies to. Start a new sentence "This is the most important"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might note the importance of tuning eps in the user guide (doc/modules/cluster.rst) or summary section of the docstring. Do we have an example illustrating the effect of this parameter? What would be a good dataset to illustrate with?
sklearn/cluster/dbscan_.py
Outdated
as in the same neighborhood. | ||
The maximum distance between two samples for one to be considered | ||
as in the neighborhood of the other. This is not a maximum bound | ||
on the distances of points within a cluster, and the most important |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @kno10 that min_samples is not as important because having more or fewer core samples in a region is less essential than determining whether samples are in the same (or any) cluster.
sklearn/cluster/dbscan_.py
Outdated
as in the same neighborhood. | ||
The maximum distance between two samples for one to be considered | ||
as in the neighborhood of the other. This is not a maximum bound | ||
on the distances of points within a cluster, and the most important |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like "and" here as it's not clear what "the most important ..." applies to. Start a new sentence "This is the most important"
As seen here: https://stackoverflow.com/a/55388827/1939754 the old description of the eps parameter can be misunderstood as a maximum distance of any two points. Also add a reference that discusses parameterization.
Made this two sentences, added a reference with discussion of parameterization, and added a paragraph to the user guide on parameterization, too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great improvement!
As seen here: https://stackoverflow.com/a/55388827/1939754
the old description of the eps parameter can be misunderstood as a maximum distance of any two points.
Also, people really need to tune this parameter, not rely on the bad default value.