8000 `min_samples` in HDSCAN · Issue #28976 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

min_samples in HDSCAN #28976

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
psl-schaefer opened this issue May 8, 2024 · 3 comments · Fixed by #29263
Closed

min_samples in HDSCAN #28976

psl-schaefer opened this issue May 8, 2024 · 3 comments · Fixed by #29263

Comments

@psl-schaefer
Copy link

Describe the issue linked to the documentation

I find the description of the min_samples argument in sklearn.cluster.HDBSCAN confusing.

It says "The number of samples in a neighborhood for a point to be considered as a core point. This includes the point itself."

But if I understand everything correctly min_samples corresponds to the $k$ used to compute the core distance $\text{core}_k\left(x\right)$ for every sample $x$ where the $k$'th core distance for some sample $x$ is defined as the distance to the $k$'th nearest-neighbor of $x$ (counting itself). (-> which exactly what is happening in the code here: https://github.com/scikit-learn-contrib/hdbscan/blob/fc94241a4ecf5d3668cbe33b36ef03e6160d7ab7/hdbscan/_hdbscan_reachability.pyx#L45-L47, where it is called min_points)

I don't understand how both of these descriptions are equivalent. I would assume that other people might find that confusing as well.

Link in Code:

min_samples : int, default=None
The number of samples in a neighborhood for a point
to be considered as a core point. This includes the point itself.
When `None`, defaults to `min_cluster_size`.

Link in Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN

Suggest a potential alternative/fix

No response

@psl-schaefer psl-schaefer added Documentation Needs Triage Issue requires triage labels May 8, 2024
@adrinjalali
Copy link
Member

cc @Micky774

@glemaitre
Copy link
Member

Indeed, we used the docstring of the original implementation that reused the DBSCAN information. However, the parameter here have a different meaning: it define the core distance.

So we should make sure to change the different docstrings from the file.

@glemaitre glemaitre added help wanted and removed Needs Triage Issue requires triage labels May 15, 2024
m-maggi added a commit to m-maggi/scikit-learn that referenced this issue Jun 14, 2024
@m-maggi
Copy link
Contributor
m-maggi commented Jun 14, 2024

@glemaitre I opened a draft PR for this. I have a proposal for the description of min_samples, if that's fine I can write that also in other place in sklearn/cluster/_hdbscan/hdbscan.py, e.g. here

def _brute_mst(mutual_reachability, min_samples):
"""
Builds a minimum spanning tree (MST) from the provided mutual-reachability
values. This function dispatches to a custom Cython implementation for
dense arrays, and `scipy.sparse.csgraph.minimum_spanning_tree` for sparse
arrays/matrices.
Parameters
----------
mututal_reachability_graph: {ndarray, sparse matrix} of shape \
(n_samples, n_samples)
Weighted adjacency matrix of the mutual reachability graph.
min_samples : int, default=None
The number of samples in a neighborhood for a point
to be considered as a core point. This includes the point itself.

or here

Notes
-----
The `min_samples` parameter includes the point itself, whereas the implementation in
`scikit-learn-contrib/hdbscan <https://github.com/scikit-learn-contrib/hdbscan>`_
does not. To get the same results in both versions, the value of `min_samples` here
must be 1 greater than the value used in `scikit-learn-contrib/hdbscan
<https://github.com/scikit-learn-contrib/hdbscan>`_.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants
0