8000 `min_samples` in HDSCAN · Issue #28976 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content
min_samples in HDSCAN #28976
Closed
Closed
@psl-schaefer

Description

@psl-schaefer

Describe the issue linked to the documentation

I find the description of the min_samples argument in sklearn.cluster.HDBSCAN confusing.

It says "The number of samples in a neighborhood for a point to be considered as a core point. This includes the point itself."

But if I understand everything correctly min_samples corresponds to the $k$ used to compute the core distance $\text{core}_k\left(x\right)$ for every sample $x$ where the $k$'th core distance for some sample $x$ is defined as the distance to the $k$'th nearest-neighbor of $x$ (counting itself). (-> which exactly what is happening in the code here: https://github.com/scikit-learn-contrib/hdbscan/blob/fc94241a4ecf5d3668cbe33b36ef03e6160d7ab7/hdbscan/_hdbscan_reachability.pyx#L45-L47, where it is called min_points)

I don't understand how both of these descriptions are equivalent. I would assume that other people might find that confusing as well.

Link in Code:

min_samples : int, default=None
The number of samples in a neighborhood for a point
to be considered as a core point. This includes the point itself.
When `None`, defaults to `min_cluster_size`.

Link in Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN

Suggest a potential alternative/fix

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0