8000 ENH Add `HDBSCAN` as a new estimator in `sklearn.cluster` by Micky774 · Pull Request #26385 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

ENH Add HDBSCAN as a new estimator in sklearn.cluster #26385

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
May 31, 2023

Conversation

Micky774
Copy link
Contributor
@Micky774 Micky774 commented May 16, 2023

Reference Issues/PRs

Towards #24686

What does this implement/fix? Explain your changes.

Each change has been separately reviewed (see #24686 for details).

Edit: due to git shenanigans, some changes needed to be made within this PR. The novel changes included w/ this PR are:

  1. Replaced cnp.*_t typing with *_t from _typedefs.pxd
  2. Replaced *.shape[0] pattern with len(*) for ndarray objects
  3. Trimmed unused variables (thanks to Cython linting pre-commit)

Any other comments?

cc: @thomasjpfan @jjerphan @glemaitre

Micky774 and others added 11 commits May 16, 2023 12:35
…#22616)

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
…ikit-learn#25538)

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
…cikit-learn#24698)

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
@jjerphan
Copy link
Member
jjerphan commented May 16, 2023

Since all PRs target the branch of this PR have been reviewed, merged, and are referenced in the issue, I thinking making the CI pass is sufficient to accept this PR.

What do you think?

@Micky774 Micky774 marked this pull request as draft May 16, 2023 20:59
@Micky774
Copy link
Contributor Author

I am going to mark it as a draft for now to resolve CI issues (it seems some changes weren't carried over due to git shenanigans) but will reopen once those are addressed.

@Micky774
Copy link
Contributor Author
Micky774 commented May 21, 2023

Should be good for review now @jjerphan @thomasjpfan!

Since all PRs target the branch of this PR have been reviewed, merged, and are referenced in the issue, I thinking making the CI pass is sufficient to accept this PR.

There are some not-yet-reviewed changes. For your convenience, here are the novel changes (all of which are included in these five commits)

These changes are comprised of:

  1. Deletion of submodule-level setup.py files corresponding to old build strategy
  2. Update to setup.py for building _hdbscan submodule
  3. Sphinx doc appeasement in clustering.rst, plot_hdbscan.py, and hdbscan.py
  4. Updated sklearn/cluster/_hierarchical_fast.pyx::UnionFind to avoid redeclaration of attributes and use new typings
  5. Updated shape semantics in Cython files (len for lists/sets, .shape[i] for memoryviews, PyArray_SHAPE for ndarray objects)

@Micky774 Micky774 marked this pull request as ready for review May 21, 2023 00:41
Copy link
Member
@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Thank you for bringing HDBSCAN in scikit-learn, @Micky774!

Here are just minor comments.

@jjerphan
Copy link
Member

I think it is reasonable to wait for a third reviewer to approve this PR before merging it.

@Micky774
Copy link
Contributor Author

cc: @scikit-learn/core-devs in case anyone would like to donate a third review :)

Copy link
Member
@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The copyright here: https://github.com/scikit-learn-contrib/hdbscan/blob/master/LICENSE isn't really BSD, and doesn't allow to just use it, to me it seems we need a copy of that file here? Unless that copyright is changed or we somehow get a written permission from all contributors to that package.

I left a few comments just to make a few things more readable, otherwise I'm happy to merge.

@Micky774
Copy link
Contributor Author

@adrinjalali Should be ready for a second-pass :)

Copy link
Member
@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PRETTY KEWL!!!

@adrinjalali adrinjalali merged commit 415b11a into scikit-learn:main May 31, 2023
@jjerphan
Copy link
Member

Thank you for working on integrating HDBSCAN in scikit-learn, @Micky774! 🎉

@Micky774 Micky774 deleted the hdbscan_sync branch May 31, 2023 14:05
@Micky774
Copy link
Contributor Author

Thank you to everyone for all your help with this! This was a massive collaboration and I greatly appreciate your persistence and patience with review and API discussions <3

@lmcinnes Wanted to ping you as well! Thanks again to you and your team for the fantastic original implementation.

The first major hurdle is done, now to slowly reintroduce some functionality (e.g. boruvka), optimize, and extend support (e.g. float32 data).

@Micky774 Micky774 mentioned this pull request May 31, 2023
REDVM pushed a commit to REDVM/scikit-learn that referenced this pull request Nov 16, 2023
…arn#26385)

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0