8000 add HDBSCAN · Issue #14331 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

add HDBSCAN #14331

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
amueller opened this issue Jul 13, 2019 · 11 comments · Fixed by #22616
Closed

add HDBSCAN #14331

amueller opened this issue Jul 13, 2019 · 11 comments · Fixed by #22616

Comments

@amueller
Copy link
Member

I think we should add HDBSCAN. the original paper is from 2013, @lmcinnes's accelerated version is from 2017, the original paper has 300 citations, the 2017 JOSS paper about the implementation has 100.
I think that should fulfill our requirements, and it's commonly asked for.

@lmcinnes said he might not have time to move it so maybe someone else can pick it up.

For reference:
https://github.com/scikit-learn-contrib/hdbscan

@lmcinnes
Copy link
Contributor

I will be happy to provide assistance with moving it over -- there are some changes that will be required, mostly related to the difference between accessing internals of scikit-learn kd-trees via Cython. I will also be happy to help with reviewing.

@jnothman
Copy link
Member
jnothman commented Jul 14, 2019 via email

@amueller
Copy link
Member Author

I'm really not that familiar with OPTICS.
Looks like it might make the OPTICS implementation obsolete?
https://datascience.stackexchange.com/a/11630

@amueller
Copy link
Member Author

Btw I like the demo dataset for hdbscan, maybe it could replace some of the other ones we have in the comparison? https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html

@amueller
Copy link
Member Author

@jnothman I know you're catching up with a lot but maybe this is worth looking into, given that there's still issues with Optics and this is actually a pretty well-tested implementation?

@rth
Copy link
Member
rth commented Jul 22, 2019

Looks like it might make the OPTICS implementation obsolete?
[..]
given that there's still issues with Optics

OPTICS was announced as a "major feature" in v0.21 so I guess it's now there to stay in any case? The 1999 paper also has 3.5k citations. Unless there are significant issues with it? I haven't found that many on issue tracker, but I haven't followed the development either.

I think it would be good to include HDBSCAN, just saying that purely following the inclusion criteria (independently of any technical merits of the algorithms) it made sense to include OPTICS first. Now what impact that may have on the future HDBSCAN inclusion I'm not sure.

@amueller
Copy link
Member Author

@rth not sure I follow your logic. Are you talking about the class or the implementation or both?
I am not very familiar with either algorithm, but it looks to me as if an implementation of HDBSCAN would also implement OPTICS, and having a redundant implementation of OPTICS seems unnecessary?

@rth
Copy link
Member
rth commented Jul 22, 2019

I meant the OPTICS algorithm, not so much the implementation. I was not aware that OPTICS results could be obtained with HDBSCAN exactly. As long as we don't break backward compatibility of OPTICS I don't really have an opinion, and will let people who have worked on this decide..

@adrinjalali
Copy link
Member

I haven't read the HDBSCAN's paper in detail, but as I understand, it's not strictly a superset of OPTICS, but it seems the community has accepted that it's a better one compare to OPTICS.

I don't think it'd be too hard to refactor the code so that both algorithms can use the core part.

@lmcinnes
Copy link
Contributor

HDBSCAN and OPTICS share the same computational core (though HDBSCAN is a little more general); the post-processing can be a little different. I do think you want to look to re-use/integrate the core code if possible to improve stability, debugging, and maintenance.

@Micky774
Copy link
Contributor

Finally closed in #26385 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants
0