-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
KMeans singnificantly slower on 0.23 #17208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for the report. Can you tell how many cores you have ? |
Thank you for fast reaction. I have 4 cores. |
I can reproduce the slowdown on my laptop and the changes I made in #17210 solve the issue for me. We are going to merge it and it will be available it the nightly builds. Here are the instructions to install it: https://scikit-learn.org/stable/developers/advanced_installation.html#installing-nightly-builds. We'd be very interested if you could check if it fixes your issue (you'd have to wait a day after we merge it) |
I checked out your PR and installed the package with Now times are: |
I did cleaner measurements and indeed the proposed fix is still ~3x slower. Profiling showed that it's a new helper which takes 90% of the execution time. It's called at each iteration. I'll make a pr to move it outside of the iteration loop. Out of curiosity, is performance critical for you for problems that take ~10ms to run ? |
@jeremiedbb thank you. That would be great. We use scikit as the Orange dependency (Orange is a graphical tool for data analysis) and on the other problem (data with size (300, 25)) clustering which took milliseconds before (0.1 - 0.5s) takes few seconds now. Most of Orange's users have data with smaller sizes and for them, it is quite a difference. I mean it is not the most critical issue, but if it is possible we would prefer that things would be computed faster. |
I opened #17235. It's better but still not as fast as scikit-learn 0.22. EDIT: the conclusions are wrong, please ignore. The reason is the overhead of the deprecation of positional args in 0.23 as we can see in this profiling: Using public functions which are now wrapped in the positional args deprecation context manager in tight loops may have a non negligible overhead. The solution would be to do someting like:
and call |
Another alternative is to disable the wrapping with a context manager. For instance: def _deprecate_positional_args(f):
if not get_config(wrap_positional):
return f
...
with config_context(wrap_positional=False):
public_method(...) Would this improve the overhead? WDYT @thomasjpfan? |
This could work. I have another idea I want to try out ;) @jeremiedbb Can you provide the code snippet you used to profile? |
I take back what I said. I misinterpreted the profile. The overhead of the decorator is actually the very small bars right under As a confirmation, I tried adrin's suggestion and it did not improve @thomasjpfan here's what I use to profile:
in this specific case:
|
That's exactly what I said in the previous comment :D |
I think I managed to fix the issue now in #17235. @PrimozGodec would you mind checking this out to see if you recover the performances of 0.22 ? |
@jeremiedbb thank you for your help. I tested the PR and it works now normally. |
Thanks @PrimozGodec ! Closing. |
Describe the bug
With the latest changes, KMeans is significantly slower on small datasets. The time needed to compute clusters is around ten times longer.
Steps/Code to Reproduce
Times with the following code are:
scikit-lern 0.22: ~0.015
scikit-learn 0.23: ~0.15
I also tried on a bigger dataset with shape (300, 25) where clustering with the new version needed 3-4s while before it happened in miliseconds.
Expected Results
Clusters would be computed as fast as before.
Versions
The text was updated successfully, but these errors were encountered: