ENH Avoid repeated input checks in kmeans++#19002
ENH Avoid repeated input checks in kmeans++#19002thomasjpfan merged 11 commits intoscikit-learn:masterfrom
Conversation
|
Future improvement could be to merge the computation of the euclidean distances with the computation of the min in a single function parallelized over data chunks in cython (a bit like it's done for the kmeans iteration loop. In fact maybe some part of the code could be shared). |
There was a problem hiding this comment.
This looks great. This is a nice and simple perf improvement.
|
Please document it in whats new (for 1.0 as suppose as the previous perf was not catastrophic enough to be considered a bug). Although, since 0.24.0 is not yet out we could also consider it for inclusion in 0.24.0, as you wish. |
|
For reference, I made a summary of our current understanding of the perf issues with k-means++ init in #10924 (comment). |
There was a problem hiding this comment.
The simplification in _euclidean_distances is quite nice.
Minor comments, otherwise LGTM
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
kmeans++ calls
euclidean_distancesmany times and each call repeat the check of the input which is not necessary since it has already been checked.partially take over #7383
The time wasted in the input check can be huge on machines with many cores. For instance I tested on a machine with 44 cores. Here's the time spent during KMeans.fit (in seconds):
cc @ogrisel