ENH Avoid repeated input checks in kmeans++ #19002

jeremiedbb · 2020-12-14T15:37:35Z

kmeans++ calls euclidean_distances many times and each call repeat the check of the input which is not necessary since it has already been checked.

partially take over #7383

The time wasted in the input check can be huge on machines with many cores. For instance I tested on a machine with 44 cores. Here's the time spent during KMeans.fit (in seconds):

	master	PR
total time	252	165
in kmeans++	166	79
in lloyd	83	84

cc @ogrisel

jeremiedbb · 2020-12-14T15:47:04Z

Future improvement could be to merge the computation of the euclidean distances with the computation of the min in a single function parallelized over data chunks in cython (a bit like it's done for the kmeans iteration loop. In fact maybe some part of the code could be shared).

ogrisel

This looks great. This is a nice and simple perf improvement.

sklearn/metrics/pairwise.py

ogrisel · 2020-12-16T09:32:59Z

Please document it in whats new (for 1.0 as suppose as the previous perf was not catastrophic enough to be considered a bug).

Although, since 0.24.0 is not yet out we could also consider it for inclusion in 0.24.0, as you wish.

ogrisel · 2020-12-16T09:34:12Z

For reference, I made a summary of our current understanding of the perf issues with k-means++ init in #10924 (comment).

thomasjpfan

The simplification in _euclidean_distances is quite nice.

Minor comments, otherwise LGTM

sklearn/metrics/pairwise.py

doc/whats_new/v1.0.rst

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

jeremiedbb added 2 commits December 14, 2020 14:20

check_input in euclidean_distances

7b7a6b9

fix

45c4ad5

github-actions bot added module:cluster module:metrics labels Dec 14, 2020

jeremiedbb added 4 commits December 14, 2020 17:29

fix shape

b325f5b

from check_input=False to dedicated private func

8000

2edb0f5

better coverage

6a99592

lint

eb6dec4

ogrisel mentioned this pull request Dec 16, 2020

Simplify Elkan k-means? #10924

Open

ogrisel approved these changes Dec 16, 2020

View reviewed changes

sklearn/metrics/pairwise.py Outdated Show resolved Hide resolved

sklearn/metrics/pairwise.py Outdated Show resolved Hide resolved 10000

ogrisel added the Performance label Dec 16, 2020

jeremiedbb added 2 commits December 16, 2020 11:50

Merge branch 'master' into kmeans_pp_check_input

6b7f944

what's new

4fffdde

jeremiedbb mentioned this pull request Dec 16, 2020

[MRG] Added flag to disable l2-dist finite check #7383

Open

better error message

083cddf

jeremiedbb changed the title ~~[WIP] Avoid repeated input checks in kmeans++~~ [MRG] Avoid repeated input checks in kmeans++ Dec 16, 2020

ogrisel added the Waiting for Reviewer label Dec 16, 2020

ogrisel requested a review from thomasjpfan December 16, 2020 14:31

thomasjpfan approved these changes Dec 17, 2020

View reviewed changes

sklearn/metrics/pairwise.py Outdated Show resolved Hide resolved

doc/whats_new/v1.0.rst Outdated Show resolved Hide resolved

jeremiedbb and others added 2 commits December 19, 2020 13:43

Update sklearn/metrics/pairwise.py

3bb6322

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

Update doc/whats_new/v1.0.rst

ea237ed

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

thomasjpfan changed the title ~~[MRG] Avoid repeated input checks in kmeans++~~ ENH Avoid repeated input checks in kmeans++ Dec 20, 2020

thomasjpfan merged commit dc1ea27 into scikit-learn:master Dec 20, 2020

glemaitre mentioned this pull request Dec 22, 2020

Release 0.24.0 #19058

Merged

14 tasks

glemaitre mentioned this pull request Apr 22, 2021

Release 0.24.2 #19954

Merged

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH Avoid repeated input checks in kmeans++ #19002

ENH Avoid repeated input checks in kmeans++ #19002

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ENH Avoid repeated input checks in kmeans++ #19002

ENH Avoid repeated input checks in kmeans++ #19002

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!