[MRG] Fix KMeans convergence when tol==0 #17959

jeremiedbb · 2020-07-20T17:36:22Z

When tol==0, convergence based on the norm of the difference of the centers between 2 iterations can fail due to rounding errors making the norm not exactly 0 (and leave iterations go until max_iter is reached).

This PR implements convergence based on the difference of the labels between 2 iterations when tol == 0.

It does not impact performances. Here's a timing comparison between setting tol=0 and tol=1e-16 (both converge after 60 iterations to the same clustering):

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

X, _ = make_blobs(n_samples=100000, centers=100, n_features=10)

km = KMeans(n_clusters=100, tol=0, n_i
8000
nit=1, init='random', random_state=0)
%timeit km.fit(X)
2.13 s ± 3.17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

km = KMeans(n_clusters=100, tol=1e-16, n_init=1, init='random', random_state=0)
%timeit km.fit(X)
2.14 s ± 16.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

ogrisel

LGTM, just a minor remark:

sklearn/cluster/_kmeans.py

thomasjpfan

Thank you for the PR @jeremiedbb !

sklearn/cluster/_kmeans.py

sklearn/cluster/tests/test_k_means.py

jeremiedbb · 2020-07-22T10:21:18Z

I changed to always check convergence based on labels first and then on tol. I tested the impact on performance in the worst case, i.e. many samples, few centers, few features. It makes the computation of labels comparison as costly as possible compared to the other computations. In that case the drop of performance is rather small, see below. I think it's acceptable (recalling that it's in the worst situation. I many situations the drop is indetectable).

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import numpy as np

X = np.random.RandomState(0).random_sample((1000000, 3))

km = KMeans(n_clusters=3, tol=1e-16, n_init=1, init='random', random_state=0)

# master
%timeit km.fit(X)
2.19 s ± 11.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

km.n_iter_
71

# this pr
%timeit km.fit(X)
2.26 s ± 25.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

km.n_iter_
71

ogrisel

LGTM again, I agree with your perf analysis and I things it's ok to always do the check. Some nitpicks:

sklearn/cluster/_kmeans.py

ogrisel · 2020-07-22T12:17:01Z

Another suggestion: the coverage report highlights the fact that the new convergence messages are not covered.

It might be interesting to extend the tests using the capsys fixture to make some assertion on those, in particular with tol=0 or 1e-100:

https://docs.pytest.org/en/stable/capture.html#accessing-captured-output-from-a-test-function

jeremiedbb · 2020-07-22T14:12:10Z

I added a test checking all expected outputs of KMeans in verbose mode.
I'd rather not do the same for minibatch in this pr, but rather in #17622 since #17622 changes some prints.

ogrisel · 2020-07-24T14:35:52Z

sklearn/cluster/tests/test_k_means.py


-    assert km.n_iter_ < 300
+    assert km_full.n_iter_ == km_elkan.n_iter_ < max_iter


Will this assertion be robust to rounding errors that can affect Lloyd and Elkan slightly differently?

That is a badly broken line anyway that is likely equivalent to

assert int(km_full.n_iter_ == km_elkan.n_iter_) < max_iter

isn't it? And because int(boolean) is either 0 or 1, this will likely always be True (unless you use max_iter=1 in some test).

As for numerical issues: in theory it can happen that because of rounding issues Lloyd and Elkan in a rare case may produce different results. But I would rather test for them to do the same number of iterations and then study the situation if they ever not agree; if ever. That would be an interesting case to see.

It's not broken. In python you can chain comparison operators. a == b < c is equivalent to (a == b) and (b < c).

It's not robust to rounding errors indeed. While I agree that it's interesting to check when such a situation happens, it's painful to have that in the test suite, because it's hard to debug when it suddently fails on the ci because some hardware change or some other reason idk. It happened with another test and we just ended up xfailing the test...

also there's already a test which checks that elkan and lloyd have the same n_iter_ in a situation where rounding errors can't interfere (test_kmeans_elkan_results iirc)

thomasjpfan

LGTM!

Please add an entry to the change log at doc/whats_new/v0.24.rst with tag |Fix|. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.

jnothman · 2020-07-28T21:36:00Z

Should this be in 0.23.2??

ogrisel · 2020-07-29T07:39:29Z

Should this be in 0.23.2??

Seems easy to backport. Let's do it.

jeremiedbb added 4 commits July 20, 2020 17:48

wip

9f46b6c

wip

ae53852

wip

6677781

wip

4e6b2bd

github-actions bot added the module:cluster label Jul 20, 2020

ogrisel approved these changes Jul 21, 2020

View reviewed changes

sklearn/cluster/_kmeans.py Outdated Show resolved Hide resolved

sklearn/cluster/_kmeans.py Outdated Show resolved Hide resolved

thomasjpfan reviewed Jul 21, 2020

View reviewed changes

sklearn/cluster/_kmeans.py Outdated Show resolved Hide resolved

sklearn/cluster/_kmeans.py Outdated Show resolved Hide resolved

sklearn/cluster/_kmeans.py Outdated Show resolved Hide resolved

kno10 reviewed Jul 21, 2020

View reviewed changes

sklearn/cluster/_kmeans.py Outdated Show resolved Hide resolved

kno10 reviewed Jul 21, 2020

View reviewed changes

sklearn/cluster/tests/test_k_means.py Outdated Show resolved Hide resolved

address comments

f65e36c

ogrisel approved these changes Jul 22, 2020

View reviewed changes

sklearn/cluster/_kmeans.py Outdated Show resolved Hide resolved

sklearn/cluster/_kmeans.py Outdated Show resolved Hide resolved

sklearn/cluster/_kmeans.py Outdated Show resolved Hide resolved

cln + test verbose

06fed59

cln

e34bd47

ogrisel reviewed Jul 24, 2020

View reviewed changes

change back to parametrized test

d4eb470

thomasjpfan approved these changes Jul 27, 2020

View reviewed changes

jeremiedbb added 2 commits July 28, 2020 12:56

what's new

dd2b284

Merge remote-tracking branch 'upstream/master' into zero-tol-kmeans

7a28c82

ogrisel merged commit fc06bae into scikit-learn:master Jul 28, 2020

jnothman modified the milestone: 0.23.2 Jul 28, 2020

ogrisel added this to the 0.23.2 milestone Jul 29, 2020

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Aug 3, 2020

[MRG] Fix KMeans convergence when tol==0 (scikit-learn#17959)

7715819

jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020

[MRG] Fix KMeans convergence when tol==0 (scikit-learn#17959)

2ae8603

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[MRG] Fix KMeans convergence when tol==0 #17959

[MRG] Fix KMeans convergence when tol==0 #17959

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!


		assert km.n_iter_ < 300
		assert km_full.n_iter_ == km_elkan.n_iter_ < max_iter

Uh oh!

[MRG] Fix KMeans convergence when tol==0 #17959

[MRG] Fix KMeans convergence when tol==0 #17959

Uh oh!

Conversation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!