-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] Fix KMeans convergence when tol==0 #17959
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a minor remark:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR @jeremiedbb !
I changed to always check convergence based on labels first and then on tol. I tested the impact on performance in the worst case, i.e. many samples, few centers, few features. It makes the computation of labels comparison as costly as possible compared to the other computations. In that case the drop of performance is rather small, see below. I think it's acceptable (recalling that it's in the worst situati 8000 on. I many situations the drop is indetectable). from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import numpy as np
X = np.random.RandomState(0).random_sample((1000000, 3))
km = KMeans(n_clusters=3, tol=1e-16, n_init=1, init='random', random_state=0)
# master
%timeit km.fit(X)
2.19 s ± 11.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
km.n_iter_
71
# this pr
%timeit km.fit(X)
2.26 s ± 25.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
km.n_iter_
71 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM again, I agree with your perf analysis and I things it's ok to always do the check. Some nitpicks:
Another suggestion: the coverage report highlights the fact that the new convergence messages are not covered. It might be interesting to extend the tests using the capsys fixture to make some assertion on those, in particular with https://docs.pytest.org/en/stable/capture.html#accessing-captured-output-from-a-test-function |
|
||
assert km.n_iter_ < 300 | ||
assert km_full.n_iter_ == km_elkan.n_iter_ < max_iter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this assertion be robust to rounding errors that can affect Lloyd and Elkan slightly differently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a badly broken line anyway that is likely equivalent to
assert int(km_full.n_iter_ == km_elkan.n_iter_) < max_iter
isn't it? And because int(boolean) is either 0 or 1, this will likely always be True (unless you use max_iter=1 in some test).
As for numerical issues: in theory it can happen that because of rounding issues Lloyd and Elkan in a rare case may produce different results. But I would rather test for them to do the same number of iterations and then study the situation if they ever not agree; if ever. That would be an interesting case to see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not broken. In python you can chain comparison operators. a == b < c
is equivalent to (a == b) and (b < c)
.
It's not robust to rounding errors indeed. While I agree that it's interesting to check when such a situation happens, it's painful to have that in the test suite, because it's hard to debug when it suddently fails on the ci because some hardware change or some other reason idk. It happened with another test and we just ended up xfailing the test...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also there's already a test which checks that elkan and lloyd have the same n_iter_ in a situation where rounding errors can't interfere (test_kmeans_elkan_results iirc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Please add an entry to the change log at doc/whats_new/v0.24.rst
with tag |Fix|. Like the other entries there, please reference this pull request with :pr:
and credit yourself (and other contributors if applicable) with :user:
.
Should this be in 0.23.2?? |
Seems easy to backport. Let's do it. |
Fixes #17428
When
tol==0
, convergence based on the norm of the difference of the centers between 2 iterations can fail due to rounding errors making the norm not exactly 0 (and leave iterations go until max_iter is reached).This PR implements convergence based on the difference of the labels between 2 iterations when tol == 0.
It does not impact performances. Here's a timing comparison between setting tol=0 and tol=1e-16 (both converge after 60 iterations to the same clustering):