-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
locally failing kmeans convergence test (WSL) #17428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
cc @jeremiedbb |
Inertia basically stays the same after iteration 50, but there seems to be some floating point issues maybe?
showing center shift we can see:
|
I see two solutions: remove the test because it's unstable, or add an extra condition into the convergence check for |
It does not come from relabelling indefinitely. When I already added a comment in the docstring that using tol=0 is not advised. |
No sure if you saw my comment just before yours: What I mean is that either we should make sure the test that's actually performed when using tol=0 is more stable than a floating point comparison, or we shouldn't have that setting in the tests. |
Initially this test was added in a PR which goal was to make sure that when tol=0, the iteration loop don't run until max_iter (due to strict inequality check before). Since we can't guarantee that due to floating point errors, I think we should just remove the test. |
I'm fine with that, but then maybe we should also raise a warning if someone provides |
I don't have a preference. I'm fine with both. I'd also be fine with silently switch to tol=eps when user provides tol=0 :) |
The standard approach is simply to set a flag when any point is reassigned when computing the new assignment; and converge only when no point is relabeled. |
@kno10 I agree that having the textbook definition seems useful, though it is mostly academic. With a dataset of any common size waiting for full convergence is probably very slow and not what you want. The 'finishing' also just means 'found a local optimum with a particular sets of local moves'. But indeed the stable state is a nice property. I'm not sure what you mean by the "If the user wants a floating-point tolerance" statement. I think the question is whether it's acceptable to rewrite tol=0 as tol=eps and then use the current criterion which is equivalent to no more change for all practical purposes, or if we should actually track cluster re-assignment, which we could do for the case of |
It is not that slow to do full convergence. First of all, few users actually have that big data where k-means makes sense to use in the first place. Big data is a big lie, most of the time. Most users are still in the megabyte range. And even if someone has gigabytes of data, that is usually before preparation and "vectorization". Once you get the data into shape for clustering, it tends to shrink (or you are probably doing something wrong anyway). So for most users, the number of iterations is of little concern. The others can set tolerance. For good algorithms (Elkan and beyond), the first iteration is expensive - but the later iterations get cheaper and cheaper, because fewer points are recomputed. I've had benchmarks where the last 100 iterations took as long as the first iteration because only single points were recomputed, much less than N/100. Try it on the mnist data, plot runtime vs. number of iterations. Note that having a boolean flag "something changed" can even be cheaper than the current tolerance computation. That also only adds O(1) memory. Since Lloyd's algorithm for example does not need to compute the center shift at all you "waste" O(k*d) operations to compute the center shift instead. With floating point, I would be more concerned that in some rare corner cases, an oscillation could arise due to bad rounding. And you then do not reach a stable state. But I do not have proof that this can happen, and this then likely happens independently of tolerance or fix point convergence. |
#16083 |
@kno10 I opened the issue because I had an oscillation on the test that you added. Oscillation in the change, not in the labels, of course. It's very hard to assess what the 'average use case' is for a library like sklearn and I can see an argument being made in either direction. I don't think we'll change the default behavior for now but I'd be happy to see the fixes from #16083 be incorporate as far as they still apply. It's unfortunate that your work there conflicted with a major rewrite. |
Btw do you have thoughts on selecting algorithms and other algorithms we should implement? The last time I checked the trade-offs were not super clear for all the algorithms. Elkan seems mostly to be problematic because of the memory requirement and ying-yang is better there but there's also been some other modifications afterwards IIRC. |
Oscillation in the change raises largely the same question: why and how does this happen. How can the change oscillate, but not the labels? When the centers are computed using the arithmetic mean, the only way they could oscillate is if the labels changed? Or is it because the center shift computation itself is numerically unstable (c.f., #9354) so that identical centers do not yield a distance of 0? Can you replace the distance computations there to not use the dot hack? Yin-Yang k-means is a bit tricky for users because it adds additional tuning parameters. You then have to run k-means on the initial cluster centers, for example. In experiments here, the performance would usually be between Elkan and Hamerly (unsurprisingly); so the method is a tradeoff, but once you tune parameters you'll likely want to use either of the original methods. |
When building the 0.23.1 packages for Debian, I too encountered these test failures (both 'full' and 'elkan' failed). Reading the exchange above, the issue seems to already have been identified. I thought it couldn't hurt to add another data point. We've locally disabled these tests for now. |
Since we now use OpenMP (thread-based) parallelism by default, I would not be surprised that we observe some non-deterministic rounding errors. |
fails for me for both algorithms:
show_versions:
The text was updated successfully, but these errors were encountered: