-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
KMeans Elkan algorithm (the default) is generally slower than Full #21729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@jeremiedbb , continuing here the conversation we started in #20642:
Sure, I'm happy to submit a PR. However, I'm not sure about the process in this case: the parameter's default value remains So should I "just do it" and label the change as a performance improvement? Or should this be considered an API change since existing models will give different results? Should I first put a warning that in Sklearn 1.3, the default Or perhaps we want to drop What do you think? |
To me "auto" means let scikit-learn decide what's best, so if we find that something is better we should be able to change the behavior without the 2 releases deprecation cycle. Others might disagree :) In this case, since elkan and full give the same result, switching the behavior of "auto" is just a performance improvement. I think it's fine to directly switch the behavior of "auto" for the next major release (1.1.0). |
Thanks @jeremiedbb , I just pushed a tiny PR for this. Let me know what you think. Note that I did not add |
I think I would be in favor of:
Document all of this in the "Changed models" section of the 1.1 changelog. Since Elkan and Lloyd should converge to the same solution (up to numerical rounding errors) and that we do not expect any significant performance degradation both in terms of memory usage and CPU time (quite the opposite), I don't think we need to go through a deprecation cycle. |
As explained here those benchmark results might be biased in favor of Lloyd because the Elkan implementation can suffer from the fact that it iteratively calls into 2 distinct threadpools (OpenBLAS native threadpool and OpenMP) when scipy is installed from the wheels. |
While it is true that k-means mostly makes sense on such data, its not interesting to use k-means on this. Instead maybe benchmark with some embedding vectors etc. without a clear structure. Make sure to set The reason is as follows: If you have well separated blobs, and initialize with, e.g., k-means++, then the algorithms can converge in just 1 or 2 iterations - if your cluster guesses are somewhat fine, all points will be assigned to the "correct" cluster, and then the means will be perfect in the next step. The improved algorithms such as Elkan, Hamerly, Yin-Yang, Exponion, Shallot, ... are beneficial when you need to do many more than 5 iterations, because they have to do more work in the first iteration, but much less once the result changes only gradually. If you choose a too simple problem, they never get to benefit from their bounds. So don't draw conclusions from Even with |
Mmh, you make really good points, @kno10. 👍 A better benchmark should probably tackle a variety of real-life problems. May I suggest that you update the Colab I shared above to run the benchmark on a few datasets instead of just |
Well right now it seems that the naive algorithm is not consequently outperformed by the scikit-learn implementation of Elkans because of the mentioned code optimization issues. So for now, it is probably okay to default to the naive algorithm. Maybe have a look at the mlpack benchmarking system, which might include some data sets and suitable parameters: https://github.com/mlpack/benchmarks |
We haven't touched that repository in quite a while---it's a lot of work to keep a benchmark system up to date! I agree entirely with @kno10 that |
Improved benchmark, with real data (color histograms, c.f., https://doi.org/10.5281/zenodo.6355684 ), but still effectively only a single data set: import pandas as pd
from time import time
from sklearn.cluster import KMeans
def time_kmeans(X, n_clusters, algorithm, random_state):
kmeans = KMeans(n_clusters=n_clusters, algorithm=algorithm, random_state=random_state, tol=0, max_iter=1000)
t0 = time()
kmeans.fit(X)
return time() - t0, kmeans.n_iter_
iteration = 0
for d in (8, 27, 64, 125, 216, 343, 512, 729, 1000):
df = pd.read_csv("https://zenodo.org/record/6355684/files/aloi-%dd.csv.gz" % d, delimiter=" ", header=None)
X=df.iloc[:,:-2].values
assert X.shape == (110250, d), Xshape
for k in (2, 5, 10, 20, 50, 100):
iteration += 1
time_elkan, iter_elkan = time_kmeans(X, k, algorithm="elkan", random_state=iteration)
time_full, iter_full = time_kmeans(X, k, algorithm="full", random_state=iteration)
winner = "Full" if time_full < time_elkan else "Elkan"
print(
f"n_iter={iter_elkan:-4}, n_iter={iter_full:-4}, "
f"k={k:-4}, d={d:-4}, "
f"time_elkan={time_elkan:-6.2f}, time_full={time_full:-6.2f}, "
f"winner={winner}") on colab, this gives (still running for the larger data sets)
which appears to align with earlier results that Elkan is better for high dimensionality, and low k (because of the many bounds). For larger k, Hamerly and its improvements are often better than Elkan. |
Describe the bug
I just ran some tests on
KMeans
, and usingalgorithm="elkan"
is generally slower (and sometimes much slower) thanalgorithm="full"
on various datasets generated usingmake_blobs()
(I tried differentn_samples
,n_features
,cluster_std
, and different numbers of clusters). I basically couldn't find a dataset where Elkan was reliably faster than Full. I also tried both NumPy 1.18.5 and 1.19.5, with Scikit-Learn 1.0.1, but Elkan remained generally slower than Full.Steps/Code to Reproduce
The experiments were done on Colab, like this:
And the output is:
Expected Results
Elkan should be generally faster, especially on large datasets with many well-defined clusters.
Actual Results
Versions
Note: I first mentioned these results in #20642, but it turned out to be a different issue which is why I'm copying these results in a new issue.
The text was updated successfully, but these errors were encountered: