10000 KMeans Elkan algorithm (the default) is generally slower than Full · Issue #21729 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

KMeans Elkan algorithm (the default) is generally slower than Full #21729

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ageron opened this issue Nov 21, 2021 · 11 comments · Fixed by #21735
Closed

KMeans Elkan algorithm (the default) is generally slower than Full #21729

ageron opened this issue Nov 21, 2021 · 11 comments · Fixed by #21735

Comments

@ageron
Copy link
Contributor
ageron commented Nov 21, 2021

Describe the bug

I just ran some tests on KMeans, and using algorithm="elkan" is generally slower (and sometimes much slower) than algorithm="full" on various datasets generated using make_blobs() (I tried different n_samples, n_features, cluster_std, and different numbers of clusters). I basically couldn't find a dataset where Elkan was reliably faster than Full. I also tried both NumPy 1.18.5 and 1.19.5, with Scikit-Learn 1.0.1, but Elkan remained generally slower than Full.

Steps/Code to Reproduce

The experiments were done on Colab, like this:

from time import time

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

def time_kmeans(X, n_clusters, algorithm, random_state):
  kmeans = KMeans(n_clusters=n_clusters, algorithm=algorithm, random_state=random_state)
  t0 = time()
  kmeans.fit(X)
  return time() - t0

iteration = 0
for n_samples in (10**3, 10**4, 10**5):
  for n_features in (2, 10, 100):
    for n_clusters in (2, 5, 10, 20, 50, 100):
      for cluster_std in (1e-3, 0.1, 1, 5):
        n_samples_per_cluster = n_samples // n_clusters
        X, y = make_blobs(n_samples=[n_samples_per_cluster] * n_clusters,
                          n_features=n_features,
                          cluster_std=cluster_std)
        iteration += 1
        time_elkan = time_kmeans(X, n_clusters, algorithm="elkan", random_state=iteration)
        time_full = time_kmeans(X, n_clusters, algorithm="full", random_state=iteration)
        winner = "Full" if time_full < time_elkan else "Elkan"
        print(
            f"n_samples={n_samples}, n_features={n_features}, "
            f"n_clusters={n_clusters}, cluster_std={cluster_std}, "
            f"time_elkan={time_elkan:.2f}, time_full={time_full:.2f}, "
            f"winner={winner}")

And the output is:

n_samples=1000, n_features=2, n_clusters=2, cluster_std=0.001, time_elkan=0.01, time_full=0.01, winner=Full
n_samples=1000, n_features=2, n_clusters=20, cluster_std=1, time_elkan=0.11, time_full=0.07, winner=Full
n_samples=1000, n_features=2, n_clusters=50, cluster_std=1, time_elkan=0.19, time_full=0.15, winner=Full
n_samples=1000, n_features=2, n_clusters=100, cluster_std=0.001, time_elkan=0.49, time_full=0.31, winner=Full
n_samples=1000, n_features=2, n_clusters=100, cluster_std=0.1, time_elkan=0.73, time_full=0.29, winner=Full
n_samples=1000, n_features=2, n_clusters=100, cluster_std=1, time_elkan=1.17, time_full=0.32, winner=Full
n_samples=1000, n_features=2, n_clusters=100, cluster_std=5, time_elkan=1.21, time_full=0.30, winner=Full
n_samples=1000, n_features=10, n_clusters=5, cluster_std=0.1, time_elkan=0.12, time_full=0.12, winner=Full
n_samples=1000, n_features=10, n_clusters=10, cluster_std=0.1, time_elkan=0.09, time_full=0.08, winner=Full
n_samples=1000, n_features=10, n_clusters=20, cluster_std=0.1, time_elkan=0.13, time_full=0.23, winner=Elkan
n_samples=1000, n_features=10, n_clusters=20, cluster_std=5, time_elkan=0.49, time_full=0.63, winner=Elkan
n_samples=1000, n_features=10, n_clusters=50, cluster_std=0.1, time_elkan=0.48, time_full=0.24, winner=Full
n_samples=1000, n_features=10, n_clusters=50, cluster_std=5, time_elkan=0.48, time_full=0.26, winner=Full
n_samples=1000, n_features=10, n_clusters=100, cluster_std=0.001, time_elkan=0.75, time_full=0.42, winner=Full
n_samples=1000, n_features=10, n_clusters=100, cluster_std=0.1, time_elkan=0.61, time_full=0.43, winner=Full
n_samples=1000, n_features=10, n_clusters=100, cluster_std=1, time_elkan=0.88, time_full=0.53, winner=Full
n_samples=1000, n_features=10, n_clusters=100, cluster_std=5, time_elkan=1.36, time_full=0.46, winner=Full
n_samples=1000, n_features=100, n_clusters=5, cluster_std=0.001, time_elkan=0.08, time_full=0.15, winner=Elkan
n_samples=1000, n_features=100, n_clusters=5, cluster_std=5, time_elkan=0.16, time_full=0.07, winner=Full
n_samples=1000, n_features=100, n_clusters=10, cluster_std=1, time_elkan=0.27, time_full=0.28, winner=Elkan
n_samples=1000, n_features=100, n_clusters=20, cluster_std=0.1, time_elkan=0.21, time_full=0.19, winner=Full
n_samples=1000, n_features=100, n_clusters=20, cluster_std=5, time_elkan=0.25, time_full=0.26, winner=Elkan
n_samples=1000, n_features=100, n_clusters=50, cluster_std=0.001, time_elkan=0.76, time_full=0.61, winner=Full
n_samples=1000, n_features=100, n_clusters=50, cluster_std=0.1, time_elkan=0.69, time_full=0.43, winner=Full
n_samples=1000, n_features=100, n_clusters=50, cluster_std=1, time_elkan=0.59, time_full=0.61, winner=Elkan
n_samples=1000, n_features=100, n_clusters=100, cluster_std=0.001, time_elkan=0.93, time_full=0.77, winner=Full
n_samples=1000, n_features=100, n_clusters=100, cluster_std=0.1, time_elkan=0.84, time_full=0.96, winner=Elkan
n_samples=1000, n_features=100, n_clusters=100, cluster_std=1, time_elkan=1.37, time_full=0.94, winner=Full
n_samples=1000, n_features=100, n_clusters=100, cluster_std=5, time_elkan=1.31, time_full=0.79, winner=Full
n_samples=10000, n_features=2, n_clusters=2, cluster_std=5, time_elkan=0.14, time_full=0.27, winner=Elkan
n_samples=10000, n_features=2, n_clusters=5, cluster_std=1, time_elkan=0.31, time_full=0.18, winner=Full
n_samples=10000, n_features=2, n_clusters=10, cluster_std=0.001, time_elkan=0.29, time_full=0.16, winner=Full
n_samples=10000, n_features=2, n_clusters=10, cluster_std=1, time_elkan=0.53, time_full=0.21, winner=Full
n_samples=10000, n_features=2, n_clusters=10, cluster_std=5, time_elkan=1.27, time_full=0.83, winner=Full
n_samples=10000, n_features=2, n_clusters=20, cluster_std=0.1, time_elkan=0.25, time_full=0.27, winner=Elkan
n_samples=10000, n_features=2, n_clusters=20, cluster_std=1, time_elkan=1.17, time_full=0.58, winner=Full
n_samples=10000, n_features=2, n_clusters=20, cluster_std=5, time_elkan=1.60, time_full=0.86, winner=Full
n_samples=10000, n_features=2, n_clusters=50, cluster_std=0.001, time_elkan=0.79, time_full=0.84, winner=Elkan
n_samples=10000, n_features=2, n_clusters=50, cluster_std=0.1, time_elkan=0.82, time_full=0.70, winner=Full
n_samples=10000, n_features=2, n_clusters=50, cluster_std=1, time_elkan=1.99, time_full=1.34, winner=Full
n_samples=10000, n_features=2, n_clusters=50, cluster_std=5, time_elkan=2.35, time_full=1.47, winner=Full
n_samples=10000, n_features=2, n_clusters=100, cluster_std=0.001, time_elkan=1.12, time_full=1.06, winner=Full
n_samples=10000, n_features=2, n_clusters=100, cluster_std=0.1, time_elkan=1.34, time_full=1.08, winner=Full
n_samples=10000, n_features=2, n_clusters=100, cluster_std=1, time_elkan=7.00, time_full=1.95, winner=Full
n_samples=10000, n_features=2, n_clusters=100, cluster_std=5, time_elkan=7.06, time_full=1.92, winner=Full
n_samples=10000, n_features=10, n_clusters=5, cluster_std=0.001, time_elkan=0.13, time_full=0.24, winner=Elkan
n_samples=10000, n_features=10, n_clusters=5, cluster_std=5, time_elkan=0.40, time_full=0.32, winner=Full
n_samples=10000, n_features=10, n_clusters=10, cluster_std=0.1, time_elkan=0.30, time_full=0.17, winner=Full
n_samples=10000, n_features=10, n_clusters=10, cluster_std=5, time_elkan=0.70, time_full=0.43, winner=Full
n_samples=10000, n_features=10, n_clusters=20, cluster_std=0.1, time_elkan=0.46, time_full=0.27, winner=Full
n_samples=10000, n_features=10, n_clusters=20, cluster_std=1, time_elkan=0.48, time_full=0.51, winner=Elkan
n_samples=10000, n_features=10, n_clusters=20, cluster_std=5, time_elkan=1.39, time_full=1.01, winner=Full
n_samples=10000, n_features=10, n_clusters=50, cluster_std=0.001, time_elkan=0.78, time_full=0.59, winner=Full
n_samples=10000, n_features=10, n_clusters=50, cluster_std=0.1, time_elkan=0.69, time_full=0.57, winner=Full
n_samples=10000, n_features=10, n_clusters=50, cluster_std=1, time_elkan=1.00, time_full=0.90, winner=Full
n_samples=10000, n_features=10, n_clusters=50, cluster_std=5, time_elkan=3.47, time_full=2.15, winner=Full
n_samples=10000, n_features=10, n_clusters=100, cluster_std=0.001, time_elkan=1.29, time_full=1.15, winner=Full
n_samples=10000, n_features=10, n_clusters=100, cluster_std=0.1, time_elkan=1.45, time_full=1.36, winner=Full
n_samples=10000, n_features=10, n_clusters=100, cluster_std=1, time_elkan=2.58, time_full=1.63, winner=Full
n_samples=10000, n_features=10, n_clusters=100, cluster_std=5, time_elkan=11.02, time_full=3.39, winner=Full
n_samples=10000, n_features=100, n_clusters=2, cluster_std=1, time_elkan=0.34, time_full=0.22, winner=Full
n_samples=10000, n_features=100, n_clusters=5, cluster_std=0.001, time_elkan=0.34, time_full=0.37, winner=Elkan
n_samples=10000, n_features=100, n_clusters=5, cluster_std=0.1, time_elkan=0.61, time_full=0.33, winner=Full
n_samples=10000, n_features=100, n_clusters=5, cluster_std=5, time_elkan=0.50, time_full=0.56, winner=Elkan
n_samples=10000, n_features=100, n_clusters=10, cluster_std=0.1, time_elkan=0.59, time_full=0.47, winner=Full
n_samples=10000, n_features=100, n_clusters=10, cluster_std=1, time_elkan=0.63, time_full=0.53, winner=Full
n_samples=10000, n_features=100, n_clusters=10, cluster_std=5, time_elkan=1.02, time_full=1.10, winner=Elkan
n_samples=10000, n_features=100, n_clusters=20, cluster_std=0.001, time_elkan=0.82, time_full=0.81, winner=Full
n_samples=10000, n_features=100, n_clusters=20, cluster_std=0.1, time_elkan=0.82, time_full=0.89, winner=Elkan
n_samples=10000, n_features=100, n_clusters=20, cluster_std=1, time_elkan=0.92, time_full=0.78, winner=Full
n_samples=10000, n_features=100, n_clusters=20, cluster_std=5, time_elkan=1.77, time_full=1.56, winner=Full
n_samples=10000, n_features=100, n_clusters=50, cluster_std=0.001, time_elkan=2.10, time_full=1.71, winner=Full
n_samples=10000, n_features=100, n_clusters=50, cluster_std=0.1, time_elkan=1.93, time_full=1.82, winner=Full
n_samples=10000, n_features=100, n_clusters=50, cluster_std=1, time_elkan=2.10, time_full=1.77, winner=Full
n_samples=10000, n_features=100, n_clusters=50, cluster_std=5, time_elkan=2.89, time_full=2.49, winner=Full
n_samples=10000, n_features=100, n_clusters=100, cluster_std=0.001, time_elkan=4.13, time_full=3.24, winner=Full
n_samples=10000, n_features=100, n_clusters=100, cluster_std=0.1, time_elkan=3.97, time_full=3.15, winner=Full
n_samples=10000, n_features=100, n_clusters=100, cluster_std=1, time_elkan=4.20, time_full=3.46, winner=Full
n_samples=10000, n_features=100, n_clusters=100, cluster_std=5, time_elkan=5.92, time_full=4.10, winner=Full
n_samples=100000, n_features=2, n_clusters=2, cluster_std=0.1, time_elkan=0.30, time_full=0.38, winner=Elkan
n_samples=100000, n_features=2, n_clusters=2, cluster_std=5, time_elkan=1.31, time_full=1.18, winner=Full
n_samples=100000, n_features=2, n_clusters=5, cluster_std=0.1, time_elkan=0.46, time_full=0.47, winner=Elkan
n_samples=100000, n_features=2, n_clusters=5, cluster_std=1, time_elkan=1.16, time_full=0.63, winner=Full
n_samples=100000, n_features=2, n_clusters=5, cluster_std=5, time_elkan=2.77, time_full=1.77, winner=Full
n_samples=100000, n_features=2, n_clusters=10, cluster_std=0.001, time_elkan=0.97, time_full=0.71, winner=Full
n_samples=100000, n_features=2, n_clusters=10, cluster_std=0.1, time_elkan=0.80, time_full=0.73, winner=Full
n_samples=100000, n_features=2, n_clusters=10, cluster_std=1, time_elkan=2.18, time_full=1.16, winner=Full
n_
8000
samples=100000, n_features=2, n_clusters=10, cluster_std=5, time_elkan=4.82, time_full=2.35, winner=Full
n_samples=100000, n_features=2, n_clusters=20, cluster_std=0.001, time_elkan=1.36, time_full=1.23, winner=Full
n_samples=100000, n_features=2, n_clusters=20, cluster_std=0.1, time_elkan=1.66, time_full=1.44, winner=Full
n_samples=100000, n_features=2, n_clusters=20, cluster_std=1, time_elkan=4.42, time_full=2.82, winner=Full
n_samples=100000, n_features=2, n_clusters=20, cluster_std=5, time_elkan=11.02, time_full=4.66, winner=Full
n_samples=100000, n_features=2, n_clusters=50, cluster_std=0.001, time_elkan=3.61, time_full=3.55, winner=Full
n_samples=100000, n_features=2, n_clusters=50, cluster_std=0.1, time_elkan=3.89, time_full=3.49, winner=Full
n_samples=100000, n_features=2, n_clusters=50, cluster_std=1, time_elkan=18.86, time_full=9.43, winner=Full
n_samples=100000, n_features=2, n_clusters=50, cluster_std=5, time_elkan=27.77, time_full=12.31, winner=Full
n_samples=100000, n_features=2, n_clusters=100, cluster_std=0.001, time_elkan=8.33, time_full=7.79, winner=Full
n_samples=100000, n_features=2, n_clusters=100, cluster_std=0.1, time_elkan=9.55, time_full=8.30, winner=Full
n_samples=100000, n_features=2, n_clusters=100, cluster_std=1, time_elkan=59.08, time_full=20.13, winner=Full
n_samples=100000, n_features=2, n_clusters=100, cluster_std=5, time_elkan=70.57, time_full=23.11, winner=Full
n_samples=100000, n_features=10, n_clusters=2, cluster_std=0.1, time_elkan=0.51, time_full=0.42, winner=Full
n_samples=100000, n_features=10, n_clusters=2, cluster_std=1, time_elkan=0.44, time_full=0.61, winner=Elkan
n_samples=100000, n_features=10, n_clusters=2, cluster_std=5, time_elkan=0.85, time_full=0.64, winner=Full
n_samples=100000, n_features=10, n_clusters=5, cluster_std=0.001, time_elkan=0.55, time_full=0.63, winner=Elkan
n_samples=100000, n_features=10, n_clusters=5, cluster_std=0.1, time_elkan=0.77, time_full=0.58, winner=Full
n_samples=100000, n_features=10, n_clusters=5, cluster_std=1, time_elkan=0.80, time_full=0.61, winner=Full
n_samples=100000, n_features=10, n_clusters=5, cluster_std=5, time_elkan=1.70, time_full=1.00, winner=Full
n_samples=100000, n_features=10, n_clusters=10, cluster_std=0.001, time_elkan=1.05, time_full=0.87, winner=Full
n_samples=100000, n_features=10, n_clusters=10, cluster_std=0.1, time_elkan=1.06, time_full=0.80, winner=Full
n_samples=100000, n_features=10, n_clusters=10, cluster_std=1, time_elkan=1.23, time_full=1.14, winner=Full
n_samples=100000, n_features=10, n_clusters=10, cluster_std=5, time_elkan=3.28, time_full=2.13, winner=Full
n_samples=100000, n_features=10, n_clusters=20, cluster_std=0.001, time_elkan=1.96, time_full=1.55, winner=Full
n_samples=100000, n_features=10, n_clusters=20, cluster_std=0.1, time_elkan=2.17, time_full=2.00, winner=Full
n_samples=100000, n_features=10, n_clusters=20, cluster_std=1, time_elkan=2.31, time_full=1.86, winner=Full
n_samples=100000, n_features=10, n_clusters=20, cluster_std=5, time_elkan=6.73, time_full=4.03, winner=Full
n_samples=100000, n_features=10, n_clusters=50, cluster_std=0.001, time_elkan=4.87, time_full=4.19, winner=Full
n_samples=100000, n_features=10, n_clusters=50, cluster_std=0.1, time_elkan=4.92, time_full=4.53, winner=Full
n_samples=100000, n_features=10, n_clusters=50, cluster_std=1, time_elkan=6.05, time_full=5.50, winner=Full
n_samples=100000, n_features=10, n_clusters=50, cluster_std=5, time_elkan=35.26, time_full=18.50, winner=Full
n_samples=100000, n_features=10, n_clusters=100, cluster_std=0.001, time_elkan=10.04, time_full=9.49, winner=Full
n_samples=100000, n_features=10, n_clusters=100, cluster_std=0.1, time_elkan=10.54, time_full=9.01, winner=Full
n_samples=100000, n_features=10, n_clusters=100, cluster_std=1, time_elkan=16.26, time_full=12.68, winner=Full
n_samples=100000, n_features=10, n_clusters=100, cluster_std=5, time_elkan=140.66, time_full=44.06, winner=Full
n_samples=100000, n_features=100, n_clusters=2, cluster_std=0.001, time_elkan=1.19, time_full=1.25, winner=Elkan
n_samples=100000, n_features=100, n_clusters=2, cluster_std=0.1, time_elkan=1.45, time_full=1.49, winner=Elkan
n_samples=100000, n_features=100, n_clusters=2, cluster_std=1, time_elkan=1.47, time_full=1.49, winner=Elkan
n_samples=100000, n_features=100, n_clusters=2, cluster_std=5, time_elkan=1.70, time_full=1.55, winner=Full
n_samples=100000, n_features=100, n_clusters=5, cluster_std=0.001, time_elkan=2.12, time_full=1.95, winner=Full
n_samples=100000, n_features=100, n_clusters=5, cluster_std=0.1, time_elkan=2.24, time_full=2.16, winner=Full
n_samples=100000, n_features=100, n_clusters=5, cluster_std=1, time_elkan=2.19, time_full=2.06, winner=Full
n_samples=100000, n_features=100, n_clusters=5, cluster_std=5, time_elkan=2.74, time_full=2.13, winner=Full
n_samples=100000, n_features=100, n_clusters=10, cluster_std=0.001, time_elkan=3.53, time_full=3.39, winner=Full
n_samples=100000, n_features=100, n_clusters=10, cluster_std=0.1, time_elkan=3.61, time_full=3.42, winner=Full
n_samples=100000, n_features=100, n_clusters=10, cluster_std=1, time_elkan=3.67, time_full=3.42, winner=Full
n_samples=100000, n_features=100, n_clusters=10, cluster_std=5, time_elkan=5.12, time_full=4.50, winner=Full
n_samples=100000, n_features=100, n_clusters=20, cluster_std=0.001, time_elkan=6.02, time_full=5.40, winner=Full
n_samples=100000, n_features=100, n_clusters=20, cluster_std=0.1, time_elkan=6.02, time_full=5.48, winner=Full
n_samples=100000, n_features=100, n_clusters=20, cluster_std=1, time_elkan=6.07, time_full=5.51, winner=Full
n_samples=100000, n_features=100, n_clusters=20, cluster_std=5, time_elkan=16.14, time_full=17.33, winner=Elkan
n_samples=100000, n_features=100, n_clusters=50, cluster_std=0.001, time_elkan=14.73, time_full=13.29, winner=Full
n_samples=100000, n_features=100, n_clusters=50, cluster_std=0.1, time_elkan=14.86, time_full=13.29, winner=Full
...

Expected Results

Elkan should be generally faster, especially on large datasets with many well-defined clusters.

Actual Results

image

Versions

System:
    python: 3.7.12 (default, Sep 10 2021, 00:21:48)  [GCC 7.5.0]
executable: /usr/bin/python3
   machine: Linux-5.4.104+-x86_64-with-Ubuntu-18.04-bionic

Python dependencies:
          pip: 21.1.3
   setuptools: 57.4.0
      sklearn: 1.0.1
        numpy: 1.19.5
        scipy: 1.4.1
       Cython: 0.29.24
       pandas: 1.1.5
   matplotlib: 3.2.2
       joblib: 1.1.0
threadpoolctl: 3.0.0

Built with OpenMP: True

Note: I first mentioned these results in #20642, but it turned out to be a different issue which is why I'm copying these results in a new issue.

@ageron
Copy link
Contributor Author
ageron commented Nov 21, 2021

@jeremiedbb , continuing here the conversation we started in #20642:

Thanks for these exhaustive benchmarks @ageron !
It confirms some intuition we had but I imagined elkan would be better in more cases. In addition, when elkan is faster it's usually just a little faster while "full" can be a lot faster than "elkan". I think it's definitely time, given these results, for us to change the default value of the algorithm parameter. Would you be interested in submitting a PR for that ?

Sure, I'm happy to submit a PR. However, I'm not sure about the process in this case: the parameter's default value remains "auto", right? But I would change its behavior to be "full" rather than "elkan". This should not change the API, it would only improve performance (in general).

So should I "just do it" and label the change as a performance improvement?

Or should this be considered an API change since existing models will give different results?

Should I first put a warning that in Sklearn 1.3, the default "auto" will change to using "full" instead of "elkan"?

Or perhaps we want to drop "auto" altogether and just default to "full"?

What do you think?

@jeremiedbb
Copy link
Member

To me "auto" means let scikit-learn decide what's best, so if we find that something is better we should be able to change the behavior without the 2 releases deprecation cycle. Others might disagree :)

In this case, since elkan and full give the same result, switching the behavior of "auto" is just a performance improvement. I think it's fine to directly switch the behavior of "auto" for the next major release (1.1.0).

@ageron
Copy link
Contributor Author
ageron commented Nov 21, 2021

Thanks @jeremiedbb , I just pushed a tiny PR for this. Let me know what you think. Note that I did not add .. versionchanged:: 1.1 for the algorithm parameter, as I'm not changing the API, just improving performance.

@ogrisel
Copy link
Member
ogrisel commented Nov 22, 2021

I think I would be in favor of:

  • switching the default to "lloyd",
  • keeping an alias named "full" to "lloyd" but deprecate it,
  • deprecate "auto" because it's not needed anymore.

Document all of this in the "Changed models" section of the 1.1 changelog.

Since Elkan and Lloyd should converge to the same solution (up to numerical rounding errors) and that we do not expect any significant performance degradation both in terms of memory usage and CPU time (quite the opposite), I don't think we need to go through a deprecation cycle.

@ageron
Copy link
Contributor Author
ageron commented Nov 22, 2021

Sounds good @ogrisel , done, see #21735.

@ogrisel
Copy link
Member
ogrisel commented Dec 15, 2021

As explained here those benchmark results might be biased in favor of Lloyd because the Elkan implementation can suffer from the fact that it iteratively calls into 2 distinct threadpools (OpenBLAS native threadpool and OpenMP) when scipy is installed from the wheels.

@kno10
Copy link
Contributor
kno10 commented Aug 12, 2022

make_blobs is poor benchmark data.

While it is true that k-means mostly makes sense on such data, its not interesting to use k-means on this.

Instead maybe benchmark with some embedding vectors etc. without a clear structure. Make sure to set tol=0.

The reason is as follows:

If you have well separated blobs, and initialize with, e.g., k-means++, then the algorithms can converge in just 1 or 2 iterations - if your cluster guesses are somewhat fine, all points will be assigned to the "correct" cluster, and then the means will be perfect in the next step.

The improved algorithms such as Elkan, Hamerly, Yin-Yang, Exponion, Shallot, ... are beneficial when you need to do many more than 5 iterations, because they have to do more work in the first iteration, but much less once the result changes only gradually. If you choose a too simple problem, they never get to benefit from their bounds.

So don't draw conclusions from make_blobs! Benchmark on more difficult data instead, and with tol=0 (which should in my opinion be the default anyway), and keep an eye on the number of iterations used to detect too easy cases.

Even with tol=0 and a much higher standard deviation (to get more difficult data), full seems to win most of the time; this may well be attributed to how well scipy dgemm is optimized as opposed to how Elkan is currently implemented in scikit-learn, such as the threading issue mentioned above.

@ageron
Copy link
Contributor Author
ageron commented Aug 12, 2022

Mmh, you make really good points, @kno10. 👍 A better benchmark should probably tackle a variety of real-life problems. May I suggest that you update the Colab I shared above to run the benchmark on a few datasets instead of just make_blobs? Note that I got similar results with the digits dataset, IIRC, which may be because it's also fairly easy to cluster. The question is: should the default 8000 algorithm favour easy-to-cluster datasets or hard-to-cluster datasets?

@kno10
Copy link
Contributor
kno10 commented Aug 13, 2022

Well right now it seems that the naive algorithm is not consequently outperformed by the scikit-learn implementation of Elkans because of the mentioned code optimization issues. So for now, it is probably okay to default to the naive algorithm.
It may be worth including alternate implementations in the benchmark. E.g., ELKI, shogun, mlpack, R.

Maybe have a look at the mlpack benchmarking system, which might include some data sets and suitable parameters: https://github.com/mlpack/benchmarks

  • the results on the web page appear to pretty old, not sure if @rcurtin @zoq @mlpack are still working on it.

@rcurtin
Copy link
rcurtin commented Aug 13, 2022

We haven't touched that repository in quite a while---it's a lot of work to keep a benchmark system up to date! I agree entirely with @kno10 that make_blobs is not a great benchmark. In my experiments on various data and my uses of k-means in practice, I find that a tree-based approach works well for low-dimensional data; and like was pointed out earlier in the thread, Elkan's algorithm (and also Hamerly's algorithm) really shine when there are many iterations of k-means; often this happens with "difficult" data where the first handful of iterations are where the clusters actually move a lot, and then there are tons of "fine tuning" iterations where very little actually changes, but the algorithm has not converged yet.

@kno10
Copy link
Contributor
kno10 commented Aug 29, 2022

Improved benchmark, with real data (color histograms, c.f., https://doi.org/10.5281/zenodo.6355684 ), but still effectively only a single data set:

import pandas as pd
from time import time
from sklearn.cluster import KMeans

def time_kmeans(X, n_clusters, algorithm, random_state):
  kmeans = KMeans(n_clusters=n_clusters, algorithm=algorithm, random_state=random_state, tol=0, max_iter=1000)
  t0 = time()
  kmeans.fit(X)
  return time() - t0, kmeans.n_iter_

iteration = 0
for d in (8, 27, 64, 125, 216, 343, 512, 729, 1000):
  df = pd.read_csv("https://zenodo.org/record/6355684/files/aloi-%dd.csv.gz" % d, delimiter=" ", header=None)
  X=df.iloc[:,:-2].values
  assert X.shape == (110250, d), Xshape
  for k in (2, 5, 10, 20, 50, 100):
    iteration += 1
    time_elkan, iter_elkan = time_kmeans(X, k, algorithm="elkan", random_state=iteration)
    time_full, iter_full = time_kmeans(X, k, algorithm="full", random_state=iteration)
    winner = "Full" if time_full < time_elkan else "Elkan"
    print(
        f"n_iter={iter_elkan:-4}, n_iter={iter_full:-4}, "
        f"k={k:-4}, d={d:-4}, "
        f"time_elkan={time_elkan:-6.2f}, time_full={time_full:-6.2f}, "
        f"winner={winner}")

on colab, this gives (still running for the larger data sets)

n_iter=  16, n_iter=  16, k=   2, d=   8, time_elkan=  1.12, time_full=  1.13, winner=Elkan
n_iter=  91, n_iter=  91, k=   5, d=   8, time_elkan=  2.60, time_full=  2.35, winner=Full
n_iter=  78, n_iter=  78, k=  10, d=   8, time_elkan=  4.99, time_full=  3.84, winner=Full
n_iter=  80, n_iter=  80, k=  20, d=   8, time_elkan=  9.55, time_full=  6.26, winner=Full
n_iter= 160, n_iter= 160, k=  50, d=   8, time_elkan= 20.37, time_full= 12.33, winner=Full
n_iter= 101, n_iter= 101, k= 100, d=   8, time_elkan= 71.28, time_full= 26.39, winner=Full
n_iter=  21, n_iter=  21, k=   2, d=  27, time_elkan=  1.69, time_full=  2.95, winner=Elkan
n_iter=  55, n_iter=  55, k=   5, d=  27, time_elkan=  3.27, time_full=  4.12, winner=Elkan
n_iter=  73, n_iter=  73, k=  10, d=  27, time_elkan=  6.85, time_full=  7.89, winner=Elkan
n_iter=  81, n_iter=  81, k=  20, d=  27, time_elkan= 14.78, time_full= 14.79, winner=Elkan
n_iter= 199, n_iter= 199, k=  50, d=  27, time_elkan= 25.07, time_full= 22.89, winner=Full
n_iter= 132, n_iter= 132, k= 100, d=  27, time_elkan= 92.91, time_full= 48.75, winner=Full
n_iter=  20, n_iter=  20, k=   2, d=  64, time_elkan=  2.25, time_full=  3.52, winner=Elkan
n_iter=  40, n_iter=  40, k=   5, d=  64, time_elkan=  6.51, time_full= 11.48, winner=Elkan
n_iter=  88, n_iter=  88, k=  10, d=  64, time_elkan=  9.45, time_full= 14.38, winner=Elkan
n_iter= 125, n_iter= 125, k=  20, d=  64, time_elkan= 17.72, time_full= 22.30, winner=Elkan
n_iter= 114, n_iter= 114, k=  50, d=  64, time_elkan= 34.08, time_full= 43.57, winner=Elkan
n_iter= 134, n_iter= 134, k= 100, d=  64, time_elkan=105.27, time_full= 77.91, winner=Full
n_iter=  19, n_iter=  19, k=   2, d= 125, time_elkan=  3.17, time_full=  5.51, winner=Elkan
n_iter=  54, n_iter=  54, k=   5, d= 125, time_elkan=  9.96, time_full= 19.17, winner=Elkan
n_iter= 169, n_iter= 169, k=  10, d= 125, time_elkan= 16.45, time_full= 31.44, winner=Elkan
n_iter=  60, n_iter=  60, k=  20, d= 125, time_elkan= 21.45, time_full= 35.11, winner=Elkan
n_iter= 130, n_iter= 130, k=  50, d= 125, time_elkan= 49.37, time_full= 81.40, winner=Elkan
n_iter=  82, n_iter=  82, k= 100, d= 125, time_elkan=135.25, time_full=148.29, winner=Elkan
n_iter=  22, n_iter=  22, k=   2, d= 216, time_elkan=  5.58, time_full=  8.99, winner=Elkan
n_iter=  61, n_iter=  61, k=   5, d= 216, time_elkan= 16.80, time_full= 34.09, winner=Elkan
n_iter= 108, n_iter= 108, k=  10, d= 216, time_elkan= 22.85, time_full= 45.63, winner=Elkan
n_iter=  78, n_iter=  78, k=  20, d= 216, time_elkan= 33.35, time_full= 61.57, winner=Elkan
n_iter= 238, n_iter= 238, k=  50, d= 216, time_elkan= 57.04, time_full=113.56, winner=Elkan
n_iter= 240, n_iter= 240, k= 100, d= 216, time_elkan=140.52, time_full=207.79, winner=Elkan
n_iter=  22, n_iter=  22, k=   2, d= 343, time_elkan=  6.98, time_full= 13.22, winner=Elkan
n_iter= 116, n_iter= 116, k=   5, d= 343, time_elkan= 23.74, time_full= 52.03, winner=Elkan
n_iter=  52, n_iter=  52, k=  10, d= 343, time_elkan= 33.98, time_full= 72.24, winner=Elkan
n_iter= 137, n_iter= 137, k=  20, d= 343, time_elkan= 41.94, time_full= 84.31, winner=Elkan
n_iter= 103, n_iter= 103, k=  50, d= 343, time_elkan= 70.77, time_full=151.32, winner=Elkan
n_iter= 102, n_iter= 102, k= 100, d= 343, time_elkan=171.56, time_full=303.35, winner=Elkan
n_iter=  18, n_iter=  18, k=   2, d= 512, time_elkan=  9.29, time_full= 17.43, winner=Elkan
n_iter=  64, n_iter=  64, k=   5, d= 512, time_elkan= 24.56, time_full= 52.91, winner=Elkan
n_iter=  59, n_iter=  59, k=  10, d= 512, time_elkan= 56.67, time_full=128.65, winner=Elkan
n_iter= 140, n_iter= 140, k=  20, d= 512, time_elkan= 74.82, time_full=165.59, winner=Elkan
n_iter= 133, n_iter= 133, k=  50, d= 512, time_elkan= 98.23, time_full=232.96, winner=Elkan
n_iter=  98, n_iter=  98, k= 100, d= 512, time_elkan=242.41, time_full=512.89, winner=Elkan

which appears to align with earlier results that Elkan is better for high dimensionality, and low k (because of the many bounds). For larger k, Hamerly and its improvements are often better than Elkan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants
0