8000 KMeans Elkan algorithm (the default) is generally slower than Full · Issue #21729 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content
KMeans Elkan algorithm (the default) is generally slower than Full #21729
Closed
@ageron

Description

@ageron

Describe the bug

I just ran some tests on KMeans, and using algorithm="elkan" is generally slower (and sometimes much slower) than algorithm="full" on various datasets generated using make_blobs() (I tried different n_samples, n_features, cluster_std, and different numbers of clusters). I basically couldn't find a dataset where Elkan was reliably faster than Full. I also tried both NumPy 1.18.5 and 1.19.5, with Scikit-Learn 1.0.1, but Elkan remained generally slower than Full.

Steps/Code to Reproduce

The experiments were done on Colab, like this:

from time import time

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

def time_kmeans(X, n_clusters, algorithm, random_state):
  kmeans = KMeans(n_clusters=n_clusters, algorithm=algorithm, random_state=random_state)
  t0 = time()
  kmeans.fit(X)
  return time() - t0

iteration = 0
for n_samples in (10**3, 10**4, 10**5):
  for n_features in (2, 10, 100):
    for n_clusters in (2, 5, 10, 20, 50, 100):
      for cluster_std in (1e-3, 0.1, 1, 5):
        n_samples_per_cluster = n_samples // n_clusters
        X, y = make_blobs(n_samples=[n_samples_per_cluster] * n_clusters,
                          n_features=n_features,
                          cluster_std=cluster_std)
        iteration += 1
        time_elkan = time_kmeans(X, n_clusters, algorithm="elkan", random_state=iteration)
        time_full = time_kmeans(X, n_clusters, algorithm="full", random_state=iteration)
        winner = "Full" if time_full < time_elkan else "Elkan"
        print(
            f"n_samples={n_samples}, n_features={n_features}, "
            f"n_clusters={n_clusters}, cluster_std={cluster_std}, "
            f"time_elkan={time_elkan:.2f}, time_full={time_full:.2f}, "
            f"winner={winner}")

And the output is:

n_samples=1000, n_features=2, n_clusters=2, cluster_std=0.001, time_elkan=0.01, time_full=0.01, winner=Full
n_samples=1000, n_features=2, n_clusters=20, cluster_std=1, time_elkan=0.11, time_full=0.07, winner=Full
n_samples=1000, n_features=2, n_clusters=50, cluster_std=1, time_elkan=0.19, time_full=0.15, winner=Full
n_samples=1000, n_features=2, n_clusters=100, cluster_std=0.001, time_elkan=0.49, time_full=0.31, winner=Full
n_samples=1000, n_features=2, n_clusters=100, cluster_std=0.1, time_elkan=0.73, time_full=0.29, winner=Full
n_samples=1000, n_features=2, n_clusters=100, cluster_std=1, time_elkan=1.17, time_full=0.32, winner=Full
n_samples=1000, n_features=2, n_clusters=100, cluster_std=5, time_elkan=1.21, time_full=0.30, winner=Full
n_samples=1000, n_features=10, n_clusters=5, cluster_std=0.1, time_elkan=0.12, time_full=0.12, winner=Full
n_samples=1000, n_features=10, n_clusters=10, cluster_std=0.1, time_elkan=0.09, time_full=0.08, winner=Full
n_samples=1000, n_features=10, n_clusters=20, cluster_std=0.1, time_elkan=0.13, time_full=0.23, winner=Elkan
n_samples=1000, n_features=10, n_clusters=20, cluster_std=5, time_elkan=0.49, time_full=0.63, winner=Elkan
n_samples=1000, n_features=10, n_clusters=50, cluster_std=0.1, time_elkan=0.48, time_full=0.24, winner=Full
n_samples=1000, n_features=10, n_clusters=50, cluster_std=5, time_elkan=0.48, time_full=0.26, winner=Full
n_samples=1000, n_features=10, n_clusters=100, cluster_std=0.001, time_elkan=0.75, time_full=0.42, winner=Full
n_samples=1000, n_features=10, n_clusters=100, cluster_std=0.1, time_elkan=0.61, time_full=0.43, winner=Full
n_samples=1000, n_features=10, n_clusters=100, cluster_std=1, time_elkan=0.88, time_full=0.53, winner=Full
n_samples=1000, n_features=10, n_clusters=100, cluster_std=5, time_elkan=1.36, time_full=0.46, winner=Full
n_samples=1000, n_features=100, n_clusters=5, cluster_std=0.001, time_elkan=0.08, time_full=0.15, winner=Elkan
n_samples=1000, n_features=100, n_clusters=5, cluster_std=5, time_elkan=0.16, time_full=0.07, winner=Full
n_samples=1000, n_features=100, n_clusters=10, cluster_std=1, time_elkan=0.27, time_full=0.28, winner=Elkan
n_samples=1000, n_features=100, n_clusters=20, cluster_std=0.1, time_elkan=0.21, time_full=0.19, winner=Full
n_samples=1000, n_features=100, n_clusters=20, cluster_std=5, time_elkan=0.25, time_full=0.26, winner=Elkan
n_samples=1000, n_features=100, n_clusters=50, cluster_std=0.001, time_elkan=0.76, time_full=0.61, winner=Full
n_samples=1000, n_features=100, n_clusters=50, cluster_std=0.1, time_elkan=0.69, time_full=0.43, winner=Full
n_samples=1000, n_features=100, n_clusters=50, cluster_std=1, time_elkan=0.59, time_full=0.61, winner=Elkan
n_samples=1000, n_features=100, n_clusters=100, cluster_std=0.001, time_elkan=0.93, time_full=0.77, winner=Full
n_samples=1000, n_features=100, n_clusters=100, cluster_std=0.1, time_elkan=0.84, time_full=0.96, winner=Elkan
n_samples=1000, n_features=100, n_clusters=100, cluster_std=1, time_elkan=1.37, time_full=0.94, winner=Full
n_samples=1000, n_features=100, n_clusters=100, cluster_std=5, time_elkan=1.31, time_full=0.79, winner=Full
n_samples=10000, n_features=2, n_clusters=2, cluster_std=5, time_elkan=0.14, time_full=0.27, winner=Elkan
n_samples=10000, n_features=2, n_clusters=5, cluster_std=1, time_elkan=0.31, time_full=0.18, winner=Full
n_samples=10000, n_features=2, n_clusters=10, cluster_std=0.001, time_elkan=0.29, time_full=0.16, winner=Full
n_samples=10000, n_features=2, n_clusters=10, cluster_std=1, time_elkan=0.53, time_full=0.21, winner=Full
n_samples=10000, n_features=2, n_clusters=10, cluster_std=5, time_elkan=1.27, time_full=0.83, winner=Full
n_samples=10000, n_features=2, n_clusters=20, cluster_std=0.1, time_elkan=0.25, time_full=0.27, winner=Elkan
n_samples=10000, n_features=2, n_clusters=20, cluster_std=1, time_elkan=1.17, time_full=0.58, winner=Full
n_samples=10000, n_features=2, n_clusters=20, cluster_std=5, time_elkan=1.60, time_full=0.86, winner=Full
n_samples=10000, n_features=2, n_clusters=50, cluster_std=0.001, time_elkan=0.79, time_full=0.84, winner=Elkan
n_samples=10000, n_features=2, n_clusters=50, cluster_std=0.1, time_elkan=0.82, time_full=0.70, winner=Full
n_samples=10000, n_features=2, n_clusters=50, cluster_std=1, time_elkan=1.99, time_full=1.34, winner=Full
n_samples=10000, n_features=2, n_clusters=50, cluster_std=5, time_elkan=2.35, time_full=1.47, winner=Full
n_samples=10000, n_features=2, n_clusters=100, cluster_std=0.001, time_elkan=1.12, time_full=1.06, winner=Full
n_samples=10000, n_features=2, n_clusters=100, cluster_std=0.1, time_elkan=1.34, time_full=1.08, winner=Full
n_samples=10000, n_features=2, n_clusters=100, cluster_std=1, time_elkan=7.00, time_full=1.95, winner=Full
n_samples=10000, n_features=2, n_clusters=100, cluster_std=5, time_elkan=7.06, time_full=1.92, winner=Full
n_samples=10000, n_features=10, n_clusters=5, cluster_std=0.001, time_elkan=0.13, time_full=0.24, winner=Elkan
n_samples=10000, n_features=10, n_clusters=5, cluster_std=5, time_elkan=0.40, time_full=0.32, winner=Full
n_samples=10000, n_features=10, n_clusters=10, cluster_std=0.1, time_elkan=0.30, time_full=0.17, winner=Full
n_samples=10000, n_features=10, n_clusters=10, cluster_std=5, time_elkan=0.70, time_full=0.43, winner=Full
n_samples=10000, n_features=10, n_clusters=20, cluster_std=0.1, time_elkan=0.46, time_full=0.27, winner=Full
n_samples=10000, n_features=10, n_clusters=20, cluster_std=1, time_elkan=0.48, time_full=0.51, winner=Elkan
n_samples=10000, n_features=10, n_clusters=20, cluster_std=5, time_elkan=1.39, time_full=1.01, winner=Full
n_samples=10000, n_features=10, n_clusters=50, cluster_std=0.001, time_elkan=0.78, time_full=0.59, winner=Full
n_samples=10000, n_features=10, n_clusters=50, cluster_std=0.1, time_elkan=0.69, time_full=0.57, winner=Full
n_samples=10000, n_features=10, n_clusters=50, cluster_std=1, time_elkan=1.00, time_full=0.90, winner=Full
n_samples=10000, n_features=10, n_clusters=50, cluster_std=5, time_elkan=3.47, time_full=2.15, winner=Full
n_samples=10000, n_features=10, n_clusters=100, cluster_std=0.001, time_elkan=1.29, time_full=1.15, winner=Full
n_samples=10000, n_features=10, n_clusters=100, cluster_std=0.1, time_elkan=1.45, time_full=1.36, winner=Full
n_samples=10000, n_features=10, n_clusters=100, cluster_std=1, time_elkan=2.58, time_full=1.63, winner=Full
n_samples=10000, n_features=10, n_clusters=100, cluster_std=5, time_elkan=11.02, time_full=3.39, winner=Full
n_samples=10000, n_features=100, n_clusters=2, cluster_std=1, time_elkan=0.34, time_full=0.22, winner=Full
n_samples=10000, n_features=100, n_clusters=5, cluster_std=0.001, time_elkan=0.34, time_full=0.37, winner=Elkan
n_samples=10000, n_features=100, n_clusters=5, cluster_std=0.1, time_elkan=0.61, time_full=0.33, winner=Full
n_samples=10000, n_features=100, n_clusters=5, cluster_std=5, time_elkan=0.50, time_full=0.56, winner=Elkan
n_samples=10000, n_features=100, n_clusters=10, cluster_std=0.1, time_elkan=0.59, time_full=0.47, winner=Full
n_samples=10000, n_features=100, n_clusters=10, cluster_std=1, time_elkan=0.63, time_full=0.53, winner=Full
n_samples=10000, n_features=100, n_clusters=10, cluster_std=5, time_elkan=1.02, time_full=1.10, winner=Elkan
n_samples=10000, n_features=100, n_clusters=20, cluster_std=0.001, time_elkan=0.82, time_full=0.81, winner=Full
n_samples=10000, n_features=100, n_clusters=20, cluster_std=0.1, time_elkan=0.82, time_full=0.89, winner=Elkan
n_samples=10000, n_features=100, n_clusters=20, cluster_std=1, time_elkan=0.92, time_full=0.78, winner=Full
n_samples=10000, n_features=100, n_clusters=20, cluster_std=5, time_elkan=1.77, time_full=1.56, winner=Full
n_samples=10000, n_features=100, n_clusters=50, cluster_std=0.001, time_elkan=2.10, time_full=1.71, winner=Full
n_samples=10000, n_features=100, n_clusters=50, cluster_std=0.1, time_elkan=1.93, time_full=1.82, winner=Full
n_samples=10000, n_features=100, n_clusters=50, cluster_std=1, time_elkan=2.10, time_full=1.77, winner=Full
n_samples=10000, n_features=100, n_clusters=50, cluster_std=5, time_elkan=2.89, time_full=2.49, winner=Full
n_samples=10000, n_features=100, n_clusters=100, cluster_std=0.001, time_elkan=4.13, time_full=3.24, winner=Full
n_samples=10000, n_features=100, n_clusters=100, cluster_std=0.1, time_elkan=3.97, time_full=3.15, winner=Full
n_samples=10000, n_features=100, n_clusters=100, cluster_std=1, time_elkan=4.20, time_full=3.46, winner=Full
n_samples=10000, n_features=100, n_clusters=100, cluster_std=5, time_elkan=5.92, time_full=4.10, winner=Full
n_samples=100000, n_features=2, n_clusters=2, cluster_std=0.1, time_elkan=0.30, time_full=0.38, winner=Elkan
n_samples=100000, n_features=2, n_clusters=2, cluster_std=5, time_elkan=1.31, time_full=1.18, winner=Full
n_samples=100000, n_features=2, n_clusters=5, cluster_std=0.1, time_elkan=0.46, time_full=0.47, winner=Elkan
n_samples=100000, n_features=2, n_clusters=5, cluster_std=1, time_elkan=1.16, time_full=0.63, winner=Full
n_samples=100000, n_features=2, n_clusters=5, cluster_std=5, time_elkan=2.77, time_full=1.77, winner=Full
n_samples=100000, n_features=2, n_clusters=10, cluster_std=0.001, time_elkan=0.97, time_full=0.71, winner=Full
n_samples=100000, n_features=2, n_clusters=10, cluster_std=0.1, time_elkan=0.80, time_full=0.73, winner=Full
n_samples=100000, n_features=2, n_clusters=10, cluster_std=1, time_elkan=2.18, time_full=1.16, winner=Full
n_samples=100000, n_features=2, n_clusters=10, cluster_std=5, time_elkan=4.82, time_full=2.35, winner=Full
n_samples=100000, n_features=2, n_clusters=20, cluster_std=0.001, time_elkan=1.36, time_full=1.23, winner=Full
n_samples=100000, n_features=2, n_clusters=20, cluster_std=0.1, time_elkan=1.66, time_full=1.44, winner=Full
n_samples=100000, n_features=2, n_clusters=20, cluster_std=1, time_elkan=4.42, time_full=2.82, winner=Full
n_samples=100000, n_features=2, n_clusters=20, cluster_std=5, time_elkan=11.02, time_full=4.66, winner=Full
n_samples=100000, n_features=2, n_clusters=50, cluster_std=0.001, time_elkan=3.61, time_full=3.55, winner=Full
n_samples=100000, n_features=2, n_clusters=50, cluster_std=0.1, time_elkan=3.89, time_full=3.49, winner=Full
n_samples=100000, n_features=2, n_clusters=50, cluster_std=1, time_elkan=18.86, time_full=9.43, winner=Full
n_samples=100000, n_features=2, n_clusters=50, cluster_std=5, time_elkan=27.77, time_full=12.31, winner=Full
n_samples=100000, n_features=2, n_clusters=100, cluster_std=0.001, time_elkan=8.33, time_full=7.79, winner=Full
n_samples=100000, n_features=2, n_clusters=100, cluster_std=0.1, time_elkan=9.55, time_full=8.30, winner=Full
n_samples=100000, n_features=2, n_clusters=100, cluster_std=1, time_elkan=59.08, time_full=20.13, winner=Full
n_samples=100000, n_features=2, n_clusters=100, cluster_std=5, time_elkan=70.57, time_full=23.11, winner=Full
n_samples=100000, n_features=10, n_clusters=2, cluster_std=0.1, time_elkan=0.51, time_full=0.42, winner=Full
n_samples=100000, n_features=10, n_clusters=2, cluster_std=1, time_elkan=0.44, time_full=0.61, winner=Elkan
n_samples=100000, n_features=10, n_clusters=2, cluster_std=5, time_elkan=0.85, time_full=0.64, winner=Full
n_samples=100000, n_features=10, n_clusters=5, cluster_std=0.001, time_elkan=0.55, time_full=0.63, winner=Elkan
n_samples=100000, n_features=10, n_clusters=5, cluster_std=0.1, time_elkan=0.77, time_full=0.58, winner=Full
n_samples=100000, n_features=10, n_clusters=5, cluster_std=1, time_elkan=0.80, time_full=0.61, winner=Full
n_samples=100000, n_features=10, n_clusters=5, cluster_std=5, time_elkan=1.70, time_full=1.00, winner=Full
n_samples=100000, n_features=10, n_clusters=10, cluster_std=0.001, time_elkan=1.05, time_full=0.87, winner=Full
n_samples=100000, n_features=10, n_clusters=10, cluster_std=0.1, time_elkan=1.06, time_full=0.80, winner=Full
n_samples=100000, n_features=10, n_clusters=10, cluster_std=1, time_elkan=1.23, time_full=1.14, winner=Full
n_samples=100000, n_features=10, n_clusters=10, cluster_std=5, time_elkan=3.28, time_full=2.13, winner=Full
n_samples=100000, n_features=10, n_clusters=20, cluster_std=0.001, time_elkan=1.96, time_full=1.55, winner=Full
n_samples=100000, n_features=10, n_clusters=20, cluster_std=0.1, time_elkan=2.17, time_full=2.00, winner=Full
n_samples=100000, n_features=10, n_clusters=20, cluster_std=1, time_elkan=2.31, time_full=1.86, winner=Full
n_samples=100000, n_features=10, n_clusters=20, cluster_std=5, time_elkan=6.73, time_full=4.03, winner=Full
n_samples=100000, n_features=10, n_clusters=50, cluster_std=0.001, time_elkan=4.87, time_full=4.19, winner=Full
n_samples=100000, n_features=10, n_clusters=50, cluster_std=0.1, time_elkan=4.92, time_full=4.53, winner=Full
n_samples=100000, n_features=10, n_clusters=50, cluster_std=1, time_elkan=6.05, time_full=5.50, winner=Full
n_samples=100000, n_features=10, n_clusters=50, cluster_std=5, time_elkan=35.26, time_full=18.50, winner=Full
n_samples=100000, n_features=10, n_clusters=100, cluster_std=0.001, time_elkan=10.04, time_full=9.49, winner=Full
n_samples=100000, n_features=10, n_clusters=100, cluster_std=0.1, time_elkan=10.54, time_full=9.01, winner=Full
n_samples=100000, n_features=10, n_clusters=100, cluster_std=1, time_elkan=16.26, time_full=12.68, winner=Full
n_samples=100000, n_features=10, n_clusters=100, cluster_std=5, time_elkan=140.66, time_full=44.06, winner=Full
n_samples=100000, n_features=100, n_clusters=2, cluster_std=0.001, time_elkan=1.19, time_full=1.25, winner=Elkan
n_samples=100000, n_features=100, n_clusters=2, cluster_std=0.1, time_elkan=1.45, time_full=1.49, winner=Elkan
n_samples=100000, n_features=100, n_clusters=2, cluster_std=1, time_elkan=1.47, time_full=1.49, winner=Elkan
n_samples=100000, n_features=100, n_clusters=2, cluster_std=5, time_elkan=1.70, time_full=1.55, winner=Full
n_samples=100000, n_features=100, n_clusters=5, cluster_std=0.001, time_elkan=2.12, time_full=1.95, winner=Full
n_samples=100000, n_features=100, n_clusters=5, cluster_std=0.1, time_elkan=2.24, time_full=2.16, winner=Full
n_samples=100000, n_features=100, n_clusters=5, cluster_std=1, time_elkan=2.19, time_full=2.06, winner=Full
n_samples=100000, n_features=100, n_clusters=5, cluster_std=5, time_elkan=2.74, time_full=2.13, winner=Full
n_samples=100000, n_features=100, n_clusters=10, cluster_std=0.001, time_elkan=3.53, time_full=3.39, winner=Full
n_samples=100000, n_features=100, n_clusters=10, cluster_std=0.1, time_elkan=3.61, time_full=3.42, winner=Full
n_samples=100000, n_features=100, n_clusters=10, cluster_std=1, time_elkan=3.67, time_full=3.42, winner=Full
n_samples=100000, n_features=100, n_clusters=10, cluster_std=5, time_elkan=5.12, time_full=4.50, winner=Full
n_samples=100000, n_features=100, n_clusters=20, cluster_std=0.001, time_elkan=6.02, time_full=5.40, winner=Full
n_samples=100000, n_features=100, n_clusters=20, cluster_std=0.1, time_elkan=6.02, time_full=5.48, winner=Full
n_samples=100000, n_features=100, n_clusters=20, cluster_std=1, time_elkan=6.07, time_full=5.51, winner=Full
n_samples=100000, n_features=100, n_clusters=20, cluster_std=5, time_elkan=16.14, time_full=17.33, winner=Elkan
n_samples=100000, n_features=100, n_clusters=50, cluster_std=0.001, time_elkan=14.73, time_full=13.29, winner=Full
n_samples=100000, n_features=100, n_clusters=50, cluster_std=0.1, time_elkan=14.86, time_full=13.29, winner=Full
...

Expected Results

Elkan should be generally faster, especially on large datasets with many well-defined clusters.

Actual Results

image

Versions

System:
    python: 3.7.12 (default, Sep 10 2021, 00:21:48)  [GCC 7.5.0]
executable: /usr/bin/python3
   machine: Linux-5.4.104+-x86_64-with-Ubuntu-18.04-bionic

Python dependencies:
          pip: 21.1.3
   setuptools: 57.4.0
      sklearn: 1.0.1
        numpy: 1.19.5
        scipy: 1.4.1
       Cython: 0.29.24
       pandas: 1.1.5
   matplotlib: 3.2.2
       joblib: 1.1.0
threadpoolctl: 3.0.0

Built with OpenMP: True

Note: I first mentioned these results in #20642, but it turned out to be a different issue which is why I'm copying these results in a new issue.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0