Description
Describe the bug
I just ran some tests on KMeans
, and using algorithm="elkan"
is generally slower (and sometimes much slower) than algorithm="full"
on various datasets generated using make_blobs()
(I tried different n_samples
, n_features
, cluster_std
, and different numbers of clusters). I basically couldn't find a dataset where Elkan was reliably faster than Full. I also tried both NumPy 1.18.5 and 1.19.5, with Scikit-Learn 1.0.1, but Elkan remained generally slower than Full.
Steps/Code to Reproduce
The experiments were done on Colab, like this:
from time import time
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
def time_kmeans(X, n_clusters, algorithm, random_state):
kmeans = KMeans(n_clusters=n_clusters, algorithm=algorithm, random_state=random_state)
t0 = time()
kmeans.fit(X)
return time() - t0
iteration = 0
for n_samples in (10**3, 10**4, 10**5):
for n_features in (2, 10, 100):
for n_clusters in (2, 5, 10, 20, 50, 100):
for cluster_std in (1e-3, 0.1, 1, 5):
n_samples_per_cluster = n_samples // n_clusters
X, y = make_blobs(n_samples=[n_samples_per_cluster] * n_clusters,
n_features=n_features,
cluster_std=cluster_std)
iteration += 1
time_elkan = time_kmeans(X, n_clusters, algorithm="elkan", random_state=iteration)
time_full = time_kmeans(X, n_clusters, algorithm="full", random_state=iteration)
winner = "Full" if time_full < time_elkan else "Elkan"
print(
f"n_samples={n_samples}, n_features={n_features}, "
f"n_clusters={n_clusters}, cluster_std={cluster_std}, "
f"time_elkan={time_elkan:.2f}, time_full={time_full:.2f}, "
f"winner={winner}")
And the output is:
n_samples=1000, n_features=2, n_clusters=2, cluster_std=0.001, time_elkan=0.01, time_full=0.01, winner=Full n_samples=1000, n_features=2, n_clusters=20, cluster_std=1, time_elkan=0.11, time_full=0.07, winner=Full n_samples=1000, n_features=2, n_clusters=50, cluster_std=1, time_elkan=0.19, time_full=0.15, winner=Full n_samples=1000, n_features=2, n_clusters=100, cluster_std=0.001, time_elkan=0.49, time_full=0.31, winner=Full n_samples=1000, n_features=2, n_clusters=100, cluster_std=0.1, time_elkan=0.73, time_full=0.29, winner=Full n_samples=1000, n_features=2, n_clusters=100, cluster_std=1, time_elkan=1.17, time_full=0.32, winner=Full n_samples=1000, n_features=2, n_clusters=100, cluster_std=5, time_elkan=1.21, time_full=0.30, winner=Full n_samples=1000, n_features=10, n_clusters=5, cluster_std=0.1, time_elkan=0.12, time_full=0.12, winner=Full n_samples=1000, n_features=10, n_clusters=10, cluster_std=0.1, time_elkan=0.09, time_full=0.08, winner=Full n_samples=1000, n_features=10, n_clusters=20, cluster_std=0.1, time_elkan=0.13, time_full=0.23, winner=Elkan n_samples=1000, n_features=10, n_clusters=20, cluster_std=5, time_elkan=0.49, time_full=0.63, winner=Elkan n_samples=1000, n_features=10, n_clusters=50, cluster_std=0.1, time_elkan=0.48, time_full=0.24, winner=Full n_samples=1000, n_features=10, n_clusters=50, cluster_std=5, time_elkan=0.48, time_full=0.26, winner=Full n_samples=1000, n_features=10, n_clusters=100, cluster_std=0.001, time_elkan=0.75, time_full=0.42, winner=Full n_samples=1000, n_features=10, n_clusters=100, cluster_std=0.1, time_elkan=0.61, time_full=0.43, winner=Full n_samples=1000, n_features=10, n_clusters=100, cluster_std=1, time_elkan=0.88, time_full=0.53, winner=Full n_samples=1000, n_features=10, n_clusters=100, cluster_std=5, time_elkan=1.36, time_full=0.46, winner=Full n_samples=1000, n_features=100, n_clusters=5, cluster_std=0.001, time_elkan=0.08, time_full=0.15, winner=Elkan n_samples=1000, n_features=100, n_clusters=5, cluster_std=5, time_elkan=0.16, time_full=0.07, winner=Full n_samples=1000, n_features=100, n_clusters=10, cluster_std=1, time_elkan=0.27, time_full=0.28, winner=Elkan n_samples=1000, n_features=100, n_clusters=20, cluster_std=0.1, time_elkan=0.21, time_full=0.19, winner=Full n_samples=1000, n_features=100, n_clusters=20, cluster_std=5, time_elkan=0.25, time_full=0.26, winner=Elkan n_samples=1000, n_features=100, n_clusters=50, cluster_std=0.001, time_elkan=0.76, time_full=0.61, winner=Full n_samples=1000, n_features=100, n_clusters=50, cluster_std=0.1, time_elkan=0.69, time_full=0.43, winner=Full n_samples=1000, n_features=100, n_clusters=50, cluster_std=1, time_elkan=0.59, time_full=0.61, winner=Elkan n_samples=1000, n_features=100, n_clusters=100, cluster_std=0.001, time_elkan=0.93, time_full=0.77, winner=Full n_samples=1000, n_features=100, n_clusters=100, cluster_std=0.1, time_elkan=0.84, time_full=0.96, winner=Elkan n_samples=1000, n_features=100, n_clusters=100, cluster_std=1, time_elkan=1.37, time_full=0.94, winner=Full n_samples=1000, n_features=100, n_clusters=100, cluster_std=5, time_elkan=1.31, time_full=0.79, winner=Full n_samples=10000, n_features=2, n_clusters=2, cluster_std=5, time_elkan=0.14, time_full=0.27, winner=Elkan n_samples=10000, n_features=2, n_clusters=5, cluster_std=1, time_elkan=0.31, time_full=0.18, winner=Full n_samples=10000, n_features=2, n_clusters=10, cluster_std=0.001, time_elkan=0.29, time_full=0.16, winner=Full n_samples=10000, n_features=2, n_clusters=10, cluster_std=1, time_elkan=0.53, time_full=0.21, winner=Full n_samples=10000, n_features=2, n_clusters=10, cluster_std=5, time_elkan=1.27, time_full=0.83, winner=Full n_samples=10000, n_features=2, n_clusters=20, cluster_std=0.1, time_elkan=0.25, time_full=0.27, winner=Elkan n_samples=10000, n_features=2, n_clusters=20, cluster_std=1, time_elkan=1.17, time_full=0.58, winner=Full n_samples=10000, n_features=2, n_clusters=20, cluster_std=5, time_elkan=1.60, time_full=0.86, winner=Full n_samples=10000, n_features=2, n_clusters=50, cluster_std=0.001, time_elkan=0.79, time_full=0.84, winner=Elkan n_samples=10000, n_features=2, n_clusters=50, cluster_std=0.1, time_elkan=0.82, time_full=0.70, winner=Full n_samples=10000, n_features=2, n_clusters=50, cluster_std=1, time_elkan=1.99, time_full=1.34, winner=Full n_samples=10000, n_features=2, n_clusters=50, cluster_std=5, time_elkan=2.35, time_full=1.47, winner=Full n_samples=10000, n_features=2, n_clusters=100, cluster_std=0.001, time_elkan=1.12, time_full=1.06, winner=Full n_samples=10000, n_features=2, n_clusters=100, cluster_std=0.1, time_elkan=1.34, time_full=1.08, winner=Full n_samples=10000, n_features=2, n_clusters=100, cluster_std=1, time_elkan=7.00, time_full=1.95, winner=Full n_samples=10000, n_features=2, n_clusters=100, cluster_std=5, time_elkan=7.06, time_full=1.92, winner=Full n_samples=10000, n_features=10, n_clusters=5, cluster_std=0.001, time_elkan=0.13, time_full=0.24, winner=Elkan n_samples=10000, n_features=10, n_clusters=5, cluster_std=5, time_elkan=0.40, time_full=0.32, winner=Full n_samples=10000, n_features=10, n_clusters=10, cluster_std=0.1, time_elkan=0.30, time_full=0.17, winner=Full n_samples=10000, n_features=10, n_clusters=10, cluster_std=5, time_elkan=0.70, time_full=0.43, winner=Full n_samples=10000, n_features=10, n_clusters=20, cluster_std=0.1, time_elkan=0.46, time_full=0.27, winner=Full n_samples=10000, n_features=10, n_clusters=20, cluster_std=1, time_elkan=0.48, time_full=0.51, winner=Elkan n_samples=10000, n_features=10, n_clusters=20, cluster_std=5, time_elkan=1.39, time_full=1.01, winner=Full n_samples=10000, n_features=10, n_clusters=50, cluster_std=0.001, time_elkan=0.78, time_full=0.59, winner=Full n_samples=10000, n_features=10, n_clusters=50, cluster_std=0.1, time_elkan=0.69, time_full=0.57, winner=Full n_samples=10000, n_features=10, n_clusters=50, cluster_std=1, time_elkan=1.00, time_full=0.90, winner=Full n_samples=10000, n_features=10, n_clusters=50, cluster_std=5, time_elkan=3.47, time_full=2.15, winner=Full n_samples=10000, n_features=10, n_clusters=100, cluster_std=0.001, time_elkan=1.29, time_full=1.15, winner=Full n_samples=10000, n_features=10, n_clusters=100, cluster_std=0.1, time_elkan=1.45, time_full=1.36, winner=Full n_samples=10000, n_features=10, n_clusters=100, cluster_std=1, time_elkan=2.58, time_full=1.63, winner=Full n_samples=10000, n_features=10, n_clusters=100, cluster_std=5, time_elkan=11.02, time_full=3.39, winner=Full n_samples=10000, n_features=100, n_clusters=2, cluster_std=1, time_elkan=0.34, time_full=0.22, winner=Full n_samples=10000, n_features=100, n_clusters=5, cluster_std=0.001, time_elkan=0.34, time_full=0.37, winner=Elkan n_samples=10000, n_features=100, n_clusters=5, cluster_std=0.1, time_elkan=0.61, time_full=0.33, winner=Full n_samples=10000, n_features=100, n_clusters=5, cluster_std=5, time_elkan=0.50, time_full=0.56, winner=Elkan n_samples=10000, n_features=100, n_clusters=10, cluster_std=0.1, time_elkan=0.59, time_full=0.47, winner=Full n_samples=10000, n_features=100, n_clusters=10, cluster_std=1, time_elkan=0.63, time_full=0.53, winner=Full n_samples=10000, n_features=100, n_clusters=10, cluster_std=5, time_elkan=1.02, time_full=1.10, winner=Elkan n_samples=10000, n_features=100, n_clusters=20, cluster_std=0.001, time_elkan=0.82, time_full=0.81, winner=Full n_samples=10000, n_features=100, n_clusters=20, cluster_std=0.1, time_elkan=0.82, time_full=0.89, winner=Elkan n_samples=10000, n_features=100, n_clusters=20, cluster_std=1, time_elkan=0.92, time_full=0.78, winner=Full n_samples=10000, n_features=100, n_clusters=20, cluster_std=5, time_elkan=1.77, time_full=1.56, winner=Full n_samples=10000, n_features=100, n_clusters=50, cluster_std=0.001, time_elkan=2.10, time_full=1.71, winner=Full n_samples=10000, n_features=100, n_clusters=50, cluster_std=0.1, time_elkan=1.93, time_full=1.82, winner=Full n_samples=10000, n_features=100, n_clusters=50, cluster_std=1, time_elkan=2.10, time_full=1.77, winner=Full n_samples=10000, n_features=100, n_clusters=50, cluster_std=5, time_elkan=2.89, time_full=2.49, winner=Full n_samples=10000, n_features=100, n_clusters=100, cluster_std=0.001, time_elkan=4.13, time_full=3.24, winner=Full n_samples=10000, n_features=100, n_clusters=100, cluster_std=0.1, time_elkan=3.97, time_full=3.15, winner=Full n_samples=10000, n_features=100, n_clusters=100, cluster_std=1, time_elkan=4.20, time_full=3.46, winner=Full n_samples=10000, n_features=100, n_clusters=100, cluster_std=5, time_elkan=5.92, time_full=4.10, winner=Full n_samples=100000, n_features=2, n_clusters=2, cluster_std=0.1, time_elkan=0.30, time_full=0.38, winner=Elkan n_samples=100000, n_features=2, n_clusters=2, cluster_std=5, time_elkan=1.31, time_full=1.18, winner=Full n_samples=100000, n_features=2, n_clusters=5, cluster_std=0.1, time_elkan=0.46, time_full=0.47, winner=Elkan n_samples=100000, n_features=2, n_clusters=5, cluster_std=1, time_elkan=1.16, time_full=0.63, winner=Full n_samples=100000, n_features=2, n_clusters=5, cluster_std=5, time_elkan=2.77, time_full=1.77, winner=Full n_samples=100000, n_features=2, n_clusters=10, cluster_std=0.001, time_elkan=0.97, time_full=0.71, winner=Full n_samples=100000, n_features=2, n_clusters=10, cluster_std=0.1, time_elkan=0.80, time_full=0.73, winner=Full n_samples=100000, n_features=2, n_clusters=10, cluster_std=1, time_elkan=2.18, time_full=1.16, winner=Full n_samples=100000, n_features=2, n_clusters=10, cluster_std=5, time_elkan=4.82, time_full=2.35, winner=Full n_samples=100000, n_features=2, n_clusters=20, cluster_std=0.001, time_elkan=1.36, time_full=1.23, winner=Full n_samples=100000, n_features=2, n_clusters=20, cluster_std=0.1, time_elkan=1.66, time_full=1.44, winner=Full n_samples=100000, n_features=2, n_clusters=20, cluster_std=1, time_elkan=4.42, time_full=2.82, winner=Full n_samples=100000, n_features=2, n_clusters=20, cluster_std=5, time_elkan=11.02, time_full=4.66, winner=Full n_samples=100000, n_features=2, n_clusters=50, cluster_std=0.001, time_elkan=3.61, time_full=3.55, winner=Full n_samples=100000, n_features=2, n_clusters=50, cluster_std=0.1, time_elkan=3.89, time_full=3.49, winner=Full n_samples=100000, n_features=2, n_clusters=50, cluster_std=1, time_elkan=18.86, time_full=9.43, winner=Full n_samples=100000, n_features=2, n_clusters=50, cluster_std=5, time_elkan=27.77, time_full=12.31, winner=Full n_samples=100000, n_features=2, n_clusters=100, cluster_std=0.001, time_elkan=8.33, time_full=7.79, winner=Full n_samples=100000, n_features=2, n_clusters=100, cluster_std=0.1, time_elkan=9.55, time_full=8.30, winner=Full n_samples=100000, n_features=2, n_clusters=100, cluster_std=1, time_elkan=59.08, time_full=20.13, winner=Full n_samples=100000, n_features=2, n_clusters=100, cluster_std=5, time_elkan=70.57, time_full=23.11, winner=Full n_samples=100000, n_features=10, n_clusters=2, cluster_std=0.1, time_elkan=0.51, time_full=0.42, winner=Full n_samples=100000, n_features=10, n_clusters=2, cluster_std=1, time_elkan=0.44, time_full=0.61, winner=Elkan n_samples=100000, n_features=10, n_clusters=2, cluster_std=5, time_elkan=0.85, time_full=0.64, winner=Full n_samples=100000, n_features=10, n_clusters=5, cluster_std=0.001, time_elkan=0.55, time_full=0.63, winner=Elkan n_samples=100000, n_features=10, n_clusters=5, cluster_std=0.1, time_elkan=0.77, time_full=0.58, winner=Full n_samples=100000, n_features=10, n_clusters=5, cluster_std=1, time_elkan=0.80, time_full=0.61, winner=Full n_samples=100000, n_features=10, n_clusters=5, cluster_std=5, time_elkan=1.70, time_full=1.00, winner=Full n_samples=100000, n_features=10, n_clusters=10, cluster_std=0.001, time_elkan=1.05, time_full=0.87, winner=Full n_samples=100000, n_features=10, n_clusters=10, cluster_std=0.1, time_elkan=1.06, time_full=0.80, winner=Full n_samples=100000, n_features=10, n_clusters=10, cluster_std=1, time_elkan=1.23, time_full=1.14, winner=Full n_samples=100000, n_features=10, n_clusters=10, cluster_std=5, time_elkan=3.28, time_full=2.13, winner=Full n_samples=100000, n_features=10, n_clusters=20, cluster_std=0.001, time_elkan=1.96, time_full=1.55, winner=Full n_samples=100000, n_features=10, n_clusters=20, cluster_std=0.1, time_elkan=2.17, time_full=2.00, winner=Full n_samples=100000, n_features=10, n_clusters=20, cluster_std=1, time_elkan=2.31, time_full=1.86, winner=Full n_samples=100000, n_features=10, n_clusters=20, cluster_std=5, time_elkan=6.73, time_full=4.03, winner=Full n_samples=100000, n_features=10, n_clusters=50, cluster_std=0.001, time_elkan=4.87, time_full=4.19, winner=Full n_samples=100000, n_features=10, n_clusters=50, cluster_std=0.1, time_elkan=4.92, time_full=4.53, winner=Full n_samples=100000, n_features=10, n_clusters=50, cluster_std=1, time_elkan=6.05, time_full=5.50, winner=Full n_samples=100000, n_features=10, n_clusters=50, cluster_std=5, time_elkan=35.26, time_full=18.50, winner=Full n_samples=100000, n_features=10, n_clusters=100, cluster_std=0.001, time_elkan=10.04, time_full=9.49, winner=Full n_samples=100000, n_features=10, n_clusters=100, cluster_std=0.1, time_elkan=10.54, time_full=9.01, winner=Full n_samples=100000, n_features=10, n_clusters=100, cluster_std=1, time_elkan=16.26, time_full=12.68, winner=Full n_samples=100000, n_features=10, n_clusters=100, cluster_std=5, time_elkan=140.66, time_full=44.06, winner=Full n_samples=100000, n_features=100, n_clusters=2, cluster_std=0.001, time_elkan=1.19, time_full=1.25, winner=Elkan n_samples=100000, n_features=100, n_clusters=2, cluster_std=0.1, time_elkan=1.45, time_full=1.49, winner=Elkan n_samples=100000, n_features=100, n_clusters=2, cluster_std=1, time_elkan=1.47, time_full=1.49, winner=Elkan n_samples=100000, n_features=100, n_clusters=2, cluster_std=5, time_elkan=1.70, time_full=1.55, winner=Full n_samples=100000, n_features=100, n_clusters=5, cluster_std=0.001, time_elkan=2.12, time_full=1.95, winner=Full n_samples=100000, n_features=100, n_clusters=5, cluster_std=0.1, time_elkan=2.24, time_full=2.16, winner=Full n_samples=100000, n_features=100, n_clusters=5, cluster_std=1, time_elkan=2.19, time_full=2.06, winner=Full n_samples=100000, n_features=100, n_clusters=5, cluster_std=5, time_elkan=2.74, time_full=2.13, winner=Full n_samples=100000, n_features=100, n_clusters=10, cluster_std=0.001, time_elkan=3.53, time_full=3.39, winner=Full n_samples=100000, n_features=100, n_clusters=10, cluster_std=0.1, time_elkan=3.61, time_full=3.42, winner=Full n_samples=100000, n_features=100, n_clusters=10, cluster_std=1, time_elkan=3.67, time_full=3.42, winner=Full n_samples=100000, n_features=100, n_clusters=10, cluster_std=5, time_elkan=5.12, time_full=4.50, winner=Full n_samples=100000, n_features=100, n_clusters=20, cluster_std=0.001, time_elkan=6.02, time_full=5.40, winner=Full n_samples=100000, n_features=100, n_clusters=20, cluster_std=0.1, time_elkan=6.02, time_full=5.48, winner=Full n_samples=100000, n_features=100, n_clusters=20, cluster_std=1, time_elkan=6.07, time_full=5.51, winner=Full n_samples=100000, n_features=100, n_clusters=20, cluster_std=5, time_elkan=16.14, time_full=17.33, winner=Elkan n_samples=100000, n_features=100, n_clusters=50, cluster_std=0.001, time_elkan=14.73, time_full=13.29, winner=Full n_samples=100000, n_features=100, n_clusters=50, cluster_std=0.1, time_elkan=14.86, time_full=13.29, winner=Full ...
Expected Results
Elkan should be generally faster, especially on large datasets with many well-defined clusters.
Actual Results
Versions
System:
python: 3.7.12 (default, Sep 10 2021, 00:21:48) [GCC 7.5.0]
executable: /usr/bin/python3
machine: Linux-5.4.104+-x86_64-with-Ubuntu-18.04-bionic
Python dependencies:
pip: 21.1.3
setuptools: 57.4.0
sklearn: 1.0.1
numpy: 1.19.5
scipy: 1.4.1
Cython: 0.29.24
pandas: 1.1.5
matplotlib: 3.2.2
joblib: 1.1.0
threadpoolctl: 3.0.0
Built with OpenMP: True
Note: I first mentioned these results in #20642, but it turned out to be a different issue which is why I'm copying these results in a new issue.