Poor performance of sklearn.cluster.KMeans for numpy >= 1.19.0 #20642

gittar · 2021-07-31T12:50:21Z

Describe the bug

While doing experiments with the KMeans class I observed hugely varying running times for the same problem when using different versions of numpy. A systematic check revealed the following:

for numpy version 1.9.0+ KMeans was 100-500% slower than for numpy 1.8.5-

Statistics (figures are repeatable with little variance):

time:  7.67, Err: 0.01360200 python= 3.6.13 sklearn= 0.23.2 numpy 1.16.4
time:  8.20, Err: 0.01360200 python= 3.6.13 sklearn= 0.23.2 numpy 1.18.5
time: 40.85, Err: 0.01360200 python= 3.6.13 sklearn= 0.23.2 numpy 1.19.0
time:  9.69, Err: 0.01361583 python= 3.6.13 sklearn= 0.24.2 numpy 1.16.4
time: 10.21, Err: 0.01361583 python= 3.6.13 sklearn= 0.24.2 numpy 1.18.5
time: 17.68, Err: 0.01361583 python= 3.6.13 sklearn= 0.24.2 numpy 1.19.0

time:  7.74, Err: 0.01360200 python= 3.7.10 sklearn= 0.23.2 numpy 1.18.5
time: 40.00, Err: 0.01360200 python= 3.7.10 sklearn= 0.23.2 numpy 1.19.0
time: 40.47, Err: 0.01360200 python= 3.7.10 sklearn= 0.23.2 numpy 1.21.1
time: 10.37, Err: 0.01361583 python= 3.7.10 sklearn= 0.24.2 numpy 1.18.5
time: 17.80, Err: 0.01361583 python= 3.7.10 sklearn= 0.24.2 numpy 1.19.0
time: 17.54, Err: 0.01361583 python= 3.7.10 sklearn= 0.24.2 numpy 1.21.1

time:  7.82, Err: 0.01360200 python= 3.8.10 sklearn= 0.23.2 numpy 1.18.5
time: 40.54, Err: 0.01360200 python= 3.8.10 sklearn= 0.23.2 numpy 1.19.0
time: 40.29, Err: 0.01360200 python= 3.8.10 sklearn= 0.23.2 numpy 1.21.1
time: 10.25, Err: 0.01361583 python= 3.8.10 sklearn= 0.24.2 numpy 1.18.5
time: 17.06, Err: 0.01361583 python= 3.8.10 sklearn= 0.24.2 numpy 1.19.0
time: 17.20, Err: 0.01361583 python= 3.8.10 sklearn= 0.24.2 numpy 1.21.1

time: 14.45, Err: 0.01360200 python= 3.9.6  sklearn= 0.23.2 numpy 1.21.1
time: 17.44, Err: 0.01361583 python= 3.9.6  sklearn= 0.24.2 numpy 1.21.1

Steps/Code to Reproduce

Here is the python benchmark I used to produce each of the above lines. It generates three random data sets and runs k-means++ with k in {50,100}. The time for running the fit() method is aggregated as well as the normalized error

# file 'bench.py'
from sklearn.cluster import KMeans
import numpy as np
from time import time
import sys
import sklearn

def bench():
    np.random.seed(2) # fix random generator
    # generate datasets of size 1000,5000,10000
    # run k-means++ with k= 50,100
    Ds=[]
    for n in [1000,5000,10000]:
        Ds.append(np.random.random(size=(n,2)))
    ks=[50,100]
    t_total=0
    err=0
    for D in Ds:
        for k in ks:
            km=KMeans(n_clusters=k)
            t0=time()
            km.fit(D)
            t=time()-t0
            t_total+=t
            err+=km.inertia_/len(D)
        
    return f"time: {t_total:5.2f}, Err: {err:.8f}"
    
print(bench(),"python=",sys.version.split()[0],
"sklearn=",sklearn.__version__, "numpy",np.__version__)

The environments were produced with conda, e.g. the one for Python3.9 and scikit-learn 0.24.2 as follows:

conda create --name py3.9-sk0.24.2 python=3.9
conda activate py3.9-sk0.24.2
pip install scikit-learn==0.24.2
pip install numpy==1.18.5

Then for each environment the above benchmark was run:

python src/bench.py

Expected Results

I was expecting similar execution times for the different numpy versions or even better performance for higher versions.

Actual Results

The performance dropped sharply when switching from numpy 1.18.5 to 1.19.0. This was the case for Python 6,7,8 and scikit-learn 0.23.2 and 0.24.2. The performance hit was sharper larger for scikit-learn 0.23.2 than for scikit-learn 0.24.2.

With numpy 1.18.5 the performance of scikit-learn 0.23.2 was better than that of scikit-learn 0.24.2.

THe computed error values were identical for a given verson of scikit-learn and differred only slightly between scikit-learn 0.23.2 and 0.24.2

Question: What could be the reason for the huge performance deterioration for all numpy version starting with numpy 1.19.0?

Versions

(see table above)

The text was updated successfully, but these errors were encountered:

glemaitre · 2021-08-01T13:03:43Z

Could you provide as well the version. It would be nice to have the OS info.

gittar · 2021-08-01T13:38:31Z

Sure,
OS: Ubuntu 20.04.2 LTS
CPU: AMD FX(tm)-8300 Eight-Core Processor

ogrisel · 2021-08-03T14:15:20Z

Can you please check that you get the same value for km.n_iter_ in the end?

ogrisel · 2021-08-03T14:20:26Z

Can you please check the version of openblas used by sklearn (it should be the openblas shipped with scipy), for instance for the python 3.8 runs:

python -m threadpoolctl --import sklearn

glemaitre · 2021-08-03T17:05:10Z

I just reproduce the benchmark and see the same behaviour.

NumPy 1.18

scikit-learn: 0.24.2, numpy: 1.18.5, scipy: 1.7.1
(n_samples, n_features): (1000, 2), n_clusters: 50, time:  0.34, n_iter: 18, Err: 0.00280174
(n_samples, n_features): (1000, 2), n_clusters: 100, time:  0.84, n_iter: 9, Err: 0.00404925
(n_samples, n_features): (5000, 2), n_clusters: 50, time:  3.07, n_iter: 23, Err: 0.00726103
(n_samples, n_features): (5000, 2), n_clusters: 100, time:  5.46, n_iter: 23, Err: 0.00879794
(n_samples, n_features): (10000, 2), n_clusters: 50, time:  8.47, n_iter: 29, Err: 0.01202810
(n_samples, n_features): (10000, 2), n_clusters: 100, time: 11.88, n_iter: 23, Err: 0.01361594

openblas version:

[
  {
    "filepath": "/home/glemaitre/miniconda3/envs/npy_118/lib/python3.8/site-packages/scikit_learn.libs/libgomp-f7e03b3e.so.1.0.0",
    "prefix": "libgomp",
    "user_api": "openmp",
    "internal_api": "openmp",
    "version": null,
    "num_threads": 4
  },
  {
    "filepath": "/home/glemaitre/miniconda3/envs/npy_118/lib/python3.8/site-packages/numpy.libs/libopenblasp-r0-34a18dc3.3.7.so",
    "prefix": "libopenblas",
    "user_api": "blas",
    "internal_api": "openblas",
    "version": "0.3.7",
    "num_threads": 4,
    "threading_layer": "pthreads",
    "architecture": "Haswell"
  },
  {
    "filepath": "/home/glemaitre/miniconda3/envs/npy_118/lib/python3.8/site-packages/scipy.libs/libopenblasp-r0-085ca80a.3.9.so",
    "prefix": "libopenblas",
    "user_api": "blas",
    "internal_api": "openblas",
    "version": "0.3.9",
    "num_threads": 4,
    "threading_layer": "pthreads",
    "architecture": "Haswell"
  }
]

NumPy 1.21

scikit-learn: 0.24.2, numpy: 1.21.1, scipy: 1.7.1
(n_samples, n_features): (1000, 2), n_clusters: 50, time:  0.35, n_iter: 18, Err: 0.00280174
(n_samples, n_features): (1000, 2), n_clusters: 100, time:  4.28, n_iter: 9, Err: 0.00404925
(n_samples, n_features): (5000, 2), n_clusters: 50, time:  6.46, n_iter: 23, Err: 0.00726103
(n_samples, n_features): (5000, 2), n_clusters: 100, time: 19.63, n_iter: 23, Err: 0.00879794
(n_samples, n_features): (10000, 2), n_clusters: 50, time: 24.15, n_iter: 29, Err: 0.01202810
(n_samples, n_features): (10000, 2), n_clusters: 100, time: 41.98, n_iter: 23, Err: 0.01361594

openblas version:

[
  {
    "filepath": "/home/glemaitre/miniconda3/envs/npy_latest/lib/python3.8/site-packages/scikit_learn.libs/libgomp-f7e03b3e.so.1.0.0",
    "prefix": "libgomp",
    "user_api": "openmp",
    "internal_api": "openmp",
    "version": null,
    "num_threads": 4
  },
  {
    "filepath": "/home/glemaitre/miniconda3/envs/npy_latest/lib/python3.8/site-packages/numpy.libs/libopenblasp-r0-2d23e62b.3.17.so",
    "prefix": "libopenblas",
    "user_api": "blas",
    "internal_api": "openblas",
    "version": "0.3.17",
    "num_threads": 4,
    "threading_layer": "pthreads",
    "architecture": "Haswell"
  },
  {
    "filepath": "/home/glemaitre/miniconda3/envs/npy_latest/lib/python3.8/site-packages/scipy.libs/libopenblasp-r0-085ca80a.3.9.so",
    "prefix": "libopenblas",
    "user_api": "blas",
    "internal_api": "openblas",
    "version": "0.3.9",
    "num_threads": 4,
    "threading_layer": "pthreads",
    "architecture": "Haswell"
  }
]

ogrisel · 2021-08-03T21:03:20Z

Interesting: scikit-learn's KMeans inner operations use a GEMM routine from scipy, therefore the OpenBLAS version 0.3.9 from SciPy that has not changed in both cases.

numpy should not be used in the inner loops of KMeans. So the difference in performance is really mysterious to me.

ogrisel · 2021-08-04T13:06:05Z

Based on interactive debugging session with @glemaitre it seems that:

this problem only happens when OMP_NUM_THREADS is unset or set to a high value (e.g. 6 or more on a machine with 4 physical cores and 8 hyper-threads)
this problem only happens with algorithm="elkan" which is the default for this problem. Setting algorithm="full" makes the problem disappear.

Profiler output with native threads is difficult to interpret but it seems that the openmp parallelism of the elkan loops is involved. I do not understand how it relates to the numpy version though.

Even more surprising, this problem also happens when numpy, scipy and openblas are all installed from conda-forge. And this time the version of numpy and openblas do not seem to matter... It's always slow.

gittar · 2021-08-04T13:24:21Z

Excellent findings

glemaitre · 2021-08-04T13:59:01Z

Using a machine with many cores (48 physical cores and hyper-threading is deactivated), I don't see any dramatic performance drop between the NumPy version installed with PyPI. I used up to 36 cores (I could not use more since some other processes were running on the machine and were taking 8-10 cores).

I am under the impression that accounting for the hyper-threads might be one of the factors for performance drop. However, it is misterious why NumPy 1.18.5 installed from PyPI would limit to the physical cores while the conda-forge version would not.

ogrisel · 2021-08-05T07:14:03Z

It could be a case of OpenMathLib/OpenBLAS#3187 (thanks @jeremiedbb, I had completely forgotten about this). So it could be related to how OpenBLAS assigns threads to specific CPUs. However I am still unsure why the use of OpenMP threads impacts OpenBLAS thread placement since they are not nested calls but sequential calls.

ogrisel · 2021-08-05T07:21:28Z

Hum, maybe OpenBLAS is setting an affinity mask on the main thread to tie it to a specific CPU and then subsequent calls to OpenMP spawn threads all tied to the same CPU causing oversubscription. We indeed need to update the C reproducer of OpenMathLib/OpenBLAS#3187 to monitor the value of the cpus_allowed_list key in /proc/<pid>/status both for the main thread and for each thread inside the openmp loop.

jeremiedbb · 2021-09-17T16:26:05Z

Interesting: scikit-learn's KMeans inner operations use a GEMM routine from scipy, therefore the OpenBLAS version 0.3.9 from SciPy that has not changed in both cases.

numpy should not be used in the inner loops of KMeans. So the difference in performance is really mysterious to me.

this problem only happens with algorithm="elkan"

Actually with elkan there's no blas call inside the inner loop. Instead there's a blas call between each iteration. So the relevant BLAS is indeed the one shipped with numpy. Therefore there might be a regression in OpenBLAS between 0.3.7 and 0.3.17. I'll experiment OpenMathLib/OpenBLAS#3187 with these versions to see if this is indeed the same issue.

jeremiedbb · 2021-10-08T15:24:05Z

I ran the same snippet with latest versions of numpy scipy scikit-learn by just changing the OpenBLAS versionand rebuilding everything from source, and I can reproduce the timings. It actually appeared since this change in OpenBLAS OpenMathLib/OpenBLAS#2375.

However, it's not really a regression of OpenBLAS. The PR linked essentialy changes the size of the matrices under which it runs sequentially. The poor performance reported was already there, but with bigger matrices (in particular higher number of centers).

In conclusion this issue is the same as I reported in OpenMathLib/OpenBLAS#3187. The bad news is that it might not be a bug but an expected behavior :/
I'll discuss further with OpenBLAS folks to try to learn more about that. For now I don't think there's anything we can do about it on the sklearn side.

@gitter As a workaround you can try to set the env vars OPENBLAS_NUM_THREADS and OMP_NUM_THREADS to half the number of cores each.

gittar · 2021-10-11T19:51:11Z

This seems to be a much harder problem than I expected when I reported it. Thanks for your effort in investigating this. If a fix will take long (or is even not possible for some reason) you might consider informing users that they can avoid a huge performance hit by sticking with numpy 1.18.5. Of course it would be way better if the issue could be solved.

ogrisel · 2021-10-19T10:07:00Z

I think you can also change the number of threads used by OpenBLAS with OPENBLAS_NUM_THREADS=1 or https://github.com/joblib/threadpoolctl or just install numpy, scipy and scikit-learn from conda-forge, in which case both scikit-learn and openblas can/will link to the same OpenMP runtime in which case the problem disappears.

ageron · 2021-11-20T03:58:50Z

Not sure if this is related, but I just ran some tests, and using algorithm="elkan" is generally slower (and sometimes much slower) than algorithm="full" on various datasets generated using make_blobs() (I tried different n_samples, n_features, cluster_std, and different numbers of clusters). I basically couldn't find a dataset where Elkan was reliably faster than Full. I also tried both NumPy 1.18.5 and 1.19.5, with Scikit-Learn 1.0.1, but Elkan remained generally slower than Full:

The experiments were done on Colab, like this:

from time import time

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

def time_kmeans(X, n_clusters, algorithm, random_state):
  kmeans = KMeans(n_clusters=n_clusters, algorithm=algorithm, random_state=random_state)
  t0 = time()
  kmeans.fit(X)
  return time() - t0

iteration = 0
for n_samples in (10**3, 10**4, 10**5):
  for n_features in (2, 10, 100):
    for n_clusters in (2, 5, 10, 20, 50, 100):
      for cluster_std in (1e-3, 0.1, 1, 5):
        n_samples_per_cluster = n_samples // n_clusters
        X, y = make_blobs(n_samples=[n_samples_per_cluster] * n_clusters,
                          n_features=n_features,
                          cluster_std=cluster_std)
        iteration += 1
        time_elkan = time_kmeans(X, n_clusters, algorithm="elkan", random_state=iteration)
        time_full = time_kmeans(X, n_clusters, algorithm="full", random_state=iteration)
        winner = "Full" if time_full < time_elkan else "Elkan"
        print(
            f"n_samples={n_samples}, n_features={n_features}, "
            f"n_clusters={n_clusters}, cluster_std={cluster_std}, "
            f"time_elkan={time_elkan:.2f}, time_full={time_full:.2f}, "
            f"winner={winner}")

And the output is:

n_samples=1000, n_features=2, n_clusters=2, cluster_std=0.001, time_elkan=0.01, time_full=0.01, winner=Full
n_samples=1000, n_features=2, n_clusters=20, cluster_std=1, time_elkan=0.11, time_full=0.07, winner=Full
n_samples=1000, n_features=2, n_clusters=50, cluster_std=1, time_elkan=0.19, time_full=0.15, winner=Full
n_samples=1000, n_features=2, n_clusters=100, cluster_std=0.001, time_elkan=0.49, time_full=0.31, winner=Full
n_samples=1000, n_features=2, n_clusters=100, cluster_std=0.1, time_elkan=0.73, time_full=0.29, winner=Full
n_samples=1000, n_features=2, n_clusters=100, cluster_std=1, time_elkan=1.17, time_full=0.32, winner=Full
n_samples=1000, n_features=2, n_clusters=100, cluster_std=5, time_elkan=1.21, time_full=0.30, winner=Full
n_samples=1000, n_features=10, n_clusters=5, cluster_std=0.1, time_elkan=0.12, time_full=0.12, winner=Full
n_samples=1000, n_features=10, n_clusters=10, cluster_std=0.1, time_elkan=0.09, time_full=0.08, winner=Full
n_samples=1000, n_features=10, n_clusters=20, cluster_std=0.1, time_elkan=0.13, time_full=0.23, winner=Elkan
n_samples=1000, n_features=10, n_clusters=20, cluster_std=5, time_elkan=0.49, time_full=0.63, winner=Elkan
n_samples=1000, n_features=10, n_clusters=50, cluster_std=0.1, time_elkan=0.48, time_full=0.24, winner=Full
n_samples=1000, n_features=10, n_clusters=50, cluster_std=5, time_elkan=0.48, time_full=0.26, winner=Full
n_samples=1000, n_features=10, n_clusters=100, cluster_std=0.001, time_elkan=0.75, time_full=0.42, winner=Full
n_samples=1000, n_features=10, n_clusters=100, cluster_std=0.1, time_elkan=0.61, time_full=0.43, winner=Full
n_samples=1000, n_features=10, n_clusters=100, cluster_std=1, time_elkan=0.88, time_full=0.53, winner=Full
n_samples=1000, n_features=10, n_clusters=100, cluster_std=5, time_elkan=1.36, time_full=0.46, winner=Full
n_samples=1000, n_features=100, n_clusters=5, cluster_std=0.001, time_elkan=0.08, time_full=0.15, winner=Elkan
n_samples=1000, n_features=100, n_clusters=5, cluster_std=5, time_elkan=0.16, time_full=0.07, winner=Full
n_samples=1000, n_features=100, n_clusters=10, cluster_std=1, time_elkan=0.27, time_full=0.28, winner=Elkan
n_samples=1000, n_features=100, n_clusters=20, cluster_std=0.1, time_elkan=0.21, time_full=0.19, winner=Full
n_samples=1000, n_features=100, n_clusters=20, cluster_std=5, time_elkan=0.25, time_full=0.26, winner=Elkan
n_samples=1000, n_features=100, n_clusters=50, cluster_std=0.001, time_elkan=0.76, time_full=0.61, winner=Full
n_samples=1000, n_features=100, n_clusters=50, cluster_std=0.1, time_elkan=0.69, time_full=0.43, winner=Full
n_samples=1000, n_features=100, n_clusters=50, cluster_std=1, time_elkan=0.59, time_full=0.61, winner=Elkan
n_samples=1000, n_features=100, n_clusters=100, cluster_std=0.001, time_elkan=0.93, time_full=0.77, winner=Full
n_samples=1000, n_features=100, n_clusters=100, cluster_std=0.1, time_elkan=0.84, time_full=0.96, winner=Elkan
n_samples=1000, n_features=100, n_clusters=100, cluster_std=1, time_elkan=1.37, time_full=0.94, winner=Full
n_samples=1000, n_features=100, n_clusters=100, cluster_std=5, time_elkan=1.31, time_full=0.79, winner=Full
n_samples=10000, n_features=2, n_clusters=2, cluster_std=5, time_elkan=0.14, time_full=0.27, winner=Elkan
n_samples=10000, n_features=2, n_clusters=5, cluster_std=1, time_elkan=0.31, time_full=0.18, winner=Full
n_samples=10000, n_features=2, n_clusters=10, cluster_std=0.001, time_elkan=0.29, time_full=0.16, winner=Full
n_samples=10000, n_features=2, n_clusters=10, cluster_std=1, time_elkan=0.53, time_full=0.21, winner=Full
n_samples=10000, n_features=2, n_clusters=10, cluster_std=5, time_elkan=1.27, time_full=0.83, winner=Full
n_samples=10000, n_features=2, n_clusters=20, cluster_std=0.1, time_elkan=0.25, time_full=0.27, winner=Elkan
n_samples=10000, n_features=2, n_clusters=20, cluster_std=1, time_elkan=1.17, time_full=0.58, winner=Full
n_samples=10000, n_features=2, n_clusters=20, cluster_std=5, time_elkan=1.60, time_full=0.86, winner=Full
n_samples=10000, n_features=2, n_clusters=50, cluster_std=0.001, time_elkan=0.79, time_full=0.84, winner=Elkan
n_samples=10000, n_features=2, n_clusters=50, cluster_std=0.1, time_elkan=0.82, time_full=0.70, winner=Full
n_samples=10000, n_features=2, n_clusters=50, cluster_std=1, time_elkan=1.99, time_full=1.34, winner=Full
n_samples=10000, n_features=2, n_clusters=50, cluster_std=5, time_elkan=2.35, time_full=1.47, winner=Full
n_samples=10000, n_features=2, n_clusters=100, cluster_std=0.001, time_elkan=1.12, time_full=1.06, winner=Full
n_samples=10000, n_features=2, n_clusters=100, cluster_std=0.1, time_elkan=1.34, time_full=1.08, winner=Full
n_samples=10000, n_features=2, n_clusters=100, cluster_std=1, time_elkan=7.00, time_full=1.95, winner=Full
n_samples=10000, n_features=2, n_clusters=100, cluster_std=5, time_elkan=7.06, time_full=1.92, winner=Full
n_samples=10000, n_features=10, n_clusters=5, cluster_std=0.001, time_elkan=0.13, time_full=0.24, winner=Elkan
n_samples=10000, n_features=10, n_clusters=5, cluster_std=5, time_elkan=0.40, time_full=0.32, winner=Full
n_samples=10000, n_features=10, n_clusters=10, cluster_std=0.1, time_elkan=0.30, time_full=0.17, winner=Full
n_samples=10000, n_features=10, n_clusters=10, cluster_std=
8000
5, time_elkan=0.70, time_full=0.43, winner=Full
n_samples=10000, n_features=10, n_clusters=20, cluster_std=0.1, time_elkan=0.46, time_full=0.27, winner=Full
n_samples=10000, n_features=10, n_clusters=20, cluster_std=1, time_elkan=0.48, time_full=0.51, winner=Elkan
n_samples=10000, n_features=10, n_clusters=20, cluster_std=5, time_elkan=1.39, time_full=1.01, winner=Full
n_samples=10000, n_features=10, n_clusters=50, cluster_std=0.001, time_elkan=0.78, time_full=0.59, winner=Full
n_samples=10000, n_features=10, n_clusters=50, cluster_std=0.1, time_elkan=0.69, time_full=0.57, winner=Full
n_samples=10000, n_features=10, n_clusters=50, cluster_std=1, time_elkan=1.00, time_full=0.90, winner=Full
n_samples=10000, n_features=10, n_clusters=50, cluster_std=5, time_elkan=3.47, time_full=2.15, winner=Full
n_samples=10000, n_features=10, n_clusters=100, cluster_std=0.001, time_elkan=1.29, time_full=1.15, winner=Full
n_samples=10000, n_features=10, n_clusters=100, cluster_std=0.1, time_elkan=1.45, time_full=1.36, winner=Full
n_samples=10000, n_features=10, n_clusters=100, cluster_std=1, time_elkan=2.58, time_full=1.63, winner=Full
n_samples=10000, n_features=10, n_clusters=100, cluster_std=5, time_elkan=11.02, time_full=3.39, winner=Full
n_samples=10000, n_features=100, n_clusters=2, cluster_std=1, time_elkan=0.34, time_full=0.22, winner=Full
n_samples=10000, n_features=100, n_clusters=5, cluster_std=0.001, time_elkan=0.34, time_full=0.37, winner=Elkan
n_samples=10000, n_features=100, n_clusters=5, cluster_std=0.1, time_elkan=0.61, time_full=0.33, winner=Full
n_samples=10000, n_features=100, n_clusters=5, cluster_std=5, time_elkan=0.50, time_full=0.56, winner=Elkan
n_samples=10000, n_features=100, n_clusters=10, cluster_std=0.1, time_elkan=0.59, time_full=0.47, winner=Full
n_samples=10000, n_features=100, n_clusters=10, cluster_std=1, time_elkan=0.63, time_full=0.53, winner=Full
n_samples=10000, n_features=100, n_clusters=10, cluster_std=5, time_elkan=1.02, time_full=1.10, winner=Elkan
n_samples=10000, n_features=100, n_clusters=20, cluster_std=0.001, time_elkan=0.82, time_full=0.81, winner=Full
n_samples=10000, n_features=100, n_clusters=20, cluster_std=0.1, time_elkan=0.82, time_full=0.89, winner=Elkan
n_samples=10000, n_features=100, n_clusters=20, cluster_std=1, time_elkan=0.92, time_full=0.78, winner=Full
n_samples=10000, n_features=100, n_clusters=20, cluster_std=5, time_elkan=1.77, time_full=1.56, winner=Full
n_samples=10000, n_features=100, n_clusters=50, cluster_std=0.001, time_elkan=2.10, time_full=1.71, winner=Full
n_samples=10000, n_features=100, n_clusters=50, cluster_std=0.1, time_elkan=1.93, time_full=1.82, winner=Full
n_samples=10000, n_features=100, n_clusters=50, cluster_std=1, time_elkan=2.10, time_full=1.77, winner=Full
n_samples=10000, n_features=100, n_clusters=50, cluster_std=5, time_elkan=2.89, time_full=2.49, winner=Full
n_samples=10000, n_features=100, n_clusters=100, cluster_std=0.001, time_elkan=4.13, time_full=3.24, winner=Full
n_samples=10000, n_features=100, n_clusters=100, cluster_std=0.1, time_elkan=3.97, time_full=3.15, winner=Full
n_samples=10000, n_features=100, n_clusters=100, cluster_std=1, time_elkan=4.20, time_full=3.46, winner=Full
n_samples=10000, n_features=100, n_clusters=100, cluster_std=5, time_elkan=5.92, time_full=4.10, winner=Full
n_samples=100000, n_features=2, n_clusters=2, cluster_std=0.1, time_elkan=0.30, time_full=0.38, winner=Elkan
n_samples=100000, n_features=2, n_clusters=2, cluster_std=5, time_elkan=1.31, time_full=1.18, winner=Full
n_samples=100000, n_features=2, n_clusters=5, cluster_std=0.1, time_elkan=0.46, time_full=0.47, winner=Elkan
n_samples=100000, n_features=2, n_clusters=5, cluster_std=1, time_elkan=1.16, time_full=0.63, winner=Full
n_samples=100000, n_features=2, n_clusters=5, cluster_std=5, time_elkan=2.77, time_full=1.77, winner=Full
n_samples=100000, n_features=2, n_clusters=10, cluster_std=0.001, time_elkan=0.97, time_full=0.71, winner=Full
n_samples=100000, n_features=2, n_clusters=10, cluster_std=0.1, time_elkan=0.80, time_full=0.73, winner=Full
n_samples=100000, n_features=2, n_clusters=10, cluster_std=1, time_elkan=2.18, time_full=1.16, winner=Full
n_samples=100000, n_features=2, n_clusters=10, cluster_std=5, time_elkan=4.82, time_full=2.35, winner=Full
n_samples=100000, n_features=2, n_clusters=20, cluster_std=0.001, time_elkan=1.36, time_full=1.23, winner=Full
n_samples=100000, n_features=2, n_clusters=20, cluster_std=0.1, time_elkan=1.66, time_full=1.44, winner=Full
n_samples=100000, n_features=2, n_clusters=20, cluster_std=1, time_elkan=4.42, time_full=2.82, winner=Full
n_samples=100000, n_features=2, n_clusters=20, cluster_std=5, time_elkan=11.02, time_full=4.66, winner=Full
n_samples=100000, n_features=2, n_clusters=50, cluster_std=0.001, time_elkan=3.61, time_full=3.55, winner=Full
n_samples=100000, n_features=2, n_clusters=50, cluster_std=0.1, time_elkan=3.89, time_full=3.49, winner=Full
n_samples=100000, n_features=2, n_clusters=50, cluster_std=1, time_elkan=18.86, time_full=9.43, winner=Full
n_samples=100000, n_features=2, n_clusters=50, cluster_std=5, time_elkan=27.77, time_full=12.31, winner=Full
n_samples=100000, n_features=2, n_clusters=100, cluster_std=0.001, time_elkan=8.33, time_full=7.79, winner=Full
n_samples=100000, n_features=2, n_clusters=100, cluster_std=0.1, time_elkan=9.55, time_full=8.30, winner=Full
n_samples=100000, n_features=2, n_clusters=100, cluster_std=1, time_elkan=59.08, time_full=20.13, winner=Full
n_samples=100000, n_features=2, n_clusters=100, cluster_std=5, time_elkan=70.57, time_full=23.11, winner=Full
n_samples=100000, n_features=10, n_clusters=2, cluster_std=0.1, time_elkan=0.51, time_full=0.42, winner=Full
n_samples=100000, n_features=10, n_clusters=2, cluster_std=1, time_elkan=0.44, time_full=0.61, winner=Elkan
n_samples=100000, n_features=10, n_clusters=2, cluster_std=5, time_elkan=0.85, time_full=0.64, winner=Full
n_samples=100000, n_features=10, n_clusters=5, cluster_std=0.001, time_elkan=0.55, time_full=0.63, winner=Elkan
n_samples=100000, n_features=10, n_clusters=5, cluster_std=0.1, time_elkan=0.77, time_full=0.58, winner=Full
n_samples=100000, n_features=10, n_clusters=5, cluster_std=1, time_elkan=0.80, time_full=0.61, winner=Full
n_samples=100000, n_features=10, n_clusters=5, cluster_std=5, time_elkan=1.70, time_full=1.00, winner=Full
n_samples=100000, n_features=10, n_clusters=10, cluster_std=0.001, time_elkan=1.05, time_full=0.87, winner=Full
n_samples=100000, n_features=10, n_clusters=10, cluster_std=0.1, time_elkan=1.06, time_full=0.80, winner=Full
n_samples=100000, n_features=10, n_clusters=10, cluster_std=1, time_elkan=1.23, time_full=1.14, winner=Full
n_samples=100000, n_features=10, n_clusters=10, cluster_std=5, time_elkan=3.28, time_full=2.13, winner=Full
n_samples=100000, n_features=10, n_clusters=20, cluster_std=0.001, time_elkan=1.96, time_full=1.55, winner=Full
n_samples=100000, n_features=10, n_clusters=20, cluster_std=0.1, time_elkan=2.17, time_full=2.00, winner=Full
n_samples=100000, n_features=10, n_clusters=20, cluster_std=1, time_elkan=2.31, time_full=1.86, winner=Full
n_samples=100000, n_features=10, n_clusters=20, cluster_std=5, time_elkan=6.73, time_full=4.03, winner=Full
n_samples=100000, n_features=10, n_clusters=50, cluster_std=0.001, time_elkan=4.87, time_full=4.19, winner=Full
n_samples=100000, n_features=10, n_clusters=50, cluster_std=0.1, time_elkan=4.92, time_full=4.53, winner=Full
n_samples=100000, n_features=10, n_clusters=50, cluster_std=1, time_elkan=6.05, time_full=5.50, winner=Full
n_samples=100000, n_features=10, n_clusters=50, cluster_std=5, time_elkan=35.26, time_full=18.50, winner=Full
n_samples=100000, n_features=10, n_clusters=100, cluster_std=0.001, time_elkan=10.04, time_full=9.49, winner=Full
n_samples=100000, n_features=10, n_clusters=100, cluster_std=0.1, time_elkan=10.54, time_full=9.01, winner=Full
n_samples=100000, n_features=10, n_clusters=100, cluster_std=1, time_elkan=16.26, time_full=12.68, winner=Full
n_samples=100000, n_features=10, n_clusters=100, cluster_std=5, time_elkan=140.66, time_full=44.06, winner=Full
n_samples=100000, n_features=100, n_clusters=2, cluster_std=0.001, time_elkan=1.19, time_full=1.25, winner=Elkan
n_samples=100000, n_features=100, n_clusters=2, cluster_std=0.1, time_elkan=1.45, time_full=1.49, winner=Elkan
n_samples=100000, n_features=100, n_clusters=2, cluster_std=1, time_elkan=1.47, time_full=1.49, winner=Elkan
n_samples=100000, n_features=100, n_clusters=2, cluster_std=5, time_elkan=1.70, time_full=1.55, winner=Full
n_samples=100000, n_features=100, n_clusters=5, cluster_std=0.001, time_elkan=2.12, time_full=1.95, winner=Full
n_samples=100000, n_features=100, n_clusters=5, cluster_std=0.1, time_elkan=2.24, time_full=2.16, winner=Full
n_samples=100000, n_features=100, n_clusters=5, cluster_std=1, time_elkan=2.19, time_full=2.06, winner=Full
n_samples=100000, n_features=100, n_clusters=5, cluster_std=5, time_elkan=2.74, time_full=2.13, winner=Full
n_samples=100000, n_features=100, n_clusters=10, cluster_std=0.001, time_elkan=3.53, time_full=3.39, winner=Full
n_samples=100000, n_features=100, n_clusters=10, cluster_std=0.1, time_elkan=3.61, time_full=3.42, winner=Full
n_samples=100000, n_features=100, n_clusters=10, cluster_std=1, time_elkan=3.67, time_full=3.42, winner=Full
n_samples=100000, n_features=100, n_clusters=10, cluster_std=5, time_elkan=5.12, time_full=4.50, winner=Full
n_samples=100000, n_features=100, n_clusters=20, cluster_std=0.001, time_elkan=6.02, time_full=5.40, winner=Full
n_samples=100000, n_features=100, n_clusters=20, cluster_std=0.1, time_elkan=6.02, time_full=5.48, winner=Full
n_samples=100000, n_features=100, n_clusters=20, cluster_std=1, time_elkan=6.07, time_full=5.51, winner=Full
n_samples=100000, n_features=100, n_clusters=20, cluster_std=5, time_elkan=16.14, time_full=17.33, winner=Elkan
n_samples=100000, n_features=100, n_clusters=50, cluster_std=0.001, time_elkan=14.73, time_full=13.29, winner=Full
n_samples=100000, n_features=100, n_clusters=50, cluster_std=0.1, time_elkan=14.86, time_full=13.29, winner=Full
...

jeremiedbb · 2021-11-20T15:33:01Z

Thanks for these exhaustive benchmarks @ageron !
It confirms some intuition we had but I imagined elkan would be better in more cases. In addition, when elkan is faster it's usually just a little faster while "full" can be a lot faster than "elkan". I think it's definitely time, given these results, for us to change the default value of the algorithm parameter. Would you be interested in submitting a PR for that ?

gittar · 2021-11-21T00:05:38Z

These are interesting results, @ageron. However, I think they cannot explain the huge performance differences I observed. I did my simulations mainly with one specific version of scikit-learn (0.23.2), and no specific setting of the "algorithm" parameter was done, which therefore kept its default "auto" which again is mapped to "elkan". Your results suggest that "auto" should instead be mapped to "full," but according to your experiments, only a moderate improvement (visually estimated: 30%) would follow and possibly both to older and newer versions of NumPy which would leave the performance gap as is.

ageron · 2021-11-21T00:08:23Z

@gittar , you're right. I'll open another issue regarding Elkan vs Full.

ogrisel · 2021-12-15T13:09:26Z

Actually @ageron's benchmark results might be biased against Elkan because the Lloyd solver does not have the threadpool interaction problem.

It would be possible to check this by re-running the benchmarks in a pure conda-forge environment where scipy's OpenBLAS and scikit-learn's compiled extensions are both linked to the same OpenMP runtime to use a single threadpool.

conda create -n all-openmp -y -c conda-forge scikit-learn "libopenblas=*=*openmp*"
conda activate all-openmp

We can check that the env has a single threadpool by checking that there is a single OpenBLAS runtime and that its threading layer is "openmp":

python -m threadpoolctl -i scipy.linalg -i sklearn

EDIT: added missing -i before sklearn

For instance, on my laptop I get the following:

[
  {
    "user_api": "blas",
    "internal_api": "openblas",
    "prefix": "libopenblas",
    "filepath": "/Users/ogrisel/mambaforge/envs/all-openmp/lib/libopenblas_vortexp-r0.3.18.dylib",
    "version": "0.3.18",
    "threading_layer": "openmp",
    "architecture": "VORTEX",
    "num_threads": 8
  },
  {
    "user_api": "openmp",
    "internal_api": "openmp",
    "prefix": "libomp",
    "filepath": "/Users/ogrisel/mambaforge/envs/all-openmp/lib/libomp.dylib",
    "version": null,
    "num_threads": 8
  }
]

ogrisel · 2021-12-15T14:23:20Z

Note that on Linux, when building scikit-learn from source with gcc, scikit-learn will use the GNU libgomp OpenMP runtime library instead of the LLVM libomp used by default in conda-forge packages (see https://conda-forge.org/docs/maintainer/knowledge_base.html#openmp).

To make conda-forge use libgomp instead of libomp, it should be possible install the env as follows:

conda create -n all-libgomp -y -c conda-forge scikit-learn "libopenblas=*=*openmp*" "_openmp_mutex=*=*_gnu" compilers
conda activate all-libgomp

and then compile scikit-learn using the gcc compiler from that conda env instead of the system compiler. I have not tested this setup yet.

ageron · 2021-12-15T21:38:52Z

Hi @ogrisel,

I just ran the benchmark again, this time on my laptop, in a conda environment, and Elkan was still generally slower than Lloyd (it produced roughly the same histogram as above). It was based on OpenBLAS with OpenMP as the threading layer:

[
  {
    "user_api": "blas",
    "internal_api": "openblas",
    "prefix": "libopenblas",
    "filepath": "/Users/ageron/miniconda3/envs/test/lib/libopenblasp-r0.3.18.dylib",
    "version": "0.3.18",
    "threading_layer": "openmp",
    "architecture": "Haswell",
    "num_threads": 16
  },
  {
    "user_api": "openmp",
    "internal_api": "openmp",
    "prefix": "libomp",
    "filepath": "/Users/ageron/miniconda3/envs/test/lib/libomp.dylib",
    "version": null,
    "num_threads": 16
  }
]

ogrisel · 2021-12-15T22:00:55Z

Thank you very much for checking. We won't have to revert your PR ;)

gittar · 2022-01-07T23:34:47Z

I made an experiment with nearly current versions of all packages and that looks quite promising:

time:  8.60, Err: 0.01361583 python= 3.9.7 sklearn= 1.0.1 numpy 1.21.2
time:  8.94, Err: 0.01361583 python= 3.9.7 sklearn= 1.0.1 numpy 1.21.2
time:  7.02, Err: 0.01361583 python= 3.9.7 sklearn= 1.0.1 numpy 1.21.2
time:  6.75, Err: 0.01361583 python= 3.9.7 sklearn= 1.0.1 numpy 1.21.2

However, with the most recent versions the test program runs a lot slower:

time: 18.05, Err: 0.01361583 python= 3.10.0 sklearn= 1.0.2 numpy 1.22.0
time: 18.05, Err: 0.01361583 python= 3.10.0 sklearn= 1.0.2 numpy 1.22.0
time: 17.42, Err: 0.01361583 python= 3.10.0 sklearn= 1.0.2 numpy 1.22.0
time: 17.53, Err: 0.01361583 python= 3.10.0 sklearn= 1.0.2 numpy 1.22.0

And with older versions I still get the observed very poor performance:

time: 39.49, Err: 0.01360200 python= 3.8.12 sklearn= 0.23.2 numpy 1.19.0

At least this seemingly makes it possible to get rid of the numpy 1.18.5 dependency for good performance, which already helps in my case.

jeremiedbb · 2022-03-11T22:36:39Z

The default algorithm was changed to "lloyd" which does not suffer from this performance issue.

The issue with "elkan" has always been there and does not depend on the version of numpy (only the sizes of the matrices that trigger it depends on the version of numpy). It's actually an OpenBLAS / OpenMP issue that won't be fixed on the scikit-learn side. The best way to solve probably comes from a unification of the BLAS/OpenMP packing in the scientific python ecosystem.

gittar added the Bug: triage label Jul 31, 2021

gittar changed the title ~~Poor performance of sklearn.cluster.KMeans for Python > 3.6.13~~ Poor performance of sklearn.cluster.KMeans for numpy >= 1.19.0 Jul 31, 2 8000 021

ogrisel added Bug Performance and removed Performance labels Aug 3, 2021

ogrisel mentioned this issue Nov 18, 2021

Performance issue on macOS arm64 (M1) when installing from wheels (2x libopenblas) scipy/scipy#15050

Closed

ageron mentioned this issue Nov 21, 2021

KMeans Elkan algorithm (the default) is generally slower than Full #21729

Closed

ogrisel removed the Bug: triage label Nov 22, 2021

rgommers mentioned this issue Nov 30, 2021

Consider unifying the two OpenBLAS libraries in NumPy and SciPy wheels Create OpenBLAS wheel scipy/scipy#15129

Open

ogrisel mentioned this issue Dec 15, 2021 6D40

kmeans elkan vs. lloyd performance #16976

Open

ogrisel mentioned this issue Dec 15, 2021

ENH Loss module LogisticRegression #21808

Merged

jeremiedbb closed this as completed Mar 11, 2022

jeremiedbb mentioned this issue Apr 25, 2024

Configure OpenBLAS to use scikit-learn's OpenMP threadpool #28883

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor performance of sklearn.cluster.KMeans for numpy >= 1.19.0 #20642

Poor performance of sklearn.cluster.KMeans for numpy >= 1.19.0 #20642

Poor performance of sklearn.cluster.KMeans for numpy >= 1.19.0 #20642

Poor performance of sklearn.cluster.KMeans for numpy >= 1.19.0 #20642

Comments

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

NumPy 1.18

NumPy 1.21