8000 Poor performance of sklearn.cluster.KMeans for numpy >= 1.19.0 · Issue #20642 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Poor performance of sklearn.cluster.KMeans for numpy >= 1.19.0 #20642

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gittar opened this issue Jul 31, 2021 · 25 comments
Closed

Poor performance of sklearn.cluster.KMeans for numpy >= 1.19.0 #20642

gittar opened this issue Jul 31, 2021 · 25 comments
Labels

Comments

@gittar
Copy link
Contributor
gittar commented Jul 31, 2021

Describe the bug

While doing experiments with the KMeans class I observed hugely varying running times for the same problem when using different versions of numpy. A systematic check revealed the following:

  • for numpy version 1.9.0+ KMeans was 100-500% slower than for numpy 1.8.5-

Statistics (figures are repeatable with little variance):

time:  7.67, Err: 0.01360200 python= 3.6.13 sklearn= 0.23.2 numpy 1.16.4
time:  8.20, Err: 0.01360200 python= 3.6.13 sklearn= 0.23.2 numpy 1.18.5
time: 40.85, Err: 0.01360200 python= 3.6.13 sklearn= 0.23.2 numpy 1.19.0
time:  9.69, Err: 0.01361583 python= 3.6.13 sklearn= 0.24.2 numpy 1.16.4
time: 10.21, Err: 0.01361583 python= 3.6.13 sklearn= 0.24.2 numpy 1.18.5
time: 17.68, Err: 0.01361583 python= 3.6.13 sklearn= 0.24.2 numpy 1.19.0

time:  7.74, Err: 0.01360200 python= 3.7.10 sklearn= 0.23.2 numpy 1.18.5
time: 40.00, Err: 0.01360200 python= 3.7.10 sklearn= 0.23.2 numpy 1.19.0
time: 40.47, Err: 0.01360200 python= 3.7.10 sklearn= 0.23.2 numpy 1.21.1
time: 10.37, Err: 0.01361583 python= 3.7.10 sklearn= 0.24.2 numpy 1.18.5
time: 17.80, Err: 0.01361583 python= 3.7.10 sklearn= 0.24.2 numpy 1.19.0
time: 17.54, Err: 0.01361583 python= 3.7.10 sklearn= 0.24.2 numpy 1.21.1

time:  7.82, Err: 0.01360200 python= 3.8.10 sklearn= 0.23.2 numpy 1.18.5
time: 40.54, Err: 0.01360200 python= 3.8.10 sklearn= 0.23.2 numpy 1.19.0
time: 40.29, Err: 0.01360200 python= 3.8.10 sklearn= 0.23.2 numpy 1.21.1
time: 10.25, Err: 0.01361583 python= 3.8.10 sklearn= 0.24.2 numpy 1.18.5
time: 17.06, Err: 0.01361583 python= 3.8.10 sklearn= 0.24.2 numpy 1.19.0
time: 17.20, Err: 0.01361583 python= 3.8.10 sklearn= 0.24.2 numpy 1.21.1

time: 14.45, Err: 0.01360200 python= 3.9.6  sklearn= 0.23.2 numpy 1.21.1
time: 17.44, Err: 0.01361583 python= 3.9.6  sklearn= 0.24.2 numpy 1.21.1

Steps/Code to Reproduce

Here is the python benchmark I used to produce each of the above lines. It generates three random data sets and runs k-means++ with k in {50,100}. The time for running the fit() method is aggregated as well as the normalized error

# file 'bench.py'
from sklearn.cluster import KMeans
import numpy as np
from time import time
import sys
import sklearn

def bench():
    np.random.seed(2) # fix random generator
    # generate datasets of size 1000,5000,10000
    # run k-means++ with k= 50,100
    Ds=[]
    for n in [1000,5000,10000]:
        Ds.append(np.random.random(size=(n,2)))
    ks=[50,100]
    t_total=0
    err=0
    for D in Ds:
        for k in ks:
            km=KMeans(n_clusters=k)
            t0=time()
            km.fit(D)
            t=time()-t0
            t_total+=t
            err+=km.inertia_/len(D)
        
    return f"time: {t_total:5.2f}, Err: {err:.8f}"
    
print(bench(),"python=",sys.version.split()[0],
"sklearn=",sklearn.__version__, "numpy",np.__version__)

The environments were produced with conda, e.g. the one for Python3.9 and scikit-learn 0.24.2 as follows:

conda create --name py3.9-sk0.24.2 python=3.9
conda activate py3.9-sk0.24.2
pip install scikit-learn==0.24.2
pip install numpy==1.18.5

Then for each environment the above benchmark was run:

python src/bench.py

Expected Results

I was expecting similar execution times for the different numpy versions or even better performance for higher versions.

Actual Results

The performance dropped sharply when switching from numpy 1.18.5 to 1.19.0. This was the case for Python 6,7,8 and scikit-learn 0.23.2 and 0.24.2. The performance hit was sharper larger for scikit-learn 0.23.2 than for scikit-learn 0.24.2.

With numpy 1.18.5 the performance of scikit-learn 0.23.2 was better than that of scikit-learn 0.24.2.

THe computed error values were identical for a given verson of scikit-learn and differred only slightly between scikit-learn 0.23.2 and 0.24.2

Question: What could be the reason for the huge performance deterioration for all numpy version starting with numpy 1.19.0?

Versions

(see table above)

@gittar gittar changed the title Poor performance of sklearn.cluster.KMeans for Python > 3.6.13 Poor performance of sklearn.cluster.KMeans for numpy >= 1.19.0 Jul 31, 2 8000 021
@glemaitre
Copy link
Member

Could you provide as well the version. It would be nice to have the OS info.

@gittar
Copy link
Contributor Author
gittar commented Aug 1, 2021

Sure,
OS: Ubuntu 20.04.2 LTS
CPU: AMD FX(tm)-8300 Eight-Core Processor

@ogrisel
Copy link
Member
ogrisel commented Aug 3, 2021

Can you please check that you get the same value for km.n_iter_ in the end?

@ogrisel
Copy link
Member
ogrisel commented Aug 3, 2021

Can you please check the version of openblas used by sklearn (it should be the openblas shipped with scipy), for instance for the python 3.8 runs:

python -m threadpoolctl --import sklearn

@glemaitre
Copy link
Member

I just reproduce the benchmark and see the same behaviour.

NumPy 1.18

scikit-learn: 0.24.2, numpy: 1.18.5, scipy: 1.7.1
(n_samples, n_features): (1000, 2), n_clusters: 50, time:  0.34, n_iter: 18, Err: 0.00280174
(n_samples, n_features): (1000, 2), n_clusters: 100, time:  0.84, n_iter: 9, Err: 0.00404925
(n_samples, n_features): (5000, 2), n_clusters: 50, time:  3.07, n_iter: 23, Err: 0.00726103
(n_samples, n_features): (5000, 2), n_clusters: 100, time:  5.46, n_iter: 23, Err: 0.00879794
(n_samples, n_features): (10000, 2), n_clusters: 50, time:  8.47, n_iter: 29, Err: 0.01202810
(n_samples, n_features): (10000, 2), n_clusters: 100, time: 11.88, n_iter: 23, Err: 0.01361594

openblas version:

[
  {
    "filepath": "/home/glemaitre/miniconda3/envs/npy_118/lib/python3.8/site-packages/scikit_learn.libs/libgomp-f7e03b3e.so.1.0.0",
    "prefix": "libgomp",
    "user_api": "openmp",
    "internal_api": "openmp",
    "version": null,
    "num_threads": 4
  },
  {
    "filepath": "/home/glemaitre/miniconda3/envs/npy_118/lib/python3.8/site-packages/numpy.libs/libopenblasp-r0-34a18dc3.3.7.so",
    "prefix": "libopenblas",
    "user_api": "blas",
    "internal_api": "openblas",
    "version": "0.3.7",
    "num_threads": 4,
    "threading_layer": "pthreads",
    "architecture": "Haswell"
  },
  {
    "filepath": "/home/glemaitre/miniconda3/envs/npy_118/lib/python3.8/site-packages/scipy.libs/libopenblasp-r0-085ca80a.3.9.so",
    "prefix": "libopenblas",
    "user_api": "blas",
    "internal_api": "openblas",
    "version": "0.3.9",
    "num_threads": 4,
    "threading_layer": "pthreads",
    "architecture": "Haswell"
  }
]

NumPy 1.21

scikit-learn: 0.24.2, numpy: 1.21.1, scipy: 1.7.1
(n_samples, n_features): (1000, 2), n_clusters: 50, time:  0.35, n_iter: 18, Err: 0.00280174
(n_samples, n_features): (1000, 2), n_clusters: 100, time:  4.28, n_iter: 9, Err: 0.00404925
(n_samples, n_features): (5000, 2), n_clusters: 50, time:  6.46, n_iter: 23, Err: 0.00726103
(n_samples, n_features): (5000, 2), n_clusters: 100, time: 19.63, n_iter: 23, Err: 0.00879794
(n_samples, n_features): (10000, 2), n_clusters: 50, time: 24.15, n_iter: 29, Err: 0.01202810
(n_samples, n_features): (10000, 2), n_clusters: 100, time: 41.98, n_iter: 23, Err: 0.01361594

openblas version:

[
  {
    "filepath": "/home/glemaitre/miniconda3/envs/npy_latest/lib/python3.8/site-packages/scikit_learn.libs/libgomp-f7e03b3e.so.1.0.0",
    "prefix": "libgomp",
    "user_api": "openmp",
    "internal_api": "openmp",
    "version": null,
    "num_threads": 4
  },
  {
    "filepath": "/home/glemaitre/miniconda3/envs/npy_latest/lib/python3.8/site-packages/numpy.libs/libopenblasp-r0-2d23e62b.3.17.so",
    "prefix": "libopenblas",
    "user_api": "blas",
    "internal_api": "openblas",
    "version": "0.3.17",
    "num_threads": 4,
    "threading_layer": "pthreads",
    "architecture": "Haswell"
  },
  {
    "filepath": "/home/glemaitre/miniconda3/envs/npy_latest/lib/python3.8/site-packages/scipy.libs/libopenblasp-r0-085ca80a.3.9.so",
    "prefix": "libopenblas",
    "user_api": "blas",
    "internal_api": "openblas",
    "version": "0.3.9",
    "num_threads": 4,
    "threading_layer": "pthreads",
    "architecture": "Haswell"
  }
]

@ogrisel
Copy link
Member
ogrisel commented Aug 3, 2021

Interesting: scikit-learn's KMeans inner operations use a GEMM routine from scipy, therefore the OpenBLAS version 0.3.9 from SciPy that has not changed in both cases.

numpy should not be used in the inner loops of KMeans. So the difference in performance is really mysterious to me.

@ogrisel
Copy link
Member
ogrisel commented Aug 4, 2021

Based on interactive debugging session with @glemaitre it seems that:

  • this problem only happens when OMP_NUM_THREADS is unset or set to a high value (e.g. 6 or more on a machine with 4 physical cores and 8 hyper-threads)
  • this problem only happens with algorithm="elkan" which is the default for this problem. Setting algorithm="full" makes the problem disappear.

Profiler output with native threads is difficult to interpret but it seems that the openmp parallelism of the elkan loops is involved. I do not understand how it relates to the numpy version though.

Even more surprising, this problem also happens when numpy, scipy and openblas are all installed from conda-forge. And this time the version of numpy and openblas do not seem to matter... It's always slow.

@gittar
Copy link
Contributor Author
gittar commented Aug 4, 2021 via email

@glemaitre
Copy link
Member

Using a machine with many cores (48 physical cores and hyper-threading is deactivated), I don't see any dramatic performance drop between the NumPy version installed with PyPI. I used up to 36 cores (I could not use more since some other processes were running on the machine and were taking 8-10 cores).

I am under the impression that accounting for the hyper-threads might be one of the factors for performance drop. However, it is misterious why NumPy 1.18.5 installed from PyPI would limit to the physical cores while the conda-forge version would not.

@ogrisel
Copy link
Member
ogrisel commented Aug 5, 2021

It could be a case of OpenMathLib/OpenBLAS#3187 (thanks @jeremiedbb, I had completely forgotten about this). So it could be related to how OpenBLAS assigns threads to specific CPUs. However I am still unsure why the use of OpenMP threads impacts OpenBLAS thread placement since they are not nested calls but sequential calls.

@ogrisel
Copy link
Member
ogrisel commented Aug 5, 2021

Hum, maybe OpenBLAS is setting an affinity mask on the main thread to tie it to a specific CPU and then subsequent calls to OpenMP spawn threads all tied to the same CPU causing oversubscription. We indeed need to update the C reproducer of OpenMathLib/OpenBLAS#3187 to monitor the value of the cpus_allowed_list key in /proc/<pid>/status both for the main thread and for each thread inside the openmp loop.

@jeremiedbb
Copy link
Member
jeremiedbb commented Sep 17, 2021

Interesting: scikit-learn's KMeans inner operations use a GEMM routine from scipy, therefore the OpenBLAS version 0.3.9 from SciPy that has not changed in both cases.

numpy should not be used in the inner loops of KMeans. So the difference in performance is really mysterious to me.

this problem only happens with algorithm="elkan"

Actually with elkan there's no blas call inside the inner loop. Instead there's a blas call between each iteration. So the relevant BLAS is indeed the one shipped with numpy. Therefore there might be a regression in OpenBLAS between 0.3.7 and 0.3.17. I'll experiment OpenMathLib/OpenBLAS#3187 with these versions to see if this is indeed the same issue.

@jeremiedbb
Copy link
Member
jeremiedbb commented Oct 8, 2021

I ran the same snippet with latest versions of numpy scipy scikit-learn by just changing the OpenBLAS versionand rebuilding everything from source, and I can reproduce the timings. It actually appeared since this change in OpenBLAS OpenMathLib/OpenBLAS#2375.

However, it's not really a regression of OpenBLAS. The PR linked essentialy changes the size of the matrices under which it runs sequentially. The poor performance reported was already there, but with bigger matrices (in particular higher number of centers).

In conclusion this issue is the same as I reported in OpenMathLib/OpenBLAS#3187. The bad news is that it might not be a bug but an expected behavior :/
I'll discuss further with OpenBLAS folks to try to learn more about that. For now I don't think there's anything we can do about it on the sklearn side.

@gitter As a workaround you can try to set the env vars OPENBLAS_NUM_THREADS and OMP_NUM_THREADS to half the number of cores each.

@gittar
Copy link
Contributor Author
gittar commented Oct 11, 2021

This seems to be a much harder problem than I expected when I reported it. Thanks for your effort in investigating this. If a fix will take long (or is even not possible for some reason) you might consider informing users that they can avoid a huge performance hit by sticking with numpy 1.18.5. Of course it would be way better if the issue could be solved.

@ogrisel
Copy link
Member
ogrisel commented Oct 19, 2021

I think you can also change the number of threads used by OpenBLAS with OPENBLAS_NUM_THREADS=1 or https://github.com/joblib/threadpoolctl or just install numpy, scipy and scikit-learn from conda-forge, in which case both scikit-learn and openblas can/will link to the same OpenMP runtime in which case the problem disappears.

@ageron
Copy link
Contributor
ageron commented Nov 20, 2021

Not sure if this is related, but I just ran some tests, and using algorithm="elkan" is generally slower (and sometimes much slower) than algorithm="full" on various datasets generated using make_blobs() (I tried different n_samples, n_features, cluster_std, and different numbers of clusters). I basically couldn't find a dataset where Elkan was reliably faster than Full. I also tried both NumPy 1.18.5 and 1.19.5, with Scikit-Learn 1.0.1, but Elkan remained generally slower than Full:

image

The experiments were done on Colab, like this:

from time import time

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

def time_kmeans(X, n_clusters, algorithm, random_state):
  kmeans = KMeans(n_clusters=n_clusters, algorithm=algorithm, random_state=random_state)
  t0 = time()
  kmeans.fit(X)
  return time() - t0

iteration = 0
for n_samples in (10**3, 10**4, 10**5):
  for n_features in (2, 10, 100):
    for n_clusters in (2, 5, 10, 20, 50, 100):
      for cluster_std in (1e-3, 0.1, 1, 5):
        n_samples_per_cluster = n_samples // n_clusters
        X, y = make_blobs(n_samples=[n_samples_per_cluster] * n_clusters,
                          n_features=n_features,
                          cluster_std=cluster_std)
        iteration += 1
        time_elkan = time_kmeans(X, n_clusters, algorithm="elkan", random_state=iteration)
        time_full = time_kmeans(X, n_clusters, algorithm="full", random_state=iteration)
        winner = "Full" if time_full < time_elkan else "Elkan"
        print(
            f"n_samples={n_samples}, n_features={n_features}, "
            f"n_clusters={n_clusters}, cluster_std={cluster_std}, "
            f"time_elkan={time_elkan:.2f}, time_full={time_full:.2f}, "
            f"winner={winner}")

And the output is:

n_samples=1000, n_features=2, n_clusters=2, cluster_std=0.001, time_elkan=0.01, time_full=0.01, winner=Full
n_samples=1000, n_features=2, n_clusters=20, cluster_std=1, time_elkan=0.11, time_full=0.07, winner=Full
n_samples=1000, n_features=2, n_clusters=50, cluster_std=1, time_elkan=0.19, time_full=0.15, winner=Full
n_samples=1000, n_features=2, n_clusters=100, cluster_std=0.001, time_elkan=0.49, time_full=0.31, winner=Full
n_samples=1000, n_features=2, n_clusters=100, cluster_std=0.1, time_elkan=0.73, time_full=0.29, winner=Full
n_samples=1000, n_features=2, n_clusters=100, cluster_std=1, time_elkan=1.17, time_full=0.32, winner=Full
n_samples=1000, n_features=2, n_clusters=100, cluster_std=5, time_elkan=1.21, time_full=0.30, winner=Full
n_samples=1000, n_features=10, n_clusters=5, cluster_std=0.1, time_elkan=0.12, time_full=0.12, winner=Full
n_samples=1000, n_features=10, n_clusters=10, cluster_std=0.1, time_elkan=0.09, time_full=0.08, winner=Full
n_samples=1000, n_features=10, n_clusters=20, cluster_std=0.1, time_elkan=0.13, time_full=0.23, winner=Elkan
n_samples=1000, n_features=10, n_clusters=20, cluster_std=5, time_elkan=0.49, time_full=0.63, winner=Elkan
n_samples=1000, n_features=10, n_clusters=50, cluster_std=0.1, time_elkan=0.48, time_full=0.24, winner=Full
n_samples=1000, n_features=10, n_clusters=50, cluster_std=5, time_elkan=0.48, time_full=0.26, winner=Full
n_samples=1000, n_features=10, n_clusters=100, cluster_std=0.001, time_elkan=0.75, time_full=0.42, winner=Full
n_samples=1000, n_features=10, n_clusters=100, cluster_std=0.1, time_elkan=0.61, time_full=0.43, winner=Full
n_samples=1000, n_features=10, n_clusters=100, cluster_std=1, time_elkan=0.88, time_full=0.53, winner=Full
n_samples=1000, n_features=10, n_clusters=100, cluster_std=5, time_elkan=1.36, time_full=0.46, winner=Full
n_samples=1000, n_features=100, n_clusters=5, cluster_std=0.001, time_elkan=0.08, time_full=0.15, winner=Elkan
n_samples=1000, n_features=100, n_clusters=5, cluster_std=5, time_elkan=0.16, time_full=0.07, winner=Full
n_samples=1000, n_features=100, n_clusters=10, cluster_std=1, time_elkan=0.27, time_full=0.28, winner=Elkan
n_samples=1000, n_features=100, n_clusters=20, cluster_std=0.1, time_elkan=0.21, time_full=0.19, winner=Full
n_samples=1000, n_features=100, n_clusters=20, cluster_std=5, time_elkan=0.25, time_full=0.26, winner=Elkan
n_samples=1000, n_features=100, n_clusters=50, cluster_std=0.001, time_elkan=0.76, time_full=0.61, winner=Full
n_samples=1000, n_features=100, n_clusters=50, cluster_std=0.1, time_elkan=0.69, time_full=0.43, winner=Full
n_samples=1000, n_features=100, n_clusters=50, cluster_std=1, time_elkan=0.59, time_full=0.61, winner=Elkan
n_samples=1000, n_features=100, n_clusters=100, cluster_std=0.001, time_elkan=0.93, time_full=0.77, winner=Full
n_samples=1000, n_features=100, n_clusters=100, cluster_std=0.1, time_elkan=0.84, time_full=0.96, winner=Elkan
n_samples=1000, n_features=100, n_clusters=100, cluster_std=1, time_elkan=1.37, time_full=0.94, winner=Full
n_samples=1000, n_features=100, n_clusters=100, cluster_std=5, time_elkan=1.31, time_full=0.79, winner=Full
n_samples=10000, n_features=2, n_clusters=2, cluster_std=5, time_elkan=0.14, time_full=0.27, winner=Elkan
n_samples=10000, n_features=2, n_clusters=5, cluster_std=1, time_elkan=0.31, time_full=0.18, winner=Full
n_samples=10000, n_features=2, n_clusters=10, cluster_std=0.001, time_elkan=0.29, time_full=0.16, winner=Full
n_samples=10000, n_features=2, n_clusters=10, cluster_std=1, time_elkan=0.53, time_full=0.21, winner=Full
n_samples=10000, n_features=2, n_clusters=10, cluster_std=5, time_elkan=1.27, time_full=0.83, winner=Full
n_samples=10000, n_features=2, n_clusters=20, cluster_std=0.1, time_elkan=0.25, time_full=0.27, winner=Elkan
n_samples=10000, n_features=2, n_clusters=20, cluster_std=1, time_elkan=1.17, time_full=0.58, winner=Full
n_samples=10000, n_features=2, n_clusters=20, cluster_std=5, time_elkan=1.60, time_full=0.86, winner=Full
n_samples=10000, n_features=2, n_clusters=50, cluster_std=0.001, time_elkan=0.79, time_full=0.84, winner=Elkan
n_samples=10000, n_features=2, n_clusters=50, cluster_std=0.1, time_elkan=0.82, time_full=0.70, winner=Full
n_samples=10000, n_features=2, n_clusters=50, cluster_std=1, time_elkan=1.99, time_full=1.34, winner=Full
n_samples=10000, n_features=2, n_clusters=50, cluster_std=5, time_elkan=2.35, time_full=1.47, winner=Full
n_samples=10000, n_features=2, n_clusters=100, cluster_std=0.001, time_elkan=1.12, time_full=1.06, winner=Full
n_samples=10000, n_features=2, n_clusters=100, cluster_std=0.1, time_elkan=1.34, time_full=1.08, winner=Full
n_samples=10000, n_features=2, n_clusters=100, cluster_std=1, time_elkan=7.00, time_full=1.95, winner=Full
n_samples=10000, n_features=2, n_clusters=100, cluster_std=5, time_elkan=7.06, time_full=1.92, winner=Full
n_samples=10000, n_features=10, n_clusters=5, cluster_std=0.001, time_elkan=0.13, time_full=0.24, winner=Elkan
n_samples=10000, n_features=10, n_clusters=5, cluster_std=5, time_elkan=0.40, time_full=0.32, winner=Full
n_samples=10000, n_features=10, n_clusters=10, cluster_std=0.1, time_elkan=0.30, time_full=0.17, winner=Full
n_samples=10000, n_features=10, n_clusters=10, cluster_std=
8000
5, time_elkan=0.70, time_full=0.43, winner=Full
n_samples=10000, n_features=10, n_clusters=20, cluster_std=0.1, time_elkan=0.46, time_full=0.27, winner=Full
n_samples=10000, n_features=10, n_clusters=20, cluster_std=1, time_elkan=0.48, time_full=0.51, winner=Elkan
n_samples=10000, n_features=10, n_clusters=20, cluster_std=5, time_elkan=1.39, time_full=1.01, winner=Full
n_samples=10000, n_features=10, n_clusters=50, cluster_std=0.001, time_elkan=0.78, time_full=0.59, winner=Full
n_samples=10000, n_features=10, n_clusters=50, cluster_std=0.1, time_elkan=0.69, time_full=0.57, winner=Full
n_samples=10000, n_features=10, n_clusters=50, cluster_std=1, time_elkan=1.00, time_full=0.90, winner=Full
n_samples=10000, n_features=10, n_clusters=50, cluster_std=5, time_elkan=3.47, time_full=2.15, winner=Full
n_samples=10000, n_features=10, n_clusters=100, cluster_std=0.001, time_elkan=1.29, time_full=1.15, winner=Full
n_samples=10000, n_features=10, n_clusters=100, cluster_std=0.1, time_elkan=1.45, time_full=1.36, winner=Full
n_samples=10000, n_features=10, n_clusters=100, cluster_std=1, time_elkan=2.58, time_full=1.63, winner=Full
n_samples=10000, n_features=10, n_clusters=100, cluster_std=5, time_elkan=11.02, time_full=3.39, winner=Full
n_samples=10000, n_features=100, n_clusters=2, cluster_std=1, time_elkan=0.34, time_full=0.22, winner=Full
n_samples=10000, n_features=100, n_clusters=5, cluster_std=0.001, time_elkan=0.34, time_full=0.37, winner=Elkan
n_samples=10000, n_features=100, n_clusters=5, cluster_std=0.1, time_elkan=0.61, time_full=0.33, winner=Full
n_samples=10000, n_features=100, n_clusters=5, cluster_std=5, time_elkan=0.50, time_full=0.56, winner=Elkan
n_samples=10000, n_features=100, n_clusters=10, cluster_std=0.1, time_elkan=0.59, time_full=0.47, winner=Full
n_samples=10000, n_features=100, n_clusters=10, cluster_std=1, time_elkan=0.63, time_full=0.53, winner=Full
n_samples=10000, n_features=100, n_clusters=10, cluster_std=5, time_elkan=1.02, time_full=1.10, winner=Elkan
n_samples=10000, n_features=100, n_clusters=20, cluster_std=0.001, time_elkan=0.82, time_full=0.81, winner=Full
n_samples=10000, n_features=100, n_clusters=20, cluster_std=0.1, time_elkan=0.82, time_full=0.89, winner=Elkan
n_samples=10000, n_features=100, n_clusters=20, cluster_std=1, time_elkan=0.92, time_full=0.78, winner=Full
n_samples=10000, n_features=100, n_clusters=20, cluster_std=5, time_elkan=1.77, time_full=1.56, winner=Full
n_samples=10000, n_features=100, n_clusters=50, cluster_std=0.001, time_elkan=2.10, time_full=1.71, winner=Full
n_samples=10000, n_features=100, n_clusters=50, cluster_std=0.1, time_elkan=1.93, time_full=1.82, winner=Full
n_samples=10000, n_features=100, n_clusters=50, cluster_std=1, time_elkan=2.10, time_full=1.77, winner=Full
n_samples=10000, n_features=100, n_clusters=50, cluster_std=5, time_elkan=2.89, time_full=2.49, winner=Full
n_samples=10000, n_features=100, n_clusters=100, cluster_std=0.001, time_elkan=4.13, time_full=3.24, winner=Full
n_samples=10000, n_features=100, n_clusters=100, cluster_std=0.1, time_elkan=3.97, time_full=3.15, winner=Full
n_samples=10000, n_features=100, n_clusters=100, cluster_std=1, time_elkan=4.20, time_full=3.46, winner=Full
n_samples=10000, n_features=100, n_clusters=100, cluster_std=5, time_elkan=5.92, time_full=4.10, winner=Full
n_samples=100000, n_features=2, n_clusters=2, cluster_std=0.1, time_elkan=0.30, time_full=0.38, winner=Elkan
n_samples=100000, n_features=2, n_clusters=2, cluster_std=5, time_elkan=1.31, time_full=1.18, winner=Full
n_samples=100000, n_features=2, n_clusters=5, cluster_std=0.1, time_elkan=0.46, time_full=0.47, winner=Elkan
n_samples=100000, n_features=2, n_clusters=5, cluster_std=1, time_elkan=1.16, time_full=0.63, winner=Full
n_samples=100000, n_features=2, n_clusters=5, cluster_std=5, time_elkan=2.77, time_full=1.77, winner=Full
n_samples=100000, n_features=2, n_clusters=10, cluster_std=0.001, time_elkan=0.97, time_full=0.71, winner=Full
n_samples=100000, n_features=2, n_clusters=10, cluster_std=0.1, time_elkan=0.80, time_full=0.73, winner=Full
n_samples=100000, n_features=2, n_clusters=10, cluster_std=1, time_elkan=2.18, time_full=1.16, winner=Full
n_samples=100000, n_features=2, n_clusters=10, cluster_std=5, time_elkan=4.82, time_full=2.35, winner=Full
n_samples=100000, n_features=2, n_clusters=20, cluster_std=0.001, time_elkan=1.36, time_full=1.23, winner=Full
n_samples=100000, n_features=2, n_clusters=20, cluster_std=0.1, time_elkan=1.66, time_full=1.44, winner=Full
n_samples=100000, n_features=2, n_clusters=20, cluster_std=1, time_elkan=4.42, time_full=2.82, winner=Full
n_samples=100000, n_features=2, n_clusters=20, cluster_std=5, time_elkan=11.02, time_full=4.66, winner=Full
n_samples=100000, n_features=2, n_clusters=50, cluster_std=0.001, time_elkan=3.61, time_full=3.55, winner=Full
n_samples=100000, n_features=2, n_clusters=50, cluster_std=0.1, time_elkan=3.89, time_full=3.49, winner=Full
n_samples=100000, n_features=2, n_clusters=50, cluster_std=1, time_elkan=18.86, time_full=9.43, winner=Full
n_samples=100000, n_features=2, n_clusters=50, cluster_std=5, time_elkan=27.77, time_full=12.31, winner=Full
n_samples=100000, n_features=2, n_clusters=100, cluster_std=0.001, time_elkan=8.33, time_full=7.79, winner=Full
n_samples=100000, n_features=2, n_clusters=100, cluster_std=0.1, time_elkan=9.55, time_full=8.30, winner=Full
n_samples=100000, n_features=2, n_clusters=100, cluster_std=1, time_elkan=59.08, time_full=20.13, winner=Full
n_samples=100000, n_features=2, n_clusters=100, cluster_std=5, time_elkan=70.57, time_full=23.11, winner=Full
n_samples=100000, n_features=10, n_clusters=2, cluster_std=0.1, time_elkan=0.51, time_full=0.42, winner=Full
n_samples=100000, n_features=10, n_clusters=2, cluster_std=1, time_elkan=0.44, time_full=0.61, winner=Elkan
n_samples=100000, n_features=10, n_clusters=2, cluster_std=5, time_elkan=0.85, time_full=0.64, winner=Full
n_samples=100000, n_features=10, n_clusters=5, cluster_std=0.001, time_elkan=0.55, time_full=0.63, winner=Elkan
n_samples=100000, n_features=10, n_clusters=5, cluster_std=0.1, time_elkan=0.77, time_full=0.58, winner=Full
n_samples=100000, n_features=10, n_clusters=5, cluster_std=1, time_elkan=0.80, time_full=0.61, winner=Full
n_samples=100000, n_features=10, n_clusters=5, cluster_std=5, time_elkan=1.70, time_full=1.00, winner=Full
n_samples=100000, n_features=10, n_clusters=10, cluster_std=0.001, time_elkan=1.05, time_full=0.87, winner=Full
n_samples=100000, n_features=10, n_clusters=10, cluster_std=0.1, time_elkan=1.06, time_full=0.80, winner=Full
n_samples=100000, n_features=10, n_clusters=10, cluster_std=1, time_elkan=1.23, time_full=1.14, winner=Full
n_samples=100000, n_features=10, n_clusters=10, cluster_std=5, time_elkan=3.28, time_full=2.13, winner=Full
n_samples=100000, n_features=10, n_clusters=20, cluster_std=0.001, time_elkan=1.96, time_full=1.55, winner=Full
n_samples=100000, n_features=10, n_clusters=20, cluster_std=0.1, time_elkan=2.17, time_full=2.00, winner=Full
n_samples=100000, n_features=10, n_clusters=20, cluster_std=1, time_elkan=2.31, time_full=1.86, winner=Full
n_samples=100000, n_features=10, n_clusters=20, cluster_std=5, time_elkan=6.73, time_full=4.03, winner=Full
n_samples=100000, n_features=10, n_clusters=50, cluster_std=0.001, time_elkan=4.87, time_full=4.19, winner=Full
n_samples=100000, n_features=10, n_clusters=50, cluster_std=0.1, time_elkan=4.92, time_full=4.53, winner=Full
n_samples=100000, n_features=10, n_clusters=50, cluster_std=1, time_elkan=6.05, time_full=5.50, winner=Full
n_samples=100000, n_features=10, n_clusters=50, cluster_std=5, time_elkan=35.26, time_full=18.50, winner=Full
n_samples=100000, n_features=10, n_clusters=100, cluster_std=0.001, time_elkan=10.04, time_full=9.49, winner=Full
n_samples=100000, n_features=10, n_clusters=100, cluster_std=0.1, time_elkan=10.54, time_full=9.01, winner=Full
n_samples=100000, n_features=10, n_clusters=100, cluster_std=1, time_elkan=16.26, time_full=12.68, winner=Full
n_samples=100000, n_features=10, n_clusters=100, cluster_std=5, time_elkan=140.66, time_full=44.06, winner=Full
n_samples=100000, n_features=100, n_clusters=2, cluster_std=0.001, time_elkan=1.19, time_full=1.25, winner=Elkan
n_samples=100000, n_features=100, n_clusters=2, cluster_std=0.1, time_elkan=1.45, time_full=1.49, winner=Elkan
n_samples=100000, n_features=100, n_clusters=2, cluster_std=1, time_elkan=1.47, time_full=1.49, winner=Elkan
n_samples=100000, n_features=100, n_clusters=2, cluster_std=5, time_elkan=1.70, time_full=1.55, winner=Full
n_samples=100000, n_features=100, n_clusters=5, cluster_std=0.001, time_elkan=2.12, time_full=1.95, winner=Full
n_samples=100000, n_features=100, n_clusters=5, cluster_std=0.1, time_elkan=2.24, time_full=2.16, winner=Full
n_samples=100000, n_features=100, n_clusters=5, cluster_std=1, time_elkan=2.19, time_full=2.06, winner=Full
n_samples=100000, n_features=100, n_clusters=5, cluster_std=5, time_elkan=2.74, time_full=2.13, winner=Full
n_samples=100000, n_features=100, n_clusters=10, cluster_std=0.001, time_elkan=3.53, time_full=3.39, winner=Full
n_samples=100000, n_features=100, n_clusters=10, cluster_std=0.1, time_elkan=3.61, time_full=3.42, winner=Full
n_samples=100000, n_features=100, n_clusters=10, cluster_std=1, time_elkan=3.67, time_full=3.42, winner=Full
n_samples=100000, n_features=100, n_clusters=10, cluster_std=5, time_elkan=5.12, time_full=4.50, winner=Full
n_samples=100000, n_features=100, n_clusters=20, cluster_std=0.001, time_elkan=6.02, time_full=5.40, winner=Full
n_samples=100000, n_features=100, n_clusters=20, cluster_std=0.1, time_elkan=6.02, time_full=5.48, winner=Full
n_samples=100000, n_features=100, n_clusters=20, cluster_std=1, time_elkan=6.07, time_full=5.51, winner=Full
n_samples=100000, n_features=100, n_clusters=20, cluster_std=5, time_elkan=16.14, time_full=17.33, winner=Elkan
n_samples=100000, n_features=100, n_clusters=50, cluster_std=0.001, time_elkan=14.73, time_full=13.29, winner=Full
n_samples=100000, n_features=100, n_clusters=50, cluster_std=0.1, time_elkan=14.86, time_full=13.29, winner=Full
...

@jeremiedbb
Copy link
Member
jeremiedbb commented Nov 20, 2021

Thanks for these exhaustive benchmarks @ageron !
It confirms some intuition we had but I imagined elkan would be better in more cases. In addition, when elkan is faster it's usually just a little faster while "full" can be a lot faster than "elkan". I think it's definitely time, given these results, for us to change the default value of the algorithm parameter. Would you be interested in submitting a PR for that ?

@gittar
Copy link
Contributor Author
gittar commented Nov 21, 2021

These are interesting results, @ageron. However, I think they cannot explain the huge performance differences I observed. I did my simulations mainly with one specific version of scikit-learn (0.23.2), and no specific setting of the "algorithm" parameter was done, which therefore kept its default "auto" which again is mapped to "elkan". Your results suggest that "auto" should instead be mapped to "full," but according to your experiments, only a moderate improvement (visually estimated: 30%) would follow and possibly both to older and newer versions of NumPy which would leave the performance gap as is.

@ageron
Copy link
Contributor
ageron commented Nov 21, 2021

@gittar , you're right. I'll open another issue regarding Elkan vs Full.

@ogrisel
Copy link
Member
ogrisel commented Dec 15, 2021

Actually @ageron's benchmark results might be biased against Elkan because the Lloyd solver does not have the threadpool interaction problem.

It would be possible to check this by re-running the benchmarks in a pure conda-forge environment where scipy's OpenBLAS and scikit-learn's compiled extensions are both linked to the same OpenMP runtime to use a single threadpool.

conda create -n all-openmp -y -c conda-forge scikit-learn "libopenblas=*=*openmp*"
conda activate all-openmp

We can check that the env has a single threadpool by checking that there is a single OpenBLAS runtime and that its threading layer is "openmp":

python -m threadpoolctl -i scipy.linalg -i sklearn

EDIT: added missing -i before sklearn

For instance, on my laptop I get the following:

[
  {
    "user_api": "blas",
    "internal_api": "openblas",
    "prefix": "libopenblas",
    "filepath": "/Users/ogrisel/mambaforge/envs/all-openmp/lib/libopenblas_vortexp-r0.3.18.dylib",
    "version": "0.3.18",
    "threading_layer": "openmp",
    "architecture": "VORTEX",
    "num_threads": 8
  },
  {
    "user_api": "openmp",
    "internal_api": "openmp",
    "prefix": "libomp",
    "filepath": "/Users/ogrisel/mambaforge/envs/all-openmp/lib/libomp.dylib",
    "version": null,
    "num_threads": 8
  }
]

@ogrisel
Copy link
Member
ogrisel commented Dec 15, 2021

Note that on Linux, when building scikit-learn from source with gcc, scikit-learn will use the GNU libgomp OpenMP runtime library instead of the LLVM libomp used by default in conda-forge packages (see https://conda-forge.org/docs/maintainer/knowledge_base.html#openmp).

To make conda-forge use libgomp instead of libomp, it should be possible install the env as follows:

conda create -n all-libgomp -y -c conda-forge scikit-learn "libopenblas=*=*openmp*" "_openmp_mutex=*=*_gnu" compilers
conda activate all-libgomp

and then compile scikit-learn using the gcc compiler from that conda env instead of the system compiler. I have not tested this setup yet.

@ageron
Copy link
Contributor
ageron commented Dec 15, 2021

Hi @ogrisel,

I just ran the benchmark again, this time on my laptop, in a conda environment, and Elkan was still generally slower than Lloyd (it produced roughly the same histogram as above). It was based on OpenBLAS with OpenMP as the threading layer:

[
  {
    "user_api": "blas",
    "internal_api": "openblas",
    "prefix": "libopenblas",
    "filepath": "/Users/ageron/miniconda3/envs/test/lib/libopenblasp-r0.3.18.dylib",
    "version": "0.3.18",
    "threading_layer": "openmp",
    "architecture": "Haswell",
    "num_threads": 16
  },
  {
    "user_api": "openmp",
    "internal_api": "openmp",
    "prefix": "libomp",
    "filepath": "/Users/ageron/miniconda3/envs/test/lib/libomp.dylib",
    "version": null,
    "num_threads": 16
  }
]

@ogrisel
Copy link
Member
ogrisel commented Dec 15, 2021

Thank you very much for checking. We won't have to revert your PR ;)

@gittar
Copy link
Contributor Author
gittar commented Jan 7, 2022

I made an experiment with nearly current versions of all packages and that looks quite promising:

time:  8.60, Err: 0.01361583 python= 3.9.7 sklearn= 1.0.1 numpy 1.21.2
time:  8.94, Err: 0.01361583 python= 3.9.7 sklearn= 1.0.1 numpy 1.21.2
time:  7.02, Err: 0.01361583 python= 3.9.7 sklearn= 1.0.1 numpy 1.21.2
time:  6.75, Err: 0.01361583 python= 3.9.7 sklearn= 1.0.1 numpy 1.21.2

However, with the most recent versions the test program runs a lot slower:

time: 18.05, Err: 0.01361583 python= 3.10.0 sklearn= 1.0.2 numpy 1.22.0
time: 18.05, Err: 0.01361583 python= 3.10.0 sklearn= 1.0.2 numpy 1.22.0
time: 17.42, Err: 0.01361583 python= 3.10.0 sklearn= 1.0.2 numpy 1.22.0
time: 17.53, Err: 0.01361583 python= 3.10.0 sklearn= 1.0.2 numpy 1.22.0

And with older versions I still get the observed very poor performance:

time: 39.49, Err: 0.01360200 python= 3.8.12 sklearn= 0.23.2 numpy 1.19.0

At least this seemingly makes it possible to get rid of the numpy 1.18.5 dependency for good performance, which already helps in my case.

@jeremiedbb
Copy link
Member

The default algorithm was changed to "lloyd" which does not suffer from this performance issue.

The issue with "elkan" has always been there and does not depend on the version of numpy (only the sizes of the matrices that trigger it depends on the version of numpy). It's actually an OpenBLAS / OpenMP issue that won't be fixed on the scikit-learn side. The best way to solve probably comes from a unification of the BLAS/OpenMP packing in the scientific python ecosystem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants
0