8000 Benchmark NMF and SGDClassifier on MKL vs OpenBlas · Issue #9429 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Benchmark NMF and SGDClassifier on MKL vs OpenBlas #9429

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
amueller opened this issue Jul 21, 2017 · 27 comments
Open

Benchmark NMF and SGDClassifier on MKL vs OpenBlas #9429

amueller opened this issue Jul 21, 2017 · 27 comments
Labels
help wanted module:decomposition module:linear_model Needs Benchmarks A tag for the issues and PRs which require some benchmarks

Comments

@amueller
Copy link
Member

At scipy the intel python distribution people cited a speedup of ~10x for NMF and SGDClassifier on MKL vs OpenBlas. It would be great if someone could try to reproduce, and find the bottlenecks.

@amueller amueller added Easy Well-defined and straightforward way to resolve Need Contributor labels Jul 21, 2017
@piyush0609
Copy link

@amueller I am interested. Do you have the citation link for this?

@souravsingh
Copy link
Contributor
souravsingh commented Jul 31, 2017

@amueller I have tested the matrix decomposition on Olivetti Faces dataset on both MKL and Openblas. Here are the results-

Under MKL and Intel Python distribution-

Dataset consists of 400 faces
Extracting the top 6 Eigenfaces - PCA using randomized SVD...
done in 0.062s
Extracting the top 6 Non-negative components - NMF...
done in 0.339s
Extracting the top 6 Independent components - FastICA...
done in 0.125s
Extracting the top 6 Sparse comp. - MiniBatchSparsePCA...
done in 0.842s
Extracting the top 6 MiniBatchDictionaryLearning...
done in 0.618s
Extracting the top 6 Cluster centers - MiniBatchKMeans...
done in 0.071s
Extracting the top 6 Factor Analysis components - FA...
done in 0.065s

and under normal python and OpenBLAS-

Dataset consists of 400 faces
Extracting the top 6 Eigenfaces - PCA using randomized SVD...
done in 0.079s
Extracting the top 6 Non-negative components - NMF...
done in 0.646s
Extracting the top 6 Independent components - FastICA...
done in 0.190s
Extracting the top 6 Sparse comp. - MiniBatchSparsePCA...
done in 0.799s
Extracting the top 6 MiniBatchDictionaryLearning...
done in 0.643s
Extracting the top 6 Cluster centers - MiniBatchKMeans...
done in 0.079s
Extracting the top 6 Factor Analysis components - FA...
done in 0.103s

Only MiniBatchSparsePCA shows better performance with OpenBLAS. For the purposes of benchmark, I have used a Google Cloud Instance in us-west with 4 cores and Ubuntu 16.04 as the OS.

Would you like me to do more benchmarks?

@piyush0609
Copy link

@souravsingh can you try it with a more powerful instance?

@souravsingh
Copy link
Contributor

Here are the benchmarks done on Google Cloud HighCPU-8 instance(Cant get HighCPU-16 instance)-

For MKL-

Extracting the top 6 Eigenfaces - PCA using randomized SVD...
done in 0.046s
Extracting the top 6 Non-negative components - NMF...
done in 0.260s
Extracting the top 6 Independent components - FastICA...
done in 0.137s
Extracting the top 6 Sparse comp. - MiniBatchSparsePCA...
done in 0.812s
Extracting the top 6 MiniBatchDictionaryLearning...
done in 0.589s
Extracting the top 6 Cluster centers - MiniBatchKMeans...
done in 0.067s
Extracting the top 6 Factor Analysis components - FA...
done in 0.048s

And for OpenBLAS-

Extracting the top 6 Eigenfaces - PCA using randomized SVD...
done in 0.111s
Extracting the top 6 Non-negative components - NMF...
done in 0.536s
Extracting the top 6 Independent components - FastICA...
done in 0.155s
Extracting the top 6 Sparse comp. - MiniBatchSparsePCA...
done in 0.784s
Extracting the top 6 MiniBatchDictionaryLearning...
done in 0.870s
Extracting the top 6 Cluster centers - MiniBatchKMeans...
done in 0.108s
Extracting the top 6 Factor Analysis components - FA...
done in 0.145s

@rth
Copy link
Member
rth commented Jul 31, 2017

@souravsingh Could you please provide more information about your setup in both environments? In particular, did you use the official version of scikit-learn & numpy linked against MKL or builds from the intel conda distribution, where they mention significant speedups?

In my understanding, in addition to linking to MKL, in the intel channel, they are also adding some optimizations / solvers in the source code of both numpy and scikit-learn which might explain the difference in performance. Below is the diff between scikit-learn from the default conda channel (linked against MKL) and the one from intel for linux-64/scikit-learn-0.18.1-py35*.tar.bz2, excluding binary files,

diff -r anconda-lib/python3.5/site-packages/sklearn/__init__.py intel-lib/python3.5/site-packages/sklearn/__init__.py
69c69
<                'svm', 'tree', 'discriminant_analysis',
---
>                'svm', 'tree', 'discriminant_analysis', 'daal4sklearn',
72a73,74
>     from .daal4sklearn import dispatcher
>     dispatcher.enable()
87a90
> 
diff -r anconda-lib/python3.5/site-packages/sklearn/cluster/tests/test_k_means.py intel-lib/python3.5/site-packages/sklearn/cluster/tests/test_k_means.py
7a8
> from sklearn.utils.testing import assert_allclose
550c551
<     # centers must not been collapsed
---
>     # centers must not have been collapsed
807,812c808,813
<             assert_array_almost_equal(inertia[np.float32], inertia[np.float64],
<                                       decimal=4)
<             assert_array_almost_equal(X_new[np.float32], X_new[np.float64],
<                                       decimal=4)
<             assert_array_almost_equal(centers[np.float32], centers[np.float64],
<                                       decimal=4)
---
>             assert_allclose(inertia[np.float32], inertia[np.float64],
>                                       atol=1e-4, rtol=1e-4)
>             assert_allclose(X_new[np.float32], X_new[np.float64],
>                                       atol=1e-3, rtol=1e-4)
>             assert_allclose(centers[np.float32], centers[np.float64],
>                                       atol=1e-4, rtol=1e-4)
Only in intel-lib/python3.5/site-packages/sklearn: daal4sklearn
diff -r anconda-lib/python3.5/site-packages/sklearn/mixture/tests/test_gaussian_mixture.py intel-lib/python3.5/site-packages/sklearn/mixture/tests/test_gaussian_mixture.py
983c983
<     assert_greater(gmm2.lower_bound_, gmm1.lower_bound_)
---
>     assert_greater_equal(gmm2.lower_bound_, gmm1.lower_bound_)
diff -r anconda-lib/python3.5/site-packages/sklearn/setup.py intel-lib/python3.5/site-packages/sklearn/setup.py
42a43
>     config.add_subpackage('daal4sklearn')
diff -r anconda-lib/python3.5/site-packages/sklearn/tests/test_common.py intel-lib/python3.5/site-packages/sklearn/tests/test_common.py
146a147
>                                       \.daal4sklearn(\.|$)|

together with additional files in,

ls intel-lib/python3.5/site-packages/sklearn/daal4sklearn  
__init__.py  dispatcher.py  k_means.py  linear.py  pairwise.py  pca.py  ridge.py  utils.py

@rth
Copy link
Member
rth commented Jul 31, 2017

@souravsingh Also could you please provide the script you used for benchmarking?

@amueller
Copy link
Member Author

I would run something that takes significantly more time. If your benchmark only runs a fraction of a second, this will most likely be dominated by overhead, not the actual computation. Try running a benchmark that takes closer to a minute.

@souravsingh
Copy link
Contributor
souravsingh commented Aug 1, 2017

For the script, I used the Olivetti Faces Decomposition example from scikit-learn docs

@souravsingh
Copy link
Contributor

@amueller I ran a Topic Extraction example on both environments. Here are the timings-

On OpenBLAS

Loading dataset...
done in 2.202s.
Extracting tf-idf features for NMF...
done in 1.326s.
Fitting the NMF model with tf-idf features, n_samples=5000 and n_features=5000...
done in 74.643s.

For Intel MKL

Loading dataset...
done in 1.898s.
Extracting tf-idf features for NMF...
done in 1.340s.
Fitting the NMF model with tf-idf features, n_samples=5000 and n_features=5000...
done in 73.420s.

@amueller
Copy link
Member Author
amueller commented Aug 4, 2017

Nice. Lol. Maybe try with dense data and different number of features and samples. (not using the text data, maybe do subsets of MNIST or random data or whatever)

@amueller
Copy link
Member Author
amueller commented Aug 4, 2017

How did you set up the system @souravsingh ?

@souravsingh
Copy link
Contributor
souravsingh commented Aug 4, 2017

@amueller I created two separate conda environments- One with Intel Python distribution and second one with OpenBLAS. Here are the packages in both the environments-

IntelPython Anaconda environment-

icc_rt                    16.0.3                  intel_6  [intel]  intel
intelpython               2017.0.3                      4    intel
mkl                       2017.0.3                intel_6  [intel]  intel
numpy                     1.12.1             py27_intel_8  [intel]  intel
openmp                    2017.0.3                intel_8    intel
openssl                   1.0.2k                  intel_3  [intel]  intel
pandas                    0.20.1          np112py27_intel_1  [intel]  intel
pip                       9.0.1              py27_intel_0  [intel]  intel
pydaal                    2017.0.3.20170412 np112py27_intel_3  [intel]  intel
python                    2.7.13                  intel_1  [intel]  intel
python-dateutil           2.6.0              py27_intel_0  [intel]  intel
pytz                      2017.2             py27_intel_0  [intel]  intel
readline                  6.2                           2  
scikit-learn              0.18.1             py27_intel_6  [intel]  intel
scipy                     0.19.0          np112py27_intel_2  [intel]  intel
setuptools                27.2.0             py27_intel_0  [intel]  intel
six                       1.10.0             py27_intel_7  [intel]  intel
sqlite                    3.13.0                 intel_14  [intel]  intel
tbb                       2017.0.7           py27_intel_2  [intel]  intel
tcl                       8.6.4                  intel_16  [intel]  intel
tk                        8.6.4                  intel_26  [intel]  intel
wheel                     0.29.0             py27_intel_5  [intel]  intel
zlib                      1.2.11                  intel_2  [intel]  intel

And OpenBLAS Anaconda environment-

blas                      1.1                    openblas    conda-forge
ca-certificates           2017.7.27.1                   0    conda-forge
distribute                0.6.45                   py27_1  
libgfortran               3.0.0                         1  
ncurses                   5.9                          10    conda-forge
numpy                     1.13.1          py27_blas_openblas_200  [blas_openblas]  conda-forge
openblas                  0.2.19                        2    conda-forge
openssl                   1.0.2l                        0    conda-forge
pandas                    0.20.3                   py27_1    conda-forge
pip                       1.4.1                    py27_0  
python                    2.7.13                        1    conda-forge
python-dateutil           2.6.1                    py27_0    conda-forge
pytz                      2017.2                   py27_0    conda-forge
readline                  6.2                           0    conda-forge
scikit-learn              0.18.2          np113py27_blas_openblas_200  [blas_openblas]  conda-forge
scipy                     0.19.1          py27_blas_openblas_201  [blas_openblas]  conda-forge
six                       1.10.0                   py27_1    conda-forge
sqlite                    3.13.0                        1    conda-forge
tk                        8.5.19                        2    conda-forge
zlib                      1.2.11                        0    conda-forge

Do you want me to make any changes to any of the two environments or is it fine?

@TomDLT
Copy link
Member
TomDLT commented Aug 4, 2017

I would suggest to use random data with very different sizes for (n_samples, n_features, n_components): with different ratios n_samples/n_features, and especially with very different values of n_components.
IIRC this can changes the bottleneck of NMF, from cython coordinate descent to numpy dot product, which should affect a lot the MKL speed up.

Also I hope the ~x10 speed up for NMF was not with the (removed) projected gradient solver.
We also might check the multiplicative update solver, which relies more intensively on numpy operations.

@amueller
Copy link
Member Author
amueller commented Aug 4, 2017

great, the environments look good, and I think @TomDLT has the right idea ;)

You can try completely random positive data or low-rank data + noise.

@souravsingh
Copy link
Contributor

@amueller I have done a small benchmark using a matrix of random values of varying n_samples and n_features and a constant value of n_components.

Here are the results on Intel MKL-

X done in 4.363s.
Y done in 13.370s.
Z done in 38.817s.

Results on OpenBLAS-

X done in 19.249s.
Y done in 68.057s.
Z done in 206.880s.

Here is the code I used-

import numpy as np
from sklearn.decomposition import NMF
from time import time


X=np.random.random((5000,3000))
Y=np.random.random((10000,5000))
Z=np.random.random((15000,10000))

model = NMF(n_components=5, init='random', random_state=0)
t0 = time()
model.fit(X)
print("X done in %0.3fs." % (time() - t0))


model1 = NMF(n_components=5, init='random', random_state=0)
t1 = time()
model1.fit(Y)
print("Y done in %0.3fs." % (time() - t1))

model2 = NMF(n_components=5, init='random', random_state=0)
t2 = time()
model.fit(Z)
print("Z done in %0.3fs." % (time() - t2))

@amueller
Copy link
Member Author
amueller commented Aug 9, 2017

I think their benchmarks are here: https://github.com/dvnagorny/sklearn_benchs (though I'm not entirely sure these are the relevant ones).

@souravsingh
Copy link
Contributor

True, these benchmarks aren't relevant to the issue.

@amueller
Copy link
Member Author
amueller commented Aug 9, 2017

@souravsingh thanks for providing the benchmarks. Looks like about a 5x speedup with MKL. Do you know whether that's due to using multiple cores?

@souravsingh
Copy link
Contributor

@amueller I don't think the speedup could be due to multiple cores since NMF doesn't have the n_jobs parameter. But I could be wrong.

@jnothman
Copy link
Member
jnothman commented Aug 10, 2017 via email

@lesshaste
Copy link

It would be worth simply observing "top" during the execution to be sure what is happening in practice. https://askubuntu.com/q/257248/183825 has a number of interesting tips on this topic too.

@souravsingh
Copy link
Contributor
souravsingh commented Aug 10, 2017

@lesshaste I was able to confirm the usage of multiple cores for the program. The program run on MKL used upto 4 cores for computation, but program run on OpenBLAS used only upto 2 cores, with one core(cpu4) having 100% usage.

@ogrisel
Copy link
Member
ogrisel commented Sep 1, 2017

Note: you can control the number of threads used by OpenBLAS or MKL by setting the OPENBLAS_NUM_THREADS=4 and MKL_NUM_THREADS=4 environment variables for instance.

@ogrisel
Copy link
Member
ogrisel commented Sep 1, 2017

When using the anaconda MKL on a laptop with 2 physical intel cores I get the following timings for the above benchmark script (I remove the long running one). I use scikit-learn 0.19.0 with MKL from anaconda and openblas from conda-forge in two different conda environment:

(mkl-from-anaconda) 0 [~]$ python /tmp/bench_nmf.py
X done in 7.607s.
Y done in 24.316s.
(openblas-from-conda-forge) 0 [~]$ python /tmp/bench_nmf.py
X done in 11.835s.
Y done in 39.979s.

So MKL is a bit faster but this is not such a big difference. I have not tried yet with the Intel Python distribution. Maybe they patched scikit-learn in their distribution to make it work faster and it's not just about MKL vs OpenBLAS.

@rth
Copy link
Member
rth commented Sep 1, 2017

I have not tried yet with the Intel Python distribution. Maybe they patched scikit-learn in their distribution to make it work faster and it's not just about MKL vs OpenBLAS.

@ogrisel They have patched scikit-learn to use their Data Analytics Acceleration Library (daal), see the diff in #9429 (comment). Also related to #9430 ...

@souravsingh
Copy link
Contributor

@ogrisel Can you show the code which was used for conducting the benchmark? I can try to run the same benchmarks on the Intel python distribution

@amueller
Copy link
Member Author

here's what they used:

For non-negative matrix factorization benchmark here are the problem size we used

    https://github.com/scikit-learn/scikit-learn/blob/0.16.X/benchmarks/bench_plot_nmf.py#L149

with the following data-sizes:
 
    samples_range = np.linspace(2000, 3001, 2, dtype=np.int)
    features_range = np.linspace(2000, 3001, 2, dtype=np.int)
    timeset, err = benchmark(samples_range, features_range)
 
For SGD we used

   https://github.com/scikit-learn/scikit-learn/blob/0.16.X/benchmarks/bench_sgd_regression.py#L26
 
With the following data-sizes:
 
    list_n_samples = np.linspace(10000, 100000, 5).astype(np.int)
    list_n_features = [100, 1000, 10000]
    n_test = 1000
    noise = 0.1
    alpha = 0.01

@cmarmo cmarmo added module:decomposition module:linear_model Needs Benchmarks A tag for the issues and PRs which require some benchmarks and removed Easy Well-defined and straightforward way to resolve labels Dec 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted module:decomposition module:linear_model Needs Benchmarks A tag for the issues and PRs which require some benchmarks
Projects
None yet
Development

No branches or pull requests

10 participants
0