-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Benchmark NMF and SGDClassifier on MKL vs OpenBlas #9429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@amueller I am interested. Do you have the citation link for this? |
@amueller I have tested the matrix decomposition on Olivetti Faces dataset on both MKL and Openblas. Here are the results- Under MKL and Intel Python distribution-
and under normal python and OpenBLAS-
Only Would you like me to do more benchmarks? |
@souravsingh can you try it with a more powerful instance? |
Here are the benchmarks done on Google Cloud HighCPU-8 instance(Cant get HighCPU-16 instance)- For MKL-
And for OpenBLAS-
|
@souravsingh Could you please provide more information about your setup in both environments? In particular, did you use the official version of scikit-learn & numpy linked against MKL or builds from the intel conda distribution, where they mention significant speedups? In my understanding, in addition to linking to MKL, in the intel channel, they are also adding some optimizations / solvers in the source code of both numpy and scikit-learn which might explain the difference in performance. Below is the diff between scikit-learn from the default conda channel (linked against MKL) and the one from intel for diff -r anconda-lib/python3.5/site-packages/sklearn/__init__.py intel-lib/python3.5/site-packages/sklearn/__init__.py
69c69
< 'svm', 'tree', 'discriminant_analysis',
---
> 'svm', 'tree', 'discriminant_analysis', 'daal4sklearn',
72a73,74
> from .daal4sklearn import dispatcher
> dispatcher.enable()
87a90
>
diff -r anconda-lib/python3.5/site-packages/sklearn/cluster/tests/test_k_means.py intel-lib/python3.5/site-packages/sklearn/cluster/tests/test_k_means.py
7a8
> from sklearn.utils.testing import assert_allclose
550c551
< # centers must not been collapsed
---
> # centers must not have been collapsed
807,812c808,813
< assert_array_almost_equal(inertia[np.float32], inertia[np.float64],
< decimal=4)
< assert_array_almost_equal(X_new[np.float32], X_new[np.float64],
< decimal=4)
< assert_array_almost_equal(centers[np.float32], centers[np.float64],
< decimal=4)
---
> assert_allclose(inertia[np.float32], inertia[np.float64],
> atol=1e-4, rtol=1e-4)
> assert_allclose(X_new[np.float32], X_new[np.float64],
> atol=1e-3, rtol=1e-4)
> assert_allclose(centers[np.float32], centers[np.float64],
> atol=1e-4, rtol=1e-4)
Only in intel-lib/python3.5/site-packages/sklearn: daal4sklearn
diff -r anconda-lib/python3.5/site-packages/sklearn/mixture/tests/test_gaussian_mixture.py intel-lib/python3.5/site-packages/sklearn/mixture/tests/test_gaussian_mixture.py
983c983
< assert_greater(gmm2.lower_bound_, gmm1.lower_bound_)
---
> assert_greater_equal(gmm2.lower_bound_, gmm1.lower_bound_)
diff -r anconda-lib/python3.5/site-packages/sklearn/setup.py intel-lib/python3.5/site-packages/sklearn/setup.py
42a43
> config.add_subpackage('daal4sklearn')
diff -r anconda-lib/python3.5/site-packages/sklearn/tests/test_common.py intel-lib/python3.5/site-packages/sklearn/tests/test_common.py
146a147
> \.daal4sklearn(\.|$)| together with additional files in, ls intel-lib/python3.5/site-packages/sklearn/daal4sklearn
__init__.py dispatcher.py k_means.py linear.py pairwise.py pca.py ridge.py utils.py |
@souravsingh Also could you please provide the script you used for benchmarking? |
I would run something that takes significantly more time. If your benchmark only runs a fraction of a second, this will most likely be dominated by overhead, not the actual computation. Try running a benchmark that takes closer to a minute. |
For the script, I used the Olivetti Faces Decomposition example from scikit-learn docs |
@amueller I ran a Topic Extraction example on both environments. Here are the timings- On OpenBLAS
For Intel MKL
|
Nice. Lol. Maybe try with dense data and different number of features and samples. (not using the text data, maybe do subsets of MNIST or random data or whatever) |
How did you set up the system @souravsingh ? |
@amueller I created two separate conda environments- One with Intel Python distribution and second one with OpenBLAS. Here are the packages in both the environments- IntelPython Anaconda environment-
And OpenBLAS Anaconda environment-
Do you want me to make any changes to any of the two environments or is it fine? |
I would suggest to use random data with very different sizes for (n_samples, n_features, n_components): with different ratios n_samples/n_features, and especially with very different values of n_components. Also I hope the ~x10 speed up for NMF was not with the (removed) projected gradient solver. |
great, the environments look good, and I think @TomDLT has the right idea ;) You can try completely random positive data or low-rank data + noise. |
@amueller I have done a small benchmark using a matrix of random values of varying Here are the results on Intel MKL-
Results on OpenBLAS-
Here is the code I used-
|
I think their benchmarks are here: https://github.com/dvnagorny/sklearn_benchs (though I'm not entirely sure these are the relevant ones). |
True, these benchmarks aren't relevant to the issue. |
@souravsingh thanks for providing the benchmarks. Looks like about a 5x speedup with MKL. Do you know whether that's due to using multiple cores? |
@amueller I don't think the speedup could be due to multiple cores since NMF doesn't have the n_jobs parameter. But I could be wrong. |
some parallel processing occurs due to numpy configuration.
…On 10 Aug 2017 2:58 pm, "Sourav Singh" ***@***.***> wrote:
@amueller <https://github.com/amueller> I don't think the speedup could
be due to multiple cores since NMF doesn't have the n_jobs parameter. But I
could be wrong.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#9429 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6zE9XDWT8T0BjsfydAX_5rL_a75Kks5sWo3_gaJpZM4Oflir>
.
|
It would be worth simply observing "top" during the execution to be sure what is happening in practice. https://askubuntu.com/q/257248/183825 has a number of interesting tips on this topic too. |
@lesshaste I was able to confirm the usage of multiple cores for the program. The program run on MKL used upto 4 cores for computation, but program run on OpenBLAS used only upto 2 cores, with one core(cpu4) having 100% usage. |
Note: you can control the number of threads used by OpenBLAS or MKL by setting the |
When using the anaconda MKL on a laptop with 2 physical intel cores I get the following timings for the above benchmark script (I remove the long running one). I use scikit-learn 0.19.0 with MKL from anaconda and openblas from conda-forge in two different conda environment:
So MKL is a bit faster but this is not such a big difference. I have not tried yet with the Intel Python distribution. Maybe they patched scikit-learn in their distribution to make it work faster and it's not just about MKL vs OpenBLAS. |
@ogrisel They have patched scikit-learn to use their Data Analytics Acceleration Library (daal), see the diff in #9429 (comment). Also related to #9430 ... |
@ogrisel Can you show the code which was used for conducting the benchmark? I can try to run the same benchmarks on the Intel python distribution |
here's what they used:
|
At scipy the intel python distribution people cited a speedup of ~10x for NMF and SGDClassifier on MKL vs OpenBlas. It would be great if someone could try to reproduce, and find the bottlenecks.
The text was updated successfully, but these errors were encountered: