8000 [MRG] Use nomkl on CircleCI. by lesteve · Pull Request #12636 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] Use nomkl on CircleCI. #12636

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed

Conversation

lesteve
Copy link
Member
@lesteve lesteve commented Nov 21, 2018

That should work around the segmentation fault we see in master and 0.20.1 PR.

Fix #12630 and maybe the segfault seen in #12383.

For some reason there is a segmentation fault with MKL on CircleCI.

[doc build]
@lesteve lesteve mentioned this pull request Nov 21, 2018
Copy link
Member
@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming CI passes..

@rth
Copy link
Member
rth commented Nov 21, 2018

Now we get a segfault on Python 2 still in plot_stock_market, while it passed on Python 3.

At least, consistency is reassuring :)

@ogrisel
Copy link
Member
ogrisel commented Nov 21, 2018

So please add it also for python 2 and add a comment with the url that explains why we have this workaround:

https://github.com/scikit-learn/scikit-learn/pull/12383#issuecomment-440612975

@rth
Copy link
Member
rth commented Nov 21, 2018

I wonder what would happen with dependencies installed from the conda-forge channel..

@lesteve
Copy link
Member Author
lesteve commented Nov 21, 2018

I wonder what would happen with dependencies installed from the conda-forge channel..

I just pushed a commit trying on conda-forge.

If someone has a suggestion what the problem is behind the scenes, I am interested. It does not seem to make a lot of sense right now!

@rth
Copy link
Member
rth commented Nov 21, 2018

Note that this also started to segfault on master with Python 3 on Nov 20. And this corresponds to the MKL update on the default conda channel https://anaconda.org/anaconda/mkl/files.

So it might we worth reporting even if we don't have a very reproducible example.

@lesteve
Copy link
Member Author
lesteve commented Nov 21, 2018

Interesting although if I looked at a failing build on master, e.g. https://circleci.com/gh/scikit-learn/scikit-learn/38702 (6 hours ago), the MKL version used is mkl-2018.0.3 which is from 5 months ago and not 2019-* which were added very recently.

So it might we worth reporting even if we don't have a very reproducible example.

Where would be the best place to report, given the elusiveness of the problem?

@qinhanmin2014
Copy link
Member

Seems that python2 job passes.
Why is python3 job so slow? coincidence? or something special from conda-forge?

@ogrisel
Copy link
Member
ogrisel commented Nov 21, 2018

Why is python3 job so slow? coincidence? or something special from conda-forge?

It does not seem to be related to conda-forge because it also took too long on python 3 with the nomkl commit that was using the anaconda packages for numpy: 49445f2

https://circleci.com/gh/scikit-learn/scikit-learn/38731?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link

@amueller
Copy link
Member

@lesteve I would report at both numpy and anaconda-issues.

@ogrisel
Copy link
Member
ogrisel commented Nov 21, 2018

I am not sure this is related to numpy: nobody could reproduce locally. Also we now have timeouts in the python 3 builds on circle ci (both with nomkl and with conda-forge numpy). This was not the case previously:

generating gallery for auto_examples/decomposition... [ 50%] plot_pca_3d.py
generating gallery for auto_examples/decomposition... [ 58%] plot_ica_vs_pca.py
generating gallery for auto_examples/decomposition... [ 66%] plot_kernel_pca.py
generating gallery for auto_examples/decomposition... [ 75%] plot_pca_vs_fa_model_selection.py
Too long with no output (exceeded 10m0s)

@ogrisel
Copy link
Member
ogrisel commented Nov 21, 2018

plot_pca_vs_fa_model_selection.py takes less than 20s on a a fresh conda env with numpy from anaconda with mkl (created today).

@ogrisel
Copy link
Member
ogrisel commented Nov 21, 2018

I have also tried with nomkl (which downgraded my conda env from python 3.7 to 3.6) and it this example still runs in less than 20s on my box.

@ogrisel
Copy link
Member
ogrisel commented Nov 21, 2018

I have also tried with a new conda-forge venv and this example run also in the order of 20s. No problem either. I think there is a problem with circle ci itself.

@rth
Copy link
Member
rth commented Nov 21, 2018

Could we be hitting the RAM limit on CircleCI while generating the docs?

@lesteve
Copy link
Member Author
lesteve commented Nov 21, 2018

By sshing into CircleCI for the Python3 build, I can reproduce the slowness by running

python examples/decomposition/plot_pca_vs_fa_model_selection.py

(I killed it after 2 minutes). By setting OPENBLAS_NUM_THREADS=1 it goes back to a reasonable time 10-20s.

I'll try that in the PR to see if that makes the build green.

@lesteve
Copy link
Member Author
lesteve commented Nov 21, 2018

FYI, here is the htop output when leaving OPENBLAS_NUM_THREADS unset:
plot_pca_vs_fa_model_selection_htop

@amueller
Copy link
Member

how about we use mkl and my hotfix? or do you think the slowdown is something worth investigating for the release?

@lesteve
8000 Copy link
Member Author
lesteve commented Nov 21, 2018

I commented in your hotfix PR #12637. I have to admit, at this point of time I am just trying to get the examples running on CircleCI ...

I am not sure whether the slowdown comes from, but it's likely to be specific to CircleCI. For some reason, OpenBLAS is trying to use all the cores on the machine although CircleCI limits you to 2 CPUs from what I gathered.

@amueller
Copy link
Member

Yeah I figured our goal was to just make the doc build work. I'm relatively confident my PR will do that. Thanks for pushing the doc build :)

@amueller
Copy link
Member

this oversubscription issue is probably also slowing down our CI in general, right? Should we open an issue for that?

@rth
Copy link
Member
rth commented Nov 21, 2018

this oversubscription issue is probably also slowing down our CI in general, right?

Loky handles this case correctly I think with loky.effective_n_jobs (which will be 2 here) while in other CI we already had OPENBLAS_NUM_THREADS=1 last time I checked..

@ogrisel
Copy link
Member
ogrisel commented Nov 21, 2018

@lesteve interesting analysis, we should report the problem to upstream openblas. I think it's easy to reproduce by running

import numpy as np
np.linalg.svd(np.random.randn(1024, 1024))

with different values of OPENBLAS_NUM_THREADS on a machine with many cores.

Anyways +1 for @amueller's bugfix instead of using nomkl / openblas on circle ci.

@ogrisel
Copy link
Member
ogrisel commented Nov 21, 2018

Loky handles this case correctly I think with loky.effective_n_jobs (which will be 2 here) while in other CI we already had OPENBLAS_NUM_THREADS last time I checked..

This is only protecting oversubscription when openblas calls are nested under joblib.Parallel calls.

What @lesteve has found is a performance bug in openblas itself. We would need to check if it's present in OPENBLAS master first. @jeremiedbb has plenty of openblas already compiled on his laptop so it should be easy for him to try :)

@ogrisel
Copy link
Member
ogrisel commented Nov 21, 2018

this oversubscription issue is probably also slowing down our CI in general, right? Should we open an issue for that?

+1 for setting OPENBLAS_NUM_THREADS=2 in all our default CI environments.

@amueller
Copy link
Member

circle timed out on 2.7 for conda-forge conda-forge/scikit-learn-feedstock#81

@lesteve
Copy link
Member Author
lesteve commented Nov 22, 2018

For completeness, I ssh into Circle and looked at the segmentation fault for Python 2 with nomkl (this build).

The little info I could extract with gdb:

0x00007ffff60a107c in dlaswp_plus ()
   from /home/circleci/miniconda/envs/testenv/lib/python2.7/site-packages/numpy/core/../../../../libopenblas.so.0
(gdb) bt
#0  0x00007ffff60a107c in dlaswp_plus ()
   from /home/circleci/miniconda/envs/testenv/lib/python2.7/site-packages/numpy/core/../../../../libopenblas.so.0
#1  0x00007ffff6091ece in inner_advanced_thread ()
   from /home/circleci/miniconda/envs/testenv/lib/python2.7/site-packages/numpy/core/../../../../libopenblas.so.0
#2  0x00007ffff5f6a61d in blas_thread_server ()
   from /home/circleci/miniconda/envs/testenv/lib/python2.7/site-packages/numpy/core/../../../../libopenblas.so.0
#3  0x00007ffff7bc7064 in start_thread (arg=0x7fffb422e700) at pthread_create.c:309
#4  0x00007ffff71f462d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

So this did not seem related to the segmentation fault on MKL (the snippet in #12383 (comment) did not crash)

I could get rid the segmentation fault by setting OPENBLAS_NUM_THREADS <= 4. A bigger number would cause the segmentation fault.

@jeremiedbb
Copy link
Member
jeremiedbb commented Nov 22, 2018

What @lesteve has found is a performance bug in openblas itself. We would need to check if it's present in OPENBLAS master first. @jeremiedbb has plenty of openblas already compiled on his laptop so it should be easy for him to try :)

I can reproduce it with numpy.linalg.svd. I get a wall time 4 times bigger using 88 cores than using 4 (total cpu time 100 times bigger). I can also reproduce it with OpenBLAS master.

@lesteve
Copy link
Member Author
lesteve commented Nov 22, 2018

Closing since #12637 was merged. If anyone is keen to investigate the problems uncovered, please be my guest!

@lesteve lesteve closed this Nov 22, 2018
@amueller
Copy link
Member

conda-forge/scikit-learn-feedstock#81 (comment) could be related?

@amueller
Copy link
Member

I feel like we should report the segfault somewhere though lol

@lesteve lesteve deleted the circleci-nomkl branch February 9, 2021 14:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
0