-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] Use nomkl on CircleCI. #12636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Use nomkl on CircleCI. #12636
Conversation
For some reason there is a segmentation fault with MKL on CircleCI. [doc build]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming CI passes..
Now we get a segfault on Python 2 still in At least, consistency is reassuring :) |
So please add it also for python 2 and add a comment with the url that explains why we have this workaround:
|
I wonder what would happen with dependencies installed from the conda-forge channel.. |
[doc build]
I just pushed a commit trying on conda-forge. If someone has a suggestion what the problem is behind the scenes, I am interested. It does not seem to make a lot of sense right now! |
Note that this also started to segfault on master with Python 3 on Nov 20. And this corresponds to the MKL update on the default conda channel https://anaconda.org/anaconda/mkl/files. So it might we worth reporting even if we don't have a very reproducible example. |
Interesting although if I looked at a failing build on master, e.g. https://circleci.com/gh/scikit-learn/scikit-learn/38702 (6 hours ago), the MKL version used is
Where would be the best place to report, given the elusiveness of the problem? |
Seems that python2 job passes. |
It does not seem to be related to conda-forge because it also took too long on python 3 with the nomkl commit that was using the anaconda packages for numpy: 49445f2 |
@lesteve I would report at both numpy and anaconda-issues. |
I am not sure this is related to numpy: nobody could reproduce locally. Also we now have timeouts in the python 3 builds on circle ci (both with nomkl and with conda-forge numpy). This was not the case previously:
|
|
I have also tried with nomkl (which downgraded my conda env from python 3.7 to 3.6) and it this example still runs in less than 20s on my box. |
I have also tried with a new conda-forge venv and this example run also in the order of 20s. No problem either. I think there is a problem with circle ci itself. |
Could we be hitting the RAM limit on CircleCI while generating the docs? |
By sshing into CircleCI for the Python3 build, I can reproduce the slowness by running
(I killed it after 2 minutes). By setting I'll try that in the PR to see if that makes the build green. |
4362524
to
8ad9d28
Compare
how about we use mkl and my hotfix? or do you think the slowdown is something worth investigating for the release? |
I commented in your hotfix PR #12637. I have to admit, at this point of time I am just trying to get the examples running on CircleCI ... I am not sure whether the slowdown comes from, but it's likely to be specific to CircleCI. For some reason, OpenBLAS is trying to use all the cores on the machine although CircleCI limits you to 2 CPUs from what I gathered. |
Yeah I figured our goal was to just make the doc build work. I'm relatively confident my PR will do that. Thanks for pushing the doc build :) |
this oversubscription issue is probably also slowing down our CI in general, right? Should we open an issue for that? |
Loky handles this case correctly I think with |
@lesteve interesting analysis, we should report the problem to upstream openblas. I think it's easy to reproduce by running
with different values of Anyways +1 for @amueller's bugfix instead of using nomkl / openblas on circle ci. |
This is only protecting oversubscription when openblas calls are nested under What @lesteve has found is a performance bug in openblas itself. We would need to check if it's present in OPENBLAS master first. @jeremiedbb has plenty of openblas already compiled on his laptop so it should be easy for him to try :) |
+1 for setting |
Look at what happened on Nov. 20: a big ill-conditioning issue :) |
circle timed out on 2.7 for conda-forge conda-forge/scikit-learn-feedstock#81 |
For completeness, I ssh into Circle and looked at the segmentation fault for Python 2 with nomkl (this build). The little info I could extract with gdb:
So this did not seem related to the segmentation fault on MKL (the snippet in #12383 (comment) did not crash) I could get rid the segmentation fault by setting OPENBLAS_NUM_THREADS <= 4. A bigger number would cause the segmentation fault. |
I can reproduce it with |
Closing since #12637 was merged. If anyone is keen to investigate the problems uncovered, please be my guest!
|
conda-forge/scikit-learn-feedstock#81 (comment) could be related? |
I feel like we should report the segfault somewhere though lol |
That should work around the segmentation fault we see in master and 0.20.1 PR.
Fix #12630 and maybe the segfault seen in #12383.