8000 NumPy "dot" hangs when used with multiprocessing (potentially Apple Accelerate related?) · Issue #5752 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

NumPy "dot" hangs when used with multiprocessing (potentially Apple Accelerate related?) #5752

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
josePhoenix opened this issue Apr 6, 2015 · 28 comments

Comments

@josePhoenix
Copy link

I'm having a devil of a time making a minimal test case, but this seems to be my issue: http://stackoverflow.com/questions/23963997/python-child-process-crashes-on-numpy-dot-if-pyside-is-imported

In my case, I have code that farms out a bunch of calculations, including matrix products to a multiprocessing.Pool with pool.map. The computation hangs partway through, and some hacky print-based debugging shows it hanging on a call to np.dot down in the guts of the program.

Replacing pool.map with the built-in (serial) map makes everything work.

I used not to have this issue, then something changed (lunar eclipse?) and now my computation hangs consistently whenever multiprocessing is used. A minimal test case continues to elude me. (It's not enough to simply generate 10 random NxN arrays and dot them in a multiprocessing-based way.)

>>> numpy.show_config()
lapack_opt_info:
    extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
    extra_compile_args = ['-msse3', '-DAPPLE_ACCELERATE_SGEMV_PATCH']
    define_macros = [('NO_ATLAS_INFO', 3)]
openblas_lapack_info:
  NOT AVAILABLE
atlas_3_10_blas_threads_info:
  NOT AVAILABLE
atlas_threads_info:
  NOT AVAILABLE
atlas_3_10_threads_info:
  NOT AVAILABLE
atlas_blas_info:
  NOT AVAILABLE
atlas_3_10_blas_info:
  NOT AVAILABLE
atlas_blas_threads_info:
  NOT AVAILABLE
openblas_info:
  NOT AVAILABLE
blas_mkl_info:
  NOT AVAILABLE
blas_opt_info:
    extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
    extra_compile_args = ['-msse3', '-DAPPLE_ACCELERATE_SGEMV_PATCH', '-I/System/Library/Frameworks/vecLib.framework/Headers']
    define_macros = [('NO_ATLAS_INFO', 3)]
atlas_info:
  NOT AVAILABLE
atlas_3_10_info:
  NOT AVAILABLE
lapack_mkl_info:
  NOT AVAILABLE
mkl_info:
  NOT AVAILABLE
@njsmith
Copy link
Member
njsmith commented Apr 6, 2015

Yes, you basically can't use multiprocessing in fork mode if you're using
apple accelerate. Sorry. It's apple's fault, and they consider it not-a-bug
so... too bad for their customers, I guess? Pretty frustrating.

Workarounds:

  • use another blas, e.g. we fixed this problem in OpenBLAS a year or two
    ago.
  • use multiprocessing in "spawn" mode. This requires Python 3.4 or later,
    though.

My guess is you could get a reliable reproducer by first dot'ing some
matrices in the parent, then starting the pool, and then dot'ing some
matrices in the child. And make sure that all the matrices are large. But
really, who knows, it's up to the whims of accelerate when it uses threads
etc., so when it acts quirky then we have no way to tell why.
Multiprocessing and Accelerate are definitely, officially incompatible,
though.
On Apr 6, 2015 1:15 PM, "Joseph Long" notifications@github.com wrote:

I'm having a devil of a time making a minimal test case, but this seems to
be my issue:
http://stackoverflow.com/questions/23963997/python-child-process-crashes-on-numpy-dot-if-pyside-is-imported

In my case, I have code that farms out a bunch of calculations, including
matrix products to a multiprocessing.Pool with pool.map. The computation
hangs partway through, and some hacky print-based debugging shows it
hanging on a call to np.dot down in the guts of the program.

Replacing pool.map with the built-in (serial) map makes everything work.

I used not to have this issue, then something changed (lunar eclipse?) and
now my computation hangs consistently whenever multiprocessing is used. A
minimal test case continues to elude me. (It's not enough to simply
generate 10 random NxN arrays and dot them in a multiprocessing-based
way.)

numpy.show_config()
lapack_opt_info:
extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
extra_compile_args = ['-msse3', '-DAPPLE_ACCELERATE_SGEMV_PATCH']
define_macros = [('NO_ATLAS_INFO', 3)]
openblas_lapack_info:
NOT AVAILABLE
atlas_3_10_blas_threads_info:
NOT AVAILABLE
atlas_threads_info:
NOT AVAILABLE
atlas_3_10_threads_info:
NOT AVAILABLE
atlas_blas_info:
NOT AVAILABLE
atlas_3_10_blas_info:
NOT AVAILABLE
atlas_blas_threads_info:
NOT AVAILABLE
openblas_info:
NOT AVAILABLE
blas_mkl_info:
NOT AVAILABLE
blas_opt_info:
extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
extra_compile_args = ['-msse3', '-DAPPLE_ACCELERATE_SGEMV_PATCH', '-I/System/Library/Frameworks/vecLib.framework/Headers']
define_macros = [('NO_ATLAS_INFO', 3)]
atlas_info:
NOT AVAILABLE
atlas_3_10_info:
NOT AVAILABLE
lapack_mkl_info:
NOT AVAILABLE
mkl_info:
NOT AVAILABLE


Reply to this email directly or view it on GitHub
#5752.

@matthew-brett
Copy link
Contributor

I'm guessing that you upgraded numpy recently, using wheels? The Numpy 1.9.0 wheel used ATLAS, and was safe for multiprocessing (but slower in general). 1.9.1 and 1.9.2 wheels use Accelerate again.

@rgommers
Copy link
Member
rgommers commented Apr 6, 2015

@matthew-brett I don't think I remember the reason for that switch, can you remind me?

@njsmith
Copy link
Member
njsmith commented Apr 6, 2015

I know that we're still in the shake down period on the wheel builds, but I
do wonder if in the future we might want to apply similar criteria to those
that we do to source changes, e.g. not changing a fairly major and
regression-prone thing like the blas library in a point release?
On Apr 6, 2015 1:29 PM, "Matthew Brett" notifications@github.com wrote:

I'm guessing that you upgraded numpy recently, using wheels? The Numpy
1.9.0 wheel used ATLAS, and was safe for multiprocessing (but slower in
general). 1.9.1 and 1.9.2 wheels use Accelerate again.


Reply to this email directly or view it on GitHub
#5752 (comment).

@matthew-brett
Copy link
Contributor

Accelerate dot was broken for not-AVW-aligned float32 arrays on OSX 10.9 : #4007 . Sturla worked round that using some conditional compilation and using an alternative BLAS routine : #5223 . After that was merged I went back to using Accelerate for the OSX wheels. It's not that hard to build ATLAS wheels if people need them.

@josePhoenix
Copy link
Author

Thanks for the insight, all. We'll have to do some thinking about a workaround. (Or wait for our institute to switch to Python 3.X...)

I'm fine recompiling NumPy with a different linear algebra lib for my own use, but the project I'm working on is intended to be a 'pip install'-able package for anyone to use. We might have to detect the platform and fail with an exception in the case where using multiprocessing && Apple Accelerate.

@josePhoenix
Copy link
Author

@matthew-brett Is speed the main reason not to use ATLAS for the main public wheel builds? I'm not sure how the needs of NumPy users break down, but I'd like to put in one vote for a more conservative / compatible wheel.

@njsmith
Copy link
Member
njsmith commented Apr 6, 2015

I tend to agree. Just sent an email to the list to see if we can get
consensus.
On Apr 6, 2015 4:10 PM, "Joseph Long" notifications@github.com wrote:

@matthew-brett https://github.com/matthew-brett Is speed the main
reason not to use ATLAS for the main public wheel builds? I'm not sure how
the needs of NumPy users break down, but I'd like to put in one vote for a
more conservative / compatible wheel.


Reply to this email directly or view it on GitHub
#5752 (comment).

@matthew-brett
Copy link
Contributor

@ogrisel - any opinion on this one for sklearn?

@matthew-brett
Copy link
Contributor

By the way - when I said it wasn't that hard to build ATLAS wheels, I meant, that we (numpy) could build some ATLAS wheels and supply them somewhere as an option.

@ogrisel
Copy link
Contributor
ogrisel commented Apr 7, 2015

By default, Python multiprocessing does fork without exec which breaks various libraries that use posix thread pools (or other) internally (Accelerate, CUDA, libgomp the OpenMP implementation of gcc). This is probably still the case under some circumstances (e.g. data size) for Apple Accelerate although it seems to depend on the versions. The first time we observed the issue was on OSX 10.7. We (@cournape and I) reported the bug and Apple replied: wontfix, fork without exec is a POSIX standard violation (which is true BTW).

I issued a patch to OpenBLAS to make it's non-OpenMP thread pool fork safe in the past:

OpenMathLib/OpenBLAS#294

It would be interesting to retry if this fix still works for the latest version of OpenBLAS. As OpenBLAS is quite fast we could use it for the wheels (if all numpy + scipy tests pass under OSX with OpenBLAS).

ATLAS is robust by default too.

In Python 3.4+, multiprocessing has a forkserver start method to mitigate that issue but it is not used by default.

@mperrin
Copy link
mperrin commented Apr 22, 2015

As a status update to @josePhoenix's original problem, I've just updated our code to use forkserver on Python 3.4, and now Accelerated np.dot and multiprocessing appear to be playing well together! Fantastic. Thanks all for the insights and in particular @ogrisel for the pointer to forkserver.

(And as a bonus this now gives us a concrete reason to tell the users of our library "hey, you should update to Python 3.4" :-)

@charris
Copy link
Member
charris commented Apr 22, 2015

Can this be closed now?

@mperrin
Copy link
mperrin commented Apr 22, 2015

From our (@josePhoenix and my) perspective, I think yes it can be closed. The discussion above made it sound like there were broader interests in having ATLAS wheels as well as ones with Accelerate, but that probably ought to be its own separate issue if if's going to be pursued. That's up to you all! Thanks again.

@charris
Copy link
Member
charris commented Apr 22, 2015

@mperrin OK, thanks for the feedback.

@pcohen89
Copy link

Hi all, sorry, but is the only solution here to switch to python3? I am mid-project and I really need both np.dot() and multiprocessing. A bit concerned about switch to python3 mid project, any thoughts would be a huge help!

@josePhoenix
Copy link
Author

From our experience with this bug: adding support for Python 3.4+ (with the 'forkserver' method) was relatively painless, and resolved this specific issue. Our codebase wasn't specifically written to be forward-compatible, but it was more or less best-practices Python 2.7 code which ported rather easily.

If you try that, and it proves to be a morass, you could always do the poor man's multiprocessing: make your individual work units things that you can invoke with a subprocess.check_call out to launch another script. Any data that need to be shared with subprocesses will have to be serialized (as command line arguments or files on disk or what have you) but it might be worth it to you.

@pcohen89
Copy link

So the recommendation is python3. We haven't done anything terribly exotic so hopefully we will have decent forward compatibility. Thanks

@njsmith
Copy link
Member
njsmith commented Apr 27, 2016

The other option would be to build numpy against a different blas library. E.g., openblas (what we use by default on Windows) doesn't have this problem.

@njsmith
Copy link
Member
njsmith commented Apr 27, 2016

Err, use by default on Linux I mean. And hopefully Windows soon too.

@Marviel
Copy link
Marviel commented Jan 5, 2017

Jumping on late -- but if you're comfortable with dummy processes, switching my Pool to a multiprocess.dummy Pool was shown to get around this issue.

Found here:
(dedupeio/dedupe#206)

@ogrisel
Copy link
Contributor
ogrisel commented Jan 6, 2017

multiprocessing.dummy.Pool is just a thread pool. If your code is acquiring the GIL a lot (that is you don't have a single CPU bottleneck that happens in a GIL-free code segment in Cython code or a numpy operation) then the performance will suffer from lock contention. Using thread pools is therefore not a suitable workaround for all people suffering from this limitation.

BTW, if you want to use thread pool I would recommend to use the more modern concurrent.futures.ThreadPoolExecutor API instead of multiprocessing.dummy.Pool

@Marviel
Copy link
Marviel commented Jan 6, 2017

@ogrisel yeah totally! Definitely don't think it's a solution for all cases, but with mine (specifically just trying to get a "cancellable" python process) it worked well enough.

Thanks for the tip! I'll look at switching over

@zerodrift
Copy link

is this fixed?

@mperrin
Copy link
mperrin commented Jan 24, 2017

@zerodrift we switched our code to use the forkserver method on Python 3.4+ as was discussed above, and that has been working just fine. I don't think there's been any change to the issue on Py 2.7 though.

@zerodrift
Copy link

mine outright crashes...is it related?

Process: Python [92939]
Path: /usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/Resources/Python.app/Contents/MacOS/Python
Identifier: Python
Version: 3.5.1 (3.5.1)
Code Type: X86-64 (Native)
Parent Process: Python [92904]
Responsible: Python [92939]
User ID: 504

Date/Time: 2017-01-24 11:25:40.057 -0600
OS Version: Mac OS X 10.12.2 (16C67)
Report Version: 12
Anonymous UUID: 058190CF-EEF5-6666-8EE5-FD2F76536C1F

Sleep/Wake UUID: 12153DD9-1ECD-4124-A226-A26A3DC3F874

Time Awake Since Boot: 410000 seconds
Time Since Wake: 58000 seconds

System Integrity Protection: enabled

Crashed Thread: 0 Dispatch queue: com.apple.main-thread

Exception Type: EXC_BAD_ACCESS (SIGSEGV)
Exception Codes: KERN_INVALID_ADDRESS at 0x0000000000000110
Exception Note: EXC_CORPSE_NOTIFY

Termination Signal: Segmentation fault: 11
Termination Reason: Namespace SIGNAL, Code 0xb
Terminating Process: exc handler [0]

VM Regions Near 0x110:
-->
__TEXT 00000001065c8000-00000001065ca000 [ 8K] r-x/rwx SM=COW /usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/Resources/Python.app/Contents/MacOS/Python

Application Specific Information:
*** multi-threaded process forked ***
crashed on child side of fork pre-exec

@ogrisel
Copy link
Contributor
ogrisel commented Jan 24, 2017

It can be related. For Python 2.7 users (and Python 3 users as well), we started an alternative module to safely work with sub-processes without fearing those kinds of crashes and hanging:

https://github.com/tomMoral/loky

It's still beta but we would be glad to get your feedback as github issues.

aspatti1257 added a commit to aspatti1257/CellLineAnalyzer that referenced this issue May 1, 2019
@ghost
Copy link
ghost commented Jun 12, 2020

setting:

os.environ.update(MKL_NUM_THREADS='1')

fixed the issue for me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants
2A88
0