-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
NPY_SIGINT_{ON,OFF} have a race condition that can cause control-C to segfault Python #7545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I spent a little while noodling at this today, and it's extremely hard :-( Best attempt:
And the signal handler logic should be:
And we need to make sure that the "my jump buffer is initialized" variable is I think that would work? ...alternatively I am strongly tempted to rip this stuff out. This is a lot of complexity for something that only even affects |
My solution right now is, that with Bluestein's algorithm, we just stop doing anything for FFTs, I think its better than the current unsafe behaviour, threads are becoming more common. As to actually solving this, right now you have to reclaim the GIL to do that, Python would have to expose |
There's two separate things captured here:
For (2), I'm not sure what to do though. There's a possible issue with these interrupt handling APIs and they are not used in NumPy, SciPy or anywhere else prominent as far as I can tell with code search. I'm not sure we should spend the effort to deprecate them though. How about:
|
I will make a mini PR to just delete all traces of interrupt handling stuff (I must have missed two lines). I honestly would be 👍 for just deleting the header:
But if not now... let's mark it to maybe do when a major release comes around (i.e. any moment). |
We do not use these in NumPy anymore, and at this point the whole `npy_interrupt.h` header only exists in case someone is using it out there. We may wish to just remove it at some point, vendoring the header is simple enough after all (and no known downstream usage exists). See also numpygh-7545, numpygh-12541
As to this issue, closing as a duplicate of the newer (and more focused) gh-12541. |
In the FFTPACK code, we have some "clever" code that uses the pattern
NPY_SIGINT_ON
, then run some inner loop, thenNPY_SIGINT_OFF
. And what this does is:NPY_SIGINT_ON
usessetjmp
to make a jump buffer target, saves the current signal handler for SIGINT (control-C), and then registers a new SIGINT handler that attempts to handle control-C bylongjmp
ing out to thesetjmp
buffer. The idea is that this allows us to cancel long-running calculations.The actual implementations of these macros is:
So this has two race conditions:
Minor problem: If we receive a control-C in between the call to
PyOS_setsig
and the call toNPY_SIGSETJMP
, then we'll longjmp out of the signal handler into a uninitialized buffer, and will not go to space today.Major problem: suppose that the following sequence occurs:
NPY_SIGINT_ON
, stashes Python's default sigint handler in its_npy_sig_save
NPY_SIGINT_ON
, stashes our sigint handler in its_npy_sig_save
NPY_SIGINT_OFF
, restores Python's default sigint handler from its_npy_sig_save
NPY_SIGINT_OFF
, then restores our sigint handler from its_npy_sig_save
Now our signal handler gets left installed indefinitely, and eventually when control-C gets hit it will attempt to jump out to a stack frame that was reused ages ago. This is a bad problem and you will not go to space today.
By adding some
printf
checks toNPY_SIGINT_OFF
, I've confirmed that this is the cause of the segfault that Oscar reported on numpy-discussion. It's very easy to replicate even from Python code:Also, the whole architecture here is somewhat screwed up, because the Python model is that all signals are directed to the main thread (by setting them to masked on all other threads). So it doesn't even make sense to be installing a signal handler to interrupt operations on a thread that can't receive signals.
(If you look at the definition of
_PyArray_SigintHandler
and_PyArray_GetSigintBuf
inmultiarraymodule.c
, then you'll see that there's some basic defenses against this: we keep our jump buffer in TLS, and also a flagsigint_buf_init
, so that idea is that if we start running the signal handler in a thread that hasn't actually allocated a jump buffer, then we throw away the signal. This is also pretty buggy: (1) nothing ever sets the flag back to 0, so if a thread has ever had a valid jump buffer on its stack then we will happily jump to it, even if it's no longer valid. (2) we shouldn't throw away the signal, we should pass it on to Python.)So the simple solution is (1) check whether we are actually the one thread that can receive signals, (2) if so, then register our handler etc.; (3) if not, then make these macros into no-ops.
The one wrinkle here is that in theory it's possible for someone to call
sigprocmask
themselves and enable signal receiving on a non-default thread. We might want to be robust to this. A nice extra constraint is that there's no C API to check whether we are in the main thread; the only way I can think of to check that is actually to callsigprocmask
and check whether SIGINT delivery is enabled for this thread.The text was updated successfully, but these errors were encountered: