8000 NPY_SIGINT_{ON,OFF} have a race condition that can cause control-C to segfault Python · Issue #7545 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

NPY_SIGINT_{ON,OFF} have a race condition that can cause control-C to segfault Python #7545

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
njsmith opened this issue Apr 14, 2016 · 5 comments
Labels
57 - Close? Issues which may be closable unless discussion continued

Comments

@njsmith
Copy link
Member
njsmith commented Apr 14, 2016

In the FFTPACK code, we have some "clever" code that uses the pattern NPY_SIGINT_ON, then run some inner loop, then NPY_SIGINT_OFF. And what this does is:

NPY_SIGINT_ON uses setjmp to make a jump buffer target, saves the current signal handler for SIGINT (control-C), and then registers a new SIGINT handler that attempts to handle control-C by longjmping out to the setjmp buffer. The idea is that this allows us to cancel long-running calculations.

The actual implementations of these macros is:

#    define NPY_SIGINT_ON {                                             \
                   PyOS_sighandler_t _npy_sig_save;                     \
                   _npy_sig_save = PyOS_setsig(SIGINT, _PyArray_SigintHandler);\
 \
                   if (NPY_SIGSETJMP(*((NPY_SIGJMP_BUF *)_PyArray_GetSigintBuf(\
)), \
                                 1) == 0) {

#    define NPY_SIGINT_OFF }                                      \
        PyOS_setsig(SIGINT, _npy_sig_save);                       \
        }

So this has two race conditions:

Minor problem: If we receive a control-C in between the call to PyOS_setsig and the call to NPY_SIGSETJMP, then we'll longjmp out of the signal handler into a uninitialized buffer, and will not go to space today.

Major problem: suppose that the following sequence occurs:

  1. thread 1 enters NPY_SIGINT_ON, stashes Python's default sigint handler in its _npy_sig_save
  2. thread 2 enters NPY_SIGINT_ON, stashes our sigint handler in its _npy_sig_save
  3. thread 1 leaves via NPY_SIGINT_OFF, restores Python's default sigint handler from its _npy_sig_save
  4. thread 2 leaves via NPY_SIGINT_OFF, then restores our sigint handler from its _npy_sig_save

Now our signal handler gets left installed indefinitely, and eventually when control-C gets hit it will attempt to jump out to a stack frame that was reused ages ago. This is a bad problem and you will not go to space today.

By adding some printf checks to NPY_SIGINT_OFF, I've confirmed that this is the cause of the segfault that Oscar reported on numpy-discussion. It's very easy to replicate even from Python code:

import numpy
numpy.test()
import os
import signal
os.kill(os.getpid(), signal.SIGINT)
# -> segfault

Also, the whole architecture here is somewhat screwed up, because the Python model is that all signals are directed to the main thread (by setting them to masked on all other threads). So it doesn't even make sense to be installing a signal handler to interrupt operations on a thread that can't receive signals.

(If you look at the definition of _PyArray_SigintHandler and _PyArray_GetSigintBuf in multiarraymodule.c, then you'll see that there's some basic defenses against this: we keep our jump buffer in TLS, and also a flag sigint_buf_init, so that idea is that if we start running the signal handler in a thread that hasn't actually allocated a jump buffer, then we throw away the signal. This is also pretty buggy: (1) nothing ever sets the flag back to 0, so if a thread has ever had a valid jump buffer on its stack then we will happily jump to it, even if it's no longer valid. (2) we shouldn't throw away the signal, we should pass it on to Python.)

So the simple solution is (1) check whether we are actually the one thread that can receive signals, (2) if so, then register our handler etc.; (3) if not, then make these macros into no-ops.

The one wrinkle here is that in theory it's possible for someone to call sigprocmask themselves and enable signal receiving on a non-default thread. We might want to be robust to this. A nice extra constraint is that there's no C API to check whether we are in the main thread; the only way I can think of to check that is actually to call sigprocmask and check whether SIGINT delivery is enabled for this thread.

@njsmith
Copy link
Member Author
njsmith commented Apr 17, 2016

I spent a little while noodling at this today, and it's extremely hard :-(

Best attempt:

  • if sigprocmask says that SIGINT is masked for this thread: just run the actual user code without doing anything special
  • else, if sigprocmask says that SIGINT is _un_masked for this thread:
    • use sigprocmask to mask out SIGINT, so that we won't be interrupted while doing the following dance
    • while holding some mutex (either the GIL or a specialized mutex just for this):
      • increment a process-wide global counter of "how many threads are inside SIGINT_ON"
      • if the counter was 0 and is now 1, then:
        • first, save the current SIGINT handler to some process-wide global variable
        • then, in a 8000 second operation, register our special signal handler [this two-step process is necessary to get the invariant that whenever our signal handler is called it can assume the old signal handler pointer is in a consistent state]
    • set this thread's "my jump buffer is initialized" flag (in thread local storage)
    • call setjmp to initialize this thread's jump buffer (in thread local storage). [Note that this call can return two ways: either by falling through, or by being longjmped to. So those two control flows merge here]
    • call sigprocmask again to restore the original mask (re-enabling SIGINT, possibly for the second time in case longjmp restored the old mask)
    • check whether we're on control flow where setjmp was just called for the first time, or on the flow where it got longjmped to. If it was just called for the first time, then run the actual user code that we want to be interruptible
    • set this thread's "my jump buffer is initialized" flag to false
    • while holding the same mutex as above:
      • decrement the process-wide global counter of "how many threads are inside SIGINT_ON"
      • if the counter value is now 0, then: restore the original signal handler
    • oh thank goodness we escaped

And the signal handler logic should be:

  • if this thread's "my jump buffer is initialized" flag is false:
    • manually call the original signal handler and return
  • otherwise: call longjmp

And we need to make sure that the "my jump buffer is initialized" variable is sig_atomic_t.

I think that would work?

...alternatively I am strongly tempted to rip this stuff out. This is a lot of complexity for something that only even affects np.fft.

@seberg
Copy link
Member
seberg commented Jul 17, 2020

My solution right now is, that with Bluestein's algorithm, we just stop doing anything for FFTs, I think its better than the current unsafe behaviour, threads are becoming more common. As to actually solving this, right now you have to reclaim the GIL to do that, Python would have to expose _PyOS_InterruptOccurred(PyThreadState *tstate) probably? And I do not think our macros can really be rescued, they are simply not a safe approach.

@seberg seberg added the 57 - Close? Issues which may be closable unless discussion continued label Jul 17, 2020
@rgommers
Copy link
Member

There's two separate things captured here:

  1. An issue in numpy.fft - this is solved, we are not using NPY_SIGINT_{ON,OFF} anywhere inside NumPy anymore after the update to pocketfft
  2. An issue inherent to how NPY_SIGINT_{IN,OFF} are implemented. The description by @njsmith is excellent, and may remain useful because these macros are 8000 part of our public API: https://numpy.org/devdocs/reference/c-api/config.html#interrupt-handling.

For (2), I'm not sure what to do though. There's a possible issue with these interrupt handling APIs and they are not used in NumPy, SciPy or anywhere else prominent as far as I can tell with code search. I'm not sure we should spend the effort to deprecate them though.

How about:

  • Add a warning to the C API docs that links to this issue
  • Close this issue, saying we don't recommend using these APIs but won't change things unless actual real-world problems show up that necessitate taking action

@seberg
Copy link
Member
seberg commented Nov 21, 2022

I will make a mini PR to just delete all traces of interrupt handling stuff (I must have missed two lines). I honestly would be 👍 for just deleting the header:

  • This should hit exceedingly few users
  • vendoring the header is a simple enough fix

But if not now... let's mark it to maybe do when a major release comes around (i.e. any moment).

seberg added a commit to seberg/numpy that referenced this issue Nov 21, 2022
We do not use these in NumPy anymore, and at this point the whole
`npy_interrupt.h` header only exists in case someone is using it
out there.
We may wish to just remove it at some point, vendoring the header is
simple enough after all (and no known downstream usage exists).

See also numpygh-7545, numpygh-12541
@seberg
Copy link
Member
seberg commented Nov 21, 2022

As to this issue, closing as a duplicate of the newer (and more focused) gh-12541.

@seberg seberg closed this as completed Nov 21, 2022
< 4E38 /div>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
57 - Close? Issues which may be closable unless discussion continued
Projects
None yet
Development

No branches or pull requests

3 participants
0