BUG: fixes np.random.zipf() hanging on pathological input #10912

tylerjereddy · 2018-04-16T02:33:36Z

A cleaner fix might be possible if I were to i.e., move the C algorithm into Cython; leaving the while loop in C seems to require a bit of low-level intervention to enable Keyboard Interrupt handling & using the Python C API as suggested in the referenced issue results in a segfault after catching ctrl+c.

I'm also not a big fan of the unit test structure with Python tempfile script (!) (not even sure that will work on CI)--perhaps there are better suggestions--probing SIGINT handling on pathological input seems like a tricky thing to test.

Fixes Issue BUG: np.random.zipf hangs the interpreter on pathological input #9829 by allowing a keyboard interruption
to break out of zipf() computations that run for very
long times because of pathological inputs
Added a unit test that verifies handling of SIGINT
by zipf()

* Fixes Issue numpy#9829 by allowing a keyboard interruption to break out of zipf() computations that run for very long times because of pathological inputs * Added a unit test that verifies handling of SIGINT by zipf()

tylerjereddy · 2018-04-19T04:40:37Z

I adjusted so that it works (at least in my hands) on windows & linux with a keyboard interrupt; we'll see if the CI agrees. Even if it does, I still have some residual concerns about the potential fragility.

mattip · 2018-04-19T06:07:24Z

numpy/random/mtrand/distributions.c

+     * approach is used here;
+     * Related to Issue #9829
+     */
+    signal(SIGINT, kb_interrupt_handler);


The handler is installed on every call to rk_zipf? When do you remove it?

It's in the wrong place anyway. The problem is that the while loop never exits because the acceptance-rejection based implementation achieves near 100% rejection for parameters near one on account of an unfortunate scaling.

njsmith · 2018-04-19T06:50:15Z

So if I understand right, you install a C-level signal handler that aborts the process? That seems like an unfortunate solution :-(.

There is an existing approach to this problem in npy_interrupt.h, that uses longjmp to do an early return when a signal is detected. But this approach is terrifying to start with, can't work on Windows, and buggy (#7545), so I don't really recommend it.

Given the simplicity of this function, it seems like checking for signals occasionally would be a better approach? E.g., calling PyErr_CheckSignals once every 1000 iterations, something like that? I don't know why PyErr_CheckSignals would be segfaulting – maybe you were calling it without the GIL held?

The trickier issue is that right now, there's a strict separation between the Python wrapper and the C distribution code, but to make this work right we would need to add a C API call (PyErr_CheckSignals) to the C distribution code. I think in this case, violating that layering is actually the right idea – layering is a good thing, but if it stops you implementing real features, then it's the wrong layering! However, I'm not sure what the actual best way to do this would be.

Paging @rkern...

charris · 2018-04-19T13:58:33Z

We could also fix the function if we stop worrying about maintaining the current stream. The problem with zipf is a poor implementation, it isn't inherent.

charris · 2018-04-19T14:07:12Z

Note that the problem is also predictable, so we could also simply raise an error and bail for certain values of the distribution parameter.

tylerjereddy · 2018-04-19T17:01:54Z

Yeah, I did try to put PyErr_CheckSignals in the C while loop. From what I could tell from some searching around, the overhead for signal handlers is so low you can normally just call them on every iteration.

@njsmith if we assume I've botched the GIL handling the related C API docs seem to suggest something like this:

PyGILState_STATE gstate;
gstate = PyGILState_Ensure();

PyErr_CheckSignals();

// Release the thread.
PyGILState_Release(gstate);

Does that look about right or is there a somewhat different convention in NumPy?

njsmith · 2018-04-19T20:39:09Z

That GIL code looks plausible to me at a first glance, anyway. I guess you also need to check the return value from PyErr_CheckSignals and some way to communicate to the caller when an exception has been set? I don't have any guesses about the segfault beyond that though.

charris · 2018-04-19T21:17:41Z

One possibility that occurred to me was to simply count the number of sequential rejections and bail at some point. Just to clarify why there is a problem, the distribution becomes singular as a -> 1 (harmonic series) and almost all trial samples will be floating infinite and get rejected as too big for long. The cure is to cut off the distribution at the size of long before generating the trial values, i.e., scale the mapping from (0, 1] to the majorizing distribution. That gives speedups of many thousands for values that aren't even that near to 1.

njsmith · 2018-04-19T21:38:36Z

I believe @rkern is working on a NEP for breaking distribution compatibility. If we don't want to wait for that though then erroring out on problematic values seems reasonable.

If we do go the error route, then I think it's better to trigger that based on parameters exceeding some numeric threshold, rather than checking the rejection count. Using the rejection count means that the same call with the same parameters might sometimes work and sometimes error out depending on the seed. This seems particularly miserable if you have a script using parameters that are mostly safe, but then fail for like 1/100 seeds...

eric-wieser · 2018-04-19T23:30:32Z

How about making rk_zipf take 1000 samples, and return -1 if it exceeds the time limit. The caller can then call it repeatedly, and call PyErr_CheckSignals() after each failing call.

charris · 2018-04-22T01:55:53Z

It should not be difficult to compute the cutoff for the parameter, we just need to decide what ratio of acceptance to failure we are willing to live with. In general, I would regard an algorithm of this type as pretty poor if only half the trials were accepted ...

tylerjereddy · 2019-03-14T18:13:41Z

My approach wasn't flying here, and a lot of the older random stuff will probably end up preserved for back-compat anyway, based on the NEP.

If this becomes a priority it can be reopened or restarted, etc.

tylerjereddy force-pushed the issue_9829_zipf_hang branch from 5051603 to 8a97ed3 Compare April 16, 2018 03:51

BUG: fixes np.random.zipf() hanging on pathological input

3e98c39

* Fixes Issue numpy#9829 by allowing a keyboard interruption to break out of zipf() computations that run for very long times because of pathological inputs * Added a unit test that verifies handling of SIGINT by zipf()

tylerjereddy force-pushed the issue_9829_zipf_hang branch from 8a97ed3 to 3e98c39 Compare April 19, 2018 04:36

mattip reviewed Apr 19, 2018

View reviewed changes

mattip added 00 - Bug component: numpy.random labels Jul 19, 2018

tylerjereddy closed this Mar 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG: fixes np.random.zipf() hanging on pathological input #10912

BUG: fixes np.random.zipf() hanging on pathological input #10912

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BUG: fixes np.random.zipf() hanging on pathological input #10912

BUG: fixes np.random.zipf() hanging on pathological input #10912

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!