8000 BUG: fixes np.random.zipf() hanging on pathological input by tylerjereddy · Pull Request #10912 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

BUG: fixes np.random.zipf() hanging on pathological input #10912

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

tylerjereddy
Copy link
Contributor

A cleaner fix might be possible if I were to i.e., move the C algorithm into Cython; leaving the while loop in C seems to require a bit of low-level intervention to enable Keyboard Interrupt handling & using the Python C API as suggested in the referenced issue results in a segfault after catching ctrl+c.

I'm also not a big fan of the unit test structure with Python tempfile script (!) (not even sure that will work on CI)--perhaps there are better suggestions--probing SIGINT handling on pathological input seems like a tricky thing to test.

@tylerjereddy tylerjereddy force-pushed the issue_9829_zipf_hang branch from 5051603 to 8a97ed3 Compare April 16, 2018 03:51
* Fixes Issue numpy#9829 by allowing a keyboard interruption
to break out of zipf() computations that run for very
long times because of pathological inputs

* Added a unit test that verifies handling of SIGINT
by zipf()
@tylerjereddy tylerjereddy force-pushed the issue_9829_zipf_hang branch from 8a97ed3 to 3e98c39 Compare April 19, 2018 04:36
@tylerjereddy
Copy link
Contributor Author

I adjusted so that it works (at least in my hands) on windows & linux with a keyboard interrupt; we'll see if the CI agrees. Even if it does, I still have some residual concerns about the potential fragility.

* approach is used here;
* Related to Issue #9829
*/
signal(SIGINT, kb_interrupt_handler);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The handler is installed on every call to rk_zipf? When do you remove it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's in the wrong place anyway. The problem is that the while loop never exits because the acceptance-rejection based implementation achieves near 100% rejection for parameters near one on account of an unfortunate scaling.

@njsmith
Copy link
Member
njsmith commented Apr 19, 2018

So if I understand right, you install a C-level signal handler that aborts the process? That seems like an unfortunate solution :-(.

There is an existing approach to this problem in npy_interrupt.h, that uses longjmp to do an early return when a signal is detected. But this approach is terrifying to start with, can't work on Windows, and buggy (#7545), so I don't really recommend it.

Given the simplicity of this function, it seems like checking for signals occasionally would be a better approach? E.g., calling PyErr_CheckSignals once every 1000 iterations, something like that? I don't know why PyErr_CheckSignals would be segfaulting – maybe you were calling it without the GIL held?

The trickier issue is that right now, there's a strict separation between the Python wrapper and the C distribution code, but to make this work right we would need to add a C API call (PyErr_CheckSignals) to the C distribution code. I think in this case, violating that layering is actually the right idea – layering is a good thing, but if it stops you implementing real features, then it's the wrong layering! However, I'm not sure what the actual best way to do this would be.

Paging @rkern...

@charris
Copy link
Member
charris commented Apr 19, 2018

We could also fix the function if we stop worrying about maintaining the current stream. The problem with zipf is a poor implementation, it isn't inherent.

@charris
Copy link
Member
charris commented Apr 19, 2018

Note that the problem is also predictable, so we could also simply raise an error and bail for certain values of the distribution parameter.

@tylerjereddy
Copy link
Contributor Author

Yeah, I did try to put PyErr_CheckSignals in the C while loop. From what I could tell from some searching around, the overhead for signal handlers is so low you can normally just call them on every iteration.

@njsmith if we assume I've botched the GIL handling the related C API docs seem to suggest something like this:

PyGILState_STATE gstate;
gstate = PyGILState_Ensure();

PyErr_CheckSignals();

// Release the thread.
PyGILState_Release(gstate);

Does that look about right or is there a somewhat different convention in NumPy?

@njsmith
Copy link
Member
njsmith commented Apr 19, 2018

That GIL code looks plausible to me at a first glance, anyway. I guess you also need to check the return value from PyErr_CheckSignals and some way to communicate to the caller when an exception has been set? I don't have any guesses about the segfault beyond that though.

@charris
Copy link
Member
charris commented Apr 19, 2018

One possibility that occurred to me was to simply count the number of sequential rejections and bail at some point. Just to clarify why there is a problem, the distribution becomes singular as a -> 1 (harmonic series) and almost all trial samples will be floating infinite and get rejected as too big for long. The cure is to cut off the distribution at the size of long before generating the trial values, i.e., scale the mapping from (0, 1] to the majorizing distribution. That gives speedups of many thousands for values that aren't even that near to 1.

@njsmith
Copy link
Member
njsmith commented Apr 19, 2018

I believe @rkern is working on a NEP for breaking distribution compatibility. If we don't want to wait for that though then erroring out on problematic values seems reasonable.

If we do go the error route, then I think it's better to trigger that based on parameters exceeding some numeric threshold, rather than checking the rejection count. Using the rejection count means that the same call with the same parameters might sometimes work and sometimes error out depending on the seed. This seems particularly miserable if you have a script using parameters that are mostly safe, but then fail for like 1/100 seeds...

@eric-wieser
Copy link
Member

How about making rk_zipf take 1000 samples, and return -1 if it exceeds the time limit. The caller can then call it repeatedly, and call PyErr_CheckSignals() after each failing call.

@charris
Copy link
Member
charris commented Apr 22, 2018

It should not be difficult to compute the cutoff for the parameter, we just need to decide what ratio of acceptance to failure we are willing to live with. In general, I would regard an algorithm of this type as pretty poor if only half the trials were accepted ...

@tylerjereddy
Copy link
Contributor Author

My approach wasn't flying here, and a lot of the older random stuff will probably end up preserved for back-compat anyway, based on the NEP.

If this becomes a priority it can be reopened or restarted, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0