BUG: random: Fix long delays/hangs with zipf(a) when a near 1. #27048
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The problem reported in gh-9829 is that
zipf(a)
hangs whena
is very close to 1.The implementation is based on the rejection method described in the text "Non-Uniform Random Variate Generation" by Luc Devroye. The candidate X is generated from a uniform random variate U in (0, 1] with this code:
X is rejected if it exceeds the largest value of the integer return type. (As noted in the code comments, the
zipf
function models a Zipf distribution truncated toRAND_INT_MAX
.)The problem is that when
a
is close to 1,1 / (a - 1)
is large, and when U is sufficiently small, X will be much larger than RAND_INT_MAX and X can even overflow to infinity. The closera
is to 1, the more likely this is to happen, resulting in the code spending a lot of time rejecting values that are too big.The fix is straightforward: for the given
a
, work backwards fromRAND_INT_MAX
to find the values of U that will cause X to be too big, and eliminate those values from the uniform variate U. It is the smaller values of U that result in bigger values of X, so we need to find Umin, below which X will be too big, and then sample U from the interval (Umin, 1] instead of (0, 1]. By solvingwe find
So the fix in the code is to initialize Umin before entering the loop with:
and to replace the generation of U with
(or something equivalent) so U is sampled from the interval (Umin, 1].
With this change, an example as extreme as
a = np.nextafter(1.0, 2.0)
does not cause a problem:Closes gh-9829.