BUG: random: Fix long delays/hangs with zipf(a) when a near 1. #27048

WarrenWeckesser · 2024-07-26T04:12:53Z

The problem reported in gh-9829 is that zipf(a) hangs when a is very close to 1.

The implementation is based on the rejection method described in the text "Non-Uniform Random Variate Generation" by Luc Devroye. The candidate X is generated from a uniform random variate U in (0, 1] with this code:

am1 = a - 1.0;
[...]
X = floor(pow(U, -1.0 / am1))
if (X > (double)RAND_INT_MAX || X < 1.0) {
  continue;
}

X is rejected if it exceeds the largest value of the integer return type. (As noted in the code comments, the zipf function models a Zipf distribution truncated to RAND_INT_MAX.)

The problem is that when a is close to 1, 1 / (a - 1) is large, and when U is sufficiently small, X will be much larger than RAND_INT_MAX and X can even overflow to infinity. The closer a is to 1, the more likely this is to happen, resulting in the code spending a lot of time rejecting values that are too big.

The fix is straightforward: for the given a, work backwards from RAND_INT_MAX to find the values of U that will cause X to be too big, and eliminate those values from the uniform variate U. It is the smaller values of U that result in bigger values of X, so we need to find Umin, below which X will be too big, and then sample U from the interval (Umin, 1] instead of (0, 1]. By solving

Umin**(-1/(a-1)) == RAND_INT_MAX

we find

Umin = RAND_INT_MAX**(1 - a)

So the fix in the code is to initialize Umin before entering the loop with:

Umin = pow(RAND_INT_MAX, -am1);

and to replace the generation of U with

U01 = next_double(bitgen_state);
U = U01*Umin + (1 - U01);

(or something equivalent) so U is sampled from the interval (Umin, 1].

With this change, an example as extreme as a = np.nextafter(1.0, 2.0) does not cause a problem:

In [39]: a = np.nextafter(1.0, 2.0)

In [40]: rng.zipf(a, size=12)
Out[40]: 
array([          294267566,       6481674477934,    4311231547115815,
       7794889495726944256,                1096,             1202604,
               16066464720,              729416,              268337,
                  14650719,            65659969,       1446257064291])

Closes gh-9829.

Closes numpygh-9829.

charris · 2024-07-26T13:57:16Z

LGTM. I suspect that in practical use, most folks would only be interested in the smaller integers, ranking, for instance, so it would probably be useful to allow the user to set the truncation point with a new argument.

WarrenWeckesser · 2024-07-26T14:08:33Z

The second commit restores the legacy implementation of RandomState.zipf (including a reversion of #27046).

charris · 2024-07-26T15:22:08Z

Thanks Warren.

BUG: random: Fix long delays/hangs with zipf(a) when a near 1.

f5a5e04

Closes numpygh-9829.

WarrenWeckesser added 00 - Bug component: numpy.random labels Jul 26, 2024

MAINT: Restore legacy zipf implementation.

0a576b5

charris approved these changes Jul 26, 2024

View reviewed changes

charris merged commit 2459d80 into numpy:main Jul 26, 2024
68 checks passed

WarrenWeckesser deleted the zipf-fix-a-near-1 branch July 26, 2024 15:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: random: Fix long delays/hangs with zipf(a) when a near 1. #27048

BUG: random: Fix long delays/hangs with zipf(a) when a near 1. #27048

BUG: random: Fix long delays/hangs with zipf(a) when a near 1. #27048

BUG: random: Fix long delays/hangs with zipf(a) when a near 1. #27048

Conversation