8000 BUG: random: Fix long delays/hangs with zipf(a) when a near 1. by WarrenWeckesser · Pull Request #27048 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

BUG: random: Fix long delays/hangs with zipf(a) when a near 1. #27048

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 26, 2024

Conversation

WarrenWeckesser
Copy link
Member
@WarrenWeckesser WarrenWeckesser commented Jul 26, 2024

The problem reported in gh-9829 is that zipf(a) hangs when a is very close to 1.

The implementation is based on the rejection method described in the text "Non-Uniform Random Variate Generation" by Luc Devroye. The candidate X is generated from a uniform random variate U in (0, 1] with this code:

am1 = a - 1.0;
[...]
X = floor(pow(U, -1.0 / am1))
if (X > (double)RAND_INT_MAX || X < 1.0) {
  continue;
}

X is rejected if it exceeds the largest value of the integer return type. (As noted in the code comments, the zipf function models a Zipf distribution truncated to RAND_INT_MAX.)

The problem is that when a is close to 1, 1 / (a - 1) is large, and when U is sufficiently small, X will be much larger than RAND_INT_MAX and X can even overflow to infinity. The closer a is to 1, the more likely this is to happen, resulting in the code spending a lot of time rejecting values that are too big.

The fix is straightforward: for the given a, work backwards from RAND_INT_MAX to find the values of U that will cause X to be too big, and eliminate those values from the uniform variate U. It is the smaller values of U that result in bigger values of X, so we need to find Umin, below which X will be too big, and then sample U from the interval (Umin, 1] instead of (0, 1]. By solving

Umin**(-1/(a-1)) == RAND_INT_MAX

we find

Umin = RAND_INT_MAX**(1 - a)

So the fix in the code is to initialize Umin before entering the loop with:

Umin = pow(RAND_INT_MAX, -am1);

and to replace the generation of U with

U01 = next_double(bitgen_state);
U = U01*Umin + (1 - U01);

(or something equivalent) so U is sampled from the interval (Umin, 1].

With this change, an example as extreme as a = np.nextafter(1.0, 2.0) does not cause a problem:

In [39]: a = np.nextafter(1.0, 2.0)

In [40]: rng.zipf(a, size=12)
Out[40]: 
array([          294267566,       6481674477934,    4311231547115815,
       7794889495726944256,                1096,             1202604,
               16066464720,              729416,              268337,
                  14650719,            65659969,       1446257064291])

Closes gh-9829.

@charris
Copy link
Member
charris commented Jul 26, 2024

LGTM. I suspect that in practical use, most folks would only be interested in the smaller integers, ranking, for instance, so it would probably be useful to allow the user to set the truncation point with a new argument.

@WarrenWeckesser
Copy link
Member Author

The second commit restores the legacy implementation of RandomState.zipf (including a reversion of #27046).

@charris charris merged commit 2459d80 into numpy:main Jul 26, 2024
68 checks passed
@charris
Copy link
Member
charris commented Jul 26, 2024

Thanks Warren.

@WarrenWeckesser WarrenWeckesser deleted the zipf-fix-a-near-1 branch July 26, 2024 15:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: np.random.zipf hangs the interpreter on pathological input
2 participants
0