BUG: random: Problems with hypergeometric with ridiculously large arguments. #11443

WarrenWeckesser · 2018-06-28T16:56:16Z

In the pull request gh-9834, I reported that a call such as

np.random.hypergeometric(2**55, 2**55, 10, size=20)

where good and bad are ridiculously large and nsample is less than or equal to 10, incorrectly generates an array of all zeros. As explained in that pull request, the problem is because of the limited precision of the floating point calculations in the underlying C code.

There are also problems when nsample is larger. For example, these calls will hang the Python interpreter:

np.random.hypergeometric(2**62-1, 2**62-1, 26, size=2)
np.random.hypergeometric(2**62-1, 2**62-1, 11, size=12)

Moreover, the following call generates samples that are not correctly distributed:

np.random.hypergeometric(2**48, 2**47, 12, size=10000000)

I have run repeated tests and checked the distribution using either the chi-square test or the G-test (i.e. likelihood ratio test). When the arguments are crazy big, the distribution is no longer correct.

All these problems are connected to floating point calculations in which important information ends up having a magnitude on the scale of the ULP of the variables in the computation.

A quick fix is to simply disallow such large arguments. I'll create a pull request in which the function will raise a ValueError if good + bad exceeds some safe maximum. That maximum is still to be determined, but preliminary experiments show that something on the order of 2**35 works. I don't expect the change to have any impact on existing code--it is hard to imagine anyone actually using the function with such large values. It is still worthwhile fixing the issue, if only to prevent hanging the interpreter when someone accidentally gives huge arguments to the function.

A better fix will be to improve the implementation, but that will almost certainly require changes the stream of variates produced by the function. Such a change will have to wait until the change in numpy.random's reproducibility policy has been implemented.

The text was updated successfully, but these errors were encountered:

…uments. The changes in this pull request restrict the arguments ngood and nbad of numpy.random.hypergeometric to be less than 2**30. This "fix" (in the sense of the old joke "Doctor, it hurts when I do *this*") makes the following problems impossible to encounter: * Python hangs with these calls (see numpygh-11443): np.random.hypergeometric(2**62-1, 2**62-1, 26, size=2) np.random.hypergeometric(2**62-1, 2**62-1, 11, size=12) * Also reported in numpygh-11443: the distribution of the variates of np.random.hypergeometric(ngood, nbad, nsample) is not correct when nsample >= 10 and ngood + nbad is sufficiently large. I don't have a rigorous bound, but a histogram of the p-values of 1500 runs of a G-test with size=5000000 in each run starts to look suspicious when ngood + nbad is approximately 2**37. When the sum is greater than 2**38, the distribution is clearly wrong, and, as reported in the pull request numpygh-9834, the following call returns all zeros: np.random.hypergeometric(2**55, 2**55, 10, size=20) * The code does not check for integer overflow of ngood + nbad, so the following call np.random.hypergeometric(2**62, 2**62, 1) on a system where the default integer is 64 bit results in `ValueError: ngood + nbad < nsample`. By restricting the two arguments to be less than 2**30, the values are well within the space of inputs for which I have no evidence of the function generating an incorrect distribution, and the overflow problem is avoided for both 32-bit and 64-bit systems. I replaced the test for hypergeometric(2**40 - 2, 2**40 - 2, 2**40 - 2) > 0 with hypegeometric(2**30 A802 - 1, 2**30 - 1, 2**30 - 1) > 0. The test was a regression test for a bug in which the function would enter an infinite loop when the arguments were sufficiently large. The regression test for the fix included the call hypergeometric(2**40 - 2, 2**40 - 2, 2**40 - 2) on systems with 64 bit integers. This call is now disallowed, so I add a call with the maximum allowed values of ngood and nbad.

WarrenWeckesser · 2021-11-08T18:18:53Z

We're not touching the legacy implementation, and the appropriate changes have been included in the Generator class to avoid the problems described here, so closing.

WarrenWeckesser mentioned this issue Jul 2, 2018

BUG: random: Avoid bad behavior of hypergeometric with very large arguments. #11475

Closed

mattip added 00 - Bug component: numpy.random labels Jul 19, 2018

mattip mentioned this issue Jun 6, 2019

ENH: tracking issue for merging randomgen into numpy #13164

Closed

16 tasks

WarrenWeckesser mentioned this issue Sep 9, 2019

BUG: random.hypergeometic assumes npy_long is npy_int64, hangs ppc64 #14458

Merged

WarrenWeckesser closed this as completed Nov 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG: random: Problems with hypergeometric with ridiculously large arguments. #11443

BUG: random: Problems with hypergeometric with ridiculously large arguments. #11443

Uh oh!

Uh oh!

BUG: random: Problems with hypergeometric with ridiculously large arguments. #11443

BUG: random: Problems with hypergeometric with ridiculously large arguments. #11443

Comments

Uh oh!

Uh oh!