-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
BUG: random: Problems with hypergeometric with ridiculously large arguments. #11443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Labels
Comments
WarrenWeckesser
added a commit
to WarrenWeckesser/numpy
that referenced
this issue
Jul 2, 2018
…uments. The changes in this pull request restrict the arguments ngood and nbad of numpy.random.hypergeometric to be less than 2**30. This "fix" (in the sense of the old joke "Doctor, it hurts when I do *this*") makes the following problems impossible to encounter: * Python hangs with these calls (see numpygh-11443): np.random.hypergeometric(2**62-1, 2**62-1, 26, size=2) np.random.hypergeometric(2**62-1, 2**62-1, 11, size=12) * Also reported in numpygh-11443: the distribution of the variates of np.random.hypergeometric(ngood, nbad, nsample) is not correct when nsample >= 10 and ngood + nbad is sufficiently large. I don't have a rigorous bound, but a histogram of the p-values of 1500 runs of a G-test with size=5000000 in each run starts to look suspicious when ngood + nbad is approximately 2**37. When the sum is greater than 2**38, the distribution is clearly wrong, and, as reported in the pull request numpygh-9834, the following call returns all zeros: np.random.hypergeometric(2**55, 2**55, 10, size=20) * The code does not check for integer overflow of ngood + nbad, so the following call np.random.hypergeometric(2**62, 2**62, 1) on a system where the default integer is 64 bit results in `ValueError: ngood + nbad < nsample`. By restricting the two arguments to be less than 2**30, the values are well within the space of inputs for which I have no evidence of the function generating an incorrect distribution, and the overflow problem is avoided for both 32-bit and 64-bit systems. I replaced the test for hypergeometric(2**40 - 2, 2**40 - 2, 2**40 - 2) > 0 with hypegeometric(2**30 A802 - 1, 2**30 - 1, 2**30 - 1) > 0. The test was a regression test for a bug in which the function would enter an infinite loop when the arguments were sufficiently large. The regression test for the fix included the call hypergeometric(2**40 - 2, 2**40 - 2, 2**40 - 2) on systems with 64 bit integers. This call is now disallowed, so I add a call with the maximum allowed values of ngood and nbad.
16 tasks
We're not touching the legacy implementation, and the appropriate changes have been included in the |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
In the pull request gh-9834, I reported that a call such as
where
good
andbad
are ridiculously large andnsample
is less than or equal to 10, incorrectly generates an array of all zeros. As explained in that pull request, the problem is because of the limited precision of the floating point calculations in the underlying C code.There are also problems when
nsample
is larger. For example, these calls will hang the Python interpreter:Moreover, the following call generates samples that are not correctly distributed:
I have run repeated tests and checked the distribution using either the chi-square test or the G-test (i.e. likelihood ratio test). When the arguments are crazy big, the distribution is no longer correct.
All these problems are connected to floating point calculations in which important information ends up having a magnitude on the scale of the ULP of the variables in the computation.
A quick fix is to simply disallow such large arguments. I'll create a pull request in which the function will raise a ValueError if
good + bad
exceeds some safe maximum. That maximum is still to be determined, but preliminary experiments show that something on the order of 2**35 works. I don't expect the change to have any impact on existing code--it is hard to imagine anyone actually using the function with such large values. It is still worthwhile fixing the issue, if only to prevent hanging the interpreter when someone accidentally gives huge arguments to the function.A better fix will be to improve the implementation, but that will almost certainly require changes the stream of variates produced by the function. Such a change will have to wait until the change in numpy.random's reproducibility policy has been implemented.
The text was updated successfully, but these errors were encountered: