8000 BUG: random: Problems with hypergeometric with ridiculously large arguments. · Issue #11443 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

BUG: random: Problems with hypergeometric with ridiculously large arguments. #11443

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
WarrenWeckesser opened this issue Jun 28, 2018 · 1 comment

Comments

@WarrenWeckesser
Copy link
Member
WarrenWeckesser commented Jun 28, 2018

In the pull request gh-9834, I reported that a call such as

np.random.hypergeometric(2**55, 2**55, 10, size=20)

where good and bad are ridiculously large and nsample is less than or equal to 10, incorrectly generates an array of all zeros. As explained in that pull request, the problem is because of the limited precision of the floating point calculations in the underlying C code.

There are also problems when nsample is larger. For example, these calls will hang the Python interpreter:

np.random.hypergeometric(2**62-1, 2**62-1, 26, size=2)
np.random.hypergeometric(2**62-1, 2**62-1, 11, size=12)

Moreover, the following call generates samples that are not correctly distributed:

np.random.hypergeometric(2**48, 2**47, 12, size=10000000)

I have run repeated tests and checked the distribution using either the chi-square test or the G-test (i.e. likelihood ratio test). When the arguments are crazy big, the distribution is no longer correct.

All these problems are connected to floating point calculations in which important information ends up having a magnitude on the scale of the ULP of the variables in the computation.

A quick fix is to simply disallow such large arguments. I'll create a pull request in which the function will raise a ValueError if good + bad exceeds some safe maximum. That maximum is still to be determined, but preliminary experiments show that something on the order of 2**35 works. I don't expect the change to have any impact on existing code--it is hard to imagine anyone actually using the function with such large values. It is still worthwhile fixing the issue, if only to prevent hanging the interpreter when someone accidentally gives huge arguments to the function.

A better fix will be to improve the implementation, but that will almost certainly require changes the stream of variates produced by the function. Such a change will have to wait until the change in numpy.random's reproducibility policy has been implemented.

WarrenWeckesser added a commit to WarrenWeckesser/numpy that referenced this issue Jul 2, 2018
…uments.

The changes in this pull request restrict the arguments ngood and nbad
of numpy.random.hypergeometric to be less than 2**30.  This "fix" (in the
sense of the old joke "Doctor, it hurts when I do *this*") makes the
following problems impossible to encounter:

* Python hangs with these calls (see numpygh-11443):

      np.random.hypergeometric(2**62-1, 2**62-1, 26, size=2)
      np.random.hypergeometric(2**62-1, 2**62-1, 11, size=12)

* Also reported in numpygh-11443: the distribution of the variates of

      np.random.hypergeometric(ngood, nbad, nsample)

  is not correct when nsample >= 10 and ngood + nbad is sufficiently large.
  I don't have a rigorous bound, but a histogram of the p-values of 1500
  runs of a G-test with size=5000000 in each run starts to look suspicious
  when ngood + nbad is approximately 2**37.  When the sum is greater than
  2**38, the distribution is clearly wrong, and, as reported in the pull
  request numpygh-9834, the following call returns all zeros:

      np.random.hypergeometric(2**55, 2**55, 10, size=20)

* The code does not check for integer overflow of ngood + nbad, so the
  following call

      np.random.hypergeometric(2**62, 2**62, 1)

  on a system where the default integer is 64 bit results in
  `ValueError: ngood + nbad < nsample`.

By restricting the two arguments to be less than 2**30, the values are
well within the space of inputs for which I have no evidence of the
function generating an incorrect distribution, and the overflow problem
is avoided for both 32-bit and 64-bit systems.

I replaced the test for hypergeometric(2**40 - 2, 2**40 - 2, 2**40 - 2) > 0
with hypegeometric(2**30 
A802
- 1, 2**30 - 1, 2**30 - 1) > 0.  The test was a
regression test for a bug in which the function would enter an infinite
loop when the arguments were sufficiently large.  The regression test for
the fix included the call hypergeometric(2**40 - 2, 2**40 - 2, 2**40 - 2)
on systems with 64 bit integers.  This call is now disallowed, so I add a
call with the maximum allowed values of ngood and nbad.
@WarrenWeckesser
Copy link
Member Author

We're not touching the legacy implementation, and the appropriate changes have been included in the Generator class to avoid the problems described here, so closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants
0