8000 BUG: Add parameter check to negative_binomial by Androp0v · Pull Request #21005 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

BUG: Add parameter check to negative_binomial #21005

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Mar 13, 2022

Conversation

Androp0v
Copy link
Contributor
@Androp0v Androp0v commented Feb 6, 2022

Hi!

First contribution to numpy here, hope I did it right. Searched the repo to see if this had been solved, but I couldn't find any other PR regarding this. PR marked as draft since it has no tests. Will add tests if you deem this approach to solving #18997 acceptable.


Add parameter check to negative_binomial to avoid internally calling poisson distribution generator with an invalid lam parameter.

Fixes #18997, noticed that it Numpy 1.22.2, macOS 12.3, Arm64 (Apple Silicon), Python 3.8.5, calling the issue's sample code:

default_rng(1).negative_binomial(2**62, 0.1)

No longer makes a silent overflow, but instead enters an infinite loop. To solve it:

  • Modified the internal disc function to support parameter checks acting on pairs of parameters (the value that caused an overflow / infinite loop was dependant on both the n and p values).
  • Added extra parameter constraint type CONS_NEGATIVE_BINOMIAL_POISSON. This parameter check:
    • Computes the mean value of the gamma distribution used as an intermediate step in building the negative binomial distribution.
    • Computes the standard deviation of that gamma distribution.
    • If a value within 10 standard deviations of the mean of such gamma function would invoke the poisson generator with a value of λ (lam) that would be above the (already present) max safe value of lam (POISSON_LAM_MAX), raise a ValueError.

Sample code

After the fix, the sample code (above), returns the following:

>>> np.random.default_rng(1).negative_binomial(2.0**62, 0.1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "_generator.pyx", line 3051, in numpy.random._generator.Generator.negative_binomial
ValueError: n too large or p too small

Notes:

The parameter check is very conservative (maybe too much?). Since the POISSON_LAM_MAX is already conservative (same 10-std deviation margin) and I added another 10-std margin on top of that for the negative_binomial generator (the gamma function would have to get to a sample value 10-stds above the mean, and then call the poisson function with that value and get another 10-std above the (poisson) mean to cause a numerical overflow. Maybe the threshold of the negative binomial check could be brought down a bit (5-stds for the gamma function max, for example, but that would mean that occasionally the poisson generator could be called with a lam value above what is possible by directly calling np.random.default_rng(1).poisson.

Thoughts?

@Androp0v Androp0v changed the title Add parameter check to negative_binomial BUG: Add parameter check to negative_binomial Feb 6, 2022
@@ -3510,10 +3510,11 @@ cdef class RandomState:
... print(i, "wells drilled, probability of one success =", probability)

"""
out = disc(&legacy_negative_binomial, &self._aug_state, size, self.lock, 2, 0,
out = disc(&legacy_negative_binomial, &self._aug_state, size, self.lock, 2, 0, 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes to mtrand aren't allowed in general if they would change the stream. All that could be done here would be to protect against values that lead to infinite loops.

Copy link
Contributor Author
@Androp0v Androp0v Feb 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify what do you mean with changing the stream? Is changing the signature of disc not allowed?

Otherwise, note that I didn't change the output of any of the functions in mtrand at all (not even to protect against the infinite loops, calling np.random.negative_binomial(2**64, 0.1) would still result in an infinite loop, the checks only apply to the calls to default_rng().negative_binomial()), I just updated the function calls here since the new disc signature has an extra parameter, which is set to an ignored value for all functions in mtrand.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a value slightly above lam max in your code would eventually return, then you have now changed the stream. I haven't reasoned carefully about this but as a general rule mtrand only changes for trivial things like doc string fixes and to keep to compiling.

See https://numpy.org/neps/nep-0019-rng-policy.html

Copy link
Contributor Author
@Androp0v Androp0v Feb 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood your point about the stream now (thanks!). However, upon reading NEP-19 on your link, it's still unclear to me if this applies to the new PRNGs too (i.e. are changes to np.random.default_rng().negative_binomial's stream allowed?).

Calls to np.random.negative_binomial (the one in mtrand) in this PR do not raise an error even if they would result in a value of lam too large to ever return, so the stream is not affected for any function in mtrand. Only the negative_binomial distribution that is created when calling a Generator method is affected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes and improvements to Generator are definitely allowed. Changes like this that prevent infinite loops, which are really bugs, are definitely in bounds. I think it would help if you removed any changes to mtrand here so it would be easier to think about the changes to the checks independently from the NEP issues.

Copy link
Contributor
@bashtage bashtage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the check shoudl be pulled up into _generator so that there is no need to alter _common. It would only make sense to alter _common if there were more examples of this type. The flow will be to determine if both inputs are scalar, and then use a scalar check. If either is not a scalar, then you should convert to arrays using the C-api using the correct conversion types, and check the elements there.

@@ -408,9 +408,18 @@ cdef int check_array_constraint(np.ndarray val, object name, constraint_type con
raise ValueError("{0} value too large".format(name))
elif not np.all(np.greater_equal(val, 0.0)):
raise ValueError("{0} < 0 or {0} contains NaNs".format(name))
elif cons == CONS_NEGATIVE_BINOMIAL_POISSON:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

< 8000 p dir="auto">Isn't this redundant with CONS_POISSON. Seems like the only difference is possibly in the exception.

cdef int check_constraint_pair(double val0, double val1, object name_0, object name_1, constraint_type cons) except -1:
cdef double max_lam
if cons == CONS_NEGATIVE_BINOMIAL_POISSON:
max_lam = (1 - val1) / val1 * (val0 + 10 * np.sqrt(np.array(val0).astype(np.double)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why cast double values to array? This will be quite slow. Just use sqrt from libc on doubles.

@Androp0v
Copy link
Contributor Author

I think the check shoudl be pulled up into _generator so that there is no need to alter _common. It would only make sense to alter _common if there were more examples of this type. The flow will be to determine if both inputs are scalar, and then use a scalar check. If either is not a scalar, then you should convert to arrays using the C-api using the correct conversion types, and check the elements there.

Thanks for taking the time to review this. I put the checks in _common as I thought all checks were being performed there, but upon further research I've found some functions in _generator also perform changes there. I'll change the implementation to match your suggestions.

@Androp0v Androp0v force-pushed the negative_binomial-checks branch 2 times, most recently from 464f302 to b916807 Compare February 17, 2022 15:25
@Androp0v Androp0v force-pushed the negative_binomial-checks branch from b916807 to f3d450d Compare February 17, 2022 15:54
Copy link
Contributor
@bashtage bashtage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it can be simplified a bit more.

@Androp0v Androp0v force-pushed the negative_binomial-checks branch from e29bf7a to c1a95ff Compare February 17, 2022 16:44
@Androp0v Androp0v marked this pull request as ready for review February 17, 2022 16:45
if np.any(np.greater(max_lam_arr, POISSON_LAM_MAX)):
raise ValueError("n too large or p too small")
else:
_dp = PyFloat_AsDouble(p)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the scalar path, think you can use something like

cdef double *_dp
cdef int64_t *_in

_in = <int64_t*>np.PyArray_DATA(n_arr)[0]
_dp = <double*>np.PyArray_DATA(p_arr)[0]
check_constraint(<double>_in[0], 'n', CONS_POSITIVE_NOT_NAN)
check_constraint(_dp[0], 'p', CONS_BOUNDED_GT_0_1)

since you already have paid the price for the conversion above to arrays, so can just grab the first element form each when scalar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, this surfaced some issues with my implementation: while numpy's binomial accepts floating point values for the n argument, they're truncated to integer values. On the other hand, negative_binomial used to accept floating point arguments without truncation, so the n array should actually be of type double.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done now, hopefully!

@Androp0v Androp0v marked this pull request as draft February 18, 2022 10:07
@Androp0v Androp0v force-pushed the negative_binomial-checks branch from 3d5ecd1 to cc2ddfc Compare February 18, 2022 12:08
@Androp0v Androp0v marked this pull request as ready for review February 18, 2022 12:10
# call to the poisson distribution function with a value of lam too large.
max_lam_arr = (1 - p_arr) / p_arr * (n_arr + 10 * np.sqrt(n_arr))
if np.any(np.greater(max_lam_arr, POISSON_LAM_MAX)):
raise ValueError("n too large or p too small")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a bit to the Notes section of the docstring explaining the cross-parameter restriction?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally should reference the note in the exception so that users know precisely what has happened

9E88
check_constraint(_dp[0], 'p', CONS_BOUNDED_GT_0_1)
# Check that the choice of negative_binomial parameters won't result in a
# call to the poisson distribution function with a value of lam too large.
_dmax_lam = (1 - _dp[0]) / _dp[0] * (_dn[0] + 10 * np.sqrt(_dn[0]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cdef double _dmax_lam above

check_constraint(_dp[0], 'p', CONS_BOUNDED_GT_0_1)
# Check that the choice of negative_binomial parameters won't result in a
# call to the poisson distribution function with a value of lam too large.
_dmax_lam = (1 - _dp[0]) / _dp[0] * (_dn[0] + 10 * np.sqrt(_dn[0]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use libc.math.sqrt

from libc.math cimport sqrt

@bashtage
Copy link
Contributor

A few easy fixes that looks ready.

within 10 sigma of the max acceptable value of the ``Generator.poisson``
method. This happens when :math:`p` is too low (a lot of failures happen
for every success) and :math:`n` is too big (a lot of sucesses are
allowed).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add in the mathematical formula for computing the max?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@bashtage
Copy link
Contributor

I think my last was the final change needed before we invite a committer to review.

@Androp0v Androp0v force-pushed the negative_binomial-checks branch from fc084ea to ba31f5a Compare February 18, 2022 17:50
@Androp0v Androp0v force-pushed the negative_binomial-checks branch from ba31f5a to c320949 Compare February 18, 2022 18:01
@bashtage
Copy link
Contributor

@rkern @mattip @charris This looks correct and sensible to me. It would be good if you could take a look.

Copy link
Member
@rkern rkern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @Androp0v!

@Androp0v Androp0v force-pushed the negative_binomial-checks branch from c320949 to 2686f44 Compare February 21, 2022 17:00
@charris charris merged commit 55f31b2 into numpy:main Mar 13, 2022
@charris
Copy link
Member
charris commented Mar 13, 2022

Thanks @Androp0v .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Silent overflow error in numpy.random.default_rng.negative_binomial
4 participants
0