-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
Bug in histogram: incorrectly handles float32 inputs just below lower bin limit #10319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Actually, I think we want a variation on idea 1:
# The first "keep" is only approximately correct when the input data are single-precision float32.
# To prevent bincount from raising an error, select the data again.
keep = indices >= 0
keep &= indices < bins
indices = indices[keep]
if tmp_w is not None:
tmp_w = tmp_w[keep] For float data and (I think) for integer data, this new |
This is a duplicate of multiple issues, and has a candidate fix at #9189 |
I think the problem comes down to the fact that the bounds are The code in question has moved to here: Lines 594 to 634 in 0953c4d
|
Thanks Eric, I agree with the assessment about down- vs up-casting values. Glad there is a fix in the works. |
Hmm, this is rather alarming (#10322) >>> np.float32([1.0]) >= 1.00000001
array([ True])
>>> np.float32(1.0) >= 1.00000001
False |
Scary, right? Numpy or Python apparently compares a This is what I think @mhvk was referring to by "I think the real reason this goes wrong is that the array comparison with a numpy scalar is not properly done (which is actually a bigger bug...)" in #9189. Sounds like you want to close this issue and discuss in #10322. That works for me. |
Let's call this a duplicate of #9189 |
Fixes numpy#8123, closes numpy#9189, fixes numpy#10319 This is a workaround to numpy#10322, not a fix for it. Adds tests for cases where bounds are more precise than the data, which led to inconsistencies in the optimized path.
Fixes numpy#8123, closes numpy#9189, fixes numpy#10319 This is a workaround to numpy#10322, not a fix for it. Adds tests for cases where bounds are more precise than the data, which led to inconsistencies in the optimized path.
I am passing a very benign array of
float32
values tonp.histogram(...)
which raises aValueError: The first argument of bincount must be non-negative
. The values contain no NaN or infinite values. I think this is a bug, and it is due to incorrect handling of single-precision values in the histogram function. Specifically, values that are equal to the bin limits when compared as single-precision but are outside them when compared as doubles.First, a quick test code that works for me with 64-bit floats but raises the error with 32-bit floats on the last line:
The full output in iPython looks like this:
Notice that
a32[0]
is equal to the lower bin limit or not, depending on whether we compare as float32 or as float:Looking into
function_base.py
, we can see the problem:Up to line 753, we are discarding data that fall outside the given histogram limits. At lines 748 and 751, array
tmp_a
is still whatever type the user passes in (here,float32), and the decision is made whether to keep data *based on the original float32 values*. But in line 755, the
tmp_a` is replaced with a (double-precision) float array. When the single-precision values are equal to the lowest bin edge in single-precision comparisons but then turn out to be just below the lowest edge in double-precision comparisons, then we get this error, erroneously.I could update and make a PR, but I'm completely new to numpy and don't know what approach you'd like best. Some ideas:
You could just force values in the
indices
array to be in-range. This seems to be sort of what lines 760-770 do, so I'm not clear how it fails in the case I've raised. I think what we need is an analogue to line 761 but for the lower edge. Perhapsindices[indices < 0] = 0
? This behavior means that we bin float32 data with float32 comparisons, which might or might not be what we want.Convert the input array to
float
before deciding which values to keep. This could be a performance problem in cases where most values will be outside the histogram limits, because you'd be converting all values. (The current approach only converts the ones that will be within limits, except that "within limits" is not quite correct for float32 values.)Replace the
tmp_a.astype(float)
(line 754) so that we convert not tofloat
but to the input type. I think this would solve the problem for all floating-point inputs but likely create some new problem for integer inputs. This might require that handling floats and ints differently with separate branches.I'd say that we want to go with the first idea but am hoping an expert can weigh in.
I am using with numpy 1.13.1 on Mac OS 10.12.6 through the Macports system, with Python 2.7.14 and IPython 5.4.0.
The text was updated successfully, but these errors were encountered: