8000 Bug in histogram: incorrectly handles float32 inputs just below lower bin limit · Issue #10319 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

Bug in histogram: incorrectly handles float32 inputs just below lower bin limit #10319

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
joefowler opened this issue Jan 3, 2018 · 7 comments · Fixed by #10324
Closed

Bug in histogram: incorrectly handles float32 inputs just below lower bin limit #10319

joefowler opened this issue Jan 3, 2018 · 7 comments · Fixed by #10324

Comments

@joefowler
Copy link

I am passing a very benign array of float32 values to np.histogram(...) which raises a ValueError: The first argument of bincount must be non-negative. The values contain no NaN or infinite values. I think this is a bug, and it is due to incorrect handling of single-precision values in the histogram function. Specifically, values that are equal to the bin limits when compared as single-precision but are outside them when compared as doubles.

First, a quick test code that works for me with 64-bit floats but raises the error with 32-bit floats on the last line:

import numpy as np
a64=np.array([25552.234, 26000])

# Make a histogram from these two values. Put the lower bin limit just above the lower value.
bin_limits = (a64[0]+0.0005, 26200)
np.histogram(a64, 10, bin_limits)

# That worked. But repeat the exercise with the data converted to float32...
a32=a64.astype(np.float32)

# ... The following raises a ValueError:
np.histogram(a32, 10, bin_limits)

The full output in iPython looks like this:

In [29]: a64=np.array([25552.234, 26000])

In [30]: bin_limits = (a64[0]+0.0005, 26200)

In [31]: np.histogram(a64, 10, bin_limits)
Out[31]: 
(array([0, 0, 0, 0, 0, 0, 1, 0, 0, 0]),
 array([ 25552.2345 ,  25617.01105,  25681.7876 ,  25746.56415,
         25811.3407 ,  25876.11725,  25940.8938 ,  26005.67035,
         26070.4469 ,  26135.22345,  26200.     ]))

In [32]: a32=a64.astype(np.float32)

In [33]: np.histogram(a32, 10, bin_limits)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-33-0a3f12cea374> in <module>()
----> 1 np.histogram(a32, 10, bin_limits)

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/numpy/lib/function_base.pyc in histogram(a, bins, range, normed, weights, density)
    778             else:
    779                 n += np.bincount(indices, weights=tmp_w,
--> 780                                  minlength=bins).astype(ntype)
    781 
    782         # Rename the bin edges for return.

ValueError: The first argument of bincount must be non-negative

In [34]: np.version.version
Out[34]: '1.13.3'

Notice that a32[0] is equal to the lower bin limit or not, depending on whether we compare as float32 or as float:

In [37]: a32==bin_limits[0]
Out[37]: array([ True, False], dtype=bool)

In [38]: a32[0]==bin_limits[0]
Out[38]: False

Looking into function_base.py, we can see the problem:

   740          for i in arange(0, len(a), BLOCK):
   741              tmp_a = a[i:i+BLOCK]
   742              if weights is None:
   743                  tmp_w = None
   744              else:
   745                  tmp_w = weights[i:i + BLOCK]
   746  
   747              # Only include values in the right range
   748              keep = (tmp_a >= mn)
   749              keep &= (tmp_a <= mx)
   750              if not np.logical_and.reduce(keep):
   751                  tmp_a = tmp_a[keep]
   752                  if tmp_w is not None:
   753                      tmp_w = tmp_w[keep]
   754              tmp_a_data = tmp_a.astype(float)
   755              tmp_a = tmp_a_data - mn
   756              tmp_a *= norm
   757  
   758              # Compute the bin indices, and for values that lie exactly on mx we
   759              # need to subtract one
   760              indices = tmp_a.astype(np.intp)
   761              indices[indices == bins] -= 1
   762  
   763              # The index computation is not guaranteed to give exactly
   764              # consistent results within ~1 ULP of the bin edges.
   765              decrement = tmp_a_data < bin_edges[indices]
   766              indices[decrement] -= 1
   767              # The last bin includes the right edge. The other bins do not.
   768              increment = ((tmp_a_data >= bin_edges[indices + 1])
   769                           & (indices != bins - 1))
   770              indices[increment] += 1
   771  
   772              # We now compute the histogram using bincount
   773              if ntype.kind == 'c':
   774                  n.real += np.bincount(indices, weights=tmp_w.real,
   775                                        minlength=bins)
   776                  n.imag += np.bincount(indices, weights=tmp_w.imag,
   777                                        minlength=bins)
   778              else:
   779                  n += np.bincount(indices, weights=tmp_w,
   780                                   minlength=bins).astype(ntype)
   781  

Up to line 753, we are discarding data that fall outside the given histogram limits. At lines 748 and 751, array tmp_a is still whatever type the user passes in (here, float32), and the decision is made whether to keep data *based on the original float32 values*. But in line 755, the tmp_a` is replaced with a (double-precision) float array. When the single-precision values are equal to the lowest bin edge in single-precision comparisons but then turn out to be just below the lowest edge in double-precision comparisons, then we get this error, erroneously.

I could update and make a PR, but I'm completely new to numpy and don't know what approach you'd like best. Some ideas:

  1. You could just force values in the indices array to be in-range. This seems to be sort of what lines 760-770 do, so I'm not clear how it fails in the case I've raised. I think what we need is an analogue to line 761 but for the lower edge. Perhaps indices[indices < 0] = 0 ? This behavior means that we bin float32 data with float32 comparisons, which might or might not be what we want.

  2. Convert the input array to float before deciding which values to keep. This could be a performance problem in cases where most values will be outside the histogram limits, because you'd be converting all values. (The current approach only converts the ones that will be within limits, except that "within limits" is not quite correct for float32 values.)

  3. Replace the tmp_a.astype(float) (line 754) so that we convert not to float but to the input type. I think this would solve the problem for all floating-point inputs but likely create some new problem for integer inputs. This might require that handling floats and ints differently with separate branches.

I'd say that we want to go with the first idea but am hoping an expert can weigh in.

I am using with numpy 1.13.1 on Mac OS 10.12.6 through the Macports system, with Python 2.7.14 and IPython 5.4.0.

@joefowler
Copy link
Author

Actually, I think we want a variation on idea 1:

  1. Acknowledge that the first selection (lines 747-753) is only approximate for float32 data. Thus, do a second selection at line 770 that catches (and removes) errors made in the first, approximate selection. Something like:
# The first "keep" is only approximately correct when the input data are single-precision float32.
# To prevent bincount from raising an error, select the data again.
keep = indices >= 0
keep &= indices < bins
indices = indices[keep]
if tmp_w is not None:
    tmp_w = tmp_w[keep]

For float data and (I think) for integer data, this new keep will be all-True, but it will catch slight numerical problems in single-precision inputs like the one raised in this issue. For this reason, perhaps it would be best to do this block only for floating but not full-precision floating inputs?

@eric-wieser
Copy link
Member

This is a duplicate of multiple issues, and has a candidate fix at #9189

@eric-wieser
Copy link
Member
eric-wiese 8000 r commented Jan 3, 2018

I think the problem comes down to the fact that the bounds are float64, but the values are float32. Clearly we're sometimes down-casting the bounds, and sometimes upcasting the values, which is inconsistent.

The code in question has moved to here:

for i in np.arange(0, len(a), BLOCK):
tmp_a = a[i:i+BLOCK]
if weights is None:
tmp_w = None
else:
tmp_w = weights[i:i + BLOCK]
# Only include values in the right range
keep = (tmp_a >= first_edge)
keep &= (tmp_a <= last_edge)
if not np.logical_and.reduce(keep):
tmp_a = tmp_a[keep]
if tmp_w is not None:
tmp_w = tmp_w[keep]
tmp_a_data = tmp_a.astype(float)
tmp_a = tmp_a_data - first_edge
tmp_a *= norm
# Compute the bin indices, and for values that lie exactly on
# last_edge we need to subtract one
indices = tmp_a.astype(np.intp)
indices[indices == n_equal_bins] -= 1
# The index computation is not guaranteed to give exactly
# consistent results within ~1 ULP of the bin edges.
decrement = tmp_a_data < bin_edges[indices]
indices[decrement] -= 1
# The last bin includes the right edge. The other bins do not.
increment = ((tmp_a_data >= bin_edges[indices + 1])
& (indices != n_equal_bins - 1))
indices[increment] += 1
# We now compute the histogram using bincount
if ntype.kind == 'c':
n.real += np.bincount(indices, weights=tmp_w.real,
minlength=n_equal_bins)
n.imag += np.bincount(indices, weights=tmp_w.imag,
minlength=n_equal_bins)
else:
n += np.bincount(indices, weights=tmp_w,
minlength=n_equal_bins).astype(ntype)

@joefowler
Copy link
Author

Thanks Eric, I agree with the assessment about down- vs up-casting values. Glad there is a fix in the works.

@eric-wieser
Copy link
Member
eric-wieser commented Jan 4, 2018

Hmm, this is rather alarming (#10322)

>>> np.float32([1.0]) >= 1.00000001
array([ True])
>>> np.float32(1.0) >= 1.00000001
False

@joefowler
Copy link
Author

Scary, right? Numpy or Python apparently compares a np.float32 value against a Python float differently, either in single- or double-precision depending on whether the former is a bare vaule or an element of an np.ndarray.

This is what I think @mhvk was referring to by "I think the real reason this goes wrong is that the array comparison with a numpy scalar is not properly done (which is actually a bigger bug...)" in #9189.

Sounds like you want to close this issue and discuss in #10322. That works for me.

@eric-wieser
Copy link
Member

Let's call this a duplicate of #9189

eric-wieser added a commit to eric-wieser/numpy that referenced this issue Feb 2, 2018
Fixes numpy#8123, closes numpy#9189, fixes numpy#10319

This is a workaround to numpy#10322, not a fix for it.

Adds tests for cases where bounds are more precise than the data, which led to inconsistencies in the optimized path.
hanjohn pushed a commit to hanjohn/numpy that referenced this issue Feb 15, 2018
Fixes numpy#8123, closes numpy#9189, fixes numpy#10319

This is a workaround to numpy#10322, not a fix for it.

Adds tests for cases where bounds are more precise than the data, which led to inconsistencies in the optimized path.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants
0