Description
I am passing a very benign array of float32
values to np.histogram(...)
which raises a ValueError: The first argument of bincount must be non-negative
. The values contain no NaN or infinite values. I think this is a bug, and it is due to incorrect handling of single-precision values in the histogram function. Specifically, values that are equal to the bin limits when compared as single-precision but are outside them when compared as doubles.
First, a quick test code that works for me with 64-bit floats but raises the error with 32-bit floats on the last line:
import numpy as np
a64=np.array([25552.234, 26000])
# Make a histogram from these two values. Put the lower bin limit just above the lower value.
bin_limits = (a64[0]+0.0005, 26200)
np.histogram(a64, 10, bin_limits)
# That worked. But repeat the exercise with the data converted to float32...
a32=a64.astype(np.float32)
# ... The following raises a ValueError:
np.histogram(a32, 10, bin_limits)
The full output in iPython looks like this:
In [29]: a64=np.array([25552.234, 26000])
In [30]: bin_limits = (a64[0]+0.0005, 26200)
In [31]: np.histogram(a64, 10, bin_limits)
Out[31]:
(array([0, 0, 0, 0, 0, 0, 1, 0, 0, 0]),
array([ 25552.2345 , 25617.01105, 25681.7876 , 25746.56415,
25811.3407 , 25876.11725, 25940.8938 , 26005.67035,
26070.4469 , 26135.22345, 26200. ]))
In [32]: a32=a64.astype(np.float32)
In [33]: np.histogram(a32, 10, bin_limits)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-33-0a3f12cea374> in <module>()
----> 1 np.histogram(a32, 10, bin_limits)
/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/numpy/lib/function_base.pyc in histogram(a, bins, range, normed, weights, density)
778 else:
779 n += np.bincount(indices, weights=tmp_w,
--> 780 minlength=bins).astype(ntype)
781
782 # Rename the bin edges for return.
ValueError: The first argument of bincount must be non-negative
In [34]: np.version.version
Out[34]: '1.13.3'
Notice that a32[0]
is equal to the lower bin limit or not, depending on whether we compare as float32 or as float:
In [37]: a32==bin_limits[0]
Out[37]: array([ True, False], dtype=bool)
In [38]: a32[0]==bin_limits[0]
Out[38]: False
Looking into function_base.py
, we can see the problem:
740 for i in arange(0, len(a), BLOCK):
741 tmp_a = a[i:i+BLOCK]
742 if weights is None:
743 tmp_w = None
744 else:
745 tmp_w = weights[i:i + BLOCK]
746
747 # Only include values in the right range
748 keep = (tmp_a >= mn)
749 keep &= (tmp_a <= mx)
750 if not np.lo
75C5
gical_and.reduce(keep):
751 tmp_a = tmp_a[keep]
752 if tmp_w is not None:
753 tmp_w = tmp_w[keep]
754 tmp_a_data = tmp_a.astype(float)
755 tmp_a = tmp_a_data - mn
756 tmp_a *= norm
757
758 # Compute the bin indices, and for values that lie exactly on mx we
759 # need to subtract one
760 indices = tmp_a.astype(np.intp)
761 indices[indices == bins] -= 1
762
763 # The index computation is not guaranteed to give exactly
764 # consistent results within ~1 ULP of the bin edges.
765 decrement = tmp_a_data < bin_edges[indices]
766 indices[decrement] -= 1
767 # The last bin includes the right edge. The other bins do not.
768 increment = ((tmp_a_data >= bin_edges[indices + 1])
769 & (indices != bins - 1))
770 indices[increment] += 1
771
772 # We now compute the histogram using bincount
773 if ntype.kind == 'c':
774 n.real += np.bincount(indices, weights=tmp_w.real,
775 minlength=bins)
776 n.imag += np.bincount(indices, weights=tmp_w.imag,
777 minlength=bins)
778 else:
779 n += np.bincount(indices, weights=tmp_w,
780 minlength=bins).astype(ntype)
781
Up to line 753, we are discarding data that fall outside the given histogram limits. At lines 748 and 751, array tmp_a
is still whatever type the user passes in (here, float32), and the decision is made whether to keep data *based on the original float32 values*. But in line 755, the
tmp_a` is replaced with a (double-precision) float array. When the single-precision values are equal to the lowest bin edge in single-precision comparisons but then turn out to be just below the lowest edge in double-precision comparisons, then we get this error, erroneously.
I could update and make a PR, but I'm completely new to numpy and don't know what approach you'd like best. Some ideas:
-
You could just force values in the
indices
array to be in-range. This seems to be sort of what lines 760-770 do, so I'm not clear how it fails in the case I've raised. I think what we need is an analogue to line 761 but for the lower edge. Perhapsindices[indices < 0] = 0
? This behavior means that we bin float32 data with float32 comparisons, which might or might not be what we want. -
Convert the input array to
float
before deciding which values to keep. This could be a performance problem in cases where most values will be outside the histogram limits, because you'd be converting all values. (The current approach only converts the ones that will be within limits, except that "within limits" is not quite correct for float32 values.) -
Replace the
tmp_a.astype(float)
(line 754) so that we convert not tofloat
but to the input type. I think this would solve the problem for all floating-point inputs but likely create some new problem for integer inputs. This might require that handling floats and ints differently with separate branches.
I'd say that we want to go with the first idea but am hoping an expert can weigh in.
I am using with numpy 1.13.1 on Mac OS 10.12.6 through the Macports system, with Python 2.7.14 and IPython 5.4.0.