8000 Bug in histogram: incorrectly handles float32 inputs just below lower bin limit · Issue #10319 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content
Bug in histogram: incorrectly handles float32 inputs just below lower bin limit #10319
Closed
@joefowler

Description

@joefowler

I am passing a very benign array of float32 values to np.histogram(...) which raises a ValueError: The first argument of bincount must be non-negative. The values contain no NaN or infinite values. I think this is a bug, and it is due to incorrect handling of single-precision values in the histogram function. Specifically, values that are equal to the bin limits when compared as single-precision but are outside them when compared as doubles.

First, a quick test code that works for me with 64-bit floats but raises the error with 32-bit floats on the last line:

import numpy as np
a64=np.array([25552.234, 26000])

# Make a histogram from these two values. Put the lower bin limit just above the lower value.
bin_limits = (a64[0]+0.0005, 26200)
np.histogram(a64, 10, bin_limits)

# That worked. But repeat the exercise with the data converted to float32...
a32=a64.astype(np.float32)

# ... The following raises a ValueError:
np.histogram(a32, 10, bin_limits)

The full output in iPython looks like this:

In [29]: a64=np.array([25552.234, 26000])

In [30]: bin_limits = (a64[0]+0.0005, 26200)

In [31]: np.histogram(a64, 10, bin_limits)
Out[31]: 
(array([0, 0, 0, 0, 0, 0, 1, 0, 0, 0]),
 array([ 25552.2345 ,  25617.01105,  25681.7876 ,  25746.56415,
         25811.3407 ,  25876.11725,  25940.8938 ,  26005.67035,
         26070.4469 ,  26135.22345,  26200.     ]))

In [32]: a32=a64.astype(np.float32)

In [33]: np.histogram(a32, 10, bin_limits)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-33-0a3f12cea374> in <module>()
----> 1 np.histogram(a32, 10, bin_limits)

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/numpy/lib/function_base.pyc in histogram(a, bins, range, normed, weights, density)
    778             else:
    779                 n += np.bincount(indices, weights=tmp_w,
--> 780                                  minlength=bins).astype(ntype)
    781 
    782         # Rename the bin edges for return.

ValueError: The first argument of bincount must be non-negative

In [34]: np.version.version
Out[34]: '1.13.3'

Notice that a32[0] is equal to the lower bin limit or not, depending on whether we compare as float32 or as float:

In [37]: a32==bin_limits[0]
Out[37]: array([ True, False], dtype=bool)

In [38]: a32[0]==bin_limits[0]
Out[38]: False

Looking into function_base.py, we can see the problem:

   740          for i in arange(0, len(a), BLOCK):
   741              tmp_a = a[i:i+BLOCK]
   742              if weights is None:
   743                  tmp_w = None
   744              else:
   745                  tmp_w = weights[i:i + BLOCK]
   746  
   747              # Only include values in the right range
   748              keep = (tmp_a >= mn)
   749              keep &= (tmp_a <= mx)
   750              if not np.lo
75C5
gical_and.reduce(keep):
   751                  tmp_a = tmp_a[keep]
   752                  if tmp_w is not None:
   753                      tmp_w = tmp_w[keep]
   754              tmp_a_data = tmp_a.astype(float)
   755              tmp_a = tmp_a_data - mn
   756              tmp_a *= norm
   757  
   758              # Compute the bin indices, and for values that lie exactly on mx we
   759              # need to subtract one
   760              indices = tmp_a.astype(np.intp)
   761              indices[indices == bins] -= 1
   762  
   763              # The index computation is not guaranteed to give exactly
   764              # consistent results within ~1 ULP of the bin edges.
   765              decrement = tmp_a_data < bin_edges[indices]
   766              indices[decrement] -= 1
   767              # The last bin includes the right edge. The other bins do not.
   768              increment = ((tmp_a_data >= bin_edges[indices + 1])
   769                           & (indices != bins - 1))
   770              indices[increment] += 1
   771  
   772              # We now compute the histogram using bincount
   773              if ntype.kind == 'c':
   774                  n.real += np.bincount(indices, weights=tmp_w.real,
   775                                        minlength=bins)
   776                  n.imag += np.bincount(indices, weights=tmp_w.imag,
   777                                        minlength=bins)
   778              else:
   779                  n += np.bincount(indices, weights=tmp_w,
   780                                   minlength=bins).astype(ntype)
   781  

Up to line 753, we are discarding data that fall outside the given histogram limits. At lines 748 and 751, array tmp_a is still whatever type the user passes in (here, float32), and the decision is made whether to keep data *based on the original float32 values*. But in line 755, the tmp_a` is replaced with a (double-precision) float array. When the single-precision values are equal to the lowest bin edge in single-precision comparisons but then turn out to be just below the lowest edge in double-precision comparisons, then we get this error, erroneously.

I could update and make a PR, but I'm completely new to numpy and don't know what approach you'd like best. Some ideas:

  1. You could just force values in the indices array to be in-range. This seems to be sort of what lines 760-770 do, so I'm not clear how it fails in the case I've raised. I think what we need is an analogue to line 761 but for the lower edge. Perhaps indices[indices < 0] = 0 ? This behavior means that we bin float32 data with float32 comparisons, which might or might not be what we want.

  2. Convert the input array to float before deciding which values to keep. This could be a performance problem in cases where most values will be outside the histogram limits, because you'd be converting all values. (The current approach only converts the ones that will be within limits, except that "within limits" is not quite correct for float32 values.)

  3. Replace the tmp_a.astype(float) (line 754) so that we convert not to float but to the input type. I think this would solve the problem for all floating-point inputs but likely create some new problem for integer inputs. This might require that handling floats and ints differently with separate branches.

I'd say that we want to go with the first idea but am hoping an expert can weigh in.

I am using with numpy 1.13.1 on Mac OS 10.12.6 through the Macports system, with Python 2.7.14 and IPython 5.4.0.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0