8000 BUG: fixes #8123 Histogram precision issue due to higher precision in internal mn, mx by mherkazandjian · Pull Request #9189 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

BUG: fixes #8123 Histogram precision issue due to higher precision in internal mn, mx #9189

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

mherkazandjian
Copy link

When the input data hsa a lower precision than the specified ranges and error is raised in bincount.
This has been addressed in issues #8123

The problem is caused by higher precision mn, mx. In this pull request, i cast mn and mx to the same type as the input data

(messed up the previous pull request...could not reopen it)

@eric-wieser
Copy link
Member

This needs a test-case for the thing you're trying to fix. Commit message should start with BUG: and include something like fixes #8123 to make it auto-close the issue.

Also, you'll need to get the other tests to pass too ;). Thanks for solving the rebasing issues

@mherkazandjian
Copy link
Author

ok, tnx. I'll do the changes. The tests are failing locally too, just wanted to push to have the fix on github.

@mherkazandjian mherkazandjian changed the title Histogram precision bugfix due to higher precision in internal mn, mx BUG: fixes #8123 Histogram precision issue due to higher precision in internal mn, mx May 30, 2017
@mherkazandjian
Copy link
Author

i think i solved the problem. Finding a clean test case was a bit tricky since the data set where i encountered this problem was like 2 GB and had to bracket down the offending number. Turned out
the problem can be reproduced with just one input value.

@mhvk
Copy link
Contributor
mhvk commented May 31, 2017

I don't like that with this solution the result will depend on the dtype of the value even for a perfectly well-represented value like 1: for float32 it is counted as inside the bin, but for float64 and int you get 0. I think in your test case, the value should simply be counted as being out of range.

I think the real reason this goes wrong is that the array comparison with a numpy scalar is not properly done (which is actually a bigger bug...):

# not good: float32 array with float64 array scalar (or regular scalar)
np.array([1.], dtype=np.float32)>=np.array(1.00000001, dtype=np.float64)
# array([ True], dtype=bool)

# not good float32 array with python float
np.array([1.], dtype=np.float32)>=1.00000001
# array([ True], dtype=bool)

# OK float32 array with float64 array
np.array([1.], dtype=np.float32)>=np.array([1.00000001], dtype=np.float64)
# array([False], dtype=bool)

# also OK: (array) scalar with (array) scalar
np.array(1., dtype=np.float32)>=np.array(1.00000001, dtype=np.float64)
# False

one solution here is to cast mn, mx always as float64 arrays with shape (1,). But perhaps we should fix the real problem... See gh-9194

@mhvk
Copy link
Contributor
mhvk commented May 31, 2017

p.s. I fear solving the real problem will take some discussion about what the desired behaviour for all ufuncs is, so perhaps we should go ahead and address the issue here. Which would mean avoiding array scalars. Or maybe more cleanly, do the inclusion of values in the right range after calculating the indices (i.e., remove indices that are too large or small).

@mherkazandjian
Copy link
Author

actually I thought about that too. that 1 < 1 + tiny so the count should be zero.

maybe casting to the input data and the range (mn, mx) to the highest precision of both ?

we should keep in mind that there is also numpy.float128

  type                   precision
  ------                 ------------
  float32                1.1920929e-07
  float64                2.2204460492503131e-16
  float128               1.084202172485504434e-19

@mhvk
Copy link
Contributor
mhvk commented May 31, 2017

Good point about float128 -- my own feeling would be to have as little as possible of these complications in histogram itself. Indeed, I would suggest to remove those + 0.0, which seem only necessary to allow an in-place subtraction/addition of 0.5 -- instead, just write those as explicit assignments: mn = mn - 0.5, etc. And then check the indices instead of the array before sending off to bincount.

scratch/test.py Outdated
range=np.array([1.0 + tiny_shift, 2.0]).astype('f8')
)

print('done')
Copy link
Member
@eric-wieser eric-wieser Oct 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should be removed

Copy link
Author 8000

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this was just a temporary test (probably i committed it by mistake)

@eric-wieser eric-wieser force-pushed the histogram-precision branch 2 times, most recently from a20c42e to 89fa4bb Compare December 27, 2017 07:37
@eric-wieser
Copy link
Member
eric-wieser commented Dec 27, 2017

I've gone ahead and rebased this - some of the fixes were already applied in #10278
@mherkazandjian, if you want to continue working on this, you should do

git fetch
git checkout histogram-precision
git reset --hard origin/histogram-precision

@@ -237,6 +237,10 @@ def _get_outer_edges(a, range):
first_edge = first_edge - 0.5
last_edge = last_edge + 0.5

if a.dtype.kind not in ['i']:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this should be np.issubdtype(a.dtype, np.inexact)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment here why we are doing this: I think we're still ensuring ufunc casting does not cause problems...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative fix might just be to promote the bounds to 0d arrays

@eric-wieser
Copy link
Member

@mhvk: Your comments were addressed in #10278, but it seems that the extra patch here (or similar) is needed too.

@mhvk
Copy link
Contributor
mhvk commented Dec 29, 2017

I think the rebased PR looks good, modulo the small request for adding a comment on why the dtype case is done.

eric-wieser added a commit to eric-wieser/numpy that referenced this pull request Feb 2, 2018
Fixes numpy#8123, closes numpy#9189, fixes numpy#10319

This is a workaround to numpy#10322, not a fix for it.

Adds tests for cases where bounds are more precise than the data, which led to inconsistencies in the optimized path.
hanjohn pushed a commit to hanjohn/numpy that referenced this pull request Feb 15, 2018
Fixes numpy#8123, closes numpy#9189, fixes numpy#10319

This is a workaround to numpy#10322, not a fix for it.

Adds tests for cases where bounds are more precise than the data, which led to inconsistencies in the optimized path.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0