8000 ENH: Added proper handling of nans for numpy.lib.function_base.median by empeeu · Pull Request #4460 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

ENH: Added proper handling of nans for numpy.lib.function_base.median #4460

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

empeeu
Copy link
Contributor
@empeeu empeeu commented Mar 8, 2014

to close issue #586.

Also added unit test.

This was previously PR #4287 , but I botched the rebase, so I just started over.

Also related was PR #3908 and #4307

@@ -2773,33 +2773,31 @@ def median(a, axis=None, out=None, overwrite_input=False):
if axis is not None and axis >= a.ndim:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The normal handling of axis is:

if axis is not None:
    if axis < a.ndim:
        axis += a.ndim
    if axis < 0 or axis >= a.ndim:
        raise ValueError(...)

@empeeu
Copy link
Contributor Author
empeeu commented Mar 8, 2014

@jaimefrio : I get your gist, I have to handle the cases where axis < 0. I think the conditional on the second 'if' should be:

if axis < 0:
    axis += a.ndim

And I should raise a ValueError instead of an IndexError?

@jaimefrio
Copy link
Member

Oops! Yes, < 0, my bad.

And a ValueError is what most numpy functions raise when an axis is out of bounds, see e.g. partition:

>>> np.partition([1, 2, 3], [1], axis=2)
...
ValueError: axis(=2) out of bounds

I know that in @seberg's rewrite of indexing he changed some ValueErrors to IndexErrors, but I don't think that applies to axis.

axis += a.ndim
if axis < 0 or axis >= a.ndim:
raise IndexError(
"axis %d out of bounds (%d)" % (axis, a.ndim))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to mention this earlier, sorry for the noise... If the user passes in a negative axis that is out of bounds, the error message will be confusing, because you will be reporting the value after adding a.ndim, not the value the user entered. You probably want to use another variable to store the original value. There has to be some example of this in the Python code base, but I can't seem to find any. Here is how partition does it in the C code.

@charris
Copy link
Member
charris commented Mar 11, 2014

@jaimefrio Do you think this is ready?

@juliantaylor
Copy link
Contributor

I still think the extended axis pr should be merged first, but as that seems to be going nowhere I guess I'll add the mean cludge back in to not block this any longer

this pr still needs a warning if a nan is encountered
also I wonder if we can change it to output nan if one is found, its a quite significant change from returning almost correct result to nan for data with only a very small fraction of nans in it.

out = np.asanyarray(mean(part[indexer], axis=axis, out=out))
out[ids] = np.nan
return out
else: #if there are no nans
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment here would look better on the next line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, moved comment to next line.

@juliantaylor
Copy link
Contributor

needs a rebase now that extended axis support is merged

axis += a.ndim
if axis < 0 or axis >= a.ndim:
raise ValueError(
"axis %d out of bounds (%d)" % (axis_orig, a.ndim))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this check is now redundant, _ureduce checks the axis input

@juliantaylor
Copy link
Contributor

I'm not sure we should return nan if there is a nan in the array, I can imagine it could break a couple applications.
maybe it is better to just emit a warning and only change to nan return value after one or two releases

@empeeu
Copy link
Contributor Author
empeeu commented Mar 16, 2014

I agree that it will probably break some applications, which is why I think we'd need a nanmedian (#4307) to be released at the same time. That way, when applications break you could either switch to nanmedian, if that's the behaviour you want, or you'll be alerted to the fact that you have nan's coming from somewhere upstream in the code. I also think it SHOULD break a few applications, because it's usually worse to get almost the right answer instead of a completely wrong answer. Very hard to track down bugs where you have inconsistent performance. (Imagine sometime having a 1000 element array with 1 nan, versus 100 nans, but always getting a median that's a real number).

@njsmith
Copy link
Member
njsmith commented Mar 16, 2014

I guess the generic guideline here is that if this is a "functionality
change" then we need a warning cycle, and if it's a "bug fix" then we just
do it. I can see the argument for this being borderline between these,
because previously median did always return a non-nan value (in almost
all cases), so at least it was consistent. But given that it was
consistently returning the wrong value (not even the same as what
nanmedian would return!), I'm inclined to call it a bug fix and get the fix
out there as fast as possible.

On Sun, Mar 16, 2014 at 1:35 PM, Julian Taylor notifications@github.comwrote:

I'm not sure we should return nan if there is a nan in the array, I can
imagine it could break a couple applications.
maybe it is better to just emit a warning and only change to nan return
value after one or two releases


Reply to this email directly or view it on GitHubhttps://github.com//pull/4460#issuecomment-37757201
.

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

@charris
Copy link
Member
charris commented Mar 23, 2014

@empeeu Universal test failure

======================================================================

ERROR: test_basic (test_function_base.TestMedian)

----------------------------------------------------------------------

Traceback (most recent call last):

File "/home/travis/virtualenv/python2.6/lib/python2.6/site-packages/numpy/lib/tests/test_function_base.py", line 1866, in test_basic

assert_equal(np.median(a).ndim, 0)

AttributeError: 'float' object has no attribute 'ndim'

Also, could you squash the commits down to one, maybe two?

@empeeu
Copy link
Contributor Author
empeeu commented Mar 23, 2014

Hi yes, sorry, I haven't had time to work on this. I don't know why it's suddenly failing, hopefully I'll get to it soon.

@juliantaylor
Copy link
Contributor

I added the tests from my cleanup commit to master, applying the rest from juliantaylor@e8f2acb should fix it

@empeeu
Copy link
Contributor Author
empeeu commented Mar 28, 2014

Rolled all changes into one commit. Wow, git is great! Goodbye svn. :)

@charris
Copy link
Member
charris commented Mar 31, 2014

@juliantaylor Is this ready?

@juliantaylor
Copy link
Contributor

I guess, it still needs some cleanup and release notes but these can be added in another PR

if np.issubdtype(a.dtype, np.inexact):
if part.ndim <= 1:
if np.isnan(part[-1]):
return a.dtype.type(np.nan)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Array scalars can be used for out and have dimension 0.

@charris
Copy link
Member
charris commented Apr 1, 2014

ISTM that if _ureduce always swapped the axis then both it and the functions it called would be made simpler since no axis keyword would be needed for the call.

@empeeu
Copy link
Contributor Author
empeeu commented Apr 2, 2014

I'm unfortunately a bit low on time at the moment. If you feel like getting this done sooner, feel free to pull in and modify this PR. If you still have major changes in mind, just let me know and I can get to it eventually.

@juliantaylor
Copy link
Contributor

I pushed a cleaned up version to this branch:
https://github.com/juliantaylor/numpy/tree/empeeu-median-nan

though we also need the same for percentile and I don't see myself doing that today so I guess we will have to put it off to beta2 or 1.10 (or never).

@juliantaylor
Copy link
Contributor

unless we want to merge it without "fixing" also percentile?

@empeeu
Copy link
Contributor Author
empeeu commented Oct 3, 2014

Sorry for falling off the face of the earth. @juliantaylor thanks for cleaning this up. I see that nanmedian got merged successfully. Is percentile currently holding this up?

@charris
Copy link
Member
charris commented Oct 10, 2014

@juliantaylor What is the status of this?

@juliantaylor
Copy link
Contributor

if we want it the cleaned up branch should be merged. Though I do wonder if we really need it now that we have nanmedian and nanpercentile.
I think it would a problem to ignore nans in median but not percentile as it would break the equivalency of percentile(50) and median. So percentile should get the same change which is probably quite tedious to do.

@charris
Copy link
Member
charris commented Oct 10, 2014

If the needed functionality is provided by nanmedian and nanpercentile I don't see the need for this, but I'm not clear on what it adds. I can see raising a warning if a nan is found.

@empeeu
Copy link
Contributor Author
empeeu commented Oct 10, 2014

I continue to consider the current behavior a bug. If I take the median of (1, 1, 2, 3, nan) I should not get 2 (current implementation returns 2). The safest behavior is to return nan.

Perhaps it's cleaner if I just close this PR and let @juliantaylor create and merge a pull request? Otherwise I'm happy to merge @juliantaylor's modifications. Either way, just let me know.

Throwing a similar error in percentile I think is a good idea. I suspect median is more widely used than percentile, but the change is warranted. I can see about doing this.

@njsmith
Copy link
Member
njsmith commented Oct 10, 2014

IMO we do want this. This is mostly orthogonal to nanmedian -- nanmedian is
a way to compute the median while ignoring NaN values, this is about the
fact that currently median flat out returns the wrong answer when NaNs are
present.

I do agree that percentile needs a similar fix, but it's a bit weird to
make fixing that bug a precondition to fixing this bug...

On Fri, Oct 10, 2014 at 10:13 PM, Matt Ueckermann notifications@github.com
wrote:

I continue to consider the current behavior a bug. If I take the median of
(1, 1, 2, 3, nan) I should not get 2 (current implementation returns 2).
The safest behavior is to return nan.

Perhaps it's cleaner if I just close this PR and let @juliantaylor
https://github.com/juliantaylor create and merge a pull request?
Otherwise I'm happy to merge @juliantaylor
https://github.com/juliantaylor's modifications. Either way, just let
me know.

Throwing a similar error in percentile I think is a good idea. I suspect
median is more widely used than percentile, but the change is warranted. I
can see about doing this.


Reply to this email directly or view it on GitHub
#4460 (comment).

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

@juliantaylor
Copy link
Contributor

I'd argue its no bug, nans have a well defined ordering in numpy. median gives you the 50th percentile of all values. nanmedian gives you the 50th percentiles of all valid values where you consider nans invalid values.
The issue is just that probably many people are not aware of that they should be using nanmedian.
If we really want to help these we could just remove median for floats completely and automatically change it to nanmedian. But that would also be terribly slow.

@juliantaylor
Copy link
Contributor

well it would be an option, instead of letting applications blow up by now receiving nans, compute normally check for nan and the repeat with nanmedian to give the user the result he/she probably wanted.
I like that a bit more than returning nan, but I still really don't like it.

@Nodd
Copy link
Contributor
Nodd commented Oct 10, 2014

If I have a bad value in an array, I don't expect median to hide it silently. In that case I don't care if nans are ordered in numpy. Min and max both return nan even if nans are ordered. Also I expect to get a similar result between mean and median.

@njsmith
Copy link
Member
njsmith commented Oct 10, 2014

NaNs may have a well-defined ordering in numpy (which is dubious enough),
but I can't see how this justifies violating the usual semantics of NaNs or
is something that any user would ever want. IEEE754 defines basically every
operation that involves a NaN to return NaN.

A common case where NaNs are generated is when evaluating an indeterminate
expression like lim_x lim_y x/y, as x and y both go to 0. This could be
literally anything from -Inf to +Inf, so 0/0 returns NaN to represent that
indeterminacy. Right now median treats such values as if they are always
+Inf. This is just not useful.

On Fri, Oct 10, 2014 at 10:35 PM, Julian Taylor notifications@github.com
wrote:

I'd argue its no bug, nans have a well defined ordering in numpy. median
gives you the 50th percentile of all values. nanmedian gives you the 50th
percentiles of all valid values where you consider nans invalid values.
The issue is just that probably many people are not aware of that they
should be using nanmedian.
If we really want to help these we could just remove median for floats
completely and automatically change it to nanmedian. But that would also be
terribly slow.


Reply to this email directly or view it on GitHub
#4460 (comment).

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

@seberg
Copy link
Member
seberg commented Oct 11, 2014

returning NaN for the median/percentile would make most sense to me as well, would that be very expensive to check for?

@juliantaylor
Copy link
Contributor

I guess I can live with the all operations involving nan return nan argument. The cleanup commit would need to be picked then (and updated to use 1.10 release notes)

@empeeu
Copy link
Contributor Author
empeeu commented Oct 13, 2014

Ok. I'll pull in your cleaned up version @juliantaylor, and I'll update to use the 1.10 release notes. I'll get to it at some point this week.

Thanks everyone.

@empeeu
Copy link
Contributor Author
empeeu commented Oct 15, 2014

Rebased, merged in @juliantaylor 's changes (manually), and updated the release notes for 1.10.

I'll get working on percentile.

@empeeu
Copy link
Contributor Author
empeeu commented Oct 18, 2014

Can this cleaned up version be merged?

@empeeu
Copy link
Contributor Author
empeeu commented Oct 30, 2014

Added nan-checking to percentile as well https://github.com/empeeu/numpy/tree/percentile-nan . I have the changes to median in that branch as well (and I just rebased it). Would you prefer a separate PR for percentile (probably)? Or all of it together, just let me know.

@empeeu
Copy link
Contributor Author
empeeu commented Nov 15, 2014

Just checking in to make sure that the nan handling of median this is still on the radar. I'm happy to rebase again because I figure this PR is probably out of date again. I should be pretty responsive this weekend for quick change. Just let me know what y'all want and I'll get it done. :)

@shoyer
Copy link
Member
shoyer commented Jan 3, 2015

This look like a useful patch to fix buggy behavior -- are there any remaining issues for merging this for numpy 1.10? @juliantaylor?

empeeu added 2 commits April 7, 2015 19:51
fix output argument, add warnings, fix style and add a note in the release notes.
@empeeu
Copy link
Contributor Author
empeeu commented Apr 7, 2015

I rebased again since this was getting out of date. @jaimefrio can this be merged?

I was wondering if nan-percentile was holding this up, so I made another pull request. Let me know if I should close one of these. See #5753.

rout = a.dtype.type(np.nan)
else:
for i in range(np.count_nonzero(n.ravel())):
warnings.warn("Invalid value encountered in median",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issuing warnings in a loop seems like a bad idea, for two reasons:

  1. It would just produce a lot of noise in logs (if anyone is not filtering all but the first warning)
  2. Issuing warnings is actually very slow, so this could be a serious performance bottleneck. Consider:
In [10]: a = np.zeros((10000, 2))

In [11]: %timeit np.median(a, axis=0)
10000 loops, best of 3: 117 µs per loop

In [12]: %timeit for x in range(10000): warnings.warn("Invalid value encountered in median", RuntimeWarning)
/Users/shoyer/miniconda/envs/numpy-dev/bin/ipython:257: RuntimeWarning: Invalid value encountered in median
100 loops, best of 3: 12.3 ms per loop

@homu
Copy link
Contributor
homu commented Jun 17, 2015

☔ The latest upstream changes (presumably #5977) made this pull request unmergeable. Please resolve the merge conflicts.

@empeeu
Copy link
Contributor Author
empeeu commented Jun 23, 2015

@shoyer I agree. Moved it outside the for loop, see #5753

@homu I rebased again. Should now be able to merge automatically again on #5753.

I'm closing this PR because #5753 handles both the median-nan and percentile-nan functions and I haven't had any response to my question April 7.

@empeeu empeeu closed this Jun 23, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants
0