-
-
Notifications
You must be signed in to change notification settings - Fork 11k
ENH: Added proper handling of nans for numpy.lib.function_base.median #4460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -2773,33 +2773,31 @@ def median(a, axis=None, out=None, overwrite_input=False): | |||
if axis is not None and axis >= a.ndim: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The normal handling of axis
is:
if axis is not None:
if axis < a.ndim:
axis += a.ndim
if axis < 0 or axis >= a.ndim:
raise ValueError(...)
@jaimefrio : I get your gist, I have to handle the cases where axis < 0. I think the conditional on the second 'if' should be:
And I should raise a ValueError instead of an IndexError? |
Oops! Yes, And a
I know that in @seberg's rewrite of indexing he changed some ValueErrors to IndexErrors, but I don't think that applies to axis. |
axis += a.ndim | ||
if axis < 0 or axis >= a.ndim: | ||
raise IndexError( | ||
"axis %d out of bounds (%d)" % (axis, a.ndim)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot to mention this earlier, sorry for the noise... If the user passes in a negative axis
that is out of bounds, the error message will be confusing, because you will be reporting the value after adding a.ndim
, not the value the user entered. You probably want to use another variable to store the original value. There has to be some example of this in the Python code base, but I can't seem to find any. Here is how partition
does it in the C code.
@jaimefrio Do you think this is ready? |
I still think the extended axis pr should be merged first, but as that seems to be going nowhere I guess I'll add the mean cludge back in to not block this any longer this pr still needs a warning if a nan is encountered |
out = np.asanyarray(mean(part[indexer], axis=axis, out=out)) | ||
out[ids] = np.nan | ||
return out | ||
else: #if there are no nans |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment here would look better on the next line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, moved comment to next line.
needs a rebase now that extended axis support is merged |
axis += a.ndim | ||
if axis < 0 or axis >= a.ndim: | ||
raise ValueError( | ||
"axis %d out of bounds (%d)" % (axis_orig, a.ndim)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this check is now redundant, _ureduce checks the axis input
I'm not sure we should return nan if there is a nan in the array, I can imagine it could break a couple applications. |
I agree that it will probably break some applications, which is why I think we'd need a |
I guess the generic guideline here is that if this is a "functionality On Sun, Mar 16, 2014 at 1:35 PM, Julian Taylor notifications@github.comwrote:
Nathaniel J. Smith |
@empeeu Universal test failure
Also, could you squash the commits down to one, maybe two? |
Hi yes, sorry, I haven't had time to work on this. I don't know why it's suddenly failing, hopefully I'll get to it soon. |
I added the tests from my cleanup commit to master, applying the rest from juliantaylor@e8f2acb should fix it |
Rolled all changes into one commit. Wow, git is great! Goodbye svn. :) |
@juliantaylor Is this ready? |
I guess, it still needs some cleanup and release notes but these can be added in another PR |
if np.issubdtype(a.dtype, np.inexact): | ||
if part.ndim <= 1: | ||
if np.isnan(part[-1]): | ||
return a.dtype.type(np.nan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Array scalars can be used for out
and have dimension 0.
ISTM that if |
I'm unfortunately a bit low on time at the moment. If you feel like getting this done sooner, feel free to pull in and modify this PR. If you still have major changes in mind, just let me know and I can get to it eventually. |
I pushed a cleaned up version to this branch: though we also need the same for percentile and I don't see myself doing that today so I guess we will have to put it off to beta2 or 1.10 (or never). |
unless we want to merge it without "fixing" also percentile? |
Sorry for falling off the face of the earth. @juliantaylor thanks for cleaning this up. I see that nanmedian got merged successfully. Is percentile currently holding this up? |
@juliantaylor What is the status of this? |
if we want it the cleaned up branch should be merged. Though I do wonder if we really need it now that we have nanmedian and nanpercentile. |
If the needed functionality is provided by nanmedian and nanpercentile I don't see the need for this, but I'm not clear on what it adds. I can see raising a warning if a nan is found. |
I continue to consider the current behavior a bug. If I take the median of (1, 1, 2, 3, nan) I should not get 2 (current implementation returns 2). The safest behavior is to return nan. Perhaps it's cleaner if I just close this PR and let @juliantaylor create and merge a pull request? Otherwise I'm happy to merge @juliantaylor's modifications. Either way, just let me know. Throwing a similar error in percentile I think is a good idea. I suspect median is more widely used than percentile, but the change is warranted. I can see about doing this. |
IMO we do want this. This is mostly orthogonal to nanmedian -- nanmedian is I do agree that percentile needs a similar fix, but it's a bit weird to On Fri, Oct 10, 2014 at 10:13 PM, Matt Ueckermann notifications@github.com
Nathaniel J. Smith |
I'd argue its no bug, nans have a well defined ordering in numpy. median gives you the 50th percentile of all values. nanmedian gives you the 50th percentiles of all valid values where you consider nans invalid values. |
well it would be an option, instead of letting applications blow up by now receiving nans, compute normally check for nan and the repeat with nanmedian to give the user the result he/she probably wanted. |
If I have a bad value in an array, I don't expect median to hide it silently. In that case I don't care if nans are ordered in numpy. Min and max both return nan even if nans are ordered. Also I expect to get a similar result between mean and median. |
NaNs may have a well-defined ordering in numpy (which is dubious enough), A common case where NaNs are generated is when evaluating an indeterminate On Fri, Oct 10, 2014 at 10:35 PM, Julian Taylor notifications@github.com
Nathaniel J. Smith |
returning NaN for the median/percentile would make most sense to me as well, would that be very expensive to check for? |
I guess I can live with the all operations involving nan return nan argument. The cleanup commit would need to be picked then (and updated to use 1.10 release notes) |
Ok. I'll pull in your cleaned up version @juliantaylor, and I'll update to use the 1.10 release notes. I'll get to it at some point this week. Thanks everyone. |
Rebased, merged in @juliantaylor 's changes (manually), and updated the release notes for 1.10. I'll get working on percentile. |
Can this cleaned up version be merged? |
Added nan-checking to percentile as well https://github.com/empeeu/numpy/tree/percentile-nan . I have the changes to median in that branch as well (and I just rebased it). Would you prefer a separate PR for percentile (probably)? Or all of it together, just let me know. |
Just checking in to make sure that the nan handling of median this is still on the radar. I'm happy to rebase again because I figure this PR is probably out of date again. I should be pretty responsive this weekend for quick change. Just let me know what y'all want and I'll get it done. :) |
This look like a useful patch to fix buggy behavior -- are there any remaining issues for merging this for numpy 1.10? @juliantaylor? |
to close issue numpy#586. Also added unit test.
fix output argument, add warnings, fix style and add a note in the release notes.
I rebased again since this was getting out of date. @jaimefrio can this be merged? I was wondering if nan-percentile was holding this up, so I made another pull request. Let me know if I should close one of these. See #5753. |
rout = a.dtype.type(np.nan) | ||
else: | ||
for i in range(np.count_nonzero(n.ravel())): | ||
warnings.warn("Invalid value encountered in median", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issuing warnings in a loop seems like a bad idea, for two reasons:
- It would just produce a lot of noise in logs (if anyone is not filtering all but the first warning)
- Issuing warnings is actually very slow, so this could be a serious performance bottleneck. Consider:
In [10]: a = np.zeros((10000, 2))
In [11]: %timeit np.median(a, axis=0)
10000 loops, best of 3: 117 µs per loop
In [12]: %timeit for x in range(10000): warnings.warn("Invalid value encountered in median", RuntimeWarning)
/Users/shoyer/miniconda/envs/numpy-dev/bin/ipython:257: RuntimeWarning: Invalid value encountered in median
100 loops, best of 3: 12.3 ms per loop
☔ The latest upstream changes (presumably #5977) made this pull request unmergeable. Please resolve the merge conflicts. |
to close issue #586.
Also added unit test.
This was previously PR #4287 , but I botched the rebase, so I just started over.
Also related was PR #3908 and #4307