ENH: Added proper handling of nans for numpy.lib.function_base.median #4460

empeeu · 2014-03-08T14:56:14Z

to close issue #586.

Also added unit test.

This was previously PR #4287 , but I botched the rebase, so I just started over.

Also related was PR #3908 and #4307

jaimefrio · 2014-03-08T15:35:27Z

numpy/lib/function_base.py

@@ -2773,33 +2773,31 @@ def median(a, axis=None, out=None, overwrite_input=False):
    if axis is not None and axis >= a.ndim:


The normal handling of axis is:

if axis is not None: if axis < a.ndim: axis += a.ndim if axis < 0 or axis >= a.ndim: raise ValueError(...)

empeeu · 2014-03-08T17:38:08Z

@jaimefrio : I get your gist, I have to handle the cases where axis < 0. I think the conditional on the second 'if' should be:

if axis < 0:
    axis += a.ndim

And I should raise a ValueError instead of an IndexError?

jaimefrio · 2014-03-08T18:55:38Z

Oops! Yes, < 0, my bad.

And a ValueError is what most numpy functions raise when an axis is out of bounds, see e.g. partition:

>>> np.partition([1, 2, 3], [1], axis=2)
...
ValueError: axis(=2) out of bounds

I know that in @seberg's rewrite of indexing he changed some ValueErrors to IndexErrors, but I don't think that applies to axis.

jaimefrio · 2014-03-08T19:11:52Z

numpy/lib/function_base.py

+            axis += a.ndim
+        if axis < 0 or axis >= a.ndim:
+            raise IndexError(
+                "axis %d out of bounds (%d)" % (axis, a.ndim))


I forgot to mention this earlier, sorry for the noise... If the user passes in a negative axis that is out of bounds, the error message will be confusing, because you will be reporting the value after adding a.ndim, not the value the user entered. You probably want to use another variable to store the original value. There has to be some example of this in the Python code base, but I can't seem to find any. Here is how partition does it in the C code.

charris · 2014-03-11T23:30:38Z

@jaimefrio Do you think this is ready?

juliantaylor · 2014-03-11T23:34:47Z

I still think the extended axis pr should be merged first, but as that seems to be going nowhere I guess I'll add the mean cludge back in to not block this any longer

this pr still needs a warning if a nan is encountered
also I wonder if we can change it to output nan if one is found, its a quite significant change from returning almost correct result to nan for data with only a very small fraction of nans in it.

charris · 2014-03-11T23:42:11Z

numpy/lib/function_base.py

+            out = np.asanyarray(mean(part[indexer], axis=axis, out=out))
+            out[ids] = np.nan
+            return out
+    else: #if there are no nans


The comment here would look better on the next line.

Agreed, moved comment to next line.

juliantaylor · 2014-03-13T22:42:11Z

needs a rebase now that extended axis support is merged

juliantaylor · 2014-03-16T12:59:09Z

numpy/lib/function_base.py

+            axis += a.ndim
+        if axis < 0 or axis >= a.ndim:
+            raise ValueError(
+                "axis %d out of bounds (%d)" % (axis_orig, a.ndim))


this check is now redundant, _ureduce checks the axis input

juliantaylor · 2014-03-16T13:35:53Z

I'm not sure we should return nan if there is a nan in the array, I can imagine it could break a couple applications.
maybe it is better to just emit a warning and only change to nan return value after one or two releases

empeeu · 2014-03-16T17:32:18Z

I agree that it will probably break some applications, which is why I think we'd need a nanmedian (#4307) to be released at the same time. That way, when applications break you could either switch to nanmedian, if that's the behaviour you want, or you'll be alerted to the fact that you have nan's coming from somewhere upstream in the code. I also think it SHOULD break a few applications, because it's usually worse to get almost the right answer instead of a completely wrong answer. Very hard to track down bugs where you have inconsistent performance. (Imagine sometime having a 1000 element array with 1 nan, versus 100 nans, but always getting a median that's a real number).

njsmith · 2014-03-16T17:35:34Z

I guess the generic guideline here is that if this is a "functionality
change" then we need a warning cycle, and if it's a "bug fix" then we just
do it. I can see the argument for this being borderline between these,
because previously median did always return a non-nan value (in almost
all cases), so at least it was consistent. But given that it was
consistently returning the wrong value (not even the same as what
nanmedian would return!), I'm inclined to call it a bug fix and get the fix
out there as fast as possible.

On Sun, Mar 16, 2014 at 1:35 PM, Julian Taylor notifications@github.comwrote:

I'm not sure we should return nan if there is a nan in the array, I can
imagine it could break a couple applications.
maybe it is better to just emit a warning and only change to nan return
value after one or two releases

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/4460#issuecomment-37757201
.

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

charris · 2014-03-23T01:22:40Z

@empeeu Universal test failure

======================================================================

ERROR: test_basic (test_function_base.TestMedian)

----------------------------------------------------------------------

Traceback (most recent call last):

File "/home/travis/virtualenv/python2.6/lib/python2.6/site-packages/numpy/lib/tests/test_function_base.py", line 1866, in test_basic

assert_equal(np.median(a).ndim, 0)

AttributeError: 'float' object has no attribute 'ndim'

Also, could you squash the commits down to one, maybe two?

empeeu · 2014-03-23T16:36:25Z

Hi yes, sorry, I haven't had time to work on this. I don't know why it's suddenly failing, hopefully I'll get to it soon.

juliantaylor · 2014-03-23T17:29:04Z

I added the tests from my cleanup commit to master, applying the rest from juliantaylor@e8f2acb should fix it

empeeu · 2014-03-28T01:41:23Z

Rolled all changes into one commit. Wow, git is great! Goodbye svn. :)

charris · 2014-03-31T23:39:48Z

@juliantaylor Is this ready?

juliantaylor · 2014-03-31T23:41:13Z

I guess, it still needs some cleanup and release notes but these can be added in another PR

charris · 2014-04-01T20:24:26Z

numpy/lib/function_base.py

+    if np.issubdtype(a.dtype, np.inexact):
+        if part.ndim <= 1:
+            if np.isnan(part[-1]):
+                return a.dtype.type(np.nan)


Array scalars can be used for out and have dimension 0.

charris · 2014-04-01T20:27:52Z

ISTM that if _ureduce always swapped the axis then both it and the functions it called would be made simpler since no axis keyword would be needed for the call.

empeeu · 2014-04-02T23:50:55Z

I'm unfortunately a bit low on time at the moment. If you feel like getting this done sooner, feel free to pull in and modify this PR. If you still have major changes in mind, just let me know and I can get to it eventually.

juliantaylor · 2014-06-08T11:38:26Z

I pushed a cleaned up version to this branch:
https://github.com/juliantaylor/numpy/tree/empeeu-median-nan

though we also need the same for percentile and I don't see myself doing that today so I guess we will have to put it off to beta2 or 1.10 (or never).

juliantaylor · 2014-06-08T11:42:03Z

unless we want to merge it without "fixing" also percentile?

empeeu · 2014-10-03T21:14:34Z

Sorry for falling off the face of the earth. @juliantaylor thanks for cleaning this up. I see that nanmedian got merged successfully. Is percentile currently holding this up?

charris · 2014-10-10T16:22:35Z

@juliantaylor What is the status of this?

juliantaylor · 2014-10-10T17:07:41Z

if we want it the cleaned up branch should be merged. Though I do wonder if we really need it now that we have nanmedian and nanpercentile.
I think it would a problem to ignore nans in median but not percentile as it would break the equivalency of percentile(50) and median. So percentile should get the same change which is probably quite tedious to do.

charris · 2014-10-10T17:21:32Z

If the needed functionality is provided by nanmedian and nanpercentile I don't see the need for this, but I'm not clear on what it adds. I can see raising a warning if a nan is found.

empeeu · 2014-10-10T21:13:37Z

I continue to consider the current behavior a bug. If I take the median of (1, 1, 2, 3, nan) I should not get 2 (current implementation returns 2). The safest behavior is to return nan.

Perhaps it's cleaner if I just close this PR and let @juliantaylor create and merge a pull request? Otherwise I'm happy to merge @juliantaylor's modifications. Either way, just let me know.

Throwing a similar error in percentile I think is a good idea. I suspect median is more widely used than percentile, but the change is warranted. I can see about doing this.

njsmith · 2014-10-10T21:27:05Z

IMO we do want this. This is mostly orthogonal to nanmedian -- nanmedian is
a way to compute the median while ignoring NaN values, this is about the
fact that currently median flat out returns the wrong answer when NaNs are
present.

I do agree that percentile needs a similar fix, but it's a bit weird to
make fixing that bug a precondition to fixing this bug...

On Fri, Oct 10, 2014 at 10:13 PM, Matt Ueckermann notifications@github.com
wrote:

I continue to consider the current behavior a bug. If I take the median of
(1, 1, 2, 3, nan) I should not get 2 (current implementation returns 2).
The safest behavior is to return nan.

Perhaps it's cleaner if I just close this PR and let @juliantaylor
https://github.com/juliantaylor create and merge a pull request?
Otherwise I'm happy to merge @juliantaylor
https://github.com/juliantaylor's modifications. Either way, just let
me know.

Throwing a similar error in percentile I think is a good idea. I suspect
median is more widely used than percentile, but the change is warranted. I
can see about doing this.

—
Reply to this email directly or view it on GitHub
#4460 (comment).

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

juliantaylor · 2014-10-10T21:35:56Z

I'd argue its no bug, nans have a well defined ordering in numpy. median gives you the 50th percentile of all values. nanmedian gives you the 50th percentiles of all valid values where you consider nans invalid values.
The issue is just that probably many people are not aware of that they should be using nanmedian.
If we really want to help these we could just remove median for floats completely and automatically change it to nanmedian. But that would also be terribly slow.

juliantaylor · 2014-10-10T21:41:42Z

well it would be an option, instead of letting applications blow up by now receiving nans, compute normally check for nan and the repeat with nanmedian to give the user the result he/she probably wanted.
I like that a bit more than returning nan, but I still really don't like it.

Nodd · 2014-10-10T22:06:08Z

If I have a bad value in an array, I don't expect median to hide it silently. In that case I don't care if nans are ordered in numpy. Min and max both return nan even if nans are ordered. Also I expect to get a similar result between mean and median.

njsmith · 2014-10-10T22:12:22Z

NaNs may have a well-defined ordering in numpy (which is dubious enough),
but I can't see how this justifies violating the usual semantics of NaNs or
is something that any user would ever want. IEEE754 defines basically every
operation that involves a NaN to return NaN.

A common case where NaNs are generated is when evaluating an indeterminate
expression like lim_x lim_y x/y, as x and y both go to 0. This could be
literally anything from -Inf to +Inf, so 0/0 returns NaN to represent that
indeterminacy. Right now median treats such values as if they are always
+Inf. This is just not useful.

On Fri, Oct 10, 2014 at 10:35 PM, Julian Taylor notifications@github.com
wrote:

I'd argue its no bug, nans have a well defined ordering in numpy. median
gives you the 50th percentile of all values. nanmedian gives you the 50th
percentiles of all valid values where you consider nans invalid values.
The issue is just that probably many people are not aware of that they
should be using nanmedian.
If we really want to help these we could just remove median for floats
completely and automatically change it to nanmedian. But that would also be
terribly slow.

—
Reply to this email directly or view it on GitHub
#4460 (comment).

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

seberg · 2014-10-11T13:05:43Z

returning NaN for the median/percentile would make most sense to me as well, would that be very expensive to check for?

juliantaylor · 2014-10-11T13:17:19Z

I guess I can live with the all operations involving nan return nan argument. The cleanup commit would need to be picked then (and updated to use 1.10 release notes)

empeeu · 2014-10-13T02:18:20Z

Ok. I'll pull in your cleaned up version @juliantaylor, and I'll update to use the 1.10 release notes. I'll get to it at some point this week.

Thanks everyone.

empeeu · 2014-10-15T02:54:55Z

Rebased, merged in @juliantaylor 's changes (manually), and updated the release notes for 1.10.

I'll get working on percentile.

empeeu · 2014-10-18T01:43:25Z

Can this cleaned up version be merged?

empeeu · 2014-10-30T02:32:48Z

Added nan-checking to percentile as well https://github.com/empeeu/numpy/tree/percentile-nan . I have the changes to median in that branch as well (and I just rebased it). Would you prefer a separate PR for percentile (probably)? Or all of it together, just let me know.

empeeu · 2014-11-15T02:09:55Z

Just checking in to make sure that the nan handling of median this is still on the radar. I'm happy to rebase again because I figure this PR is probably out of date again. I should be pretty responsive this weekend for quick change. Just let me know what y'all want and I'll get it done. :)

shoyer · 2015-01-03T16:06:13Z

This look like a useful patch to fix buggy behavior -- are there any remaining issues for merging this for numpy 1.10? @juliantaylor?

to close issue numpy#586. Also added unit test.

fix output argument, add warnings, fix style and add a note in the release notes.

empeeu · 2015-04-07T23:56:36Z

I rebased again since this was getting out of date. @jaimefrio can this be merged?

I was wondering if nan-percentile was holding this up, so I made another pull request. Let me know if I should close one of these. See #5753.

shoyer · 2015-06-02T06:09:18Z

numpy/lib/function_base.py

+                    rout = a.dtype.type(np.nan)
+        else:
+            for i in range(np.count_nonzero(n.ravel())):
+                warnings.warn("Invalid value encountered in median",


Issuing warnings in a loop seems like a bad idea, for two reasons:

It would just produce a lot of noise in logs (if anyone is not filtering all but the first warning)

Issuing warnings is actually very slow, so this could be a serious performance bottleneck. Consider:

In [10]: a = np.zeros((10000, 2)) In [11]: %timeit np.median(a, axis=0) 10000 loops, best of 3: 117 µs per loop In [12]: %timeit for x in range(10000): warnings.warn("Invalid value encountered in median", RuntimeWarning) /Users/shoyer/miniconda/envs/numpy-dev/bin/ipython:257: RuntimeWarning: Invalid value encountered in median 100 loops, best of 3: 12.3 ms per loop

homu · 2015-06-17T21:15:35Z

☔ The latest upstream changes (presumably #5977) made this pull request unmergeable. Please resolve the merge conflicts.

empeeu · 2015-06-23T00:42:31Z

@shoyer I agree. Moved it outside the for loop, see #5753

@homu I rebased again. Should now be able to merge automatically again on #5753.

I'm closing this PR because #5753 handles both the median-nan and percentile-nan functions and I haven't had any response to my question April 7.

jaimefrio reviewed Mar 8, 2014
View reviewed changes

charris reviewed Mar 11, 2014
View reviewed changes

juliantaylor reviewed Mar 16, 2014
View reviewed changes

charris reviewed Apr 1, 2014
View reviewed changes

empeeu force-pushed the median-nan branch from e5afa06 to 06c8c18 Compare October 15, 2014 02:18

rgommers added component: numpy.lib 03 - Maintenance labels Mar 8, 2015

empeeu force-pushed the median-nan branch from 06c8c18 to 30590c2 Compare April 7, 2015 23:49

empeeu added 2 commits April 7, 2015 19:51

ENH: Added proper handling of nans for numpy.lib.function_base.median

5e99e33

to close issue numpy#586. Also added unit test.

MAINT: cleanup median nan check code

a5f41a3

fix output argument, add warnings, fix style and add a note in the release notes.

empeeu force-pushed the median-nan branch from 30590c2 to a5f41a3 Compare April 7, 2015 23:55

shoyer reviewed Jun 2, 2015
View reviewed changes

empeeu closed this Jun 23, 2015

		@@ -2773,33 +2773,31 @@ def median(a, axis=None, out=None, overwrite_input=False):
		if axis is not None and axis >= a.ndim:

Uh oh!

ENH: Added proper handling of nans for numpy.lib.function_base.median #4460

ENH: Added proper handling of nans for numpy.lib.function_base.median #4460

Uh oh!

Conversation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!