8000 Allow to choose median function for mad_std and median_absolute_deviation by saimn · Pull Request #5835 · astropy/astropy · GitHub
[go: up one dir, main page]

Skip to content

Allow to choose median function for mad_std and median_absolute_deviation #5835

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Mar 30, 2017

Conversation

saimn
Copy link
Contributor
@saimn saimn commented Feb 23, 2017

The idea here is to allow the user to choose which function must be used for the median computation for mad_std and median_absolute_deviation (similarly to sigma_clip), which would be useful for arrays containing NaNs. Using np.nanmedian can be much faster than having to use masked arrays (see example below).
Also, np.nanmedian allows to use a multi-dimensional axis, as well as np.median, but np.ma.median cannot do this (though it's possible to bypass this limitation with a reshape).

Example on a data cube:

In [3]: cube.shape
Out[3]: (3681, 337, 336) 

Masked array version:

In [4]: %time std = mad_std(mcube.reshape(mcube.shape[0], -1), axis=1)
CPU times: user 1min 56s, sys: 22.4 s, total: 2min 19s
Wall time: 2min 19s   

With this PR, np.nanmedian and a multi-dimensional axis:

In [5]: %time std2 = mad_std(cube, axis=(1, 2), func=np.nanmedian)        
CPU times: user 14 s, sys: 2.41 s, total: 16.4 s                             
Wall time: 16.4 s
                                                                       
In [6]: np.allclose(std, std2)                                                              
Out[6]: True                 

With bottleneck (does not support multi-dimensional axis):

In [9]: %time std3 = mad_std(cube.reshape(mcube.shape[0], -1), axis=1, func=bn.nanmedian)
CPU times: user 12.2 s, sys: 1.82 s, total: 14.1 s
Wall time: 14.1 s

The PR is not polished yet, it needs tests, docstring, changelog etc., but first I wanted to open the discussion (@larrybradley , anyone else ?)

@saimn
Copy link
Contributor Author
saimn commented Feb 23, 2017

Oh, I could have skipped the CI, if someone wants to kill the builds ...

@larrybradley
Copy link
Member

This looks great to me, please continue! Thanks, @saimn.

@MSeifert04
Copy link
Contributor
MSeifert04 commented Feb 23, 2017

This has some overlap with #5232.

I like the possibility of providing a function (🎉). But it seems weird that any providided function is overridden if the input is a masked array.

@larrybradley
Copy link
Member
larrybradley commented Feb 23, 2017

Ah, yes, thanks for reminding me about #5232.

For a masked array input, we could turn the masked values to NaNs and use np.nanmedian.

EDIT: I'm not sure if we can give the user a choice for masked arrays.

@saimn
Copy link
Contributor Author
saimn commented Feb 23, 2017

Oh indeed it's quite the same as #5232, but with a different approach. I prefer a simple approach like here where the user has the responsibility to choose the right function (nanmedian is not available before Numpy 1.9, but one could use bottleneck instead), instead of a function that tries to handle everything as in #5232 (and as such has to perform more operations, like np.isnan calls, or potentially re-masking the NaNs in a masked array). @keflavich - What's your opinion ?

For the masked array case, currently it overrides the provided function, which is I think the best choice here.
One could fill the array with NaNs and use np.nanmedian, but np.nanmedian is Numpy 1.9+, and we need to give back a masked array to the user so it needs some extra steps.

@saimn
Copy link
Contributor Author
saimn commented Feb 23, 2017

Reading a bit more #5232, in particular #5232 (comment) and the benchmark part (#5232 (comment) and below). For me it strengthens the idea that it is difficult to make good default choices, and the best is to give the user to possibility to choose :)

@keflavich
Copy link
Contributor

@saimn I lean the other way - I don't think it's reasonable to expect users to understand all the intricacies of interactions between different versions of numpy and different array types. Personally, if I ever had to figure out which variant of median to use, I'd go back to the source code in #5232.

@crawfordsm
Copy link
Member

I definitely like the possibility of the user having the option to chose the function, but I see the argument in that the average user who will just want the result doesn't want to go through all the work to figure out which to use.

So I think the best solution would be a hybrid of the two. Something where the user could specify the function they want to use (and then it would be use that function regardless of what else the user did), but the default would not be an numpy.median but an astropy.stats function that would select the correct median to call depending on the version and the type of array. This then moves the decision tree into one function where it can be maintained and also re-used if necessary.

I have to say I don't like how this one changes the function after the user has specified it depending on the array type as that seems non-intuitive from the users perspective to me and something that is done without any warning to the user.

@saimn
Copy link
Contributor Author
saimn commented Mar 7, 2017

I have changed the default function to None which allows to respect the user's choice even for masked arrays.
Having a astropy.stats.median function which provides a good default choice could indeed be a solution, as long as there is the possibility to override this. Maybe a solution based on #5232 can be added later ?

@crawfordsm
Copy link
Member

@keflavich are you happy with this merging and then in another pull request adding an astropy.stats.median that chooses the appropriate median?

8000

@keflavich
Copy link
Contributor

Yes, I think this approach is fine for now and we can hopefully enhance it later.

@saimn
Copy link
Contributor Author
saimn commented Mar 10, 2017

Ok great ! Added changelog and a test for the multi-dimensional axis case, it should be ready now.

@crawfordsm
Copy link
Member

@saimn -- it looks like your test for multi-dimensional axis is failing in older versions of numpy.

@saimn
Copy link
Contributor Author
saimn commented Mar 13, 2017

@crawfordsm - Thanks for the notice, it was indeed introduced in Numpy 1.9 for np.median. I will skip the test for older Numpy versions.

@saimn
Copy link
Contributor Author
saimn commented Mar 14, 2017

Failure is due to a timeout in the documentation build:

Warning: Embedding the documentation hyperlinks requires Internet access.
Please check your network connection.
Unable to continue embedding `numpy` links due to a URL Error:
(TimeoutError(110, 'Connection timed out'),)

@saimn
Copy link
Contributor Author
saimn commented Mar 28, 2017

ping @crawfordsm - I think this one is ready (modulo the doc build failure, which you may want to restart ?)

@MSeifert04
Copy link
Contributor

I restarted it. That was the time the numpy documentation was down, right? 😄


a = np.asanyarray(a)
a_median = func(a, axis=axis)

# broadcast the median array before subtraction
if axis is not None:
a_median = np.expand_dims(a_median, axis=axis)
if isinstance(axis, (tuple, list)):
Copy link
Contributor
@MSeifert04 MSeifert04 Mar 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are lists possible? I thought it was only tuple or int. Additional thought: Could this be replaced with a try / except TypeError?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah fun, lists are valid for np.median (docstring speaks of a sequence of int) but not for np.mean. I usually prefer to use try/except for real errors, here using isinstance is more explicit imo.
And thanks for restarting 😉

Copy link
Contributor
@MSeifert04 MSeifert04 Mar 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that sequence of int is the problem. What if I used axis=np.array([1, 2]) as input? arrays are valid axis inputs for median.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes sure, it's valid for this case. And after more testing it seems safe to catch TypeError, so if it's your pleasure to use a Numpy array here let's go for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except that np.expand_dims([0], axis=np.array([1,2])) raises a ValueError, so two exceptions to catch.

try:
a_median = np.expand_dims(a_median, axis=axis)
except (ValueError, TypeError):
for ax in sorted(list(axis)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does sorted work correctly even for negative axis?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep.

a_median = np.expand_dims(a_median, axis=axis)
try:
a_median = np.expand_dims(a_median, axis=axis)
except (ValueError, TypeError):
Copy link
Contributor
@MSeifert04 MSeifert04 Mar 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about:

try:
    len(axis)
except TypeError:
    axis = [axis]

for ax in ....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced with astropy.utils.isiterable which is quite similar.

@saimn
Copy link
Contributor Author
saimn commented Mar 29, 2017

Nice, 2 unrelated failures this time:

@saimn saimn mentioned this pull request Mar 29, 2017
@saimn saimn force-pushed the mad-choose-func branch from 6227d06 to 2a553d7 Compare March 30, 2017 12:22
@saimn
Copy link
Contributor Author
saimn commented Mar 30, 2017

Rebased to get the heaviside fix, now Travis is green (thanks @mhvk !)

@crawfordsm crawfordsm merged commit d9c8238 into astropy:master Mar 30, 2017
@crawfordsm
Copy link
Member

Merging! Thanks @saimn !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0