poor performance of multidimensional masked median #4683

juliantaylor · 2014-05-07T18:11:04Z

due to the use of apply_over_axis masked median operations gets incredibly slow when computing them along a small dimension of a large array:

In [3]: d = np.random.uniform(size=(200, 200, 50))
In [4]: %timeit np.median(d,axis=2)
10 loops, best of 3: 40.1 ms per loop
In [5]: %timeit np.nanmedian(d,axis=2)
1 loops, best of 3: 1.32 s per loop
In [6]: %timeit np.ma.median(d,axis=2)
1 loops, best of 3: 4.89 s per loop

a way out is using bottlenecks nanmedian which is on par with our non-masked performance (and better for the slow axis) but I think we should think of ways to improve this in numpy.

I can think of following improvements for median:

implement apply_over_axis in C, should be relatively easy but will only give us ~50% boost at best. it should still be worthwhile as there are likely more users of that function out there.
extend partition to broadcast the index, then one could compute the right index from the mask and do everything in one partition call. Probably worthwhile but it will make the partition interface more complicated and would probably imply rewritting our pretty ugly sorting code.
probably the simplest would be add a check on the size of the axis and use sort instead of partition as currently done in ma.median but then instead of selecting via apply_along_axis, one selects the elements to average (or not) via fancy indexing (always select the middle two and if the non-masked count is odd duplicate one of the elements so mean() does nothing)
add a MaskedPartition C-Api function
see if we can integrate bottleneck directly somehow

any more options I'm might be overlooking?

The text was updated successfully, but these errors were encountered:

…sions many masked median along a small dimension is extremely slow due to the usage of apply_along_axis which iterates fully in python. The unmasked median is about 1000x faster. Work around this issue by using indexing to select the median element instead of apply_along_axis. Further improvements are possible, e.g. using the current np.nanmedian approach for masked medians along large dimensions so partition is used instead of sort or to extend partition to allow broadcasting over multiple elements. Closes numpygh-4683.

mwcraig mentioned this issue May 12, 2014

masked median_combine is excruciatingly slow astropy/ccdproc#74

Closed

juliantaylor mentioned this issue May 30, 2014

ENH: rewrite ma.median to improve poor performance for multiple dimensions #4760

Merged

charris closed this as completed in #4760 Jun 2, 2014

durack1 mentioned this issue Jun 19, 2014

Memory bloat using numpy.ma.median (Py 2.7.4, Np 1.7.1) #4814

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

poor performance of multidimensional masked median #4683

poor performance of multidimensional masked median #4683

Uh oh!

poor performance of multidimensional masked median #4683

poor performance of multidimensional masked median #4683

Comments