-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
add extended axis and keepdims support to percentile and median #3908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
An issue is the new median doesn't call mean anymore, so it will break astropy |
using percentile to implement median makes it twice as slow for small arrays (< 10000 elements) |
What about making percentile faster, can we do that? :-)
|
improved percentile a bit, only 50% slower now for even elements, about same speed for 5k element arrays with odd number of elements. |
""" | ||
call func with a as first argument swapping axis to use extended axis | ||
on functions that don't support it natively | ||
returns result and a.shape with axis dims set to 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs summary line, parameters documentation, etc.. In other words, the usual documentation. Blank line before last """`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also use more consistent line length.
Does astropy overload mean? |
yes see #3851 |
Right, forgot about that. @astrofrog I wonder if there is a solution that doesn't involve numpy contortions. |
@charris - I really hope there is a better solution, as I am aware that what we use right now is fairly hacky I know that |
|
However, I don't see real obstacles for implementing |
@pv - so just so I understand, is there a reason why |
@astrofrog, since, at least normally, |
if (q < 0).any() or (q > 1).any(): | ||
raise ValueError( | ||
"Percentiles must be in the range [0,100]") | ||
# faster than np.any(q < 0) or np.any(q < 1), relevant for small arrays |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, not sure, but large arrays may be plausible, too. I suppose it wouldn't be here if it wasn't a big difference (and yeah, 2-3 or so elements are probably common). But maybe we should make a threshold?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reductions are so freakishly slow :(
the any() code takes 12 us, while a median of 5000 doubles takes 55us in total (including all the weighted partition stuff)
using count_nonzero is better for large q, but still slower than the python loop for len(q) < 10
nonzero wins if len(q) > 10, while median does not get much more expensive, I guess a threshold could make sense
In [3]: q = np.array([0.5])
In [4]: %timeit ((q < 0).any() or (q > 1).any())
100000 loops, best of 3: 12.1 µs per loop
In [5]: %timeit ((q < 0) | (q > 1)).any()
100000 loops, best of 3: 8.79 µs per loop
In [6]: %timeit np.count_nonzero((q < 0) | (q > 1))
100000 loops, best of 3: 4.8 µs per loop
In [7]: def f(q):
...: for i in range(q.size):
...: if q[i] < 0. or q[i] > 100.:
...: pass
...:
In [8]: %timeit f(q)
1000000 loops, best of 3: 841 ns per loop
Didn't try it out live, but the tests look quite extensive anyway. @astrofrog how important is that mean call for you? If it is important, we can maybe just add check whether partition returned a base class array, and if not call the (no-op) |
@seberg - what's important for us is that running |
This actually looks to have passed. Ping Travis to make it official. |
@astrofrog @seberg What to do here? I really don't like working around subclasses, numpy was not designed as a base class, should not be used as such, and we end up like Gulliver tangled up and paralysed in zillions of tiny hassles. I'd really like to see astropy not depend on other routines calling mean and median methods when available. At the very least, astropy should implement it's own median if numpy should be calling it. Forcing median to call mean just doesn't make sense to me. Grrr... |
@charris - I would disagree with the statement regarding sub-classing ndarray - it is a clearly documented feature in the Numpy Basics section (http://docs.scipy.org/doc/numpy/user/basics.subclassing.html). Having said that, I completely agree that relying on |
Calling |
updated according to sebergs comments, I don't see a reasonable way to call mean in this variant, just calling |
any ideas what to do with this? |
ok I reverted back to a separate median, can this now be looked at? its blocking the nan median PRs |
for i, s in enumerate(sorted(keep)): | ||
a = a.swapaxes(i, s) | ||
# merge reduced axis | ||
a = a.reshape(a.shape[:nkeep] + (np.prod(a.shape[nkeep:]),)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could use (-1,)
instead of (np.prod(a.shape[nkeep:]),)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
LGTM, modulo nitpicks. @astrofrog Does this work for you? There are two things that it would be nice to have implemented for ndarrays, missing values and units. We tried the first and ended up deadlocked, and I'm not sure how we could deal with the second in a reasonable way... :( That said, we need to draw the line somewhere in dealing with ndarray subclasses, or at least come to some conclusion about general policy. Going far down this road leads to madness. |
@charris: bit pattern NAs and units are both easy if we have parametrized
|
@charris thanks for the heads up! Just checking to see if it works for us. |
@charris - all tests in Astropy pass with this version of the Numpy branch. Thanks for being willing to compromise to not break our subclassing. I wonder whether it would be worth having a BoF at SciPy 2014 to discuss subclassing Numpy arrays and discuss a long-term solution? We use it for unit support and I think some other packages do. |
------- | ||
result : tuple | ||
Result of func(a, **kwargs) and a.shape with axis dims set to 1 | ||
suiteable to use to archive the same result as the ufunc keepdims |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suiteable <- suitable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
archive <- achieve?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed by reformulating the sentence
Merging median and percentile make would break astropy and quantities as we don't call mean anymore. These packages rely on overriding mean to add their own median behavior.
add extended axis and keepdims support to percentile and median
Finally ;) Thanks Julian. |
No description provided.