8000 add extended axis and keepdims support to percentile and median by juliantaylor · Pull Request #3908 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

add extended axis and keepdims support to percentile and median #3908

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 13, 2014

Conversation

juliantaylor
Copy link
Contributor

No description provided.

@juliantaylor
Copy link
Contributor Author

An issue is the new median doesn't call mean anymore, so it will break astropy

8000

@juliantaylor
Copy link
Contributor Author

using percentile to implement median makes it twice as slow for small arrays (< 10000 elements)
maybe we can keep the explicit the median code?

@njsmith
Copy link
Member
njsmith commented Oct 13, 2013

What about making percentile faster, can we do that? :-)
On 13 Oct 2013 17:57, "Julian Taylor" notifications@github.com wrote:

using percentile to implement median makes it twice as slow for small
arrays (< 10000 elements)
maybe we can keep the explicit the median code?


Reply to this email directly or view it on GitHubhttps://github.com//pull/3908#issuecomment-26221596
.

@juliantaylor
Copy link
Contributor Author

improved percentile a bit, only 50% slower now for even elements, about same speed for 5k element arrays with odd number of elements.

"""
call func with a as first argument swapping axis to use extended axis
on functions that don't support it natively
returns result and a.shape with axis dims set to 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs summary line, parameters documentation, etc.. In other words, the usual documentation. Blank line before last """`.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also use more consistent line length.

@charris
Copy link
Member
charris commented Oct 16, 2013

An issue is the new median doesn't call mean anymore, so it will break astropy

Does astropy overload mean?

@juliantaylor
Copy link
Contributor Author

yes see #3851

@charris
Copy link
Member
charris commented Oct 16, 2013

Right, forgot about that. @astrofrog I wonder if there is a solution that doesn't involve numpy contortions.

@astrofrog
Copy link
Contributor

@charris - I really hope there is a better solution, as I am aware that what we use right now is fairly hacky I know that __numpy_ufunc__ is going to be in 1.9, but that doesn't solve the problem of how to ensure that mean/median/etc also get processed by the sub-class since those aren't ufuncs. It would be great if these functions were made (in Numpy) to call some kind of finalize/wrap method on sub-classes once they are done (similarly to ufuncs). Have there been any related proposals in the past?

@pv
Copy link
Member
pv commented Oct 17, 2013

mean, var, std, min, max, all, any are all implemented via ufuncs, so __numpy_ufunc__ should capture them.

median uses partition, which however cannot be captured.

@pv
Copy link
Member
pv commented Oct 17, 2013

However, I don't see real obstacles for implementing mean et al. as gufuncs. This might also be a performance increase, as these methods currently use temporary arrays etc.

@astrofrog
Copy link
Contributor

@pv - so just so I understand, is there a reason why partition can't be made to behave like mean et al? (in terms of __numpy_ufunc__)

@seberg
Copy link
Member
seberg commented Oct 17, 2013

@astrofrog, since, at least normally, partition uses a weighted calculation a simple mean just can't do it. Which would make it a bit of a kludge to use a mean.

if (q < 0).any() or (q > 1).any():
raise ValueError(
"Percentiles must be in the range [0,100]")
# faster than np.any(q < 0) or np.any(q < 1), relevant for small arrays
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, not sure, but large arrays may be plausible, too. I suppose it wouldn't be here if it wasn't a big difference (and yeah, 2-3 or so elements are probably common). But maybe we should make a threshold?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reductions are so freakishly slow :(
the any() code takes 12 us, while a median of 5000 doubles takes 55us in total (including all the weighted partition stuff)

using count_nonzero is better for large q, but still slower than the python loop for len(q) < 10
nonzero wins if len(q) > 10, while median does not get much more expensive, I guess a threshold could make sense

In [3]: q = np.array([0.5])
In [4]: %timeit  ((q < 0).any() or (q > 1).any())
100000 loops, best of 3: 12.1 µs per loop

In [5]: %timeit  ((q < 0) | (q > 1)).any()
100000 loops, best of 3: 8.79 µs per loop

In [6]: %timeit  np.count_nonzero((q < 0) | (q > 1))
100000 loops, best of 3: 4.8 µs per loop


In [7]: def f(q):
   ...:     for i in range(q.size):
   ...:        if q[i] < 0. or q[i] > 100.:
   ...:           pass
   ...: 

In [8]: %timeit f(q)
1000000 loops, best of 3: 841 ns per loop

@seberg
Copy link
Member
seberg commented Dec 5, 2013

Didn't try it out live, but the tests look quite extensive anyway. @astrofrog how important is that mean call for you? If it is important, we can maybe just add check whether partition returned a base class array, and if not call the (no-op) .mean()

@astrofrog
Copy link
Contributor

@seberg - what's important for us is that running np.mean and np.median continue to call .mean() for numpy array sub-classes. Since partition is a new function, we don't really support it currently so it doesn't matter as much if it doesn't end up calling a method on the sub-class. Does this make sense?

@charris
Copy link
Member
charris commented Dec 22, 2013

This actually looks to have passed. Ping Travis to make it official.

@charris charris closed this Dec 22, 2013
@charris charris reopened this Dec 22, 2013
@charris
Copy link
Member
charris commented Dec 22, 2013

@astrofrog @seberg What to do here? I really don't like working around subclasses, numpy was not designed as a base class, should not be used as such, and we end up like Gulliver tangled up and paralysed in zillions of tiny hassles. I'd really like to see astropy not depend on other routines calling mean and median methods when available. At the very least, astropy should implement it's own median if numpy should be calling it. Forcing median to call mean just doesn't make sense to me. Grrr...

@astrofrog
Copy link
Contributor

@charris - I would disagree with the statement regarding sub-classing ndarray - it is a clearly documented feature in the Numpy Basics section (http://docs.scipy.org/doc/numpy/user/basics.subclassing.html). Having said that, I completely agree that relying on np.median to call np.mean is a hack that we should transition away from as soon as possible. Which is why np.median and np.partition should ideally support ndarray sub-classes (without calling np.mean). Why can median and partition not be made to call __array_wrap__/__array_finalize__/__numpy_ufunc__ or another equivalent sub-class method?

@seberg
Copy link
Member
seberg commented Dec 24, 2013

Calling __array_wrap__ seems reasonable to me. Frankly, we could probably call it much more often but to do that more general, but the state of all this subclassing isn't great anyway, and I doubt that alone might fix it.

@juliantaylor
Copy link
Contributor Author

updated according to sebergs comments, I don't see a reasonable way to call mean in this variant, just calling r.reshape(-1, 1).mean(axis=-1) on the result won't work because the result might not be a ndarray anymore but a scalar.
Similar with __array_wrap__ it won't work on scalars.

@juliantaylor
Copy link
Contributor Author

any ideas what to do with this?
played a bit more with __array_wrap__ and saw that neither quantities nor astropy would work with it except when we pretend percentile is a add ufunc

@juliantaylor
Copy link
Contributor Author

ok I reverted back to a separate median, can this now be looked at? its blocking the nan median PRs

for i, s in enumerate(sorted(keep)):
a = a.swapaxes(i, s)
# merge reduced axis
a = a.reshape(a.shape[:nkeep] + (np.prod(a.shape[nkeep:]),))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use (-1,) instead of (np.prod(a.shape[nkeep:]),).

Sorry, something went wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@charris
Copy link
Member
charris commented Mar 13, 2014

LGTM, modulo nitpicks. @astrofrog Does this work for you?

There are two things that it would be nice to have implemented for ndarrays, missing values and units. We tried the first and ended up deadlocked, and I'm not sure how we could deal with the second in a reasonable way... :( That said, we need to draw the line somewhere in dealing with ndarray subclasses, or at least come to some conclusion about general policy. Going far down this road leads to madness.

@njsmith
Copy link
Member
njsmith commented Mar 13, 2014

@charris: bit pattern NAs and units are both easy if we have parametrized
dtypes + base class ndarrays.
On 13 Mar 2014 01:05, "Charles Harris" notifications@github.com wrote:

LGTM, modulo nitpicks. @astrofrog https://github.com/astrofrog Does
this work for you?

There are two things that it would be nice to have implemented for
ndarrays, missing values and units. We tried the first and ended up
deadlocked, and I'm not sure how we could deal with the second in a
reasonable way... :( That said, we need to draw the line somewhere in
dealing with ndarray subclasses, or at least come to some conclusion about
general policy. Going far down this road leads to madness.


Reply to this email directly or view it on GitHubhttps://github.com//pull/3908#issuecomment-37489469
.

@astrofrog
Copy link
Contributor

@charris thanks for the heads up! Just checking to see if it works for us.

@astrofrog
Copy link
Contributor

@charris - all tests in Astropy pass with this version of the Numpy branch.

Thanks for being willing to compromise to not break our subclassing. I wonder whether it would be worth having a BoF at SciPy 2014 to discuss subclassing Numpy arrays and discuss a long-term solution? We use it for unit support and I think some other packages do.

-------
result : tuple
Result of func(a, **kwargs) and a.shape with axis dims set to 1
suiteable to use to archive the same result as the ufunc keepdims
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suiteable <- suitable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

archive <- achieve?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed by reformulating the sentence

Merging median and percentile make would break astropy and quantities as
we don't call mean anymore. These packages rely on overriding mean to
add their own median behavior.
charris added a commit that referenced this pull request Mar 13, 2014
add extended axis and keepdims support to percentile and median
@charris charris merged commit 48c77a6 into numpy:master Mar 13, 2014
@charris
Copy link
Member
charris commented Mar 13, 2014
8C7C

Finally ;) Thanks Julian.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
0