8000 Ufunc calls on scalars are very slow · Issue #11232 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

Ufunc calls on scalars are very slow #11232

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mhvk opened this issue Jun 2, 2018 · 8 comments
Open

Ufunc calls on scalars are very slow #11232

mhvk opened this issue Jun 2, 2018 · 8 comments

Comments

@mhvk
Copy link
Contributor
mhvk commented Jun 2, 2018

It is well known that ufunc calls on scalars are rather slow, but it is probably good to have a summary of why it is so slow, for which it is useful to go along the ufunc_generic_call path. I got only partway, but one possible solution might be for the scalars to already get overridden in in CheckOverride, i.e., to treat them as if they had their own __array_ufunc__ (with priority even below that of ndarray; an actual __array_ufunc__ calling math is slightly slower than our present state).

  1. PyUFunc_CheckOverride: for non-arrays (thus including scalars), this checks whether the scalar has __array_ufunc__. Easy to avoid if our whole API is available - needs ENH: implement nep 0015: merge multiarray and umath #10915.
  2. PyUFunc_GenericFunction: to be done (will edit).
  3. make_full_arg_tuple: EDIT now fast (with MAINT: ensure we do not create unnecessary tuples for outputs #11231).
  4. _find_array_wrap -> _find_array_method: skips arrays and scalars, so should be reasonably fast (though a subclass check for Generic is done before type checks on python objects in PyArray_IsAnyScalar (in ndarrayobject.h).

Simple timings

Single-input ufunc, comparing with math

a = 1.
a64 = np.float64(1.)
as64 = np.array(1., dtype=np.float64)
%timeit math.sin(a)  # and a64, as64
# 76, 76, 87  ns for a, a64, as64
%timeit np.sin(a)
# 600, 930, 450 ns for a, a64, as64

Somewhat more random, for addition

%timeit np.add(1., 1)
# 1000000 loops, best of 3: 970 ns per loop
%timeit 1. + 1
# 100000000 loops, best of 3: 8.73 ns per loop
# slightly fairer
%timeit operator.add(1., 1)
# 10000000 loops, best of 3: 80.4 ns per loop
# Oddly, again, scalars are much slower than array scalars
a64 = np.float64(1.)
%timeit np.add(a64, a64)
# 1000000 loops, best of 3: 1.35 µs per loop
as64 = np.array(1., dtype=np.float64)
%timeit np.add(as64, as64)
# 1000000 loops, best of 3: 468 ns per loop
@eric-wieser
Copy link
Member

Does this have a benchmark? If not, would be good to add one before digging any further

@seberg
Copy link
Member
seberg commented Jun 3, 2018

IIRC one of the slowest things right now may be the type resolution loop of the ufunc, because I remember discussions that implementing a hash table for it would be good, and I somewhat don't think that has been done. A few years back there were some optimizations done in the scalar paths in a GSoC I think, so a lot of the lowest hanging fruits are likely gone (though new ones might have come up I guess).

@mhvk
Copy link
Contributor Author
mhvk commented Jun 3, 2018

@seberg - I'm mostly looking to see if my idea of intercepting already at the CheckOverride is worth it - if, say, a single-input ufunc gets just a double, can we speed things up. In some sense, the approach would be to think that, as is, everything without __array_ufunc__ is considered an array, but one might as well think of scalars as having their own implied __array_ufunc__, which we known can deal with scalars only.

@seberg
Copy link
Member
seberg commented Jun 3, 2018

Yeah, it sounds like a good idea, a few simple fast paths like that for scalars could go a far way. If we can get such a thing right in a not too complex way, maybe it can even help with reducing the code complexity/duplication that is currently really annoying for scalars? (I believe in scalars, but that extra code I do not believe in)
There are also some really slow things probably such as buffering/casting necessary for things such as np.add(1., 3) which likely is hilariously slow....

@mhvk
Copy link
Contributor Author
mhvk commented Jun 3, 2018

@eric-wieser - good point about the benchmarks. From a quick look, it looks like there are benchmarks for array scalars but not for true scales (neither numpy nor python ones).

@mhvk
Copy link
Contributor Author
mhvk commented Jun 3, 2018

@seberg - there is room for improvement... I now added a few timings on top.

@juliantaylor
Copy link
Contributor

here is one of the type lookup optimizations:
#4904

@mhvk
Copy link
Contributor Author
mhvk commented Jan 11, 2024

Still would be good to see if one cannot optimize ufuncs called on scalars a bit. Timings on numpy-dev (nearly 2.0)

a = 1.
a64 = np.float64(1.)
as64 = np.array(1., dtype=np.float64)
%timeit math.sin(a)  # and a64, as64
# 53, 58, 70  ns for a, a64, as64
%timeit np.sin(a)
# 1010, 920, 560 ns for a, a64, as64

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants
0