Ufunc calls on scalars are very slow #11232

mhvk · 2018-06-02T21:54:56Z

It is well known that ufunc calls on scalars are rather slow, but it is probably good to have a summary of why it is so slow, for which it is useful to go along the ufunc_generic_call path. I got only partway, but one possible solution might be for the scalars to already get overridden in in CheckOverride, i.e., to treat them as if they had their own __array_ufunc__ (with priority even below that of ndarray; an actual __array_ufunc__ calling math is slightly slower than our present state).

PyUFunc_CheckOverride: for non-arrays (thus including scalars), this checks whether the scalar has __array_ufunc__. Easy to avoid if our whole API is available - needs ENH: implement nep 0015: merge multiarray and umath #10915.
PyUFunc_GenericFunction: to be done (will edit).
make_full_arg_tuple: EDIT now fast (with MAINT: ensure we do not create unnecessary tuples for outputs #11231).
_find_array_wrap -> _find_array_method: skips arrays and scalars, so should be reasonably fast (though a subclass check for Generic is done before type checks on python objects in PyArray_IsAnyScalar (in ndarrayobject.h).

Simple timings

Single-input ufunc, comparing with math

a = 1.
a64 = np.float64(1.)
as64 = np.array(1., dtype=np.float64)
%timeit math.sin(a)  # and a64, as64
# 76, 76, 87  ns for a, a64, as64
%timeit np.sin(a)
# 600, 930, 450 ns for a, a64, as64

Somewhat more random, for addition

%timeit np.add(1., 1)
# 1000000 loops, best of 3: 970 ns per loop
%timeit 1. + 1
# 100000000 loops, best of 3: 8.73 ns per loop
# slightly fairer
%timeit operator.add(1., 1)
# 10000000 loops, best of 3: 80.4 ns per loop
# Oddly, again, scalars are much slower than array scalars
a64 = np.float64(1.)
%timeit np.add(a64, a64)
# 1000000 loops, best of 3: 1.35 µs per loop
as64 = np.array(1., dtype=np.float64)
%timeit np.add(as64, as64)
# 1000000 loops, best of 3: 468 ns per loop

The text was updated successfully, but these errors were encountered:

eric-wieser · 2018-06-03T05:13:23Z

Does this have a benchmark? If not, would be good to add one before digging any further

seberg · 2018-06-03T08:56:02Z

IIRC one of the slowest things right now may be the type resolution loop of the ufunc, because I remember discussions that implementing a hash table for it would be good, and I somewhat don't think that has been done. A few years back there were some optimizations done in the scalar paths in a GSoC I think, so a lot of the lowest hanging fruits are likely gone (though new ones might have come up I guess).

mhvk · 2018-06-03T12:27:45Z

@seberg - I'm mostly looking to see if my idea of intercepting already at the CheckOverride is worth it - if, say, a single-input ufunc gets just a double, can we speed things up. In some sense, the approach would be to think that, as is, everything without __array_ufunc__ is considered an array, but one might as well think of scalars as having their own implied __array_ufunc__, which we known can deal with scalars only.

seberg · 2018-06-03T12:31:55Z

Yeah, it sounds like a good idea, a few simple fast paths like that for scalars could go a far way. If we can get such a thing right in a not too complex way, maybe it can even help with reducing the code complexity/duplication that is currently really annoying for scalars? (I believe in scalars, but that extra code I do not believe in)
There are also some really slow things probably such as buffering/casting necessary for things such as np.add(1., 3) which likely is hilariously slow....

mhvk · 2018-06-03T12:37:54Z

@eric-wieser - good point about the benchmarks. From a quick look, it looks like there are benchmarks for array scalars but not for true scales (neither numpy nor python ones).

mhvk · 2018-06-03T13:01:11Z

@seberg - there is room for improvement... I now added a few timings on top.

juliantaylor · 2018-06-03T15:33:53Z

here is one of the type lookup optimizations:
#4904

mhvk · 2024-01-11T19:35:30Z

Still would be good to see if one cannot optimize ufuncs called on scalars a bit. Timings on numpy-dev (nearly 2.0)

a = 1.
a64 = np.float64(1.)
as64 = np.array(1., dtype=np.float64)
%timeit math.sin(a)  # and a64, as64
# 53, 58, 70  ns for a, a64, as64
%timeit np.sin(a)
# 1010, 920, 560 ns for a, a64, as64

mhvk added 15 - Discussion component: numpy._core labels Jun 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Ufunc calls on scalars are very slow #11232

Ufunc calls on scalars are very slow #11232

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Ufunc calls on scalars are very slow #11232

Ufunc calls on scalars are very slow #11232

Comments

Uh oh!

Simple timings

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!