PERF: Reduce overhead of np.linalg.norm for small arrays #21394

eendebakpt · 2022-04-26T08:17:54Z

The numpy.linalg.norm method has some overhead for small and medium sized arrays. This PR reduces the overhead by making two small adjustments to the implementation. Details are discussed below.

Benchmark:

import numpy as np
x=np.random.rand(100,)
%timeit np.linalg.norm(x) 
%timeit np.sqrt(x.dot(x)) # for comparison

x=x+1j*np.random.rand(100,)
%timeit np.linalg.norm(x)

Results for main

3.63 µs ± 16.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
1.91 µs ± 7.91 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
5.33 µs ± 13.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

This PR

2.27 µs ± 10.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
1.92 µs ± 4.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
3.35 µs ± 10.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)```

The call dot(x,x) is replaced with x.dot(x). Since x has been cast with asarray and ravel, we know that x is a 1D numpy array. An alternative would be np.inner(x,x). The call with inner is faster than dot(x,x) but slower than x.dot(x)
The call to np.sqrt is replaced with a specialized call for float numbers (the common case). Ideally, the call would be replaced with a call to a numpy version of sqrt that only handles the scalar case. I could not find such a method exposed in the numpy codebase. Would it be possible to use a create such a method?

Reduce the overhead of np.linalg.norm by replacing dot(x,x) with x.dot(x) and replacing the sqrt calculation with a version specialized for float numbers

seberg · 2022-04-29T08:06:16Z

Thanks! The first change, I think we might as well just do. Since the input is an exact array (and not even a subclass).
The second, I am not a huge fan of, but if it is noticably faster for small array it seems OK. A comment may be nice to note that it is just a micro-optimization.

rossbar · 2022-04-29T19:05:14Z

The second, I am not a huge fan of, but if it is noticably faster for small array it seems OK. A comment may be nice to note that it is just a micro-optimization.

Agreed - are we sure that the second change actually has an appreciable impact on the benchmarks? It'd be good to know what the contribution to performance improvement is from each change.

Also re: benchmarking, we could likely get more comprehensive results with the NumPy benchmarking suite, e.g. python runtests.py --bench-compare main bench_linalg

eendebakpt · 2022-04-29T21:31:23Z

@seberg @rossbar I am not a fan of the second change as well, but was hoping we could somehow use an (internal) numpy method to perform the sqrt calculation for the various floating point types (e.g. float16, float32, etc.)

I tried running the benchmarks, but they do not work on my systems. Instead, let me give you some numbers so you can decide whether to include the second optimization or not. Note: these results are on python 3.10 windows (before it was python 3.8, linux)

np.sqrt is a ufunc implemented in C (as far as I can see). The overhead compared to plain math.sqrt is almost a microsecond.

Benchmark:

import pyperf
import numpy

setup="import math; import numpy as np; from numpy import float64; x=np.float64(1.1)"
runner = pyperf.Runner()
runner.timeit("np.sqrt", stmt="y=np.sqrt(x)",   setup=setup)
runner.timeit("np.sqrt with cast", stmt="y=np.sqrt(float(x))",   setup=setup)
runner.timeit("specialized", stmt="y=np.float64(math.sqrt(x)) if isinstance(x, float) else np.sqrt(x)",  setup=setup)
runner.timeit("math.sqrt", stmt="y=math.sqrt(x)",   setup=setup)

Result:

np.sqrt: Mean +- std dev: 1.00 us +- 0.11 us
np.sqrt with cast: Mean +- std dev: 760 ns +- 47 ns
specialized: Mean +- std dev: 284 ns +- 18 ns
math.sqrt: Mean +- std dev: 113 ns +- 6 ns

(this is on python 3.10 windows, on python 3.8 linux I see similar results)

I compared main to this PR with and without the second optimization (if statement to specialize the sqrt).
(tests with a random array of size 200). The results:

main: Mean +- std dev: 4.04 us +- 0.36 us
PR no if: Mean +- std dev: 3.73 us +- 0.37 us
PR with if: Mean +- std dev: 2.42 us +- 0.20 us

mattip · 2022-05-01T03:47:55Z

I tried running the benchmarks, but they do not work on my systems

Please open an issue with the tracebacks. This is the standard way for reporting performance enhancements in a reproducible way and allows us to track regressions. Maybe this PR improves one measure of performance but hurts others (unlikely, but stranger things have happened).

eendebakpt · 2022-05-03T10:32:45Z

I removed the specialization for float. With the ideas in #21423 (and some other ideas) the overhead from np.sqrt can be reduced enough that the performance advantange does not seem worth the special case.

Running benchmarks still does not work, it is due to #21288 (but it does not yet give me a solution).

seberg · 2022-05-03T10:43:39Z

Yeah, I just ran into the benchmark problem as well and opened gh-21428 for it. I am not sure it would be super useful here anyway, because we are a bit short on tests for small arrays/scalars.

So it would be nice to add tests for the norm of very small arrays, but considering how simple it is now and that the benchmarks don't run easily, I am happy to glance over it now.

Can you clean up the import math and float64, though?

eendebakpt · 2022-05-03T12:01:11Z

Benchmark:

       before           after         ratio
     [fd646bd6]       [f28a2946]
     <main>           <optimize_linalg_norm>
-     16.1±0.03μs      15.3±0.08μs     0.95  bench_linalg.Linalg.time_op('norm', 'float32')

seberg · 2022-05-03T12:51:28Z

OK, the benchmark seems to at least notice a small difference and I don't think this adds any maintanence burden anyway in the current state. So lets just get it in.

Thanks @eendebakpt.

ENH: Reduce overhead of np.linalg.norm

dd09ca2

Reduce the overhead of np.linalg.norm by replacing dot(x,x) with x.dot(x) and replacing the sqrt calculation with a version specialized for float numbers

eendebakpt changed the title ~~[RFC] Reduce overhead of np.linalg.norm for small arrays~~ PERF: Reduce overhead of np.linalg.norm for small arrays Apr 27, 2022

revert specialization for float

f28a294

cleanup imports

d9292b3

seberg merged commit 5c740d0 into numpy:main May 3, 2022

eendebakpt deleted the optimize_linalg_norm branch May 3, 2022 13:07

eendebakpt mentioned this pull request May 6, 2022

BENCH: Add benchmarks targeted at small arrays #21463

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

PERF: Reduce overhead of np.linalg.norm for small arrays #21394

PERF: Reduce overhead of np.linalg.norm for small arrays #21394

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

PERF: Reduce overhead of np.linalg.norm for small arrays #21394

PERF: Reduce overhead of np.linalg.norm for small arrays #21394

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!