8000 PERF: Reduce overhead of np.linalg.norm for small arrays by eendebakpt · Pull Request #21394 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

PERF: Reduce overhead of np.linalg.norm for small arrays #21394

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 3, 2022

Conversation

eendebakpt
Copy link
Contributor

The numpy.linalg.norm method has some overhead for small and medium sized arrays. This PR reduces the overhead by making two small adjustments to the implementation. Details are discussed below.

Benchmark:

import numpy as np
x=np.random.rand(100,)
%timeit np.linalg.norm(x) 
%timeit np.sqrt(x.dot(x)) # for comparison

x=x+1j*np.random.rand(100,)
%timeit np.linalg.norm(x)

Results for main

3.63 µs ± 16.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
1.91 µs ± 7.91 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
5.33 µs ± 13.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

This PR

2.27 µs ± 10.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
1.92 µs ± 4.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
3.35 µs ± 10.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)```
  • The call dot(x,x) is replaced with x.dot(x). Since x has been cast with asarray and ravel, we know that x is a 1D numpy array. An alternative would be np.inner(x,x). The call with inner is faster than dot(x,x) but slower than x.dot(x)

  • The call to np.sqrt is replaced with a specialized call for float numbers (the common case). Ideally, the call would be replaced with a call to a numpy version of sqrt that only handles the scalar case. I could not find such a method exposed in the numpy codebase. Would it be possible to use a create such a method?

Reduce the overhead of np.linalg.norm by replacing dot(x,x) with x.dot(x) and replacing the sqrt calculation with a version specialized for float numbers
@eendebakpt eendebakpt changed the title [RFC] Reduce overhead of np.linalg.norm for small arrays PERF: Reduce overhead of np.linalg.norm for small arrays Apr 27, 2022
@seberg
Copy link
Member
seberg commented Apr 29, 2022

Thanks! The first change, I think we might as well just do. Since the input is an exact array (and not even a subclass).
The second, I am not a huge fan of, but if it is noticably faster for small array it seems OK. A comment may be nice to note that it is just a micro-optimization.

@rossbar
Copy link
Contributor
rossbar commented Apr 29, 2022

The second, I am not a huge fan of, but if it is noticably faster for small array it seems OK. A comment may be nice to note that it is just a micro-optimization.

Agreed - are we sure that the second change actually has an appreciable impact on the benchmarks? It'd be good to know what the contribution to performance improvement is from each change.

Also re: benchmarking, we could likely get more comprehensive results with the NumPy benchmarking suite, e.g. python runtests.py --bench-compare main bench_linalg

@eendebakpt
Copy link
Contributor Author

@seberg @rossbar I am not a fan of the second change as well, but was hoping we could somehow use an (internal) numpy method to perform the sqrt calculation for the various floating point types (e.g. float16, float32, etc.)

I tried running the benchmarks, but they do not work on my systems. Instead, let me give you some numbers so you can decide whether to include the second optimization or not. Note: these results are on python 3.10 windows (before it was python 3.8, linux)

  • np.sqrt is a ufunc implemented in C (as far as I can see). The overhead compared to plain math.sqrt is almost a microsecond.

Benchmark:

import pyperf
import numpy

setup="import math; import numpy as np; from numpy import float64; x=np.float64(1.1)"
runner = pyperf.Runner()
runner.timeit("np.sqrt", stmt="y=np.sqrt(x)",   setup=setup)
runner.timeit("np.sqrt with cast", stmt="y=np.sqrt(float(x))",   setup=setup)
runner.timeit("specialized", stmt="y=np.float64(math.sqrt(x)) if isinstance(x, float) else np.sqrt(x)",  setup=setup)
runner.timeit("math.sqrt", stmt="y=math.sqrt(x)",   setup=setup)

Result:

np.sqrt: Mean +- std dev: 1.00 us +- 0.11 us
np.sqrt with cast: Mean +- std dev: 760 ns +- 47 ns
specialized: Mean +- std dev: 284 ns +- 18 ns
math.sqrt: Mean +- std dev: 113 ns +- 6 ns

(this is on python 3.10 windows, on python 3.8 linux I see similar results)

  • I compared main to this PR with and without the second optimization (if statement to specialize the sqrt).
    (tests with a random array of size 200). The results:
main: Mean +- std dev: 4.04 us +- 0.36 us
PR no if: Mean +- std dev: 3.73 us +- 0.37 us
PR with if: Mean +- std dev: 2.42 us +- 0.20 us

@mattip
Copy link
Member
mattip commented May 1, 2022

I tried running the benchmarks, but they do not work on my systems

Please open an issue with the tracebacks. This is the standard way for reporting performance enhancements in a reproducible way and allows us to track regressions. Maybe this PR improves one measure of performance but hurts others (unlikely, but stranger things have happened).

@eendebakpt
Copy link
Contributor Author

I removed the specialization for float. With the ideas in #21423 (and some other ideas) the overhead from np.sqrt can be reduced enough that the performance advantange does not seem worth the special case.

Running benchmarks still does not work, it is due to #21288 (but it does not yet give me a solution).

@seberg
Copy link
Member
seberg commented May 3, 2022

Yeah, I just ran into the benchmark problem as well and opened gh-21428 for it. I am not sure it would be super useful here anyway, because we are a bit short on tests for small arrays/scalars.

So it would be nice to add tests for the norm of very small arrays, but considering how simple it is now and that the benchmarks don't run easily, I am happy to glance over it now.

Can you clean up the import math and float64, though?

@eendebakpt
Copy link
Contributor Author

Benchmark:

       before           after         ratio
     [fd646bd6]       [f28a2946]
     <main>           <optimize_linalg_norm>
-     16.1±0.03μs      15.3±0.08μs     0.95  bench_linalg.Linalg.time_op('norm', 'float32')

@seberg
Copy link
Member
seberg commented May 3, 2022

OK, the benchmark seems to at least notice a small difference and I don't think this adds any maintanence burden anyway in the current state. So lets just get it in.

Thanks @eendebakpt.

@seberg seberg merged commit 5c740d0 into numpy:main May 3, 2022
@eendebakpt eendebakpt deleted the optimize_linalg_norm branch May 3, 2022 13:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0