Optimize long_pow for the common case #90178

rhettinger · 2021-12-09T06:32:23Z

BPO	46020
Nosy	@tim-one, @rhettinger, @mdickinson, @sweeneyde, @Fidget-Spinner
PRs	bpo-46020: Optimize long_pow for the common case #30555

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2021-12-09.06:32:23.304>
labels = ['interpreter-core', 'performance']
title = 'Optimize long_pow for the common case'
updated_at = <Date 2022-01-12.18:55:05.887>
user = 'https://github.com/rhettinger'

bugs.python.org fields:

activity = <Date 2022-01-12.18:55:05.887>
actor = 'tim.peters'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Interpreter Core']
creation = <Date 2021-12-09.06:32:23.304>
creator = 'rhettinger'
dependencies = []
files = []
hgrepos = []
issue_num = 46020
keywords = ['patch']
message_count = 8.0
messages = ['408076', '408079', '408094', '408099', '408116', '409609', '410381', '410422']
nosy_count = 5.0
nosy_names = ['tim.peters', 'rhettinger', 'mark.dickinson', 'Dennis Sweeney', 'kj']
pr_nums = ['30555']
priority = 'normal'
resolution = None
stage = 'patch review'
status = 'open'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue46020'
versions = []

rhettinger · 2021-12-09T06:32:23Z

The expression 'x * x' is faster than 'x ** 2'.

In Python3.10, the speed difference was enormous. Due to ceval optimizations, the difference in Python3.11 is tighter; however, there is still room for improvement.

The code for long_pow() doesn't currently have a fast path for squaring which is by far the most important case.

$ python3.10 -m timeit -r 11 -s 'x = 10' 'x ** 2'
1000000 loops, best of 11: 201 nsec per loop
$ python3.10 -m timeit -r 11 -s 'x = 10' 'x * x'
10000000 loops, best of 11: 21.9 nsec per loop

$ python3.11 -m timeit -r 11 -s 'x = 10' 'x ** 2'
10000000 loops, best of 11: 32 nsec per loop
$ python3.11 -m timeit -r 11 -s 'x = 10' 'x * x'
20000000 loops, best of 11: 17.6 nsec per loop

sweeneyde · 2021-12-09T08:01:12Z

I believe https://bugs.python.org/issue44376 added a special case for 2nd and 3rd powers, and that's the 3.10/3.11 difference in the speed of x**2, not ceval optimizations.

rhettinger · 2021-12-09T09:37:01Z

Hmm, I had just looked at that code and it wasn't at all obvious that an optimization had been added. I expected something like:

if (exp==2) return PyNumber_Multiply(x, x);

I wonder where the extra clock cycles are going. Looking at the ceval.c dispatch and the bodies of PyNumber_Multiply(), _PyNumber_PowerNoMod(), and PyNumber_Power(), it looks like the power dispatch code should be about the same as or slightly cheaper than the multiply dispatch.

I'm surprised there is still almost a two to one speed difference. ISTM there is still too much overhead for squaring.

rhettinger · 2021-12-09T09:50:28Z

The situation for floats is also disappointing:

$ python3.11 -m timeit -s 'x=1.1' 'x ** 2'
5000000 loops, best of 5: 60.8 nsec per loop
$ python3.11 -m timeit -s 'x=1.1' 'x ** 2.0'
5000000 loops, best of 5: 51.5 nsec per loop
$ python3.11 -m timeit -s 'x=1.1' 'x * x'
20000000 loops, best of 5: 17.7 nsec per loop

Complex numbers are more balanced. Surprisingly, some of the complex cases are faster than their float counterparts:

$ python3.11 -m timeit -s 'x=1.1+2.2j' 'x ** 2'
5000000 loops, best of 5: 42.4 nsec per loop
$ python3.11 -m timeit -s 'x=1.1+2.2j' 'x ** 2.0'
5000000 loops, best of 5: 43.3 nsec per loop
$ python3.11 -m timeit -s 'x=1.1+2.2j' 'x ** 2.0j'
2000000 loops, best of 5: 107 nsec per loop
$ python3.11 -m timeit -s 'x=1.1+2.2j' 'x * x'
10000000 loops, best of 5: 30.6 nsec per loop

Decimal still shows a large difference:

$ python3.11 -m timeit -s 'from decimal import Decimal' -s 'x=Decimal("1.1")' 'x ** 2'
1000000 loops, best of 5: 206 nsec per loop
$ python3.11 -m timeit -s 'from decimal import Decimal' -s 'x=Decimal("1.1")' 'x ** Decimal(2)'
1000000 loops, best of 5: 281 nsec per loop
$ python3.11 -m timeit -s 'from decimal import Decimal' -s 'x=Decimal("1.1")' 'x * x'
5000000 loops, best of 5: 60.9 nsec per loop

Fidget-Spinner · 2021-12-09T13:12:13Z

I'm not sure about the original 10:1 difference in 3.10, but in 3.11, the 2:1 difference might be due to the PEP-659 machinery optimizing for int * int, and float * float cases (see BINARY_OP_MULTIPLY_INT and BINARY_OP_MULTIPLY_FLOAT in ceval).

Last I recall, these were slightly faster than their unspecialized counterparts as they have less overhead in C function calls. Meanwhile, none of the power operations are optimized via PEP-659 machinery at the moment.

tim-one · 2022-01-03T19:25:10Z

I was suprised that

https://bugs.python.org/issue44376

managed to get i**2 to within a factor of 2 of i*i's speed. The overheads of running long_pow() at all are high! Don't overlook that initialization of stack variables at the start, like

PyLongObject *z = NULL;  /* accumulated result */

isn't free - code has to be generated to force zeroes into those variables. The initialization of table[] alone requires code to fill 256 memory bytes with zeroes (down to 128 on current main branch). Nothing is free.

We can't sanely move the table initialization expense into the "giant k-ary window exponentiation" block either, because every bigint operation can fail ("out of memory"), and the macros for doing the common ones (MULT and REDUCE) can do "goto Error;", and that common exit code has no way to know what is or isn't initialized. We can't let it see uninitialized stack trash.

The exit code in turn has a string of things like

    Py_DECREF(a);
    Py_DECREF(b);
    Py_XDECREF(c);

and those cost cycles too, including tests and branches.

So the real "outrage" to me is why xx took 17.6 nsec for x == 10 in the original report. That's many times longer than the HW takes to do the actual multiply. Whether it's spelled xx or x**2, we're overwhelming timing overheads. pow() has many because it's a kind of Swiss army knife doing all sorts of things; what's x*x's excuse? ;-)

tim-one · 2022-01-12T02:47:49Z

GH_30555 helps a bit by leaving the giant-exponent table of small odd powers as uninitialized stack trash unless it's actually used.

tim-one · 2022-01-12T18:55:06Z

New changeset fc05e6b by Tim Peters in branch 'main':
bpo-46020: Optimize long_pow for the common case (GH-30555)
fc05e6b

rhettinger added interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels Dec 9, 2021

ezio-melotti transferred this issue from another repository Apr 10, 2022

rhettinger closed this as completed Apr 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Optimize long_pow for the common case #90178

Optimize long_pow for the common case #90178

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Optimize long_pow for the common case #90178

Optimize long_pow for the common case #90178

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!