-
-
Notifications
You must be signed in to change notification settings - Fork 32.1k
Optimize long_pow for the common case #90178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The expression 'x * x' is faster than 'x ** 2'. In Python3.10, the speed difference was enormous. Due to ceval optimizations, the difference in Python3.11 is tighter; however, there is still room for improvement. The code for long_pow() doesn't currently have a fast path for squaring which is by far the most important case. $ python3.10 -m timeit -r 11 -s 'x = 10' 'x ** 2'
1000000 loops, best of 11: 201 nsec per loop
$ python3.10 -m timeit -r 11 -s 'x = 10' 'x * x'
10000000 loops, best of 11: 21.9 nsec per loop
$ python3.11 -m timeit -r 11 -s 'x = 10' 'x ** 2'
10000000 loops, best of 11: 32 nsec per loop
$ python3.11 -m timeit -r 11 -s 'x = 10' 'x * x'
20000000 loops, best of 11: 17.6 nsec per loop |
I believe https://bugs.python.org/issue44376 added a special case for 2nd and 3rd powers, and that's the 3.10/3.11 difference in the speed of x**2, not ceval optimizations. |
Hmm, I had just looked at that code and it wasn't at all obvious that an optimization had been added. I expected something like: if (exp==2) return PyNumber_Multiply(x, x); I wonder where the extra clock cycles are going. Looking at the ceval.c dispatch and the bodies of PyNumber_Multiply(), _PyNumber_PowerNoMod(), and PyNumber_Power(), it looks like the power dispatch code should be about the same as or slightly cheaper than the multiply dispatch. I'm surprised there is still almost a two to one speed difference. ISTM there is still too much overhead for squaring. |
The situation for floats is also disappointing: $ python3.11 -m timeit -s 'x=1.1' 'x ** 2'
5000000 loops, best of 5: 60.8 nsec per loop
$ python3.11 -m timeit -s 'x=1.1' 'x ** 2.0'
5000000 loops, best of 5: 51.5 nsec per loop
$ python3.11 -m timeit -s 'x=1.1' 'x * x'
20000000 loops, best of 5: 17.7 nsec per loop Complex numbers are more balanced. Surprisingly, some of the complex cases are faster than their float counterparts: $ python3.11 -m timeit -s 'x=1.1+2.2j' 'x ** 2'
5000000 loops, best of 5: 42.4 nsec per loop
$ python3.11 -m timeit -s 'x=1.1+2.2j' 'x ** 2.0'
5000000 loops, best of 5: 43.3 nsec per loop
$ python3.11 -m timeit -s 'x=1.1+2.2j' 'x ** 2.0j'
2000000 loops, best of 5: 107 nsec per loop
$ python3.11 -m timeit -s 'x=1.1+2.2j' 'x * x'
10000000 loops, best of 5: 30.6 nsec per loop Decimal still shows a large difference: $ python3.11 -m timeit -s 'from decimal import Decimal' -s 'x=Decimal("1.1")' 'x ** 2'
1000000 loops, best of 5: 206 nsec per loop
$ python3.11 -m timeit -s 'from decimal import Decimal' -s 'x=Decimal("1.1")' 'x ** Decimal(2)'
1000000 loops, best of 5: 281 nsec per loop
$ python3.11 -m timeit -s 'from decimal import Decimal' -s 'x=Decimal("1.1")' 'x * x'
5000000 loops, best of 5: 60.9 nsec per loop |
I'm not sure about the original 10:1 difference in 3.10, but in 3.11, the 2:1 difference might be due to the PEP-659 machinery optimizing for int * int, and float * float cases (see BINARY_OP_MULTIPLY_INT and BINARY_OP_MULTIPLY_FLOAT in ceval). Last I recall, these were slightly faster than their unspecialized counterparts as they have less overhead in C function calls. Meanwhile, none of the power operations are optimized via PEP-659 machinery at the moment. |
I was suprised that https://bugs.python.org/issue44376 managed to get i**2 to within a factor of 2 of i*i's speed. The overheads of running long_pow() at all are high! Don't overlook that initialization of stack variables at the start, like
isn't free - code has to be generated to force zeroes into those variables. The initialization of We can't sanely move the The exit code in turn has a string of things like Py_DECREF(a);
Py_DECREF(b);
Py_XDECREF(c); and those cost cycles too, including tests and branches. So the real "outrage" to me is why xx took 17.6 nsec for x == 10 in the original report. That's many times longer than the HW takes to do the actual multiply. Whether it's spelled xx or x**2, we're overwhelming timing overheads. |
GH_30555 helps a bit by leaving the giant-exponent table of small odd powers as uninitialized stack trash unless it's actually used. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: