8000 Optimize long_pow for the common case · Issue #90178 · python/cpython · GitHub
[go: up one dir, main page]

Skip to content

Optimize long_pow for the common case #90178

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rhettinger opened this issue Dec 9, 2021 · 8 comments
Closed

Optimize long_pow for the common case #90178

rhettinger opened this issue Dec 9, 2021 · 8 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage

Comments

@rhettinger
Copy link
Contributor
BPO 46020
Nosy @tim-one, @rhettinger, @mdickinson, @sweeneyde, @Fidget-Spinner
PRs
  • bpo-46020: Optimize long_pow for the common case #30555
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2021-12-09.06:32:23.304>
    labels = ['interpreter-core', 'performance']
    title = 'Optimize long_pow for the common case'
    updated_at = <Date 2022-01-12.18:55:05.887>
    user = 'https://github.com/rhettinger'

    bugs.python.org fields:

    activity = <Date 2022-01-12.18:55:05.887>
    actor = 'tim.peters'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Interpreter Core']
    creation = <Date 2021-12-09.06:32:23.304>
    creator = 'rhettinger'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 46020
    keywords = ['patch']
    message_count = 8.0
    messages = ['408076', '408079', '408094', '408099', '408116', '409609', '410381', '410422']
    nosy_count = 5.0
    nosy_names = ['tim.peters', 'rhettinger', 'mark.dickinson', 'Dennis Sweeney', 'kj']
    pr_nums = ['30555']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue46020'
    versions = []

    @rhettinger
    Copy link
    Contributor Author

    The expression 'x * x' is faster than 'x ** 2'.

    In Python3.10, the speed difference was enormous. Due to ceval optimizations, the difference in Python3.11 is tighter; however, there is still room for improvement.

    The code for long_pow() doesn't currently have a fast path for squaring which is by far the most important case.

    $ python3.10 -m timeit -r 11 -s 'x = 10' 'x ** 2'
    1000000 loops, best of 11: 201 nsec per loop
    $ python3.10 -m timeit -r 11 -s 'x = 10' 'x * x'
    10000000 loops, best of 11: 21.9 nsec per loop
    
    $ python3.11 -m timeit -r 11 -s 'x = 10' 'x ** 2'
    10000000 loops, best of 11: 32 nsec per loop
    $ python3.11 -m timeit -r 11 -s 'x = 10' 'x * x'
    20000000 loops, best of 11: 17.6 nsec per loop

    @rhettinger rhettinger added interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels Dec 9, 2021
    @sweeneyde
    Copy link
    Member

    I believe https://bugs.python.org/issue44376 added a special case for 2nd and 3rd powers, and that's the 3.10/3.11 difference in the speed of x**2, not ceval optimizations.

    @rhettinger
    Copy link
    Contributor Author

    Hmm, I had just looked at that code and it wasn't at all obvious that an optimization had been added. I expected something like:

    if (exp==2) return PyNumber_Multiply(x, x);

    I wonder where the extra clock cycles are going. Looking at the ceval.c dispatch and the bodies of PyNumber_Multiply(), _PyNumber_PowerNoMod(), and PyNumber_Power(), it looks like the power dispatch code should be about the same as or slightly cheaper than the multiply dispatch.

    I'm surprised there is still almost a two to one speed difference. ISTM there is still too much overhead for squaring.

    @rhettinger
    Copy link
    Contributor Author

    The situation for floats is also disappointing:

    $ python3.11 -m timeit -s 'x=1.1' 'x ** 2'
    5000000 loops, best of 5: 60.8 nsec per loop
    $ python3.11 -m timeit -s 'x=1.1' 'x ** 2.0'
    5000000 loops, best of 5: 51.5 nsec per loop
    $ python3.11 -m timeit -s 'x=1.1' 'x * x'
    20000000 loops, best of 5: 17.7 nsec per loop

    Complex numbers are more balanced. Surprisingly, some of the complex cases are faster than their float counterparts:

    $ python3.11 -m timeit -s 'x=1.1+2.2j' 'x ** 2'
    5000000 loops, best of 5: 42.4 nsec per loop
    $ python3.11 -m timeit -s 'x=1.1+2.2j' 'x ** 2.0'
    5000000 loops, best of 5: 43.3 nsec per loop
    $ python3.11 -m timeit -s 'x=1.1+2.2j' 'x ** 2.0j'
    2000000 loops, best of 5: 107 nsec per loop
    $ python3.11 -m timeit -s 'x=1.1+2.2j' 'x * x'
    10000000 loops, best of 5: 30.6 nsec per loop

    Decimal still shows a large difference:

    $ python3.11 -m timeit -s 'from decimal import Decimal' -s 'x=Decimal("1.1")' 'x ** 2'
    1000000 loops, best of 5: 206 nsec per loop
    $ python3.11 -m timeit -s 'from decimal import Decimal' -s 'x=Decimal("1.1")' 'x ** Decimal(2)'
    1000000 loops, best of 5: 281 nsec per loop
    $ python3.11 -m timeit -s 'from decimal import Decimal' -s 'x=Decimal("1.1")' 'x * x'
    5000000 loops, best of 5: 60.9 nsec per loop

    @Fidget-Spinner
    Copy link
    Member

    I'm not sure about the original 10:1 difference in 3.10, but in 3.11, the 2:1 difference might be due to the PEP-659 machinery optimizing for int * int, and float * float cases (see BINARY_OP_MULTIPLY_INT and BINARY_OP_MULTIPLY_FLOAT in ceval).

    Last I recall, these were slightly faster than their unspecialized counterparts as they have less overhead in C function calls. Meanwhile, none of the power operations are optimized via PEP-659 machinery at the moment.

    @tim-one
    Copy link
    Member
    tim-one commented Jan 3, 2022

    I was suprised that

    https://bugs.python.org/issue44376

    managed to get i**2 to within a factor of 2 of i*i's speed. The overheads of running long_pow() at all are high! Don't overlook that initialization of stack variables at the start, like

    PyLongObject *z = NULL;  /* accumulated result */
    

    isn't free - code has to be generated to force zeroes into those variables. The initialization of table[] alone requires code to fill 256 memory bytes with zeroes (down to 128 on current main branch). Nothing is free.

    We can't sanely move the table initialization expense into the "giant k-ary window exponentiation" block either, because every bigint operation can fail ("out of memory"), and the macros for doing the common ones (MULT and REDUCE) can do "goto Error;", and that common exit code has no way to know what is or isn't initialized. We can't let it see uninitialized stack trash.

    The exit code in turn has a string of things like

        Py_DECREF(a);
        Py_DECREF(b);
        Py_XDECREF(c);

    and those cost cycles too, including tests and branches.

    So the real "outrage" to me is why xx took 17.6 nsec for x == 10 in the original report. That's many times longer than the HW takes to do the actual multiply. Whether it's spelled xx or x**2, we're overwhelming timing overheads. pow() has many because it's a kind of Swiss army knife doing all sorts of things; what's x*x's excuse? ;-)

    @tim-one
    Copy link
    Member
    tim-one commented Jan 12, 2022

    GH_30555 helps a bit by leaving the giant-exponent table of small odd powers as uninitialized stack trash unless it's actually used.

    @tim-one
    Copy link
    Member
    tim-one commented Jan 12, 2022

    New changeset fc05e6b by Tim Peters in branch 'main':
    bpo-46020: Optimize long_pow for the common case (GH-30555)
    fc05e6b

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants
    0