8000 gh-119109: improve `functools.partial` vectorcall with keywords by dg-pb · Pull Request #124584 · python/cpython · GitHub
[go: up one dir, main page]

Skip to content

gh-119109: improve functools.partial vectorcall with keywords #124584

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 37 commits into from
Jul 9, 2025

Conversation

dg-pb
Copy link
Contributor
@dg-pb dg-pb commented Sep 26, 2024

(Potentially closes #128050)

This IMO is the best approach to resolve fallback "issue". It:
a) Eliminates the need for the fallback or any need to switch between implementation after initial construction
b) Delivers performance benefits for vectorcall when partial has keywords

Benchmark:

# BENCH 2 ARGS
# ------------
S="
from functools import partial
f=lambda a, b: a - b
p1 = partial(f)
p2 = partial(f, b=2)
l = lambda a: f(a, b=2)
"

$PYCMD -c "${S}; print(p1(1, 2))"   # -1     | -1     |
$PYCMD -c "${S}; print(p2(1))"      # -1     | -1     |
                                    # BEFORE | AFTER  | %CHN | LAMBDA LB
$PYCMD -m timeit -s $S 'p1(1, 2)'   #  87 ns |  85 ns |      |
$PYCMD -m timeit -s $S 'p1(1, b=2)' # 100 ns |  96 ns |      |
$PYCMD -m timeit -s $S 'p2(1)'      # 240 ns | 135 ns | -45% |  94 ns
$PYCMD -m timeit -s $S 'p2(a=1)'    # 350 ns | 160 ns | -55% | 110 ns


# BENCH 10 ARGS
# -------------
S="
from functools import partial
func = lambda a, b, c, d, e, f, g, h, i, j: (a + b + c + d + e + f + g + h + i + j)
p = partial(func, f=5, g=6, h=7, i=8, j=9)
l = lambda a, b, c, d, e, f=5, g=6, h=7, i=8, j=9: func(a, b, c, d, e, f=f, g=g, h=h, i=i, j=j)
"

C0="${S}; print(p(0, 1, 2, 3, 4))"
C1='p(0, 1, 2, 3, 4)'
C2='p(a=0, b=1, c=2, d=3, e=4)'                             # disjoint kw and pto_kw
C3='p(a=0, b=1, c=2, d=3, e=4, f=5, g=6)'                   # kw partially overlaps pto_kw
C4='p(a=0, b=1, c=2, d=3, e=4, f=5, g=6, h=7, i=8, j=9)'    # kw overrides pto_kw


$PYCMD -c $C0               #  45     | 45     |
                            #  BEFORE | AFTER  | %CHN | LAMBDA LB
$PYCMD -m timeit -s $S <
8000
span class="pl-smi">$C1  #  440 ns | 320 ns | -28% | 240 ns
$PYCMD -m timeit -s $S $C2  #  890 ns | 440 ns | -50% | 260 ns
$PYCMD -m timeit -s $S $C3  # 1000 ns | 600 ns | -40% | 270 ns
$PYCMD -m timeit -s $S $C4  # 1250 ns | 700 ns | -44% | 300 ns

# FUNCTION CALL - 210 ms
$PYCMD -m timeit -s $S 'f(a=0, b=1, c=2, d=3, e=4, f=5, g=6, h=7, i=8, j=9)'

No penalty for calls without pto_kwds.
Non negligible speed improvement for calls with pto_kwds: 27 - 55%

@rhettinger rhettinger requested review from vstinner and removed request for rhettinger September 26, 2024 17:33
@rhettinger
Copy link
Contributor

Perhaps @vstinner has the time and interest in looking at this.

@dg-pb dg-pb marked this pull request as draft September 27, 2024 06:31
@dg-pb dg-pb marked this pull request as ready for review September 27, 2024 14:54
@dg-pb
Copy link
Contributor Author
dg-pb commented Sep 29, 2024

I think it is a good compromise between simplicity and performance now.

One micro-optimization that I couldn't figure out how to do simply is pre-storing kwnames tuple so it doesn't need to be created on every call. It would drop another ~50 ns.

Not sure how much sense it makes yet, but I posted faster-cpython/ideas#699 in relation to this.

Ready for review now.

@dg-pb
Copy link
Contributor Author
dg-pb commented Oct 17, 2024

Was wandering if it might be worth factoring out macros for private use.

Copy link
Member
@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. 👍

@kumaraditya303
Copy link
Contributor

No refleaks:

./python.exe -m test -R 3:3 test_functools
Using random seed: 458260461
0:00:00 load avg: 1.72 Run 1 test sequentially in a single process
0:00:00 load avg: 1.72 [1/1] test_functools
beginning 6 repetitions. Showing number of leaks (. for 0 or less, X for 10 or more)
123:456
XX. ...
0:00:02 load avg: 1.66 [1/1] test_functools passed

== Tests result: SUCCESS ==

1 test OK.

Total duration: 2.1 sec
Total tests: run=321 skipped=1
Total test files: run=1/1
Result: SUCCESS

@dg-pb
Copy link
Contributor Author
dg-pb commented Jul 7, 2025

No refleaks:

I get the same on main.

And also the same if I remove all tests except test_functools:TestImportTime.

Maybe not related to this PR?

EDIT: I assumed this shows that there are refleaks? Or am I failing to interpret something? Never used -R - a bit more explicit comment would be much appreciated.

@kumaraditya303
Copy link
Contributor
kumaraditya303 commented Jul 8, 2025

I assumed this shows that there are refleaks? Or am I failing to interpret something? Never used -R - a bit more explicit comment would be much appreciated.

The output I posted means that there are no leaks, -R is used to check for reference leaks and 3:3 indicates the numbers of runs to checks for leaks.

See https://devguide.python.org/testing/run-write-tests/

@kumaraditya303
Copy link
Contributor

I think the resizing logic is unnecessary and adds extra complexity, would you simplify it to what Serhiy suggested or alternatively remove it altogether @dg-pb?

@dg-pb
Copy link
Contributor Author
dg-pb commented Jul 8, 2025

simplify it to what Serhiy suggested

I don't think it was simplification. He did propose an alternative and I incorporated it into the logic.

Why I kept the old one noveralloc > 6 is because without it:

# only: (noveralloc > init_stack_size / 2)
p = partial(a=1)
f(a=2)
# initial stack size: 3
# used stack size: 1

Thus, would be kicking in too often.
And I took 10 instead of 2, as it wouldn't capture cases such as:
initial stack size = 1000
used stack size = 499

10% is I think a good number, given we are talking about the pool of 0.01% of cases.

I feel that what is currently there provides necessary protection at minimal cost and is unlikely to cause any issues.

But if you have better rationale in mind, I am open to further amendments.

alternatively remove it altogether

Things like recursion of partial objects or partial calls with very high number of keyword arguments, although not common and not the best idea for production code, but given description of partial object are perfectly valid use cases that can be useful in prototyping/testing something quickly.

"Anything that can go wrong will go wrong."
0.1% doesn't seem much, out of 10M it is 10K. Out of those 10K who decide to use more than 7 keywords, say 0.1% will have a noticeable memory consumption by partial object, that is 10 people.

Thus, given enough time it is very likely that more than 1 instance of issues with this will happen.

Given it is pretty much for free, and the block can be easily removed if there is a need for it in the future, I am in favour of being "better safe than sorry" here.

@kumaraditya303
Copy link
Contributor

But if you have better rationale in mind, I am open to further amendments.

I don't have better resizing strategy in mind atm so I'll not block this PR on it now, I'll merge this now thanks.

@kumaraditya303 kumaraditya303 merged commit f9932f5 into python:main Jul 9, 2025
40 checks passed
@dg-pb dg-pb deleted the gh-119109-partial_vectorcall_kw branch July 9, 2025 13:36
AndPuQing pushed a commit to AndPuQing/cpython that referenced this pull request Jul 11, 2025
…python#124584)

Co-authored-by: Kumar Aditya <kumaraditya@python.org>
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Race between partial_vectorcall_fallback and _PyVectorcall_FunctionInline under free-threading
8 participants
0