8000 gh-112354: Add executor for less-taken branch by gvanrossum · Pull Request #112902 · python/cpython · GitHub
[go: up one dir, main page]

Skip to content

gh-112354: Add executor for less-taken branch #112902

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 41 commits into from
Closed
Changes from 1 commit
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
36feeb1
Skip ENTER_EXECUTOR as deopt target (use vm_data)
gvanrossum Nov 24, 2023
d12533b
Add an array of 'extras' to UOpExecutor
gvanrossum Nov 23, 2023
f21f2d8
Count side exits per uop loc and print if >= 10
gvanrossum Nov 24, 2023
8463965
Add _PyOptimizer_Anywhere (not yet used)
gvanrossum Nov 25, 2023
6403752
Only jump in ENTER_EXECUTOR if overwriting JUMP_BACKWARD
gvanrossum Nov 25, 2023
329dead
Assert base opcode in _Py_Specialize_ForIter
gvanrossum Nov 25, 2023
f1998c0
Disable curses tests in --fast-ci mode (make test)
gvanrossum Nov 25, 2023
b0944e6
Improve (?) check for executor recursion
gvanrossum Nov 25, 2023
26b5f89
Only generate extra executors for branches
gvanrossum Nov 25, 2023
649581c
Fix Uop -> UOp
gvanrossum Dec 1, 2023
835bf13
WIP
gvanrossum Dec 1, 2023
256b156
Fix where next_instr points upon E_E avoidance
gvanrossum Dec 2, 2023
75c7c32
Allow executors with oparg >= 256
gvanrossum Dec 2, 2023
a94c7f1
Don't try to optimize with default optimizer
gvanrossum Dec 2, 2023
747a3f0
Use separate 'counters' and 'executors' arrays
gvanrossum Dec 11, 2023
682cf5a
Jump directly to side-exit executors
gvanrossum Dec 12, 2023
359c6fc
Remove progress check; clean up the rest a big
gvanrossum Dec 13, 2023
ca6ed3a
Ensure array of executor pointers is 64-bit aligned
gvanrossum Dec 13, 2023
e2a26b5
Check at least two uops; further cleanup
gvanrossum Dec 13, 2023
8000 38c7aab
Move exit_trace up, since it is smaller
gvanrossum Dec 13, 2023
0f64231
Use configured threshold and exp. backoff for counter
gvanrossum Dec 13, 2023
83297df
Add API to access sub-interpreters
gvanrossum Dec 13, 2023
d065a94
Move optimizer/executor tests to new file test_capi/test_opt.py
gvanrossum Dec 13, 2023
0f71a03
Merge branch 'test-opt' into uops-extras
gvanrossum Dec 13, 2023
075ab91
In _PyOptimizer_Unanchored, assert not ENTER_EXECUTOR, accept JUMP_BA…
gvanrossum Dec 13, 2023
c54daef
Call DISPATCH() directly from exit_trace
gvanrossum Dec 13, 2023
934a115
Correct comment on deoptimize
gvanrossum Dec 13, 2023
52c49eb
Merge branch 'main' into uops-extras
gvanrossum Dec 13, 2023
8f5e623
Remove enter_tier_one label
gvanrossum Dec 13, 2023
10b98f1
Add test
gvanrossum Dec 13, 2023
1450ca6
Fix memory leak
gvanrossum Dec 13, 2023
dcde4d3
Clear sub-executors array upon dealloc
gvanrossum Dec 13, 2023
15df63f
Add blurb
gvanrossum Dec 13, 2023
c786418
Avoid redundant stack frame saves/restores
gvanrossum Dec 13, 2023
ee0734b
Revert "Disable curses tests in --fast-ci mode (make test)"
gvanrossum Dec 14, 2023
655a841
Merge branch 'main' into uops-extras
gvanrossum Dec 14, 2023
32e36fa
Merge branch 'main' into uops-extras
gvanrossum Dec 14, 2023
f5b317a
Fix compiler warning about int/Py_ssize_t
gvanrossum Dec 14, 2023
4804a3c
Be less casual about incref/decref current executor
gvanrossum Dec 16, 2023
46c7d26
Slightly nicer way to handle refcounts
gvanrossum Dec 16, 2023
b991279
Silence compiler warning
gvanrossum Dec 17, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Check at least two uops; further cleanup
  • Loading branch information
gvanrossum committed Dec 13, 2023
commit e2a26b59c7060d48e73036771b41f4684be9e924
42 changes: 26 additions & 16 deletions Python/ceval.c
Original file line number Diff line number Diff line change
Expand Up @@ -752,7 +752,7 @@ _PyEval_EvalFrameDefault(PyThreadState *tstate, _PyInterpreterFrame *frame, int
goto exit_unwind;
}

// Jump here from ENTER_EXECUTOR, and code under the deoptimize label
// Jump here from ENTER_EXECUTOR and exit_trace.
enter_tier_one:
next_instr = frame->instr_ptr;

Expand Down Expand Up @@ -1083,34 +1083,34 @@ _PyEval_EvalFrameDefault(PyThreadState *tstate, _PyInterpreterFrame *frame, int
int pc = next_uop - 1 - current_executor->trace;
_PyExecutorObject **pexecutor = current_executor->executors + pc;
if (*pexecutor != NULL) {
#ifdef Py_DEBUG
PyCodeObject *code = _PyFrame_GetCode(frame);
DPRINTF(2, "Jumping to new executor for %s (%s:%d) at byte offset %d\n",
PyUnicode_AsUTF8(code->co_qualname),
PyUnicode_AsUTF8(code->co_filename),
code->co_firstlineno,
2 * (int)(frame->instr_ptr - _PyCode_CODE(_PyFrame_GetCode(frame))));
#endif
Py_DECREF(current_executor);
current_executor = (_PyUOpExecutorObject *)*pexecutor;
Py_INCREF(current_executor);
goto enter_tier_two;
}

// Increment and check side exit counter.
next_instr = frame->instr_ptr;
uint16_t *pcounter = current_executor->counters + pc;
*pcounter += 1;
if (*pcounter != 32 || // TODO: use resume_threshold
tstate->interp->optimizer == &_PyOptimizer_Default)
{
goto enter_tier_one;
goto resume_frame;
}

// Decode instruction to look past EXTENDED_ARG.
_Py_CODEUNIT *src, *dest;
src = dest = frame->instr_ptr;
opcode = src->op.code;
opcode = next_instr[0].op.code;
if (opcode == EXTENDED_ARG) {
src++;
opcode = src->op.code;
opcode = next_instr[1].op.code;
}

// For selected opcodes build a new executor and enter it now.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why "selected opcodes", why not everywhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In an earlier version that somehow didn't work. Right now the check whether the new trace isn't going to immediately deopt again relies on these opcodes. I figured once we have the side exit machinery working we could gradually increase the scope to other deoptimizations. Also, not all deoptimizations are worthy of the effort (e.g. the PEP 523 test).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No special cases, please, it just make the code more complicated and slower.
If we want to treat some exits differently, let's do it properly faster-cpython/ideas#638, not here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several reasons. First, as I explain below, for bytecodes other than branches, I can't promise an exact check for whether the newly created sub-executor doesn't just repeat the same deoptimizing uop that triggered its creation (in which case the sub-executor would always deopt immediately if it is entered at all).

Second, for most bytecodes other than branches, deoptimization paths are relatively rare (IIRC this is apparent from the pystats data -- with the exception of some LOAD_ATTR specializations).

For branches, we expect many cases where the "common" path is not much more common than the "uncommon" path (e.g. 60/40 or 70/30). Now, it might make sense have a different special case here, where if e.g. _GUARD_IS_TRUE_POP has a hot side-exit, we know that the branch goes the other way, so we can simply create a sub-executor starting at the less-common branch target. The current approach doesn't do this (mostly because I'd have to thread the special case all the way through the various optimizer functions) but just creates a new executor starting at the original Tier 1 branch bytecode -- in the expectation that if the counters are tuned just right, we will have executed the less-common branch in Tier 1 while taking the common branch in Tier 2, so that Tier 1's shift register has changed state and now indicates that the less-common branch is actually taken more frequently. The checks at L1161 and ff. below are a safeguard in case that didn't happen yet (there are all kinds of interesting scenarios, e.g. loops that don't iterate much -- remember that the first iteration each time we enter a loop will be done in Tier 1, where we stay until we hit the JUMP_BACKWARD bytecode at the end of the loop).

I propose this PR as a starting point for futher iterations, not as the ultimate design for side-exits. Let's discuss this Monday.

Expand All @@ -1122,39 +1122,49 @@ _PyEval_EvalFrameDefault(PyThreadState *tstate, _PyInterpreterFrame *frame, int
DPRINTF(2, "--> %s @ %d in %p has %d side exits\n",
_PyUOpName(uopcode), pc, current_executor, (int)(*pcounter));
DPRINTF(2, " T1: %s\n", _PyOpcode_OpName[opcode]);
// The counter will cycle around once the 16 bits overflow
int optimized = _PyOptimizer_Unanchored(frame, dest, pexecutor, stack_pointer);

int optimized = _PyOptimizer_Unanchored(frame, next_instr, pexecutor, stack_pointer);
if (optimized < 0) {
goto error_tier_two;
}

if (!optimized) {
DPRINTF(2, "--> Failed to optimize %s @ %d in %p\n",
_PyUOpName(uopcode), pc, current_executor);
}
else {
#ifdef Py_DEBUG
DPRINTF(1, "--> Optimized %s @ %d in %p\n",
_PyUOpName(uopcode), pc, current_executor);
DPRINTF(1, " T1: %s\n", _PyOpcode_OpName[src->op.code]);
PyCodeObject *code = _PyFrame_GetCode(frame);
DPRINTF(2, "Jumping to fresh executor for %s (%s:%d) at byte offset %d\n",
PyUnicode_AsUTF8(code->co_qualname),
PyUnicode_AsUTF8(code->co_filename),
code->co_firstlineno,
2 * (int)(frame->instr_ptr - _PyCode_CODE(_PyFrame_GetCode(frame))));
#endif
Py_DECREF(current_executor);
current_executor = (_PyUOpExecutorObject *)*pexecutor;
// TODO: Check at least two uops: _IS_NONE, _POP_JUMP_IF_TRUE/FALSE.
if (current_executor->trace[0].opcode != uopcode) {

// Reject trace if it repeats the uop that just deoptimized.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test may be a bit imprecise(*), but it tries to discard the case where, even though the counter in the executor indicated that this side exit is "hot", the Tier 1 bytecode hasn't been re-specialized yet. In that case the trace projection will just repeat the uop that just took a deopt side exit, causing it to immediately deopt again. This seems a waste of time and executors -- eventually the sub-executor's deopt counter will also indicate it is hot, and then we'll try again, but it seems better (if we can catch it) to avoid creating the sub-executor in the first place, relying on exponential backoff for the side-exit counter instead (implemented below at L1180 and ff.).

For variou 8000 s reasons, the side-exit counters and the Tier 1 deopt counters don't run in sync, so it's possible that the side-exit counter triggers before the Tier 1 counter has re-specialized. This check gives that another chance.

The test that I would like to use here would be check if the Tier 1 opcode is still unchanged (i.e., not re-specialized), but the executor doesn't record that information (and it would take up a lot of space, we'd need an extra byte for each uop that can deoptimize at least).


(*) The test I wrote is exact for the conditional branches I special-cased above (that's why there's a further special case here for _IS_NONE). For other opcodes it may miss a few cases, e.g. when a single T1 bytecode translates to multiple guards and the failing guard is not the first uop in the translation (i.e. this would always happens for calls, whose translation always starts with _PEP_523, which never dopts in cases we care about). In those cases we can produce a sub-executor that immediately deoptimizes. (And we never try to re-create executors, no matter how often it deoptimizes -- that's a general flaw in the current executor architecture that we should probably file separately.)

int jump_opcode = current_executor->trace[0].opcode;
if (jump_opcode == _IS_NONE) {
jump_opcode = current_executor->trace[1].opcode;
}
if (jump_opcode != uopcode) {
Py_INCREF(current_executor);
goto enter_tier_two; // Yes!
goto enter_tier_two; // All systems go!
}
// This is guaranteed to deopt again; forget about it
DPRINTF(2, "Alas, it's the same uop again -- discarding trace\n");

// The trace is guaranteed to deopt again; forget about it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it? Why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See explanation above.

DPRINTF(2, "Alas, it's the same uop again (%s) -- discarding trace\n",
_PyUOpName(jump_opcode));
*pexecutor = NULL;
// It will be decref'ed below.
}
}
Py_DECREF(current_executor);
goto enter_tier_one;
goto resume_frame;

// Jump here from _EXIT_TRACE
exit_trace:
Expand Down
0