-
-
Notifications
You must be signed in to change notification settings - Fork 32.4k
gh-119786: Add jit.md. Move adaptive.md to a section of interpreter.md. #127175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 1 commit
Commits
Show all changes
13 commits
Select commit
Hold shift + click to select a range
2fb97e3
gh-119786: add intro to adaptive.md. Update index headings
iritkatriel 2414871
add content to tier2
iritkatriel e040267
review comments
iritkatriel e5802d7
add jit
iritkatriel 0c96eb2
merge adaptive.md into interpreter.md
iritkatriel 43aa0df
cache layouts
iritkatriel e7c3363
tier 2 --> jit
iritkatriel 4731d94
remove adaptive.md
iritkatriel c4b288b
mike's comments
iritkatriel aced06a
Apply suggestions from code review
iritkatriel bc8a50b
interpreter jit is for debugging
iritkatriel 3bbf6ee
Update InternalDocs/jit.md
iritkatriel acaef71
Merge branch 'main' into adaptive
iritkatriel File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
merge adaptive.md into interpreter.md
- Loading branch information
commit 0c96eb2511339355687f3bb85a6466461118d753
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,8 +1,4 @@ | ||
The bytecode interpreter | ||
======================== | ||
|
||
Overview | ||
-------- | ||
# The bytecode interpreter | ||
|
||
This document describes the workings and implementation of the bytecode | ||
interpreter, the part of python that executes compiled Python code. Its | ||
|
@@ -47,8 +43,7 @@ simply calls [`_PyEval_EvalFrameDefault()`] to execute the frame. However, as pe | |
`_PyEval_EvalFrameDefault()`. | ||
|
||
|
||
Instruction decoding | ||
-------------------- | ||
## Instruction decoding | ||
|
||
The first task of the interpreter is to decode the bytecode instructions. | ||
Bytecode is stored as an array of 16-bit code units (`_Py_CODEUNIT`). | ||
|
@@ -110,8 +105,7 @@ snippet decode a complete instruction: | |
For various reasons we'll get to later (mostly efficiency, given that `EXTENDED_ARG` | ||
is rare) the actual code is different. | ||
|
||
Jumps | ||
===== | ||
## Jumps | ||
|
||
Note that when the `switch` statement is reached, `next_instr` (the "instruction offset") | ||
already points to the next instruction. | ||
|
@@ -120,15 +114,14 @@ Thus, jump instructions can be implemented by manipulating `next_instr`: | |
- A jump forward (`JUMP_FORWARD`) sets `next_instr += oparg`. | ||
- A jump backward sets `next_instr -= oparg`. | ||
|
||
Inline cache entries | ||
==================== | ||
## Inline cache entries | ||
|
||
Some (specialized or specializable) instructions have an associated "inline cache". | ||
The inline cache consists of one or more two-byte entries included in the bytecode | ||
array as additional words following the `opcode`/`oparg` pair. | ||
The size of the inline cache for a particular instruction is fixed by its `opcode`. | ||
Moreover, the inline cache size for all instructions in a | ||
[family of specialized/specializable instructions](adaptive.md) | ||
[family of specialized/specializable instructions](#Specialization) | ||
(for example, `LOAD_ATTR`, `LOAD_ATTR_SLOT`, `LOAD_ATTR_MODULE`) must all be | ||
the same. Cache entries are reserved by the compiler and initialized with zeros. | ||
Although they are represented by code units, cache entries do not conform to the | ||
|
@@ -153,8 +146,7 @@ Serializing non-zero cache entries would present a problem because the serializa | |
More information about the use of inline caches can be found in | ||
[PEP 659](https://peps.python.org/pep-0659/#ancillary-data). | ||
|
||
The evaluation stack | ||
-------------------- | ||
## The evaluation stack | ||
|
||
Most instructions read or write some data in the form of object references (`PyObject *`). | ||
The CPython bytecode interpreter is a stack machine, meaning that its instructions operate | ||
|
@@ -193,16 +185,14 @@ For example, the following sequence is illegal, because it keeps pushing items o | |
> Do not confuse the evaluation stack with the call stack, which is used to implement calling | ||
> and returning from functions. | ||
|
||
Error handling | ||
-------------- | ||
## Error handling | ||
|
||
When the implementation of an opcode raises an exception, it jumps to the | ||
`exception_unwind` label in [Python/ceval.c](../Python/ceval.c). | ||
The exception is then handled as described in the | ||
[`exception handling documentation`](exception_handling.md#handling-exceptions). | ||
|
||
Python-to-Python calls | ||
---------------------- | ||
## Python-to-Python calls | ||
|
||
The `_PyEval_EvalFrameDefault()` function is recursive, because sometimes | ||
the interpreter calls some C function that calls back into the interpreter. | ||
|
@@ -227,8 +217,7 @@ returns from `_PyEval_EvalFrameDefault()` altogether, to a C caller. | |
|
||
A similar check is performed when an unhandled exception occurs. | ||
|
||
The call stack | ||
-------------- | ||
## The call stack | ||
|
||
Up through 3.10, the call stack was implemented as a singly-linked list of | ||
[frame objects](frames.md). This was expensive because each call would require a | ||
|
@@ -262,8 +251,7 @@ See also the [generators](generators.md) section. | |
|
||
<!-- | ||
|
||
All sorts of variables | ||
---------------------- | ||
## All sorts of variables | ||
|
||
The bytecode compiler determines the scope in which each variable name is defined, | ||
and generates instructions accordingly. For example, loading a local variable | ||
|
@@ -297,8 +285,7 @@ Other topics | |
|
||
--> | ||
|
||
Introducing a new bytecode instruction | ||
-------------------------------------- | ||
## Introducing a new bytecode instruction | ||
|
||
It is occasionally necessary to add a new opcode in order to implement | ||
a new feature or change the way that existing features are compiled. | ||
|
@@ -355,6 +342,169 @@ new bytecode properly. Run `make regen-importlib` for updating the | |
bytecode of frozen importlib files. You have to run `make` again after this | ||
to recompile the generated C files. | ||
|
||
## Specialization | ||
|
||
Bytecode specialization, which was introduced in | ||
[PEP 659](https://peps.python.org/pep-0659/), speeds up program execution by | ||
rewriting instructions based on runtime information. This is done by replacing | ||
a generic instruction with a faster version that works for the case that this | ||
program encounters. Each specializable instruction is responsible for rewriting | ||
itself, using its [inline caches](#inline-cache-entries) for | ||
bookkeeping. | ||
|
||
When an adaptive instruction executes, it may attempt to specialize itself, | ||
depending on the argument and the contents of its cache. This is done | ||
by calling one of the `_Py_Specialize_XXX` functions in | ||
[`Python/specialize.c`](../Python/specialize.c). | ||
|
||
|
||
The specialized instructions are responsible for checking that the special-case | ||
assumptions still apply, and de-optimizing back to the generic version if not. | ||
|
||
## Families of instructions | ||
|
||
A *family* of instructions consists of an adaptive instruction along with the | ||
specialized instruction that it can be replaced by. | ||
iritkatriel marked this conversation as resolved.
Show resolved
Hide resolved
|
||
It has the following fundamental properties: | ||
|
||
* It corresponds to a single instruction in the code | ||
generated by the bytecode compiler. | ||
* It has a single adaptive instruction that records an execution count and, | ||
at regular intervals, attempts to specialize itself. If not specializing, | ||
it executes the base implementation. | ||
* It has at least one specialized form of the instruction that is tailored | ||
for a particular value or set of values at runtime. | ||
* All members of the family must have the same number of inline cache entries, | ||
to ensure correct execution. | ||
Individual family members do not need to use all of the entries, | ||
but must skip over any unused entries when executing. | ||
|
||
The current implementation also requires the following, | ||
although these are not fundamental and may change: | ||
|
||
* All families use one or more inline cache entries, | ||
the first entry is always the counter. | ||
* All instruction names should start with the name of the adaptive | ||
instruction. | ||
* Specialized forms should have names describing their specialization. | ||
|
||
## Example family | ||
|
||
The `LOAD_GLOBAL` instruction (in [Python/bytecodes.c](../Python/bytecodes.c)) | ||
already has an adaptive family that serves as a relatively simple example. | ||
|
||
The `LOAD_GLOBAL` instruction performs adaptive specialization, | ||
calling `_Py_Specialize_LoadGlobal()` when the counter reaches zero. | ||
|
||
There are two specialized instructions in the family, `LOAD_GLOBAL_MODULE` | ||
which is specialized for global variables in the module, and | ||
`LOAD_GLOBAL_BUILTIN` which is specialized for builtin variables. | ||
|
||
## Performance analysis | ||
|
||
The benefit of a specialization can be assessed with the following formula: | ||
`Tbase/Tadaptive`. | ||
|
||
Where `Tbase` is the mean time to execute the base instruction, | ||
and `Tadaptive` is the mean time to execute the specialized and adaptive forms. | ||
|
||
`Tadaptive = (sum(Ti*Ni) + Tmiss*Nmiss)/(sum(Ni)+Nmiss)` | ||
|
||
`Ti` is the time to execute the `i`th instruction in the family and `Ni` is | ||
the number of times that instruction is executed. | ||
`Tmiss` is the time to process a miss, including de-optimzation | ||
and the time to execute the base instruction. | ||
|
||
The ideal situation is where misses are rare and the specialized | ||
forms are much faster than the base instruction. | ||
`LOAD_GLOBAL` is near ideal, `Nmiss/sum(Ni) ≈ 0`. | ||
In which case we have `Tadaptive ≈ sum(Ti*Ni)`. | ||
Since we can expect the specialized forms `LOAD_GLOBAL_MODULE` and | ||
`LOAD_GLOBAL_BUILTIN` to be much faster than the adaptive base instruction, | ||
we would expect the specialization of `LOAD_GLOBAL` to be profitable. | ||
|
||
## Design considerations | ||
|
||
While `LOAD_GLOBAL` may be ideal, instructions like `LOAD_ATTR` and | ||
`CALL_FUNCTION` are not. For maximum performance we want to keep `Ti` | ||
low for all specialized instructions and `Nmiss` as low as possible. | ||
|
||
Keeping `Nmiss` low means that there should be specializations for almost | ||
all values seen by the base instruction. Keeping `sum(Ti*Ni)` low means | ||
keeping `Ti` low which means minimizing branches and dependent memory | ||
accesses (pointer chasing). These two objectives may be in conflict, | ||
requiring judgement and experimentation to design the family of instructions. | ||
|
||
The size of the inline cache should as small as possible, | ||
without impairing performance, to reduce the number of | ||
`EXTENDED_ARG` jumps, and to reduce pressure on the CPU's data cache. | ||
|
||
### Gathering data | ||
|
||
Before choosing how to specialize an instruction, it is important to gather | ||
some data. What are the patterns of usage of the base instruction? | ||
Data can best be gathered by instrumenting the interpreter. Since a | ||
specialization function and adaptive instruction are going to be required, | ||
instrumentation can most easily be added in the specialization function. | ||
|
||
### Choice of specializations | ||
|
||
The performance of the specializing adaptive interpreter relies on the | ||
quality of specialization and keeping the overhead of specialization low. | ||
|
||
Specialized instructions must be fast. In order to be fast, | ||
specialized instructions should be tailored for a particular | ||
set of values that allows them to: | ||
|
||
1. Verify that incoming value is part of that set with low overhead. | ||
2. Perform the operation quickly. | ||
|
||
This requires that the set of values is chosen such that membership can be | ||
tested quickly and that membership is sufficient to allow the operation to | ||
performed quickly. | ||
|
||
For example, `LOAD_GLOBAL_MODULE` is specialized for `globals()` | ||
dictionaries that have a keys with the expected version. | ||
|
||
This can be tested quickly: | ||
|
||
* `globals->keys->dk_version == expected_version` | ||
|
||
and the operation can be performed quickly: | ||
|
||
* `value = entries[cache->index].me_value;`. | ||
|
||
Because it is impossible to measure the performance of an instruction without | ||
also measuring unrelated factors, the assessment of the quality of a | ||
specialization will require some judgement. | ||
|
||
As a general rule, specialized instructions should be much faster than the | ||
base instruction. | ||
|
||
### Implementation of specialized instructions | ||
|
||
In general, specialized instructions should be implemented in two parts: | ||
|
||
1. A sequence of guards, each of the form | ||
`DEOPT_IF(guard-condition-is-false, BASE_NAME)`. | ||
2. The operation, which should ideally have no branches and | ||
a minimum number of dependent memory accesses. | ||
|
||
In practice, the parts may overlap, as data required for guards | ||
can be re-used in the operation. | ||
|
||
If there are branches in the operation, then consider further specialization | ||
to eliminate the branches. | ||
|
||
### Maintaining stats | ||
|
||
Finally, take care that stats are gathered correctly. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we currently have docs about pystats, we may want to link to that here. (If we don't, don't worry about that now...) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't know of any. |
||
After the last `DEOPT_IF` has passed, a hit should be recorded with | ||
`STAT_INC(BASE_INSTRUCTION, hit)`. | ||
After an optimization has been deferred in the adaptive instruction, | ||
that should be recorded with `STAT_INC(BASE_INSTRUCTION, deferred)`. | ||
|
||
|
||
Additional resources | ||
-------------------- | ||
|
||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.