merge adaptive.md into interpreter.md

python · iritkatriel · Dec 6, 2024 · Nov 22, 2024 · Nov 22, 2024 · Nov 23, 2024
commit 0c96eb2511339355687f3bb85a6466461118d753
diff --git a/InternalDocs/README.md b/InternalDocs/README.md
@@ -34,9 +34,7 @@ Runtime Objects
 Program Execution
 ---
 
-- [The Basic Interpreter](interpreter.md)
-
-- [The Specializing Interpreter](adaptive.md)
+- [The Bytecode Interpreter](interpreter.md)
 
 - [The Tier 2 Interpreter and JIT](tier2.md)
 

diff --git a/InternalDocs/code_objects.md b/InternalDocs/code_objects.md
@@ -18,6 +18,11 @@ Code objects are typically produced by the bytecode [compiler](compiler.md),
 although they are often written to disk by one process and read back in by another.
 The disk version of a code object is serialized using the
 [marshal](https://docs.python.org/dev/library/marshal.html) protocol.
+When a [`CodeObject`](code_objects.md) is created, the function
+`_PyCode_Quicken()` from [`Python/specialize.c`](../Python/specialize.c) is
+called to initialize the caches of all adaptive instructions. This is
+required because the on-disk format is a sequence of bytes, and
+some of the caches need to be initialized with 16-bit values.
 
 Code objects are nominally immutable.
 Some fields (including `co_code_adaptive` and fields for runtime

diff --git a/InternalDocs/compiler.md b/InternalDocs/compiler.md
@@ -595,16 +595,6 @@ Objects
 * [Exception Handling](exception_handling.md): Describes the exception table
 
 
-Specializing Adaptive Interpreter
-=================================
-
-Adding a specializing, adaptive interpreter to CPython will bring significant
-performance improvements. These documents provide more information:
-
-* [PEP 659: Specializing Adaptive Interpreter](https://peps.python.org/pep-0659/).
-* [Adding or extending a family of adaptive instructions](adaptive.md)
-
-
 References
 ==========
 

diff --git a/InternalDocs/interpreter.md b/InternalDocs/interpreter.md
@@ -1,8 +1,4 @@
-The bytecode interpreter
-========================
-
-Overview
---------
+# The bytecode interpreter
 
 This document describes the workings and implementation of the bytecode
 interpreter, the part of python that executes compiled Python code. Its
@@ -47,8 +43,7 @@ simply calls [`_PyEval_EvalFrameDefault()`] to execute the frame. However, as pe
 `_PyEval_EvalFrameDefault()`.
 
 
-Instruction decoding
---------------------
+## Instruction decoding
 
 The first task of the interpreter is to decode the bytecode instructions.
 Bytecode is stored as an array of 16-bit code units (`_Py_CODEUNIT`).
@@ -110,8 +105,7 @@ snippet decode a complete instruction:
 For various reasons we'll get to later (mostly efficiency, given that `EXTENDED_ARG`
 is rare) the actual code is different.
 
-Jumps
-=====
+## Jumps
 
 Note that when the `switch` statement is reached, `next_instr` (the "instruction offset")
 already points to the next instruction.
@@ -120,15 +114,14 @@ Thus, jump instructions can be implemented by manipulating `next_instr`:
 - A jump forward (`JUMP_FORWARD`) sets `next_instr += oparg`.
 - A jump backward sets `next_instr -= oparg`.
 
-Inline cache entries
-====================
+## Inline cache entries
 
 Some (specialized or specializable) instructions have an associated "inline cache".
 The inline cache consists of one or more two-byte entries included in the bytecode
 array as additional words following the `opcode`/`oparg` pair.
 The size of the inline cache for a particular instruction is fixed by its `opcode`.
 Moreover, the inline cache size for all instructions in a
-[family of specialized/specializable instructions](adaptive.md)
+[family of specialized/specializable instructions](#Specialization)
 (for example, `LOAD_ATTR`, `LOAD_ATTR_SLOT`, `LOAD_ATTR_MODULE`) must all be
 the same.  Cache entries are reserved by the compiler and initialized with zeros.
 Although they are represented by code units, cache entries do not conform to the
@@ -153,8 +146,7 @@ Serializing non-zero cache entries would present a problem because the serializa
 More information about the use of inline caches can be found in
 [PEP 659](https://peps.python.org/pep-0659/#ancillary-data).
 
-The evaluation stack
---------------------
+## The evaluation stack
 
 Most instructions read or write some data in the form of object references (`PyObject *`).
 The CPython bytecode interpreter is a stack machine, meaning that its instructions operate
@@ -193,16 +185,14 @@ For example, the following sequence is illegal, because it keeps pushing items o
 > Do not confuse the evaluation stack with the call stack, which is used to implement calling
 > and returning from functions.
 
-Error handling
---------------
+## Error handling
 
 When the implementation of an opcode raises an exception, it jumps to the
 `exception_unwind` label in [Python/ceval.c](../Python/ceval.c).
 The exception is then handled as described in the
 [`exception handling documentation`](exception_handling.md#handling-exceptions).
 
-Python-to-Python calls
-----------------------
+## Python-to-Python calls
 
 The `_PyEval_EvalFrameDefault()` function is recursive, because sometimes
 the interpreter calls some C function that calls back into the interpreter.
@@ -227,8 +217,7 @@ returns from `_PyEval_EvalFrameDefault()` altogether, to a C caller.
 
 A similar check is performed when an unhandled exception occurs.
 
-The call stack
---------------
+## The call stack
 
 Up through 3.10, the call stack was implemented as a singly-linked list of
 [frame objects](frames.md). This was expensive because each call would require a
@@ -262,8 +251,7 @@ See also the [generators](generators.md) section.
 
 <!--
 
-All sorts of variables
-----------------------
+## All sorts of variables
 
 The bytecode compiler determines the scope in which each variable name is defined,
 and generates instructions accordingly.  For example, loading a local variable
@@ -297,8 +285,7 @@ Other topics
 
 -->
 
-Introducing a new bytecode instruction
---------------------------------------
+## Introducing a new bytecode instruction
 
 It is occasionally necessary to add a new opcode in order to implement
 a new feature or change the way that existing features are compiled.
@@ -355,6 +342,169 @@ new bytecode properly.  Run `make regen-importlib` for updating the
 bytecode of frozen importlib files.  You have to run `make` again after this
 to recompile the generated C files.
 
+## Specialization
+
+Bytecode specialization, which was introduced in
+[PEP 659](https://peps.python.org/pep-0659/), speeds up program execution by
+rewriting instructions based on runtime information. This is done by replacing
+a generic instruction with a faster version that works for the case that this
+program encounters. Each specializable instruction is responsible for rewriting
+itself, using its [inline caches](#inline-cache-entries) for
+bookkeeping.
+
+When an adaptive instruction executes, it may attempt to specialize itself,
+depending on the argument and the contents of its cache. This is done
+by calling one of the `_Py_Specialize_XXX` functions in
+[`Python/specialize.c`](../Python/specialize.c).
+
+
+The specialized instructions are responsible for checking that the special-case
+assumptions still apply, and de-optimizing back to the generic version if not.
+
+## Families of instructions
+
+A *family* of instructions consists of an adaptive instruction along with the
+specialized instruction that it can be replaced by.
+It has the following fundamental properties:
+
+* It corresponds to a single instruction in the code
+  generated by the bytecode compiler.
+* It has a single adaptive instruction that records an execution count and,
+  at regular intervals, attempts to specialize itself. If not specializing,
+  it executes the base implementation.
+* It has at least one specialized form of the instruction that is tailored
+  for a particular value or set of values at runtime.
+* All members of the family must have the same number of inline cache entries,
+  to ensure correct execution.
+  Individual family members do not need to use all of the entries,
+  but must skip over any unused entries when executing.
+
+The current implementation also requires the following,
+although these are not fundamental and may change:
+
+* All families use one or more inline cache entries,
+  the first entry is always the counter.
+* All instruction names should start with the name of the adaptive
+  instruction.
+* Specialized forms should have names describing their specialization.
+
+## Example family
+
+The `LOAD_GLOBAL` instruction (in [Python/bytecodes.c](../Python/bytecodes.c))
+already has an adaptive family that serves as a relatively simple example.
+
+The `LOAD_GLOBAL` instruction performs adaptive specialization,
+calling `_Py_Specialize_LoadGlobal()` when the counter reaches zero.
+
+There are two specialized instructions in the family, `LOAD_GLOBAL_MODULE`
+which is specialized for global variables in the module, and
+`LOAD_GLOBAL_BUILTIN` which is specialized for builtin variables.
+
+## Performance analysis
+
+The benefit of a specialization can be assessed with the following formula:
+`Tbase/Tadaptive`.
+
+Where `Tbase` is the mean time to execute the base instruction,
+and `Tadaptive` is the mean time to execute the specialized and adaptive forms.
+
+`Tadaptive = (sum(Ti*Ni) + Tmiss*Nmiss)/(sum(Ni)+Nmiss)`
+
+`Ti` is the time to execute the `i`th instruction in the family and `Ni` is
+the number of times that instruction is executed.
+`Tmiss` is the time to process a miss, including de-optimzation
+and the time to execute the base instruction.
+
+The ideal situation is where misses are rare and the specialized
+forms are much faster than the base instruction.
+`LOAD_GLOBAL` is near ideal, `Nmiss/sum(Ni) ≈ 0`.
+In which case we have `Tadaptive ≈ sum(Ti*Ni)`.
+Since we can expect the specialized forms `LOAD_GLOBAL_MODULE` and
+`LOAD_GLOBAL_BUILTIN` to be much faster than the adaptive base instruction,
+we would expect the specialization of `LOAD_GLOBAL` to be profitable.
+
+## Design considerations
+
+While `LOAD_GLOBAL` may be ideal, instructions like `LOAD_ATTR` and
+`CALL_FUNCTION` are not. For maximum performance we want to keep `Ti`
+low for all specialized instructions and `Nmiss` as low as possible.
+
+Keeping `Nmiss` low means that there should be specializations for almost
+all values seen by the base instruction. Keeping `sum(Ti*Ni)` low means
+keeping `Ti` low which means minimizing branches and dependent memory
+accesses (pointer chasing). These two objectives may be in conflict,
+requiring judgement and experimentation to design the family of instructions.
+
+The size of the inline cache should as small as possible,
+without impairing performance, to reduce the number of
+`EXTENDED_ARG` jumps, and to reduce pressure on the CPU's data cache.
+
+### Gathering data
+
+Before choosing how to specialize an instruction, it is important to gather
+some data. What are the patterns of usage of the base instruction?
+Data can best be gathered by instrumenting the interpreter. Since a
+specialization function and adaptive instruction are going to be required,
+instrumentation can most easily be added in the specialization function.
+
+### Choice of specializations
+
+The performance of the specializing adaptive interpreter relies on the
+quality of specialization and keeping the overhead of specialization low.
+
+Specialized instructions must be fast. In order to be fast,
+specialized instructions should be tailored for a particular
+set of values that allows them to:
+
+1. Verify that incoming value is part of that set with low overhead.
+2. Perform the operation quickly.
+
+This requires that the set of values is chosen such that membership can be
+tested quickly and that membership is sufficient to allow the operation to
+performed quickly.
+
+For example, `LOAD_GLOBAL_MODULE` is specialized for `globals()`
+dictionaries that have a keys with the expected version.
+
+This can be tested quickly:
+
+* `globals->keys->dk_version == expected_version`
+
+and the operation can be performed quickly:
+
+* `value = entries[cache->index].me_value;`.
+
+Because it is impossible to measure the performance of an instruction without
+also measuring unrelated factors, the assessment of the quality of a
+specialization will require some judgement.
+
+As a general rule, specialized instructions should be much faster than the
+base instruction.
+
+### Implementation of specialized instructions
+
+In general, specialized instructions should be implemented in two parts:
+
+1. A sequence of guards, each of the form
+  `DEOPT_IF(guard-condition-is-false, BASE_NAME)`.
+2. The operation, which should ideally have no branches and
+  a minimum number of dependent memory accesses.
+
+In practice, the parts may overlap, as data required for guards
+can be re-used in the operation.
+
+If there are branches in the operation, then consider further specialization
+to eliminate the branches.
+
+### Maintaining stats
+
+Finally, take care that stats are gathered correctly.
+After the last `DEOPT_IF` has passed, a hit should be recorded with
+`STAT_INC(BASE_INSTRUCTION, hit)`.
+After an optimization has been deferred in the adaptive instruction,
+that should be recorded with `STAT_INC(BASE_INSTRUCTION, deferred)`.
+
+
 Additional resources
 --------------------
 

diff --git a/InternalDocs/tier2.md b/InternalDocs/tier2.md
@@ -3,10 +3,10 @@
 The [basic interpreter](interpreter.md), also referred to as the `tier 1`
 interpreter, consists of a main loop that executes the bytecode instructions
 generated by the [bytecode compiler](compiler.md) and their
-[specializations](adaptive.md). Runtime optimization in tier 1 can only be
-done for one instruction at a time. The `tier 2` interpreter is based on a
-mechanism to replace an entire sequence of bytecode instructions, and this
-enables optimizations that span multiple instructions.
+[specializations](interpreter.md#Specialization). Runtime optimization in tier 1
+can only be done for one instruction at a time. The `tier 2` interpreter is
+based on a mechanism to replace an entire sequence of bytecode instructions,
+and this enables optimizations that span multiple instructions.
 
 ## The Optimizer and Executors