gh-120501: Fix reference leak in JIT build #120649

Eclips4 · 2024-06-17T15:39:21Z

Issue: A lot of leaks in the test suite on the JIT build #120501

Eclips4 · 2024-06-17T15:41:56Z

@JeffersGlass can you apply this patch locally and verify that this solves the refleak problem?
FYI there's a another one problem(😢) that was introduced after f6fab21, but firstly in this PR I would like to solve initial problem 🙂

Eclips4 · 2024-06-17T16:17:52Z

The second problem is bisected to 72867c9

JeffersGlass · 2024-06-17T20:42:56Z

On Linux/X86_64, this seems to have fixed the refleak issue! 🎉

Building on Windows_x64, though, I get an error when starting the interpreter. Assertion failed: Py_SIZE(value), file D:\cpython\Python\executor_cases.c.h, line 405. Same error whether dropping into the REPL, or running with -c, or -m test, etc.

vstinner

I confirm that the change fix the reference leak for test_colorsys (I picked on test to check the PR).

Python/optimizer.c

Eclips4 · 2024-06-24T19:33:09Z

So, here's the second problem:
Reproducer:

def foo():
    a = [1, 2, 3]
    exhit = iter(a)
    for _ in exhit:
        pass
    a.append("this should'be in exhit")
    print(f"got {list(exhit)}, should be []")

foo()
foo()
foo()
foo()
foo()
foo()

Output:

got [], should be []
got [], should be []
got [], should be []
got [], should be []
got [], should be []
got ["this should'be in exhit"], should be []

Obviously, the last line is incorrect.

Output with a PYTHON_LLTRACE=2 env:

got [], should be []
got [], should be []
got [], should be []
got [], should be []
got [], should be []
Optimizing foo (/home/eclips4/programming-languages/cpython/example.py:1) at byte offset 42
   1 ADD_TO_TRACE: _START_EXECUTOR (0, target=21, operand=0x7f4646e59832)
21: JUMP_BACKWARD(5)
   2 ADD_TO_TRACE: _CHECK_VALIDITY_AND_SET_IP (0, target=21, operand=0x7f4646e59832)
   3 ADD_TO_TRACE: _TIER2_RESUME_CHECK (0, target=21, operand=0)
18: FOR_ITER_LIST(3)
   4 ADD_TO_TRACE: _CHECK_VALIDITY_AND_SET_IP (0, target=18, operand=0x7f4646e5982c)
   5 ADD_TO_TRACE: _ITER_CHECK_LIST (3, target=18, operand=0)
   6 ADD_TO_TRACE: _GUARD_NOT_EXHAUSTED_LIST (3, target=18, operand=0)
   7 ADD_TO_TRACE: _ITER_NEXT_LIST (3, target=18, operand=0)
20: STORE_FAST(2)
   8 ADD_TO_TRACE: _CHECK_VALIDITY_AND_SET_IP (0, target=20, operand=0x7f4646e59830)
   9 ADD_TO_TRACE: _STORE_FAST (2, target=20, operand=0)
21: JUMP_BACKWARD(5)
  10 ADD_TO_TRACE: _CHECK_VALIDITY_AND_SET_IP (0, target=21, operand=0x7f4646e59832)
  11 ADD_TO_TRACE: _JUMP_TO_TOP (0, target=0, operand=0)
Created a proto-trace for foo (/home/eclips4/programming-languages/cpython/example.py:1) at byte offset 36 -- length 11
Optimized trace (length 10):
   0 OPTIMIZED: _START_EXECUTOR (0, jump_target=7, operand=0x7f4646e59e80)
   1 OPTIMIZED: _TIER2_RESUME_CHECK (0, jump_target=7, operand=0)
   2 OPTIMIZED: _ITER_CHECK_LIST (3, jump_target=8, operand=0)
   3 OPTIMIZED: _GUARD_NOT_EXHAUSTED_LIST (3, jump_target=9, operand=0)
   4 OPTIMIZED: _ITER_NEXT_LIST (3, target=18, operand=0)
   5 OPTIMIZED: _STORE_FAST_2 (2, target=20, operand=0)
   6 OPTIMIZED: _JUMP_TO_TOP (0, target=0, operand=0)
   7 OPTIMIZED: _DEOPT (0, target=21, operand=0)
   8 OPTIMIZED: _EXIT_TRACE (0, exit_index=0, operand=0)
   9 OPTIMIZED: _EXIT_TRACE (0, exit_index=1, operand=0)
got ["this should'be in exhit"], should be []

It's definitely related to this part of code:

cpython/Python/optimizer.c

Lines 1027 to 1033 in e4a97a7

    
                       if (is_for_iter_test[opcode]) { 
        
                           /* Target the POP_TOP immediately after the END_FOR, 
        
                            * leaving only the iterator on the stack. */ 
        
                           int extended_arg = inst->oparg > 255; 
        
                           int32_t next_inst = target + 1 + INLINE_CACHE_ENTRIES_FOR_ITER + extended_arg; 
        
                           jump_target = next_inst + inst->oparg + 1; 
        
                       }

I guess the culprit is there. If we remove the _GUARD_NOT_EXHAUSTED_LIST from is_for_iter_test, the problem will go away (although it can still be reproduced in other ways using other (range, tuple) iterators).

markshannon · 2024-06-25T17:15:47Z

So, here's the second problem: ...

This is unrelated to this PR, but it is a real bug.
Can you make a separate issue for it (and assign it to me)? Thanks.

Eclips4 · 2024-06-25T19:25:59Z

So, here's the second problem: ...

This is unrelated to this PR, but it is a real bug. Can you make a separate issue for it (and assign it to me)? Thanks.

Done, see #121012.

Eclips4 · 2024-09-06T19:10:33Z

FYI: I came back to this and found out that there is a new leak 😢 (maybe there are more than one leak... who knows..) which I bisected to 9621a7d.

brandtbucher · 2024-09-06T19:38:51Z

Do you have a reproducer for the leak?

And just to clarify, it's a "real" leak (objects that aren't cleaned up at shutdown)? It's normal for the JIT to allocate some new objects during hot loops, but their reference counts and everything should still be correct.

Eclips4 · 2024-09-07T06:58:46Z

Do you have a reproducer for the leak?

And just to clarify, it's a "real" leak (objects that aren't cleaned up at shutdown)? It's normal for the JIT to allocate some new objects during hot loops, but their reference counts and everything should still be correct.

It's look like a real leak since ./python -X showrefcount IIRC reports the state of reference counter at the shutdown of the interpeter.
MRE:

cases = [
    tuple(i for i in range(3)) for _ in range(39)
]


for i, j, k in cases:
    for lo in range(4):
        for hi in range(3,8):
            pass

./python -X showrefcount example.py
[1 refs, 1 blocks]

brandtbucher · 2024-09-07T20:28:24Z

More minimal reproducer. This creates a trace, and warms up a single side exit:

for _ in range(82):
    for _ in range(0):
        pass

The issue is that we aren't clearing "child" executors from side exits when deallocating an executor.

This quick fix seems to solve the issue for me locally. Haven't dug too deep into whether it's totally correct, but it seems right.

diff --git a/Python/optimizer.c b/Python/optimizer.c
index 9198e410627..8ba2782c2c3 100644
--- a/Python/optimizer.c
+++ b/Python/optimizer.c
@@ -257,6 +257,10 @@ uop_dealloc(_PyExecutorObject *self) {
     _PyObject_GC_UNTRACK(self);
     assert(self->vm_data.code == NULL);
     unlink_executor(self);
+    for (uint32_t i = 0; i < self->exit_count; i++) {
+        self->exits[i].temperature = initial_unreachable_backoff_counter();
+        Py_CLEAR(self->exits[i].executor);
+    }
 #ifdef _Py_JIT
     _PyJIT_Free(self);
 #endif

brandtbucher · 2024-09-07T20:28:49Z

Thanks for digging this up @Eclips4!

Eclips4 · 2024-09-08T08:54:45Z

More minimal reproducer. This creates a trace, and warms up a single side exit:
for _ in range(82):
    for _ in range(0):
        pass
The issue is that we aren't clearing "child" executors from side exits when deallocating an executor.

This quick fix seems to solve the issue for me locally. Haven't dug too deep into whether it's totally correct, but it seems right.
diff --git a/Python/optimizer.c b/Python/optimizer.c
index 9198e410627..8ba2782c2c3 100644
--- a/Python/optimizer.c
+++ b/Python/optimizer.c
@@ -257,6 +257,10 @@ uop_dealloc(_PyExecutorObject *self) {
     _PyObject_GC_UNTRACK(self);
     assert(self->vm_data.code == NULL);
     unlink_executor(self);
+    for (uint32_t i = 0; i < self->exit_count; i++) {
+        self->exits[i].temperature = initial_unreachable_backoff_counter();
+        Py_CLEAR(self->exits[i].executor);
+    }
 #ifdef _Py_JIT
     _PyJIT_Free(self);
 #endif

Thanks, Brandt!
Don't you mind if I add this code to this PR?

Eclips4 · 2024-09-08T09:48:45Z

Another one problem, which seems to be introduced in f6fab21:

./python -m test -R 3:3 test_descr
Using random seed: 315532800
0:00:00 load avg: 38.54 Run 1 test sequentially
0:00:00 load avg: 38.54 [1/1] test_descr
beginning 6 repetitions
123456
.test test_descr failed -- Traceback (most recent call last):
  File "/home/eclips4/programming/programming-languages/cpython/Lib/test/test_descr.py", line 1294, in test_slots
    self.assertEqual(orig_objects, new_objects)
    ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: 16891 != 16892

test_descr failed (1 failure)

== Tests result: FAILURE ==

1 test failed:
    test_descr

Total duration: 1.5 sec
Total tests: run=156 failures=1 skipped=1
Total test files: run=1/1 failed=1
Result: FAILURE

Bisected to 7b21403

brandtbucher · 2024-09-17T19:56:42Z

Don't you mind if I add this code to this PR?

Nope, go ahead (it should probably just be a Py_XDECREF, not a Py_CLEAR, and I don't think we need to touch the temperature).

brandtbucher · 2024-09-17T20:19:21Z

Another one problem, which seems to be introduced in f6fab21:

This is expected: the JIT allocates a new executor (which is a GC-tracked object) in the loop immediately above this assertEqual call. I'm not sure that the test is correct anyways, since a GC could have happened in the loop, changing the number of GC objects in the process.

Basically, it seems like a buggy test. But in general the JIT needs to be able to perform allocations in places where maybe they didn't happen before... so our leak checks should be resilient to that.

Co-authored-by: Brandt Bucher <brandt@python.org>

Eclips4 · 2024-10-25T18:14:23Z

Okay.. let's continue digging into this.. 😄
so, there's another failure (you can get it by running ./python -m test -R 3:3 test_capi)

======================================================================
FAIL: test_guard_type_version_executor_invalidated (test.test_capi.test_opt.TestUopsOptimization.test_guard_type_version_executor_invalidated)
Verify that the executor is invalided on a type change.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/eclips4/programming/programming-languages/cpython/Lib/test/test_capi/test_opt.py", line 1458, in test_guard_type_version_executor_invalidated
    self.assertEqual(list(iter_opnames(ex)).count("_GUARD_TYPE_VERSION"), 1)
    ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: 2 != 1

======================================================================
FAIL: test_guard_type_version_removed (test.test_capi.test_opt.TestUopsOptimization.test_guard_type_version_removed)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/eclips4/programming/programming-languages/cpython/Lib/test/test_capi/test_opt.py", line 1353, in test_guard_type_version_removed
    self.assertEqual(guard_type_version_count, 1)
    ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: 2 != 1

======================================================================
FAIL: test_guard_type_version_removed_inlined (test.test_capi.test_opt.TestUopsOptimization.test_guard_type_version_removed_inlined)
Verify that the guard type version if we have an inlined function
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/eclips4/programming/programming-languages/cpython/Lib/test/test_capi/test_opt.py", line 1379, in test_guard_type_version_removed_inlined
    self.assertEqual(guard_type_version_count, 1)
    ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: 2 != 1

======================================================================
UNEXPECTED SUCCESS: test_guard_type_version_not_removed_escaping (test.test_capi.test_opt.TestUopsOptimization.test_guard_type_version_not_removed_escaping)
Verify that the guard type version is not removed if have an escaping function
----------------------------------------------------------------------
Ran 924 tests in 11.506s

FAILED (failures=3, skipped=4, unexpected successes=1)
test test_capi failed
test_capi failed (3 failures)

== Tests result: FAILURE ==

1 test failed:
    test_capi

Total duration: 23.5 sec
Total tests: run=924 failures=3 skipped=4
Total test files: run=1/1 failed=1
Result: FAILURE

Eclips4 · 2024-10-25T18:39:47Z

UPD: There are many more failures:

17 tests failed:
    test.test_asyncio.test_taskgroups test.test_inspect.test_inspect
    test.test_multiprocessing_forkserver.test_processes
    test.test_multiprocessing_spawn.test_processes test_argparse
    test_ast test_capi test_clinic test_collections test_ctypes
    test_descr test_difflib test_enum test_itertools test_monitoring
    test_tarfile test_zoneinfo

brandtbucher · 2024-10-26T20:59:46Z

Okay.. let's continue digging into this.. 😄
so, there's another failure (you can get it by running ./python -m test -R 3:3 test_capi)

Let's keep this PR in scope. This new issue you found isn't a refleak. It's just a byproduct of the fact that when we run out of version numbers, we literally can't perform certain optimizations anymore.

Eclips4 · 2024-11-01T16:38:52Z

Okay.. let's continue digging into this.. 😄
so, there's another failure (you can get it by running ./python -m test -R 3:3 test_capi)

Let's keep this PR in scope. This new issue you found isn't a refleak. It's just a byproduct of the fact that when we run out of version numbers, we literally can't perform certain optimizations anymore.

Should I create an separate issue for that?

Eclips4 · 2025-01-20T22:22:25Z

So far, the latest observable leaks are memory block leaks:

eclips4@nixos ~/p/p/cpython (issue-120501)> ./python -m test -R 3:3 test_random
Using random seed: 315532800
0:00:00 load avg: 2.11 Run 1 test sequentially in a single process
0:00:00 load avg: 2.11 [1/1] test_random
beginning 6 repetitions. Showing number of leaks (. for 0 or less, X for 10 or more)
123:456
XX. 121
test_random leaked [1, 2, 1] memory blocks, sum=4
test_random failed (reference leak)

== Tests result: FAILURE ==

1 test failed:
    test_random

Total duration: 13.5 sec
Total tests: run=103
Total test files: run=1/1 failed=1
Result: FAILURE

Though, for a larger number of repetitions it doesn't fail:

eclips4@nixos ~/p/p/cpython (issue-120501) [2]> ./python -m test -R 6:6 test_random
Using random seed: 315532800
0:00:00 load avg: 2.35 Run 1 test sequentially in a single process
0:00:00 load avg: 2.35 [1/1] test_random
beginning 12 repetitions. Showing number of leaks (. for 0 or less, X for 10 or more)
123456:789012
XX.121 1.12..
test_random leaked [1, 0, 1, 2, 0, 0] memory blocks, sum=4 (this is fine)

== Tests result: SUCCESS ==

1 test OK.

Total duration: 27.5 sec
Total tests: run=103
Total test files: run=1/1
Result: SUCCESS

So.. it doesn't look like a real leaks, because ./python -Xshowrefcount always reports that there are no leaks at all:

./python -Xshowrefcount -m test -R 3:3 test_random
Using random seed: 315532800
0:00:00 load avg: 1.99 Run 1 test sequentially in a single process
0:00:00 load avg: 1.99 [1/1] test_random
beginning 6 repetitions. Showing number of leaks (. for 0 or less, X for 10 or more)
123:456
XX. 121
test_random leaked [1, 2, 1] memory blocks, sum=4
test_random failed (reference leak)

== Tests result: FAILURE ==

1 test failed:
    test_random

Total duration: 13.6 sec
Total tests: run=103
Total test files: run=1/1 failed=1
Result: FAILURE
[0 refs, 0 blocks]

I understand that with Tier 2, there could be allocations that didn't happen before. But I have no idea where they are. Any hints?

Eclips4 added 2 commits June 17, 2024 14:59

First try to fix..

871dc4c

Forgot to commit codeobject.c

08c26d3

Eclips4 requested a review from markshannon as a code owner June 17, 2024 15:39

bedevere-app bot mentioned this pull request Jun 17, 2024

A lot of leaks in the test suite on the JIT build #120501

Open

bedevere-app bot added the awaiting review label Jun 17, 2024

Eclips4 marked this pull request as draft June 17, 2024 15:39

bedevere-app bot removed the awaiting review label Jun 17, 2024

Eclips4 added the skip news label Jun 17, 2024

Fidget-Spinner requested a review from brandtbucher June 18, 2024 09:07

vstinner changed the title ~~gh-120501: Fix reference leak~~ gh-120501: Fix reference leak in JIT build Jun 18, 2024

vstinner reviewed Jun 18, 2024

View reviewed changes

Python/optimizer.c Show resolved Hide resolved

Eclips4 added 3 commits June 23, 2024 10:06

Merge branch 'main' into issue-120501

e9301f3

Add a newline

08cb747

Add a newline (2)

9926e1c

Resolve merge conflict

1874cf1

Eclips4 force-pushed the issue-120501 branch from c623d20 to 492807c Compare October 25, 2024 18:01

Clear child executors from side exits

326401b

Co-authored-by: Brandt Bucher <brandt@python.org>

Eclips4 force-pushed the issue-120501 branch from 492807c to 326401b Compare October 25, 2024 18:03

Merge branch 'main' into issue-120501

f18739b

Eclips4 added 7 commits January 9, 2025 13:34

Merge branch 'main' into issue-120501

45e0c3a

Remove unnecessary decref

a7957fa

Fix incorrect conversion specifier

7d90844

Steal reference instead of increfing

f2cb39d

Revert changes because counter optimizer has been removed

09f7314

Merge branch 'main' into issue-120501

7d291ae

Revert unnecessary changes

b10a0c9

Eclips4 added 2 commits April 13, 2025 15:24

Merge branch 'main' into issue-120501

cc17195

Resolve merge conflict

7ef5009

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

gh-120501: Fix reference leak in JIT build #120649

gh-120501: Fix reference leak in JIT build #120649

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gh-120501: Fix reference leak in JIT build #120649

Are you sure you want to change the base?

gh-120501: Fix reference leak in JIT build #120649

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!