-
-
Notifications
You must be signed in to change notification settings - Fork 32k
gh-120501: Fix reference leak in JIT build #120649
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@JeffersGlass can you apply this patch locally and verify that this solves the refleak problem? |
The second problem is bisected to 72867c9 |
On Linux/X86_64, this seems to have fixed the refleak issue! 🎉 Building on Windows_x64, though, I get an error when starting the interpreter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confirm that the change fix the reference leak for test_colorsys
(I picked on test to check the PR).
So, here's the second problem: def foo():
a = [1, 2, 3]
exhit = iter(a)
for _ in exhit:
pass
a.append("this should'be in exhit")
print(f"got {list(exhit)}, should be []")
foo()
foo()
foo()
foo()
foo()
foo() Output:
Obviously, the last line is incorrect. Output with a got [], should be []
got [], should be []
got [], should be []
got [], should be []
got [], should be []
Optimizing foo (/home/eclips4/programming-languages/cpython/example.py:1) at byte offset 42
1 ADD_TO_TRACE: _START_EXECUTOR (0, target=21, operand=0x7f4646e59
8000
832)
21: JUMP_BACKWARD(5)
2 ADD_TO_TRACE: _CHECK_VALIDITY_AND_SET_IP (0, target=21, operand=0x7f4646e59832)
3 ADD_TO_TRACE: _TIER2_RESUME_CHECK (0, target=21, operand=0)
18: FOR_ITER_LIST(3)
4 ADD_TO_TRACE: _CHECK_VALIDITY_AND_SET_IP (0, target=18, operand=0x7f4646e5982c)
5 ADD_TO_TRACE: _ITER_CHECK_LIST (3, target=18, operand=0)
6 ADD_TO_TRACE: _GUARD_NOT_EXHAUSTED_LIST (3, target=18, operand=0)
7 ADD_TO_TRACE: _ITER_NEXT_LIST (3, target=18, operand=0)
20: STORE_FAST(2)
8 ADD_TO_TRACE: _CHECK_VALIDITY_AND_SET_IP (0, target=20, operand=0x7f4646e59830)
9 ADD_TO_TRACE: _STORE_FAST (2, target=20, operand=0)
21: JUMP_BACKWARD(5)
10 ADD_TO_TRACE: _CHECK_VALIDITY_AND_SET_IP (0, target=21, operand=0x7f4646e59832)
11 ADD_TO_TRACE: _JUMP_TO_TOP (0, target=0, operand=0)
Created a proto-trace for foo (/home/eclips4/programming-languages/cpython/example.py:1) at byte offset 36 -- length 11
Optimized trace (length 10):
0 OPTIMIZED: _START_EXECUTOR (0, jump_target=7, operand=0x7f4646e59e80)
1 OPTIMIZED: _TIER2_RESUME_CHECK (0, jump_target=7, operand=0)
2 OPTIMIZED: _ITER_CHECK_LIST (3, jump_target=8, operand=0)
3 OPTIMIZED: _GUARD_NOT_EXHAUSTED_LIST (3, jump_target=9, operand=0)
4 OPTIMIZED: _ITER_NEXT_LIST (3, target=18, operand=0)
5 OPTIMIZED: _STORE_FAST_2 (2, target=20, operand=0)
6 OPTIMIZED: _JUMP_TO_TOP (0, target=0, operand=0)
7 OPTIMIZED: _DEOPT (0, target=21, operand=0)
8 OPTIMIZED: _EXIT_TRACE (0, exit_index=0, operand=0)
9 OPTIMIZED: _EXIT_TRACE (0, exit_index=1, operand=0)
got ["this should'be in exhit"], should be [] It's definitely related to this part of code: Lines 1027 to 1033 in e4a97a7
I guess the culprit is there. If we remove the |
This is unrelated to this PR, but it is a real bug. |
Done, see #121012. |
FYI: I came back to this and found out that there is a new leak 😢 (maybe there are more than one leak... who knows..) which I bisected to 9621a7d. |
Do you have a reproducer for the leak? And just to clarify, it's a "real" leak (objects that aren't cleaned up at shutdown)? It's normal for the JIT to allocate some new objects during hot loops, but their reference counts and everything should still be correct. |
It's look like a real leak since cases = [
tuple(i for i in range(3)) for _ in range(39)
]
for i, j, k in cases:
for lo in range(4):
for hi in range(3,8):
pass ./python -X showrefcount example.py
[1 refs, 1 blocks] |
More minimal reproducer. This creates a trace, and warms up a single side exit: for _ in range(82):
for _ in range(0):
pass The issue is that we aren't clearing "child" executors from side exits when deallocating an executor. This quick fix seems to solve the issue for me locally. Haven't dug too deep into whether it's totally correct, but it seems right. diff --git a/Python/optimizer.c b/Python/optimizer.c
index 9198e410627..8ba2782c2c3 100644
--- a/Python/optimizer.c
+++ b/Python/optimizer.c
@@ -257,6 +257,10 @@ uop_dealloc(_PyExecutorObject *self) {
_PyObject_GC_UNTRACK(self);
assert(self->vm_data.code == NULL);
unlink_executor(self);
+ for (uint32_t i = 0; i < self->exit_count; i++) {
+ self->exits[i].temperature = initial_unreachable_backoff_counter();
+ Py_CLEAR(self->exits[i].executor);
+ }
#ifdef _Py_JIT
_PyJIT_Free(self);
#endif |
Thanks for digging this up @Eclips4! |
Thanks, Brandt! |
Another one problem, which seems to be introduced in f6fab21: ./python -m test -R 3:3 test_descr
Using random seed: 315532800
0:00:00 load avg: 38.54 Run 1 test sequentially
0:00:00 load avg: 38.54 [1/1] test_descr
beginning 6 repetitions
123456
.test test_descr failed -- Traceback (most recent call last):
File "/home/eclips4/programming/programming-languages/cpython/Lib/test/test_descr.py", line 1294, in test_slots
self.assertEqual(orig_objects, new_objects)
~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: 16891 != 16892
test_descr failed (1 failure)
== Tests result: FAILURE ==
1 test failed:
test_descr
Total duration: 1.5 sec
Total tests: run=156 failures=1 skipped=1
Total test files: run=1/1 failed=1
Result: FAILURE Bisected to 7b21403 |
Nope, go ahead (it should probably just be a |
This is expected: the JIT allocates a new executor (which is a GC-tracked object) in the loop immediately above this Basically, it seems like a buggy test. But in general the JIT needs to be able to perform allocations in places where maybe they didn't happen before... so our leak checks should be resilient to that. |
Co-authored-by: Brandt Bucher <brandt@python.org>
Okay.. let's continue digging into this.. 😄 ======================================================================
FAIL: test_guard_type_version_executor_invalidated (test.test_capi.test_opt.TestUopsOptimization.test_guard_type_version_executor_invalidated)
Verify that the executor is invalided on a type change.
----------------------------------------------------------------------
<
9E88
span class="pl-en">Traceback (most recent call last):
File "/home/eclips4/programming/programming-languages/cpython/Lib/test/test_capi/test_opt.py", line 1458, in test_guard_type_version_executor_invalidated
self.assertEqual(list(iter_opnames(ex)).count("_GUARD_TYPE_VERSION"), 1)
~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: 2 != 1
======================================================================
FAIL: test_guard_type_version_removed (test.test_capi.test_opt.TestUopsOptimization.test_guard_type_version_removed)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/eclips4/programming/programming-languages/cpython/Lib/test/test_capi/test_opt.py", line 1353, in test_guard_type_version_removed
self.assertEqual(guard_type_version_count, 1)
~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: 2 != 1
======================================================================
FAIL: test_guard_type_version_removed_inlined (test.test_capi.test_opt.TestUopsOptimization.test_guard_type_version_removed_inlined)
Verify that the guard type version if we have an inlined function
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/eclips4/programming/programming-languages/cpython/Lib/test/test_capi/test_opt.py", line 1379, in test_guard_type_version_removed_inlined
self.assertEqual(guard_type_version_count, 1)
~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: 2 != 1
======================================================================
UNEXPECTED SUCCESS: test_guard_type_version_not_removed_escaping (test.test_capi.test_opt.TestUopsOptimization.test_guard_type_version_not_removed_escaping)
Verify that the guard type version is not removed if have an escaping function
----------------------------------------------------------------------
Ran 924 tests in 11.506s
FAILED (failures=3, skipped=4, unexpected successes=1)
test test_capi failed
test_capi failed (3 failures)
== Tests result: FAILURE ==
1 test failed:
test_capi
Total duration: 23.5 sec
Total tests: run=924 failures=3 skipped=4
Total test files: run=1/1 failed=1
Result: FAILURE |
UPD: There are many more failures: 17 tests failed:
test.test_asyncio.test_taskgroups test.test_inspect.test_inspect
test.test_multiprocessing_forkserver.test_processes
test.test_multiprocessing_spawn.test_processes test_argparse
test_ast test_capi test_clinic test_collections test_ctypes
test_descr test_difflib test_enum test_itertools test_monitoring
test_tarfile test_zoneinfo |
Let's keep this PR in scope. This new issue you found isn't a refleak. It's just a byproduct of the fact that when we run out of version numbers, we literally can't perform certain optimizations anymore. |
Should I create an separate issue for that? |
So far, the latest observable leaks are memory block leaks: eclips4@nixos ~/p/p/cpython (issue-120501)> ./python -m test -R 3:3 test_random
Using random seed: 315532800
0:00:00 load avg: 2.11 Run 1 test sequentially in a single process
0:00:00 load avg: 2.11 [1/1] test_random
beginning 6 repetitions. Showing number of leaks (. for 0 or less, X for 10 or more)
123:456
XX. 121
test_random leaked [1, 2, 1] memory blocks, sum=4
test_random failed (reference leak)
== Tests result: FAILURE ==
1 test failed:
test_random
Total duration: 13.5 sec
Total tests: run=103
Total test files: run=1/1 failed=1
Result: FAILURE Though, for a larger number of repetitions it doesn't fail: eclips4@nixos ~/p/p/cpython (issue-120501) [2]> ./python -m test -R 6:6 test_random
Using random seed: 315532800
0:00:00 load avg: 2.35 Run 1 test sequentially in a single process
0:00:00 load avg: 2.35 [1/1] test_random
beginning 12 repetitions. Showing number of leaks (. for 0 or less, X for 10 or more)
123456:789012
XX.121 1.12..
test_random leaked [1, 0, 1, 2, 0, 0] memory blocks, sum=4 (this is fine)
== Tests result: SUCCESS ==
1 test OK.
Total duration: 27.5 sec
Total tests: run=103
Total test files: run=1/1
Result: SUCCESS So.. it doesn't look like a real leaks, because ./python -Xshowrefcount -m test -R 3:3 test_random
Using random seed: 315532800
0:00:00 load avg: 1.99 Run 1 test sequentially in a single process
0:00:00 load avg: 1.99 [1/1] test_random
beginning 6 repetitions. Showing number of leaks (. for 0 or less, X for 10 or more)
123:456
XX. 121
test_random leaked [1, 2, 1] memory blocks, sum=4
test_random failed (reference leak)
== Tests result: FAILURE ==
1 test failed:
test_random
Total duration: 13.6 sec
Total tests: run=103
Total test files: run=1/1 failed=1
Result: FAILURE
[0 refs, 0 blocks] I understand that with Tier 2, there could be allocations that didn't happen before. But I have no idea where they are. Any hints? |
JIT
build #120501