10000 gh-135552: Make the GC clear weakrefs later. by nascheme · Pull Request #136189 · python/cpython · GitHub
[go: up one dir, main page]

Skip to content

gh-135552: Make the GC clear weakrefs later. #136189

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

nascheme
Copy link
Member
@nascheme nascheme commented Jul 1, 2025

Fix two bugs related to the interaction of weakrefs and the garbage
collector. The weakrefs in the tp_subclasses dictionary are needed in
order to correctly invalidate type caches (for example, by calling
PyType_Modified()). Clearing weakrefs before calling finalizers causes
the caches to not be correctly invalidated. That can cause crashes since the
caches can refer to invalid objects. This is fixed by deferring the
clearing of weakrefs to classes and without callbacks until after finalizers
are executed.

The second bug is caused by weakrefs created while running finalizers. Those
weakrefs can be outside the set of unreachable garbage and therefore survive
the delete_garbage() phase (where tp_clear() is called on objects).
Those weakrefs can expose to Python-level code objects that have had
tp_clear() called on them. See GH-91636 as an example of this kind of
bug. This is fixed be clearing all weakrefs to unreachable objects after
finalizers have been executed.

Clear the weakrefs to unreachable objects after finalizers are called.
@neonene
Copy link
Contributor
neonene commented Jul 1, 2025

I can confirm this PR fixes the gh-132413 issue as well.

@nascheme
Copy link
Member Author
nascheme commented Jul 1, 2025

I think this fixes (or mostly fixes) gh-91636 as well.

@nascheme nascheme requested a review from pablogsal July 1, 2025 23:33
@nascheme nascheme added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Jul 1, 2025
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @nascheme for commit 12f0b5c 🤖

Results will be shown at:

https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F136189%2Fmerge

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Jul 1, 2025
@nascheme
Copy link
Member Author
nascheme commented Jul 2, 2025

This introduces refleaks, it seems. One of the leaking tests:
test.test_concurrent_futures.test_shutdown.ProcessPoolSpawnProcessPoolShutdownTest.test_shutdown_gh_132969_case_1
My unconfirmed suspicion is that a finalizer is now resurrecting an object via a weakref. Previously that weakref would be cleared before the finalizer is run. The multiprocessing finalizer logic seems very complicated. :-/

@nascheme
Copy link
Member Author
nascheme commented Jul 2, 2025

This is the smallest leaking example I could make so far. Something with ProcessPoolExecutor leaking maybe.

class LeakTest(unittest.TestCase):
    @classmethod
    def _fail_task(cls, n):
        raise ValueError("failing task")

    def test_leak_case(self):
        # this leaks references
        executor = futures.ProcessPoolExecutor(
                max_workers=1,
                max_tasks_per_child=1,
                )
        f2 = executor.submit(LeakTest._fail_task, 0)
        try:
            f2.result()
        except ValueError:
            pass

        # Ensure that the executor cleans up after called
        # shutdown with wait=False
        executor_manager_thread = executor._executor_manager_thread
        executor.shutdown(wait=False)
        time.sleep(0.2)
        executor_manager_thread.join()

    def test_leak_case2(self):
        # this does not leak
        with futures.ProcessPoolExecutor(
                max_workers=1,
                max_tasks_per_child=1,
                ) as executor:
            f2 = executor.submit(LeakTest._fail_task, 0)
            try:
                f2.result()
            except ValueError:
                pass

@neonene
Copy link
Contributor
neonene commented Jul 2, 2025

Other leaking examples (on Windows):

1. test_logging:
import logging
import logging.config
import logging.handlers
from multiprocessing import Queue, Manager

class ConfigDictTest(unittest.TestCase):
    def test_multiprocessing_queues_XXX(self):
        config = {
            'version': 1,
            'handlers' : {
                'spam' : {
                    'class': 'logging.handlers.QueueHandler',
                    'queue': Manager().Queue()  ,         # Leak
                    # 'queue': Manager().JoinableQueue()  # Leak
                    # 'queue': Queue(),                   # No leak

                },
            },
            'root': {'handlers': ['spam']}
        }
        logger = logging.getLogger()
        logging.config.dictConfig(config)
        while logger.handlers:
            h = logger.handlers[0]
            logger.removeHandler(h)
            h.close()
2. test_interpreters.test_api:
import contextlib
import threading
import types
from concurrent import interpreters

def func():
    raise Exception('spam!')

@contextlib.contextmanager
def captured_thread_exception():
    ctx = types.SimpleNamespace(caught=None)
    def excepthook(args):
        ctx.caught = args
    orig_excepthook = threading.excepthook
    threading.excepthook = excepthook
    try:
        yield ctx
    finally:
        threading.excepthook = orig_excepthook

class TestInterpreterCall(unittest.TestCase):
    def test_call_in_thread_XXX(self):
        interp = interpreters.create()
        call = (interp._call, interp.call)[1]   # 0: No leak, 1: Leak
        with captured_thread_exception() as _:
            t = threading.Thread(target=call, args=(interp, func, (), {}))
            t.start()
            t.join()

UPDATE:

import weakref, _interpreters
wd = weakref.WeakValueDictionary()

class TestInterpreterCall(unittest.TestCase):
    def test_call_in_thread_XXX(self):
        id = _interpreters.create()
        wd[id] = type("", (), {})
        _interpreters.destroy(id)

@nascheme
Copy link
Member Author
nascheme commented Jul 2, 2025

The majority (maybe all) of these leaks are caused by the WeakValueDictionary used as multiprocessing.util._afterfork_registry. That took some digging to find. I'm not yet sure of a good fix for this. Explicitly cleaning the dead weak references from the .data dict works but it not too elegant.

This avoids breaking tests while fixing the issue with tp_subclasses. In
the long term, it would be better to defer the clear of all weakrefs,
not just the ones referring to classes.  However, that is a more
distruptive change and would seem to have a higher chance of breaking
user code.  So, it would not be something to do in a bugfix release.
@nascheme
Copy link
Member Author
nascheme commented Jul 3, 2025

The majority (maybe all) of these leaks are caused by the WeakValueDictionary used as multiprocessing.util._afterfork_registry. That took some digging to find. I'm not yet sure of a good fix for this. Explicitly cleaning the dead weak references from the .data dict works but it not too elegant.

Nope, that doesn't fix all the leaks. And having to explicitly clean the weakrefs from the WeakValueDictionary really shouldn't be needed, I think. The KeyedRef class uses a callback and so they should be cleaned from the dict when the referred value dies. So, I'm not exactly sure what's going on there.

For the purposes of having a fix that we can backport (should probably be backported to all maintained Python versions), a less disruptive fix would be better. To that end, I've changed this PR to only defer clearing weakrefs to class objects. That fixes the tp_subclasses bug but should be less likely to break currently working code.

@nascheme nascheme added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Jul 3, 2025
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @nascheme for commit 2f3daba 🤖

Results will be shown at:

https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F136189%2Fmerge

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Jul 3, 2025
We need to clear those before executing the callback.  Since this
ensures they can't run a second time, we don't need
_PyGC_SET_FINALIZED().  Revise comments.
@nascheme
Copy link
Member Author
nascheme commented Jul 3, 2025

The KeyedRef class uses a callback and so they should be cleaned from the dict when the referred value dies. So, I'm not exactly sure what's going on there.

Ah, the KeyedRef callback requires that the weakref is cleared when it is called, otherwise it was not deleting the item from the weak dictionary. So we need to clear the weakrefs with callbacks before executing them. That fixes the refleaks, I believe.

I revised this PR to be something that is potentially suitable for backporting. To minimize the behaviour change, I'm only deferring the clear of weakrefs that refer to types (in order to allow tp_subclasses to work) or with callbacks. I still have an extra clearing pass that gets done after the finalizers are called. That avoids bugs like #91636.

If this gets merged, I think we should consider deferring clearing all weakrefs (without callbacks) until after finalizers are executed. I think that's more correct since it gives the finalizers a better opportunity to do their job.

* *not* be cleared so that caches based on the type version are correctly
* invalidated (see GH-135552 as a bug caused by this).
*/
clear_weakrefs(&final_unreachable);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ough! Nice trick with final_unreachable!

@pablogsal pablogsal self-assigned this Jul 3, 2025
This is a bit trickly because the wrlist can be changing as we
iterate over it.
@nascheme
Copy link
Member Author
nascheme commented Jul 4, 2025

I think this PR, as is, will no longer fix GH-132413. Only weakrefs with callbacks or weakrefs to classes are deferred from clearing. GH-132413 is caused by a weakref to a module object being cleared early. Depending on how serious we think that bug is, we could change this PR back to deferring the clear of all weakrefs without callbacks. Doing that doesn't break many unit tests (just a couple in test_finalization that's explicitly checking this behavior).

@efimov-mikhail
Copy link
Contributor

IMO, it's better to defer clearing of all weakrefs without callbacks. But it's up to release managers to decide which fixes we want to backport.

FYI, @Yhg1s, @hugovk

@neonene
Copy link
Contributor
neonene commented Jul 4, 2025

Only weakrefs with callbacks or weakrefs to classes are deferred from clearing

Weakref can be deferred from clearing if a dummy callback is given, right? If so, the following result on the PR is unexpected to me:

import itertools as mod
callback = lambda wr: None

def gen():
    import weakref
    wr = weakref.ref(mod, callback)
    print(1, wr())
    try:
        yield
    finally:
        print(2, wr())

it = gen()
next(it)
print('//')
1 <module 'itertools' (built-in)>
//
2 None

I guess gh-132413 could have another backporable workaround like this.

@nascheme
Copy link
Member Author
nascheme commented Jul 4, 2025

Only weakrefs with callbacks or weakrefs to classes are deferred from clearing

Weakref can be deferred from clearing if a dummy callback is given, right?

Oops, I wrote that wrong. If there is a callback, the weakref is cleared early. There are callbacks that expect that and don't do the right think if the weakref is not cleared first. At least, the KeyedRef weakref subclass expects this and I would assume there are other callbacks that might expect it too.

@neonene
Copy link
Contributor
neonene commented Jul 4, 2025

Then, would it be possible to internally introduce a special weakref that is guaranteed to be cleared later?

@sergey-miryanov
Copy link
Contributor

Then, would it be possible to internally introduce a special weakref that is guaranteed to be cleared later?

I tried to use sentinel callback for this here - #136147. Idea was in using noop-callback for this and not changing weakref code. But this solution has some flaws with lifetime of callback.

But in general case you can't distinguish what weakref should be cleared later (IIUC).

@neonene
Copy link
Contributor
neonene commented Jul 4, 2025

it's better to defer clearing of all weakrefs without callbacks. But it's up to release managers to decide which fixes we want to backport.

+1 for all weakrefs, at least for ones to modules when the gc runs with _Py_GC_REASON_SHUTDOWN. (Usually, modules are re-importable under other gc reasons.)

@sergey-miryanov
Copy link
Contributor

@nascheme I have added tests for #135552. I believe we should add tests for other linked issues. I plan to look into it over the weekend.

@sergey-miryanov
Copy link
Contributor

The KeyedRef class uses a callback and so they should be cleaned from the dict when the referred value dies. So, I'm not exactly sure what's going on there.

IIUC, there maybe cases when selfref or self_weakref already cleared by WeakValueDictionary or _WeakValueDictionary still have values should should be cleared to. But they will not be cleared because of condition if selfref() is not None and if self_weakref() is not None.

IIUC, because PEP-442 landed we can remove weakrefs here

def remove(wr, selfref=ref(self), _atomic_removal=_remove_dead_weakref):

and there

self_weakref = _weakref.ref(self)

and use a strong reference that will outlive the KeyedRefs in the dictionaries. I replaced the weakref with a strong reference in the _bootstrap.py and checked with refleak buildbot (#136345). No leaks were found.

@nascheme WDYT?

@neonene
Copy link
Contributor
neonene commented Jul 7, 2025

Weakref with a callback needs to be cleared early (as-is), accounting for the 3rd-party applications that have similar logic to WeakValueDictionary.

@sergey-miryanov
Copy link
Contributor

Weakref with a callback needs to be cleared early (as-is)

I'm not against this. But IIUC (feel free to correct me) we can face a situation in which weakref to the WeakValueDictionary is cleared in one gc collect call and KeyedRefs are cleared in another. And strong ref should handle this situation. We can teach users about this (I suppose).

@neonene
Copy link
Contributor
neonene commented Jul 7, 2025

we can face a situation

The experiment #136345 seems currently invalid, which is based on main and not effective on this PR where I omitted calling _PyWeakref_ClearRef in handle_weakref_callbacks.

@sergey-miryanov
Copy link
Contributor

The experiment #136345 seems currently invalid, which is based on main and not effective on this PR where I omitted calling _PyWeakref_ClearRef in handle_weakref_callbacks.

If we use a strong reference, then it doesn't matter if we omit _PyWeakref_ClearRef or not. Am I wrong?

@neonene
Copy link
Contributor
neonene commented Jul 7, 2025

Before discussing, have you checked on this PR? I can see the leaks.

@sergey-miryanov
Copy link
Contributor
sergey-miryanov commented Jul 7, 2025
Results
# clean test-weak-value-dict
➜ .\python.bat -m test test_logging -R :
Running Debug|x64 interpreter...
Using random seed: 3090750191
0:00:00 Run 1 test sequentially in a single process
0:00:00 [1/1] test_logging
beginning 9 repetitions. Showing number of leaks (. for 0 or less, X for 10 or more)
12345:6789
XX... ....
0:06:26 [1/1] test_logging passed in 6 min 26 sec

== Tests result: SUCCESS ==

1 test OK.

Total duration: 6 min 26 sec
Total tests: run=272 skipped=13
Total test files: run=1/1
Result: SUCCESS

# clean gh-135552-wr-clear-later
➜ .\python.bat -m test test_logging -R :             
Running Debug|x64 interpreter...
Using random seed: 2520806690
0:00:00 Run 1 test sequentially in a single process
0:00:00 [1/1] test_logging
beginning 9 repetitions. Showing number of leaks (. for 0 or less, X for 10 or more)
12345:6789
XX... ...2
test_logging leaked [0, 0, 0, 2] memory blocks, sum=2 (this is fine)
0:06:19 [1/1] test_logging passed in 6 min 19 sec

== Tests result: SUCCESS ==

1 test OK.

Total duration: 6 min 19 sec
Total tests: run=272 skipped=13
Total test files: run=1/1
Result: SUCCESS

# gh-135552-wr-clear-later + weakref removed from WeakValueDictionary
➜ .\python.bat -m test test_logging -R :
Running Debug|x64 interpreter...
Using random seed: 1096797289
0:00:00 Run 1 test sequentially in a single process
0:00:00 [1/1] test_logging
beginning 9 repetitions. Showing number of leaks (. for 0 or less, X for 10 or more)
12345:6789
XX.2. ....
0:06:24 [1/1] test_logging passed in 6 min 24 sec

== Tests result: SUCCESS ==

1 test OK.

Total duration: 6 min 24 sec
Total tests: run=272 skipped=13
Total test files: run=1/1
Result: SUCCESS

# gh-135552-wr-clear-later + weakref removed from WeakValueDictionary + call _PyWeakref_ClearRef only for wr with callbacks
➜ .\python.bat -m test test_logging -R :
Running Debug|x64 interpreter...
Using random seed: 2599319069
0:00:00 Run 1 test sequentially in a single process
0:00:00 [1/1] test_logging
beginning 9 repetitions. Showing number of leaks (. for 0 or less, X for 10 or more)
12345:6789
XX.2. ....
0:06:32 [1/1] test_logging passed in 6 min 32 sec

== Tests result: SUCCESS ==

1 test OK.

Total duration: 6 min 32 sec
Total tests: run=272 skipped=13
Total test files: run=1/1
Result: SUCCESS

# gh-135552-wr-clear-later + call _PyWeakref_ClearRef only for wr with callbacks
➜ .\python.bat -m test test_logging -R :
Running Debug|x64 interpreter...
Using random seed: 1493424824
0:00:00 Run 1 test sequentially in a single process
0:00:00 [1/1] test_logging
beginning 9 repetitions. Showing number of leaks (. for 0 or less, X for 10 or more)
12345:6789
XX2.. ....
0:06:29 [1/1] test_logging passed in 6 min 29 sec

== Tests result: SUCCESS ==

1 test OK.

Total duration: 6 min 29 sec
Total tests: run=272 skipped=13
Total test files: run=1/1
Result: SUCCESS

@neonene
Copy link
Contributor
neonene commented Jul 7, 2025

Let's move on to #136345. (The result above looks long here.)

@nascheme
Copy link
Member Author
nascheme commented Jul 8, 2025

GH-136401 was merged. I think it should be backported as well since it should be pretty low risk.

The main branch has been merged into this PR. I've updated the PR to defer clearing all weakrefs without callbacks (not just weakrefs to classes). This seems more correct to me (rather than special casing weakrefs to classes) and seems the best way to fix GH-132413.

I think GH-132413 is actually more likely to be encountered in real code. For example, if you have some logging function in a __del__ method that uses the datetime module. So, maybe this PR should be backported to 3.13 and 3.14 as well.

// callback pointer intact. Obscure: it also changes *wrlist.
_PyObject_ASSERT((PyObject *)wr, wr->wr_object == op);
_PyWeakref_ClearRef(wr);
_PyObject_ASSERT((PyObject *)wr, wr->wr_object == Py_None);
}

if (wr->wr_callback == NULL) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth to move this check above and remove if (wr->wr_callback != NULL)?

@sergey-miryanov
Copy link
Contributor

Looks good to me. Thanks for detailed comments!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
0