8000 Deadlock in connection to DB · Issue #1299 · newrelic/newrelic-python-agent · GitHub
[go: up one dir, main page]

Skip to content
Deadlock in connection to DB #1299
@sbabashahi

Description

@sbabashahi

Description
Hi team,

We are using newrelic for a while and in the past month we faced a floating bug. The issue is, we have open connections to our Postgres DB which is in active state wait for ClientRead or in idle in transaction state . What we found is the python process started a DB transaction and then got stuck which left the DB connection open. This was happening a couple of times in a day while we have millions of DB hit per day.

We believe this is what the issue is:
NR agent in its newrelic.core.trace_cache.TraceCache:save_trace() calls sys._current_frames() (docs), that in cpython<=3.11 under specific circumstances can deadlock:

  1. sys._current_frames() acquires runtime lock (HEAD_LOCK(runtime)) and then creates frame objects (_PyFrame_GetFrameObject())
  2. GC hook gc_alloc() runs on every object allocation and may decide to launch collection (gc_collect_generations()).
  3. During garbage collection (delete_garbage()) some objects might get deleted (local_dealloc()).
  4. Object deletion tries to acquired runtime lock again (HEAD_LOCK(runtime)) and we get a deadlock.
    Issue was reported (issue), but not fixed in 3.11. In 3.12 this class of issues was fixed all together: GC doesn't run on object allocations anymore.

We do have plan to migrate to python 3.12 soon but in the mean time we reduced the issue with turning off the asyncpg hooks in new relic import-hook:asyncpg.protocol and import-hook:asyncpg.connect_utils. After disabling these hooks we faced only 1 case in 3 days.

Expected Behavior
We believe the issue root is in python not newrelic agent, but we also believe some changes on the library happened, since we hadn't this issue before and it started to show itself recently.

Troubleshooting or NR Diag results

last lines of pyspy dump of an stuck process before disabling the hooks

    run (threading.py:975)
        Arguments:
            self: <Thread at 0x7fe93e56d450>
    _run_old_run_func (sentry_sdk/integrations/threading.py:94)
        Locals:
            self: <Thread at 0x7fe93e56d450>
    run (sentry_sdk/integrations/threading.py:101)
        Locals:
            a: <cell at 0x7fe93d7fa3e0>
            kw: <cell at 0x7fe93d7fa410>
            _run_old_run_func: <function at 0x7fe93d8368e0>
    _bootstrap_inner (threading.py:1038)
        Arguments:
            self: <Thread at 0x7fe93e56d450>
    _bootstrap (threading.py:995)
        Arguments:
            self: <Thread at 0x7fe93e56d450>
Thread 146826 (idle): "NR-Harvest-Thread"
    wait (threading.py:324)
        Arguments:
            self: <Condition at 0x7fe93d8660d0>
            timeout: 4.999745607376099
        Locals:
            waiter: <_thread.lock at 0x7fe93d864dc0>
            saved_state: None
            gotit: False
    wait (threading.py:622)
        Arguments:
            self: <Event at 0x7fe93d6f5c50>
            timeout: 4.999745607376099
        Locals:
            signaled: False
    run (sched.py:149)
        Arguments:
            self: <scheduler at 0x7fe93cda3c50>
            blocking: True
        Locals:
            lock: <_thread.RLock at 0x7fe93e535400>
            q: [(1738308015.5937417, 1, 12351, <method at 0x7fe93d7cc480>, (), {}), (1738308066.2719824, 2, 12350, ...)]
            delay...
Collapse

last lines of pyspy dump of a stuck process after disabling the hooks

(newrelic/hooks/logger_structlog.py:127)
    _proxy_to_logger (structlog/_base.py:222)
    meth (structlog/_native.py:144)
    _proxy_to_logger (structlog/_base.py:223)
    _proxy_to_logger (structlog/stdlib.py:265)
    info (structlog/stdlib.py:201)
    <lambda> (structlog/stdlib.py:627)
    <lambda> (structlog/stdlib.py:627)
    run (concurrent/futures/thread.py:58)
    _worker (concurrent/futures/thread.py:83)
    run (threading.py:975)
    _run_old_run_func (sentry_sdk/integrations/threading.py:94)
    run (sentry_sdk/integrations/threading.py:101)
    _bootstrap_inner (threading.py:1038)
    _bootstrap (threading.py:995)
Thread 244055 (idle): "asyncio_31"
    _worker (concurrent/futures/thread.py:81)
    run (threading.py:975)
    _run_old_run_func (sentry_sdk/integrations/threading.py:94)
    run (sentry_sdk/integrations/threading.py:101)
    _bootstrap_inner (threading.py:1038)
    _bootstrap (threading.py:995)

Steps to Reproduce
Unfortunately we can not reproduce it.

Your Environment
python 3.11
newrelic agent 10.4.0
sqlalchemy 2.0.37
psycopg2-binary 2.9.10

Additional context
[TIP]: # ( Add any other context about the problem here. For example, relevant community posts or support tickets. )

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0