10000 Global deadlock detector · Issue #1085 · python-trio/trio · GitHub
[go: up one dir, main page]

Skip to content
Global deadlock detector #1085
@njsmith

Description

@njsmith

We have an issue for fancier deadlock detection, and API support to make it more useful (#182). This is about a simpler issue: detecting when the entire program has deadlocked, i.e. no tasks are runnable or will ever be runnable again. This is not nearly as fancy, but it would catch lots of real-world deadlock cases (e.g. in tests), and is potentially wayyy simpler. In particular, I believe a Trio program has deadlocked if:

  • There are no runnable tasks
  • There are no registered timeouts
  • There are no tasks waiting on the IOManager
  • No-one is blocked in wait_all_tasks_blocked

(Did I miss anything?)

However, there is one practical problem: the EntryQueue task is always blocked in the IOManager, waiting for someone to call run_sync_soon.

Practical example of why this is important: from the Trio scheduler's point of view, run_sync_in_worker_thread puts a task to sleep, and then later a call to reschedule(...) magically appears through run_sync_soon. So... it's entirely normal to be in a state where the whole program looks deadlocked except for the possibility of getting a run_sync_soon, and the program actually isn't deadlocked. But, of course, 99% of the time, there is absolutely and definitely no run_sync_soon call coming. There's just no way for Trio to know that.

So I guess to make this viable, we would need some way to recognize the 99% of cases where there is no chance of a run_sync_soon. I think that means, we need to refactor TrioToken so that it uses an acquire/release pattern: you acquire the token only if you plan to call run_sync_soon, and then when you're done with it you explicitly close it.

This will break the other usage of TrioToken, which is that you can compare them with is to check if two calls to trio.run are in fact the same. Maybe this is not even that useful? If it is though then we should split it off into a separate class, so that the only reason to acquire the run_sync_soon-object is because you're going to call run_sync_soon.

Given that, I think we could implement this by extending the code at the top of the event loop like:

 if runner.runq:
     timeout = 0
 elif runner.deadlines:
     deadline, _ = runner.deadlines.keys()[0]
     timeout = runner.clock.deadline_to_sleep_time(deadline)
 else:
-    timeout = _MAX_TIMEOUT
+    if not runner.io_manager.has_waits() and not runner.tokens_outstanding and not runner.waiting_for_idle:
+        # Deadlock detected! Dump a stack tree and crash, maybe...?
+    else:
+        timeout = _MAX_TIMEOUT

This is probably super-cheap too, because we only do the extra checks when there are no runnable tasks or deadlines. No runnable tasks means we're either about to go to sleep for a while, so taking some extra time here is "free", or else that we're about to detect I/O, but if there's outstanding I/O then you should probably have a deadline set...

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0