Global deadlock detector

We have an issue for fancier deadlock detection, and API support to make it more useful (#182). This is about a simpler issue: detecting when the entire program has deadlocked, i.e. no tasks are runnable or will ever be runnable again. This is not nearly as fancy, but it would catch lots of real-world deadlock cases (e.g. in tests), and is potentially wayyy simpler. In particular, I believe a Trio program has deadlocked if:

There are no runnable tasks
There are no registered timeouts
There are no tasks waiting on the IOManager
No-one is blocked in wait_all_tasks_blocked

(Did I miss anything?)

However, there is one practical problem: the EntryQueue task is always blocked in the IOManager, waiting for someone to call run_sync_soon.

Practical example of why this is important: from the Trio scheduler's point of view, run_sync_in_worker_thread puts a task to sleep, and then later a call to reschedule(...) magically appears through run_sync_soon. So... it's entirely normal to be in a state where the whole program looks deadlocked except for the possibility of getting a run_sync_soon, and the program actually isn't deadlocked. But, of course, 99% of the time, there is absolutely and definitely no run_sync_soon call coming. There's just no way for Trio to know that.

So I guess to make this viable, we would need some way to recognize the 99% of cases where there is no chance of a run_sync_soon. I think that means, we need to refactor TrioToken so that it uses an acquire/release pattern: you acquire the token only if you plan to call run_sync_soon, and then when you're done with it you explicitly close it.

This will break the other usage of TrioToken, which is that you can compare them with is to check if two calls to trio.run are in fact the same. Maybe this is not even that useful? If it is though then we should split it off into a separate class, so that the only reason to acquire the run_sync_soon-object is because you're going to call run_sync_soon.

Given that, I think we could implement this by extending the code at the top of the event loop like:

 if runner.runq:
     timeout = 0
 elif runner.deadlines:
     deadline, _ = runner.deadlines.keys()[0]
     timeout = runner.clock.deadline_to_sleep_time(deadline)
 else:
-    timeout = _MAX_TIMEOUT
+    if not runner.io_manager.has_waits() and not runner.tokens_outstanding and not runner.waiting_for_idle:
+        # Deadlock detected! Dump a stack tree and crash, maybe...?
+    else:
+        timeout = _MAX_TIMEOUT

This is probably super-cheap too, because we only do the extra checks when there are no runnable tasks or deadlines. No runnable tasks means we're either about to go to sleep for a while, so taking some extra time here is "free", or else that we're about to detect I/O, but if there's outstanding I/O then you should probably have a deadline set...

Provide feedback

Saved searches

Use saved searche 8000 s to filter your results more quickly

Uh oh!

Uh oh!

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions