-
-
Notifications
You must be signed in to change notification settings - Fork 357
Description
symptom
at a regular period (e.g. 10s or 20s), there are significant stalls (e.g. 30 ms) where it appears the main thread is pre-empted
In terms of trio.abc.Instrument
, the lost time appears between before_task_step
and after_task_step
, though tasks are doing nothing with threads or time.sleep(). In a single-threaded app with many heterogeneous tasks, the stall events appear to be distributed randomly among the tasks.
I've come up with a minimal program to reproduce the effect. Note that it is CPU-intensive (a loop of very small trio.sleep()
), as CPU use > 70% seems necessary to induce the problem.
import numba
import trio
async def _run():
T = .000015
t_start = trio.current_time()
while True:
t0 = trio.current_time()
await trio.sleep(T)
delta = trio.current_time() - t0
if delta > .005:
elapsed = trio.current_time() - t_start
print(f'{(elapsed) // 60:02.0f}:{elapsed % 60:02.0f} {delta * 1000:5.1f}')
trio.run(_run)
Note that the repro imports numba
, and that is the spooky part. With the program as written, observed output (timestamp, stall in ms):
$ python repro.py
00:00 29.1
00:11 35.5
00:21 33.4
00:32 33.8
However, removing the numba
import, the stall durations are significantly reduced (6 ms) and at a different period (3 ms):
$ ./.venv/bin/python repro.py
00:00 5.2
00:03 6.5
00:05 6.7
00:08 6.4
00:10 6.1
And here is importing numpy
instead of numba
(10 ms stalls, 4ms period):
$ ./.venv/bin/python repro.py
00:00 8.9
00:04 11.8
00:08 10.5
00:11 10.1
00:15 11.2
conditions
seen in Trio 0.17.0
it's observed by multiple people running the repro, across multiple architectures (at least Linux/amd64 and and OS X)
So far it's specific to Trio. I've tried equivalent loops using asyncio
and time.sleep()
, which do not exhibit the problem.
original bug filed to numba: numba/numba#6347
theories
- T1. kernel scheduling artifact? - is this just an expected effect as a process approaches 100% CPU, as the OS/kernel need cycles to do internal work? But why doesn't the problem show up with asyncio or time.sleep() implementations? Why does the stall period depend on unused package imports?
- T2. weird GIL interaction among main thread and one of Trio's background threads? - The GIL is essentially a scheduler, I've read.
- ...?
next steps
- try earlier versions of Trio - result: no change (tested back to 0.11.0)
- ...