bpo-34037: test_asyncio uses shutdown_default_executor() #16284

vstinner · 2019-09-19T14:12:37Z

https://bugs.python.org/issue34037

vstinner · 2019-09-19T14:13:01Z

@aeros167: Would you mind to review this change?

vstinner · 2019-09-19T14:25:58Z

I simplified my change (I reverted some useless changes).

vstinner · 2019-09-19T14:45:25Z

@asvetlov: Thanks for the review. I merged my PR.

aeros

@vstinner Apologies for the late response, I've been taking it easier today since I think I've been doing a bit too much recently, trying to avoid the infamous open source burnout you've talked about a few times. (:

Overall, the tests look correct to me. My only concern is a potential timeout issue from the part in close_loop() (lines 511-516 in test_asyncio/utils.py):

     def close_loop(loop):
         if loop._default_executor is not None:
             if not loop.is_closed():
                 loop.run_until_complete(loop.shutdown_default_executor())
             else:
                 loop._default_executor.shutdown(wait=True)

Since shutdown_default_executor() already attempts to call _default_executor.shutdown(wait=True) within the thread it creates, if there are any issues with joining the th 8000 reads, presumably calling _default_executor.shutdown(wait=True) again would not resolve the issue and could end up resulting in a complete timeout for test_asyncio if any of the threads in the ThreadPoolExecutor were unable to be joined.

Therefore, I would recommend changing the else block to the following:

            else:
                loop._default_executor.shutdown(wait=False)

This will at least guarantee that the executor will be shutdown completely, regardless of the status of the threads. IMO, a timeout is the worst case scenario, since it gives us far less useful debugging information and makes the issue harder to resolve.

An alternative of this could be:

     def close_loop(loop):
         if loop._default_executor is not None:
             try:
                 loop.run_until_complete(loop.shutdown_default_executor())
             finally:
                 loop._default_executor.shutdown(wait=False)

This would provide further assurance that the executor will be shut down, since loop._default_executor.shutdown(wait=False) would always be called last. But this may come at the cost of slightly reducing performance, since it would be called redundantly in normal scenarios. For that reason, I prefer the first option.

What are your thoughts @asvetlov and @1st1?

asvetlov · 2019-09-20T09:33:29Z

I think in tests we need to wait for executor worker threads.
If they hang -- we can debug the reason and fix broken test.

aeros · 2019-09-20T20:05:00Z

@asvetlov:

I think in tests we need to wait for executor worker threads.
If they hang -- we can debug the reason and fix broken test.

Since there isn't a timeout specified, wouldn't the worker threads indefinitely hang until the buildbot itself times out? This doesn't seem like it would give us very useful debugging information (which was my initial concern). Perhaps this should raise an exception if shutdown_default_executor() was unable to close the threads?

asvetlov · 2019-09-20T20:47:40Z

Thanks for the nice question!
This PR provides a very good improvement, I don't regret on merging :)
The next step could be:

Support timeout parameter by shutdown_default_executor(timeout=None); pass it to th.join(timeout).
Add timeout keyword-only argument to asyncio.run()

Does it make sense?
@aeros167 would you master a PR for the proposed change?

vstinner · 2019-09-20T21:07:17Z

Honestly, if a test ends with "loop._default_executor is not None and loop.is_closed()": I would call this a bug in the test, but I'm not interested to review all tests which end up like this.

I prefer to always shutdown the executor and wait until it completes.

I'm not sure of what you mean by shutting down the executor twice. In which case it would occur?

Since there isn't a timeout specified, wouldn't the worker threads indefinitely hang until the buildbot itself times out?

Buildbots implement multiple layers of timeouts:

the test runner calls faulthandler.dump_traceback_later(timeout, exit=True): usually between 20 and 30 minutes
tests are working in subprocesses: the parent waits the wotker process with a timeout: based on the first timeout x 1.5
buildbot has an hard limit of 1200 seconds with no output

aeros · 2019-09-20T21:45:21Z

@vstinner:

I'm not sure of what you mean by shutting down the executor twice. In which case it would occur?

Apologies if I didn't word this clearly, I meant that in a scenario where loop.run_until_complete(loop.shutdown_default_executor()) would fail (as a result of threads not closing), loop._default_executor.shutdown(wait=True) would also fail. So it's not clear to me that the else clause would accomplish anything different.

An "emergency shutdown" of sorts that would always close the executor would be loop._default_executor.shutdown(wait=False), so that like it would be more effective in those scenarios. I like @asvetlov's idea of adding a timeout parameter better though, that sounds like a far more robust solution.

Honestly, if a test ends with "loop._default_executor is not None and loop.is_closed()": I would call this a bug in the test, but I'm not interested to review all tests which end up like this.

Ah, I think that clears things up for me then. I was thinking more along the lines of treating this like a bug and attempting to resolve that scenario. Calling loop._default_executor.shutdown(wait=False) would likely result in dangling threads, and show us there's an issue with that section. But that may result in an excessive number of build failures, as you mentioned.

Buildbots implement multiple layers of timeouts

Good to know! I wasn't aware of there being multiple layers of timeouts for the buildbots. I figured there was some form of hard limit when there's no output for x seconds, but I didn't know about faulthandler.dump_traceback_later(timeout, exit=True).

tests are working in subprocesses: the parent waits the wotker process with a timeout: based on the first timeout x 1.5

Is there a particular reason why the timeout duration multiplier is 1.5? I'm not familiar with how the tests utilize subprocesses or the implementation details. But I am always eager to learn. (:

vstinner · 2019-09-20T21:53:27Z

Is there a particular reason why the timeout duration multiplier is 1.5? I'm not familiar with how the tests utilize subprocesses or the implementation details. But I am always eager to learn. (:

faulthandler is supposed to work. The second timeout in the parent process is just a watchdog, "just in case". I prefer to wait for faulthandler since it dumps the Python traceback of all threads before exiting, whereas when the parent hits the worker timeout, it just kills the worker process: no traceback.

aeros · 2019-09-20T21:53:40Z

@asvetlov:

Thanks for the nice question!
This PR provides a very good improvement, I don't regret on merging :)

No problem, I always try to ask questions when I don't understand something. Worst case scenario, the question goes unanswered. Best case scenario, I learn something new! I certainly understand though if you guys don't have the time to answer all of them.

I certainly don't have an issue with the PR, I just didn't understand what the else clause was accomplishing differently. But that was likely a result of my own misunderstanding, rather than something being flawed with the test. I think this PR provides an improvement either way. (:

Support timeout parameter by shutdown_default_executor(timeout=None); pass it to th.join(timeout).

Add timeout keyword-only argument to asyncio.run()

Does it make sense?
would you master a PR for the proposed change?

Yes, that makes sense. I would be glad to help with working on this!

vstinner · 2019-09-20T21:54:18Z

I meant that in a scenario where loop.run_until_complete(loop.shutdown_default_executor()) would fail (as a result of threads not closing), loop._default_executor.shutdown(wait=True) would also fail.

IMHO this discussion is a little bit too far from the initial issue and sounds like a theorical issue. If you can see a practical issue, please open a new issue. You are commenting a closed PR of a closed issue.

aeros · 2019-09-20T21:59:06Z

@vstinner:

IMHO this discussion is a little bit too far from the initial issue and sounds like a theorical issue. If you can see a practical issue, please open a new issue.

It's not so much a practical problem, it's more of a question to clear up my understanding (so it wouldn't justify opening an issue for it).

You are commenting a closed PR of a closed issue.

Apologies, I think I may have been doing that a bit too much recently, is this a bad thing to do? I mostly have been doing it when something doesn't make sense to me, to ask questions. Perhaps there's a better location to do this that I'm not aware of. Either way though, thank you for the in-depth explanation, I certainly appreciate it. (:

1st1 · 2019-09-20T22:28:48Z

@asvetlov

Add timeout keyword-only argument to asyncio.run()

I don't like this idea. We can have a global asyncio "reasonable" timeout for shutting down threads -- say 5 minutes -- if we ever reach it we complain loudly that asyncio cannot shutdown the threadpool cleanly.

But we don't want to have multiple knobs to tweak asyncio.run().

aeros · 2019-09-20T22:36:08Z

@1st1:

We can have a global asyncio "reasonable" timeout for shutting down threads -- say 5 minutes --

Do you think this would be best implemented as a private property within BaseEventLoop, for example _thread_timeout?

But we don't want to have multiple knobs to tweak asyncio.run().

I can understand why that might not be the best long term strategy. A single additional knob doesn't make a big difference, but it can significantly add up over time.

1st1 · 2019-09-20T22:37:28Z

Do you think this would be best implemented as a private property within BaseEventLoop, for example _thread_timeout?

I think it can just be a constant defined somewhere. Private properties sometimes end up becoming some "unofficial API" and we don't want those.

I can understand why that might not be the best long term strategy. A single additional knob doesn't make a big difference, but it can significantly add up over time.

Exactly.

bpo-34037: test_asyncio uses shutdown_default_executor()

2f832a0

vstinner added the skip news label Sep 19, 2019

vstinner requested a review from asvetlov September 19, 2019 14:12

vstinner requested a review from 1st1 as a code owner September 19, 2019 14:12

the-knights-who-say-ni added the CLA signed label Sep 19, 2019

bedevere-bot added the awaiting core review label Sep 19, 2019

asvetlov approved these changes Sep 19, 2019

View reviewed changes

bedevere-bot added awaiting merge and removed awaiting core review labels Sep 19, 2019

Simplify change

4cf7cd3

vstinner merged commit 079931d into python:master Sep 19, 2019

vstinner deleted the asyncio_tests branch September 19, 2019 14:45

bedevere-bot removed the awaiting merge label Sep 19, 2019

aeros reviewed Sep 19, 2019

View reviewed changes

aeros mentioned this pull request Oct 2, 2019

bpo-38267: Add thread timeout parameter to loop.shutdown_default_executor() #16360

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

bpo-34037: test_asyncio uses shutdown_default_executor() #16284

bpo-34037: test_asyncio uses shutdown_default_executor() #16284

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bpo-34037: test_asyncio uses shutdown_default_executor() #16284

bpo-34037: test_asyncio uses shutdown_default_executor() #16284

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!