-
Notifications
You must be signed in to change notification settings - Fork 272
Description
Hello there.
This issue probably does not belong here and I'm sorry for that but it's interesting and I'm running out of my debugging skills and you might be able to help me with that - and - I think that you experts love challenges.
I was describing my progress in https://bodhi.fedoraproject.org/updates/FEDORA-2021-e152ce5f31 but I'll try to summarize it here.
Short description: Combination of FreeIPA, httpd, mod_wsgi, and Python causes a deadlock in the threading module when systemd tries to stop httpd service - the process of stopping the httpd processes takes more than 90 seconds and all the processes are then killed.
The problem has been found by OpenQA tests in Fedora 34 when we tried to update Python from 3.9.7 to 3.9.8. I did a bisection of commits between these releases of Python and found out that the problem is caused/uncovered by this change: python/cpython@94d19f6 I've discussed it with @vstinner and we think that the code in Python is correct - still, without this commit, the problem is not there but the code itself does look correct. There has to be something else.
Components:
- Fedora 34
- httpd 2.4.51 (latest version) installed from RPM package on Fedora 34
- FreeIPA installed from RPM package the same way as it is in the tests, description here: https://openqa.fedoraproject.org/tests/1059611/modules/role_deploy_domain_controller/steps/1/src
- mod_wsgi - installed from RPM - for testing updated to the latest version so it's newer that the version currently available in Fedora 34. I believed that some changes in the mod_wsgi might fix the issue but they don't.
- Python - something between 3.9.7 and 3.9.8 - the reason is that I've found the commit "causing" the problem so I switch the versions.
How to reproduce it:
- Install FreeIPA with Python 3.9.8
- apachectl start
- Refresh web UI FreeIPA page a few times
- time apachectl stop
Result is:
# time apachectl stop
real 1m30.366s
user 0m0.023s
sys 0m0.031s
The log is available here: https://lbalhar.fedorapeople.org/error_log and it contains all the messages produced by httpd from start to end as described in the reproducer + some debug prints containing the file name where they come from. Note that there is no jump in the timestamps because httpd does not provide any more messages before it's killed. But the jump is in the systemd logs:
Nov 16 22:22:04 test.openqa.fedoraproject.org httpd[5648]: Server configured, listening on: port 443, port 80
Nov 16 22:22:18 test.openqa.fedoraproject.org systemd[1]: Stopping The Apache HTTP Server...
░░ Subject: A stop job for unit httpd.service has begun execution
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░
░░ A stop job for unit httpd.service has begun execution.
░░
░░ The job identifier is 4975.
Nov 16 22:23:48 test.openqa.fedoraproject.org systemd[1]: httpd.service: State 'stop-sigterm' timed out. Killing.
Nov 16 22:23:48 test.openqa.fedoraproject.org systemd[1]: httpd.service: Killing process 5648 (httpd) with signal SIGKILL.
Nov 16 22:23:48 test.openqa.fedoraproject.org systemd[1]: httpd.service: Killing process 5651 (httpd) with signal SIGKILL.
From the error_log, you can see that there are multiple similar processes (5659, 5660, …) but only one of them (5660) served the requests from the web UI and the same process caused the deadlock. Only this one process reached the new branch (line 1457 https://github.com/python/cpython/blob/94d19f606fa18a1c4d2faca1caf2f470a8ce6d46/Lib/threading.py#L1457) and never reached the end of the _shutdown function.
I've also added faulthandler.dump_traceback_later(20, file=tempfile.mkstemp(dir="/var/log/httpd/")[0], repeat=True)
to the _shutdown
function and it produces only one file with some content and the content is:
Timeout (0:00:20)!
Thread 0x00007f8aefae58c0 (most recent call first):
File "/usr/lib64/python3.9/threading.py", line 1482 in _shutdown
Thread 0x00007f8ac67f4640 (most recent call first):
<no Python frame>
Thread 0x00007f8ac47f0640 (most recent call first):
<no Python frame>
Timeout (0:00:20)!
Thread 0x00007f8aefae58c0 (most recent call first):
File "/usr/lib64/python3.9/threading.py", line 1482 in _shutdown
Thread 0x00007f8ac67f4640 (most recent call first):
<no Python frame>
Thread 0x00007f8ac47f0640 (most recent call first):
<no Python frame>
Timeout (0:00:20)!
Thread 0x00007f8aefae58c0 (most recent call first):
File "/usr/lib64/python3.9/threading.py", line 1482 in _shutdown
Thread 0x00007f8ac67f4640 (most recent call first):
<no Python frame>
Thread 0x00007f8ac47f0640 (most recent call first):
<no Python frame>
Timeout (0:00:20)!
Thread 0x00007f8aefae58c0 (most recent call first):
File "/usr/lib64/python3.9/threading.py", line 1482 in _shutdown
Thread 0x00007f8ac67f4640 (most recent call first):
<no Python frame>
Thread 0x00007f8ac47f0640 (most recent call first):
<no Python frame>
So the one process deadlocks on lock.acquire()
, line: https://github.com/python/cpython/blob/94d19f606fa18a1c4d2faca1caf2f470a8ce6d46/Lib/threading.py#L1470
From what I can see in the code of the threading module and the _shutdown
function it seems that the process serving the requests from web UI is the only one where the _shutdown
function is not called by the main thread. We have tried a blind fix - import threading as soon as possible (in site.py) and something similar is implemented here but it did not help.
I've tried to overload _thread.start_new_thread
in threading.py
but it produces no results so I think that threads are created in a different way here.
I've tried to use gdb to see what is happening in the locked process and there were two threads but one of them was caused by the faulthandler - when removed, there is a single thread with this stack:
#0 0x00007ff4f0f26a8a in __futex_abstimed_wait_common64 () from target:/lib64/libpthread.so.0
#1 0x00007ff4f0f22c98 in __new_sem_wait_slow64.constprop.0 () from target:/lib64/libpthread.so.0
#2 0x00007ff4efd2242c in PyThread_acquire_lock_timed () from target:/lib64/libpython3.9.so.1.0
#3 0x00007ff4efdae50c in acquire_timed () from target:/lib64/libpython3.9.so.1.0
#4 0x00007ff4efeabdab in lock_PyThread_acquire_lock () from target:/lib64/libpython3.9.so.1.0
#5 0x00007ff4efdb7a54 in method_vectorcall_VARARGS_KEYWORDS () from target:/lib64/libpython3.9.so.1.0
#6 0x00007ff4efce0231 in _PyEval_EvalFrameDefault () from target:/lib64/libpython3.9.so.1.0
#7 0x00007ff4efcd8e93 in function_code_fastcall () from target:/lib64/libpython3.9.so.1.0
#8 0x00007ff4f00472bb in ShutdownInterpreter_call () from target:/etc/httpd/modules/mod_wsgi_python3.so
#9 0x00007ff4efd2d336 in _PyObject_MakeTpCall () from target:/lib64/libpython3.9.so.1.0
#10 0x00007ff4efe03f76 in PyObject_VectorcallMethod () from target:/lib64/libpython3.9.so.1.0
#11 0x00007ff4efe4dd62 in wait_for_thread_shutdown () from target:/lib64/libpython3.9.so.1.0
#12 0x00007ff4efe7fb5f in Py_EndInterpreter () from target:/lib64/libpython3.9.so.1.0
#13 0x00007ff4f0045f2d in Interpreter_dealloc () from target:/etc/httpd/modules/mod_wsgi_python3.so
#14 0x00007ff4efcf0167 in free_keys_object () from target:/lib64/libpython3.9.so.1.0
#15 0x00007ff4f004e88c in wsgi_python_child_cleanup () from target:/etc/httpd/modules/mod_wsgi_python3.so
#16 0x00007ff4f0f5bcae in apr_pool_destroy () from target:/lib64/libapr-1.so.0
#17 0x00007ff4f052a52f in clean_child_exit () from target:/etc/httpd/modules/mod_mpm_event.so
#18 0x00007ff4f052c96a in child_main () from target:/etc/httpd/modules/mod_mpm_event.so
#19 0x00007ff4f052d29f in make_child () from target:/etc/httpd/modules/mod_mpm_event.so
#20 0x00007ff4f052d341 in startup_children () from target:/etc/httpd/modules/mod_mpm_event.so
#21 0x00007ff4f052e14f in event_run () from target:/etc/httpd/modules/mod_mpm_event.so
#22 0x000056514d371428 in ap_run_mpm ()
#23 0x000056514d35f484 in main ()