8000 FreeIPA + Python 3.9.8 causes deadlock · Issue #730 · GrahamDumpleton/mod_wsgi · GitHub
[go: up one dir, main page]

Skip to content
FreeIPA + Python 3.9.8 causes deadlock #730
@frenzymadness

Description

@frenzymadness

Hello there.

This issue probably does not belong here and I'm sorry for that but it's interesting and I'm running out of my debugging skills and you might be able to help me with that - and - I think that you experts love challenges.

I was describing my progress in https://bodhi.fedoraproject.org/updates/FEDORA-2021-e152ce5f31 but I'll try to summarize it here.

Short description: Combination of FreeIPA, httpd, mod_wsgi, and Python causes a deadlock in the threading module when systemd tries to stop httpd service - the process of stopping the httpd processes takes more than 90 seconds and all the processes are then killed.

The problem has been found by OpenQA tests in Fedora 34 when we tried to update Python from 3.9.7 to 3.9.8. I did a bisection of commits between these releases of Python and found out that the problem is caused/uncovered by this change: python/cpython@94d19f6 I've discussed it with @vstinner and we think that the code in Python is correct - still, without this commit, the problem is not there but the code itself does look correct. There has to be something else.

Components:

  • Fedora 34
  • httpd 2.4.51 (latest version) installed from RPM package on Fedora 34
  • FreeIPA installed from RPM package the same way as it is in the tests, description here: https://openqa.fedoraproject.org/tests/1059611/modules/role_deploy_domain_controller/steps/1/src
  • mod_wsgi - installed from RPM - for testing updated to the latest version so it's newer that the version currently available in Fedora 34. I believed that some changes in the mod_wsgi might fix the issue but they don't.
  • Python - something between 3.9.7 and 3.9.8 - the reason is that I've found the commit "causing" the problem so I switch the versions.

How to reproduce it:

  1. Install FreeIPA with Python 3.9.8
  2. apachectl start
  3. Refresh web UI FreeIPA page a few times
  4. time apachectl stop

Result is:

# time apachectl stop

real	1m30.366s
user	0m0.023s
sys	0m0.031s

The log is available here: https://lbalhar.fedorapeople.org/error_log and it contains all the messages produced by httpd from start to end as described in the reproducer + some debug prints containing the file name where they come from. Note that there is no jump in the timestamps because httpd does not provide any more messages before it's killed. But the jump is in the systemd logs:

Nov 16 22:22:04 test.openqa.fedoraproject.org httpd[5648]: Server configured, listening on: port 443, port 80
Nov 16 22:22:18 test.openqa.fedoraproject.org systemd[1]: Stopping The Apache HTTP Server...
░░ Subject: A stop job for unit httpd.service has begun execution
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░ 
░░ A stop job for unit httpd.service has begun execution.
░░ 
░░ The job identifier is 4975.
Nov 16 22:23:48 test.openqa.fedoraproject.org systemd[1]: httpd.service: State 'stop-sigterm' timed out. Killing.
Nov 16 22:23:48 test.openqa.fedoraproject.org systemd[1]: httpd.service: Killing process 5648 (httpd) with signal SIGKILL.
Nov 16 22:23:48 test.openqa.fedoraproject.org systemd[1]: httpd.service: Killing process 5651 (httpd) with signal SIGKILL.

From the error_log, you can see that there are multiple similar processes (5659, 5660, …) but only one of them (5660) served the requests from the web UI and the same process caused the deadlock. Only this one process reached the new branch (line 1457 https://github.com/python/cpython/blob/94d19f606fa18a1c4d2faca1caf2f470a8ce6d46/Lib/threading.py#L1457) and never reached the end of the _shutdown function.

I've also added faulthandler.dump_traceback_later(20, file=tempfile.mkstemp(dir="/var/log/httpd/")[0], repeat=True) to the _shutdown function and it produces only one file with some content and the content is:

Timeout (0:00:20)!
Thread 0x00007f8aefae58c0 (most recent call first):
  File "/usr/lib64/python3.9/threading.py", line 1482 in _shutdown

Thread 0x00007f8ac67f4640 (most recent call first):
<no Python frame>

Thread 0x00007f8ac47f0640 (most recent call first):
<no Python frame>
Timeout (0:00:20)!
Thread 0x00007f8aefae58c0 (most recent call first):
  File "/usr/lib64/python3.9/threading.py", line 1482 in _shutdown

Thread 0x00007f8ac67f4640 (most recent call first):
<no Python frame>

Thread 0x00007f8ac47f0640 (most recent call first):
<no Python frame>
Timeout (0:00:20)!
Thread 0x00007f8aefae58c0 (most recent call first):
  File "/usr/lib64/python3.9/threading.py", line 1482 in _shutdown

Thread 0x00007f8ac67f4640 (most recent call first):
<no Python frame>

Thread 0x00007f8ac47f0640 (most recent call first):
<no Python frame>
Timeout (0:00:20)!
Thread 0x00007f8aefae58c0 (most recent call first):
  File "/usr/lib64/python3.9/threading.py", line 1482 in _shutdown

Thread 0x00007f8ac67f4640 (most recent call first):
<no Python frame>

Thread 0x00007f8ac47f0640 (most recent call first):
<no Python frame>

So the one process deadlocks on lock.acquire(), line: https://github.com/python/cpython/blob/94d19f606fa18a1c4d2faca1caf2f470a8ce6d46/Lib/threading.py#L1470

From what I can see in the code of the threading module and the _shutdown function it seems that the process serving the requests from web UI is the only one where the _shutdown function is not called by the main thread. We have tried a blind fix - import threading as soon as possible (in site.py) and something similar is implemented here but it did not help.

I've tried to overload _thread.start_new_thread in threading.py but it produces no results so I think that threads are created in a different way here.

I've tried to use gdb to see what is happening in the locked process and there were two threads but one of them was caused by the faulthandler - when removed, there is a single thread with this stack:

#0  0x00007ff4f0f26a8a in __futex_abstimed_wait_common64 () from target:/lib64/libpthread.so.0
#1  0x00007ff4f0f22c98 in __new_sem_wait_slow64.constprop.0 () from target:/lib64/libpthread.so.0
#2  0x00007ff4efd2242c in PyThread_acquire_lock_timed () from target:/lib64/libpython3.9.so.1.0
#3  0x00007ff4efdae50c in acquire_timed () from target:/lib64/libpython3.9.so.1.0
#4  0x00007ff4efeabdab in lock_PyThread_acquire_lock () from target:/lib64/libpython3.9.so.1.0
#5  0x00007ff4efdb7a54 in method_vectorcall_VARARGS_KEYWORDS () from target:/lib64/libpython3.9.so.1.0
#6  0x00007ff4efce0231 in _PyEval_EvalFrameDefault () from target:/lib64/libpython3.9.so.1.0
#7  0x00007ff4efcd8e93 in function_code_fastcall () from target:/lib64/libpython3.9.so.1.0
#8  0x00007ff4f00472bb in ShutdownInterpreter_call () from target:/etc/httpd/modules/mod_wsgi_python3.so
#9  0x00007ff4efd2d336 in _PyObject_MakeTpCall () from target:/lib64/libpython3.9.so.1.0
#10 0x00007ff4efe03f76 in PyObject_VectorcallMethod () from target:/lib64/libpython3.9.so.1.0
#11 0x00007ff4efe4dd62 in wait_for_thread_shutdown () from target:/lib64/libpython3.9.so.1.0
#12 0x00007ff4efe7fb5f in Py_EndInterpreter () from target:/lib64/libpython3.9.so.1.0
#13 0x00007ff4f0045f2d in Interpreter_dealloc () from target:/etc/httpd/modules/mod_wsgi_python3.so
#14 0x00007ff4efcf0167 in free_keys_object () from target:/lib64/libpython3.9.so.1.0
#15 0x00007ff4f004e88c in wsgi_python_child_cleanup () from target:/etc/httpd/modules/mod_wsgi_python3.so
#16 0x00007ff4f0f5bcae in apr_pool_destroy () from target:/lib64/libapr-1.so.0
#17 0x00007ff4f052a52f in clean_child_exit () from target:/etc/httpd/modules/mod_mpm_event.so
#18 0x00007ff4f052c96a in child_main () from target:/etc/httpd/modules/mod_mpm_event.so
#19 0x00007ff4f052d29f in make_child () from target:/etc/httpd/modules/mod_mpm_event.so
#20 0x00007ff4f052d341 in startup_children () from target:/etc/httpd/modules/mod_mpm_event.so
#21 0x00007ff4f052e14f in event_run () from target:/etc/httpd/modules/mod_mpm_event.so
#22 0x000056514d371428 in ap_run_mpm ()
#23 0x000056514d35f484 in main ()

Any help is much appreciated. Cc @tiran @encukou

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0