-
Notifications
You must be signed in to change notification settings - Fork 815
GCCollector / multiprocess mode deadlock #322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
For the time being, for folks who need to disable the GC collector, run something like import gc
import prometheus_client
def burninate_gc_collector():
for callback in gc.callbacks[:]:
if callback.__qualname__.startswith('GCCollector.'):
gc.callbacks.remove(callback)
for name, collector in list(prometheus_client.REGISTRY._names_to_collectors.items()):
if name.startswith('python_gc_'):
try:
prometheus_client.REGISTRY.unregister(collector)
except KeyError: # probably gone already
pass as early as possible. |
Very sorry about this, I have no idea what multiproc mode is. Any references? |
@tcolgate See here: https://github.com/prometheus/client_python#multiprocess-mode-gunicorn The crux of the issue is that in multiprocess mode there's an additional lock that prevents concurrent/reentrant writes of values (since they're backed by a file, not bare Python memory). Since the GC callback may occur at any time, and Python runs the callback code in the main thread, there's a nontrivial chance that the GC callback is called when something else is updating a metric value, causing a deadlock (since the same thread is attempting to acquire the Lock twice). One possible way I see to fix this is to do as little work as possible in the GC callback, such as store the data in a queue, and then deal with the queued data at collection time (i.e. refactoring this to be a collector exposing a Another way would be to simply disable the GC collector for now when in multiproc mode. |
Why is that a deadlock rather than just a lock contention?
Can multiproc be detected? I'd suggest skipping the GC collector if so.
…On Tue, 9 Oct 2018, 17:08 Aarni Koskela, ***@***.***> wrote:
@tcolgate <https://github.com/tcolgate> See here:
https://github.com/prometheus/client_python#multiprocess-mode-gunicorn
The crux of the issue is that in multiprocess mode there's an additional
lock that prevents concurrent/reentrant writes of values (since they're
backed by a file, not bare Python memory).
Since the GC callback may occur at any time, and Python runs the callback
code in the main thread, there's a nontrivial chance that the GC callback
is called when something else is updating a metric value, causing a
deadlock.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#322 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEo80vQX18YLmBDR2_xDEbVlSRUrUzWks5ujMnqgaJpZM4XTJ7I>
.
|
Because the GC callbacks are run as "interrupts" in the same thread that happens to make an allocation that causes garbage collection, so you may get this happening in a single thread:
|
right, so possibly sticking the GC updates into a queue and then processing
those in another thread might be better.
I'm probably not going to have time to look at this for a week If no one
else gets to it, I'll have a go.
…On Thu, 11 Oct 2018 at 09:12 Aarni Koskela ***@***.***> wrote:
Why is that a deadlock rather than just a lock contention?
Because the GC callbacks are run as "interrupts"
<https://github.com/python/cpython/blob/master/Modules/gcmodule.c#L1153-L1190>
in the same thread that happens to make an allocation that causes garbage
collection, so you may get this happening in a single thread:
1. metric value is updated
2. acquire lock for metric writing
3. allocation occurs for something within metric writing, causes GC
4. GC callback is called, updates a metric
5. deadlock attempting to reacquire same lock we already hold
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#322 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEo80nJhkK8zEDHD9AFZgl28eJGJLZSks5ujv1sgaJpZM4XTJ7I>
.
|
Adding more threads into the mix sounds unnecessary. I think queue processing could be done on-demand, at |
Works around prometheus#322 Signed-off-by: Aarni Koskela <akx@iki.fi>
Possibly, though you wouldn't want an unbounded queue, and if a queue fills
(because prom has been down for a while), the resulting scrapes will be
given incorrect info. Personally I don't think another thread is a major
issue, but I don't really understand python threading (or python), or this
multiproc stuff well enough.
…On Thu, 11 Oct 2018 at 11:30 Aarni Koskela ***@***.***> wrote:
Adding more threads into the mix sounds unnecessary. I think queue
processing could be done on-demand, at collect() time for the GCCollector.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#322 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEo8zF7WUeEQUE6jBJV4mClJbi5yPDWks5ujx2lgaJpZM4XTJ7I>
.
|
* Add Pytest cache and Coverage HTML report dirs to gitignore * Disable the GC collector in multiprocess mode Works around #322 Signed-off-by: Aarni Koskela <akx@iki.fi>
* Add Pytest cache and Coverage HTML report dirs to gitignore * Disable the GC collector in multiprocess mode Works around prometheus#322 Signed-off-by: Aarni Koskela <akx@iki.fi>
@akx I'm coming across this issue in 0.4.2, without Relevant failure:
Any suggestions on what I might be doing wrong? |
Ugh, that sounds bad. An educated guess is that a GC collection can occur even within the same thread the GC collection callback is running in, so the callback gets called again, and... well, you can guess the recursive rest. The quickest fix (if the above guess is correct) is to add a re-entrancy lock to the GC callback function, so it won't do anything if it's already running. I'm not sure if it's safe to do so within the GC callback, but adding a pair of On the bright side: You're not doing anything wrong there, @charan28. |
@charan28 Assuming my hypothesis is right, the above ☝️ PR should fix this situation. |
We have encountered this deadlock in production environment, the python stack trace is:
You can see C stack trace here. Our code is here. We are using the latest code from pip3. I know this issue from the comment in code. Still looking what happened. |
Fixed by #371 |
Ref #313, #321.
There's a chance for a deadlock with the GCCollector and multiprocess mode (when
prometheus_multiproc_dir
is set).For instance, when running our test suite with
py.test
, the process seems to hang, and ctrl+c yields this (on 0.4.0):This looks like a problem with GC collector callbacks firing while another metric's value object (a
_MultiProcessValue
) is being modified (which holds theLock()
shared by all values created here).One possible option might be to use an
RLock()
instead of aLock()
, but I'm not sure what that might cause.The text was updated successfully, but these errors were encountered: