GCCollector / multiprocess mode deadlock #322

akx · 2018-10-09T12:23:27Z

There's a chance for a deadlock with the GCCollector and multiprocess mode (when prometheus_multiproc_dir is set).

For instance, when running our test suite with py.test, the process seems to hang, and ctrl+c yields this (on 0.4.0):

Traceback (most recent call last):
  File "prometheus_client/gc_collector.py", line 49, in _cb
    latency.labels(gen).observe(delta)
  File "prometheus_client/core.py", line 747, in labels
    self._metrics[labelvalues] = self._wrappedClass(self._name, self._labelnames, labelvalues, **self._kwargs)
  File "prometheus_client/core.py", line 1088, in __init__
    self._sum = _ValueClass(self._type, name, name + '_sum', labelnames, labelvalues)
  File "prometheus_client/core.py", line 636, in __init__
    with lock:

This looks like a problem with GC collector callbacks firing while another metric's value object (a _MultiProcessValue) is being modified (which holds the Lock() shared by all values created here).

One possible option might be to use an RLock() instead of a Lock(), but I'm not sure what that might cause.

The text was updated successfully, but these errors were encountered:

akx · 2018-10-09T13:21:51Z

For the time being, for folks who need to disable the GC collector, run something like

import gc
import prometheus_client

def burninate_gc_collector():
    for callback in gc.callbacks[:]:
        if callback.__qualname__.startswith('GCCollector.'):
            gc.callbacks.remove(callback)

    for name, collector in list(prometheus_client.REGISTRY._names_to_collectors.items()):
        if name.startswith('python_gc_'):
            try:
                prometheus_client.REGISTRY.unregister(collector)
            except KeyError:  # probably gone already
                pass

as early as possible.

tcolgate · 2018-10-09T14:27:21Z

Very sorry about this, I have no idea what multiproc mode is. Any references?

akx · 2018-10-09T16:08:05Z

@tcolgate See here: https://github.com/prometheus/client_python#multiprocess-mode-gunicorn

The crux of the issue is that in multiprocess mode there's an additional lock that prevents concurrent/reentrant writes of values (since they're backed by a file, not bare Python memory).

Since the GC callback may occur at any time, and Python runs the callback code in the main thread, there's a nontrivial chance that the GC callback is called when something else is updating a metric value, causing a deadlock (since the same thread is attempting to acquire the Lock twice).

One possible way I see to fix this is to do as little work as possible in the GC callback, such as store the data in a queue, and then deal with the queued data at collection time (i.e. refactoring this to be a collector exposing a collect() method like ProcessCollector is).

Another way would be to simply disable the GC collector for now when in multiproc mode.

tcolgate · 2018-10-09T16:57:33Z

Why is that a deadlock rather than just a lock contention? Can multiproc be detected? I'd suggest skipping the GC collector if so.

…

On Tue, 9 Oct 2018, 17:08 Aarni Koskela, ***@***.***> wrote: @tcolgate <https://github.com/tcolgate> See here: https://github.com/prometheus/client_python#multiprocess-mode-gunicorn The crux of the issue is that in multiprocess mode there's an additional lock that prevents concurrent/reentrant writes of values (since they're backed by a file, not bare Python memory). Since the GC callback may occur at any time, and Python runs the callback code in the main thread, there's a nontrivial chance that the GC callback is called when something else is updating a metric value, causing a deadlock. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#322 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEo80vQX18YLmBDR2_xDEbVlSRUrUzWks5ujMnqgaJpZM4XTJ7I> .

akx · 2018-10-11T08:12:16Z

Why is that a deadlock rather than just a lock contention?

Because the GC callbacks are run as "interrupts" in the same thread that happens to make an allocation that causes garbage collection, so you may get this happening in a single thread:

metric value is updated
acquire lock for metric writing
allocation occurs for something within metric writing, causes GC
GC callback is called, updates a metric
deadlock attempting to reacquire same lock we already hold

tcolgate · 2018-10-11T08:22:32Z

right, so possibly sticking the GC updates into a queue and then processing those in another thread might be better. I'm probably not going to have time to look at this for a week If no one else gets to it, I'll have a go.

…

On Thu, 11 Oct 2018 at 09:12 Aarni Koskela ***@***.***> wrote: Why is that a deadlock rather than just a lock contention? Because the GC callbacks are run as "interrupts" <https://github.com/python/cpython/blob/master/Modules/gcmodule.c#L1153-L1190> in the same thread that happens to make an allocation that causes garbage collection, so you may get this happening in a single thread: 1. metric value is updated 2. acquire lock for metric writing 3. allocation occurs for something within metric writing, causes GC 4. GC callback is called, updates a metric 5. deadlock attempting to reacquire same lock we already hold — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#322 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEo80nJhkK8zEDHD9AFZgl28eJGJLZSks5ujv1sgaJpZM4XTJ7I> .

akx · 2018-10-11T10:29:43Z

Adding more threads into the mix sounds unnecessary. I think queue processing could be done on-demand, at collect() time for the GCCollector.

Works around prometheus#322 Signed-off-by: Aarni Koskela <akx@iki.fi>

tcolgate · 2018-10-11T10:41:48Z

Possibly, though you wouldn't want an unbounded queue, and if a queue fills (because prom has been down for a while), the resulting scrapes will be given incorrect info. Personally I don't think another thread is a major issue, but I don't really understand python threading (or python), or this multiproc stuff well enough.

…

On Thu, 11 Oct 2018 at 11:30 Aarni Koskela ***@***.***> wrote: Adding more threads into the mix sounds unnecessary. I think queue processing could be done on-demand, at collect() time for the GCCollector. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#322 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEo8zF7WUeEQUE6jBJV4mClJbi5yPDWks5ujx2lgaJpZM4XTJ7I> .

* Add Pytest cache and Coverage HTML report dirs to gitignore * Disable the GC collector in multiprocess mode Works around #322 Signed-off-by: Aarni Koskela <akx@iki.fi>

* Add Pytest cache and Coverage HTML report dirs to gitignore * Disable the GC collector in multiprocess mode Works around prometheus#322 Signed-off-by: Aarni Koskela <akx@iki.fi>

charan28 · 2018-11-12T20:39:53Z

@akx I'm coming across this issue in 0.4.2, without prometheus_multiproc_dir set. Reading through this bug and the associated fix, it seems like this behavior was only noticed when that environment variable was set - so unsure what's causing this.

Relevant failure:

Exception ignored in: <function GCCollector.__init__.<locals>._cb at 0x7fe8dd6b1488>
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/prometheus_client/gc_collector.py", line 58, in _cb
    latency.labels(gen).observe(delta)
  File "/usr/lib/python3.6/site-packages/prometheus_client/core.py", line 777, in labels
    labelvalues = tuple(unicode(l) for l in labelvalues)
  File "/usr/lib/python3.6/site-packages/prometheus_client/core.py", line 777, in <genexpr>
    labelvalues = tuple(unicode(l) for l in labelvalues)
RecursionError: maximum recursion depth exceeded while getting the str of an object
Fatal Python error: Cannot recover from stack overflow.

Any suggestions on what I might be doing wrong?

akx · 2018-11-13T17:47:21Z

Ugh, that sounds bad.

An educated guess is that a GC collection can occur even within the same thread the GC collection callback is running in, so the callback gets called again, and... well, you can guess the recursive rest.

The quickest fix (if the above guess is correct) is to add a re-entrancy lock to the GC callback function, so it won't do anything if it's already running. I'm not sure if it's safe to do so within the GC callback, but adding a pair of gc.disable() and gc.enable() should also work.

On the bright side: You're not doing anything wrong there, @charan28.

akx · 2018-11-13T17:53:46Z

@charan28 Assuming my hypothesis is right, the above ☝️ PR should fix this situation.

xudifsd · 2019-01-31T02:15:58Z

We have encountered this deadlock in production environment, the python stack trace is:

Current thread 0x00007f1a14b31700 (most recent call first):
  File "/usr/local/lib/python3.7/site-packages/prometheus_client/metrics.py", line 148 in labels
  File "/usr/local/lib/python3.7/site-packages/prometheus_client/gc_collector.py", line 72 in _cb
  File "/usr/local/lib/python3.7/site-packages/prometheus_client/metrics.py", line 179 in _multi_samples
  File "/usr/local/lib/python3.7/site-packages/prometheus_client/metrics.py", line 68 in collect
  File "/usr/local/lib/python3.7/site-packages/prometheus_client/registry.py", line 75 in collect
  File "/usr/local/lib/python3.7/site-packages/prometheus_client/exposition.py", line 89 in generate_latest
  File "/usr/local/lib/python3.7/site-packages/prometheus_client/exposition.py", line 45 in prometheus_app
  File "/usr/local/lib/python3.7/wsgiref/handlers.py", line 137 in run
  File "/usr/local/lib/python3.7/wsgiref/simple_server.py", line 133 in handle
  File "/usr/local/lib/python3.7/socketserver.py", line 717 in __init__
  File "/usr/local/lib/python3.7/socketserver.py", line 357 in finish_request
  File "/usr/local/lib/python3.7/socketserver.py", line 344 in process_request
  File "/usr/local/lib/python3.7/socketserver.py", line 313 in _handle_request_noblock
  File "/usr/local/lib/python3.7/socketserver.py", line 234 in serve_forever
  File "/job_exporter/main.py", line 129 in main
  File "/job_exporter/main.py", line 162 in <module>

You can see C stack trace here. Our code is here. We are using the latest code from pip3. I know this issue from the comment in code.

Still looking what happened.

xudifsd · 2019-01-31T02:42:38Z

It seems this and this locks the same lock. Because GC can interrupt call to metrics._multi_samples and during gc callback it will call metrics.labels. It should be RLock.

brian-brazil · 2019-02-19T12:22:26Z

Fixed by #371

akx mentioned this issue Oct 9, 2018

Opt-in to built-in collectors? #321

Closed

akx mentioned this issue Oct 11, 2018

Disable GCCollector in multiprocess mode #324

Merged

akx added a commit to valohai/prometheus-client-python that referenced this issue Oct 11, 2018

Disable the GC collector in multiprocess mode

0245928

Works around prometheus#322 Signed-off-by: Aarni Koskela <akx@iki.fi>

akx mentioned this issue Nov 13, 2018

Avoid re-entrant calls to GC collector's callback #343

Merged

brian-brazil closed this as completed Feb 19, 2019

igor-susic mentioned this issue May 15, 2021

Does GCCollector work in multiprocess mode? #653

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GCCollector / multiprocess mode deadlock #322

GCCollector / multiprocess mode deadlock #322

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

GCCollector / multiprocess mode deadlock #322

GCCollector / multiprocess mode deadlock #322

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!