8000 GCCollector / multiprocess mode deadlock · Issue #322 · prometheus/client_python · GitHub
[go: up one dir, main page]

Skip to content

GCCollector / multiprocess mode deadlock #322

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
akx opened this issue Oct 9, 2018 · 14 comments
Closed

GCCollector / multiprocess mode deadlock #322

akx opened this issue Oct 9, 2018 · 14 comments

Comments

@akx
Copy link
Contributor
akx commented Oct 9, 2018

Ref #313, #321.

There's a chance for a deadlock with the GCCollector and multiprocess mode (when prometheus_multiproc_dir is set).

For instance, when running our test suite with py.test, the process seems to hang, and ctrl+c yields this (on 0.4.0):

Traceback (most recent call last):
  File "prometheus_client/gc_collector.py", line 49, in _cb
    latency.labels(gen).observe(delta)
  File "prometheus_client/core.py", line 747, in labels
    self._metrics[labelvalues] = self._wrappedClass(self._name, self._labelnames, labelvalues, **self._kwargs)
  File "prometheus_client/core.py", line 1088, in __init__
    self._sum = _ValueClass(self._type, name, name + '_sum', labelnames, labelvalues)
  File "prometheus_client/core.py", line 636, in __init__
    with lock:

This looks like a problem with GC collector callbacks firing while another metric's value object (a _MultiProcessValue) is being modified (which holds the Lock() shared by all values created here).

One possible option might be to use an RLock() instead of a Lock(), but I'm not sure what that might cause.

@akx
Copy link
Contributor Author
akx commented Oct 9, 2018

For the time being, for folks who need to disable the GC collector, run something like

import gc
import prometheus_client

def burninate_gc_collector():
    for callback in gc.callbacks[:]:
        if callback.__qualname__.startswith('GCCollector.'):
            gc.callbacks.remove(callback)

    for name, collector in list(prometheus_client.REGISTRY._names_to_collectors.items()):
        if name.startswith('python_gc_'):
            try:
                prometheus_client.REGISTRY.unregister(collector)
            except KeyError:  # probably gone already
                pass

as early as possible.

@tcolgate
Copy link
Contributor
tcolgate commented Oct 9, 2018

Very sorry about this, I have no idea what multiproc mode is. Any references?

@akx
Copy link
Contributor Author
akx commented Oct 9, 2018

@tcolgate See here: https://github.com/prometheus/client_python#multiprocess-mode-gunicorn

The crux of the issue is that in multiprocess mode there's an additional lock that prevents concurrent/reentrant writes of values (since they're backed by a file, not bare Python memory).

Since the GC callback may occur at any time, and Python runs the callback code in the main thread, there's a nontrivial chance that the GC callback is called when something else is updating a metric value, causing a deadlock (since the same thread is attempting to acquire the Lock twice).

One possible way I see to fix this is to do as little work as possible in the GC callback, such as store the data in a queue, and then deal with the queued data at collection time (i.e. refactoring this to be a collector exposing a collect() method like ProcessCollector is).

Another way would be to simply disable the GC collector for now when in multiproc mode.

@tcolgate
Copy link
Contributor
tcolgate commented Oct 9, 2018 via email

8000

@akx
Copy link
Contributor Author
akx commented Oct 11, 2018

Why is that a deadlock rather than just a lock contention?

Because the GC callbacks are run as "interrupts" in the same thread that happens to make an allocation that causes garbage collection, so you may get this happening in a single thread:

  1. metric value is updated
  2. acquire lock for metric writing
  3. allocation occurs for something within metric writing, causes GC
  4. GC callback is called, updates a metric
  5. deadlock attempting to reacquire same lock we already hold

@tcolgate
Copy link
Contributor
tcolgate commented Oct 11, 2018 via email

@akx
Copy link
Contributor Author
akx commented Oct 11, 2018

Adding more threads into the mix sounds unnecessary. I think queue processing could be done on-demand, at collect() time for the GCCollector.

akx added a commit to valohai/prometheus-client-python that referenced this issue Oct 11, 2018
Works around prometheus#322

Signed-off-by: Aarni Koskela <akx@iki.fi>
@tcolgate
Copy link
Contributor
tcolgate commented Oct 11, 2018 via email

brian-brazil pushed a commit that referenced this issue Oct 11, 2018
* Add Pytest cache and Coverage HTML report dirs to gitignore
* Disable the GC collector in multiprocess mode

Works around #322

Signed-off-by: Aarni Koskela <akx@iki.fi>
xavfernandez pushed a commit to Polyconseil/client_python that referenced this issue Oct 15, 2018
* Add Pytest cache and Coverage HTML report dirs to gitignore
* Disable the GC collector in multiprocess mode

Works around prometheus#322

Signed-off-by: Aarni Koskela <akx@iki.fi>
@charan28
Copy link

@akx I'm coming across this issue in 0.4.2, without prometheus_multiproc_dir set. Reading through this bug and the associated fix, it seems like this behavior was only noticed when that environment variable was set - so unsure what's causing this.

Relevant failure:

Exception ignored in: <function GCCollector.__init__.<locals>._cb at 0x7fe8dd6b1488>
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/prometheus_client/gc_collector.py", line 58, in _cb
    latency.labels(gen).observe(delta)
  File "/usr/lib/python3.6/site-packages/prometheus_client/core.py", line 777, in labels
    labelvalues = tuple(unicode(l) for l in labelvalues)
  File "/usr/lib/python3.6/site-packages/prometheus_client/core.py", line 777, in <genexpr>
    labelvalues = tuple(unicode(l) for l in labelvalues)
RecursionError: maximum recursion depth exceeded while getting the str of an object
Fatal Python error: Cannot recover from stack overflow.

Any suggestions on what I might be doing wrong?

@akx
Copy link
Contributor Author
akx commented Nov 13, 2018

Ugh, that sounds bad.

An educated guess is that a GC collection can occur even within the same thread the GC collection callback is running in, so the callback gets called again, and... well, you can guess the recursive rest.

The quickest fix (if the above guess is correct) is to add a re-entrancy lock to the GC callback function, so it won't do anything if it's already running. I'm not sure if it's safe to do so within the GC callback, but adding a pair of gc.disable() and gc.enable() should also work.

On the bright side: You're not doing anything wrong there, @charan28.

@akx
Copy link
Contributor Author
akx commented Nov 13, 2018

@charan28 Assuming my hypothesis is right, the above ☝️ PR should fix this situation.

@xudifsd
Copy link
xudifsd commented Jan 31, 2019

We have encountered this deadlock in production environment, the python stack trace is:

Current thread 0x00007f1a14b31700 (most recent call first):
  File "/usr/local/lib/python3.7/site-packages/prometheus_client/metrics.py", line 148 in labels
  File "/usr/local/lib/python3.7/site-packages/prometheus_client/gc_collector.py", line 72 in _cb
  File "/usr/local/lib/python3.7/site-packages/prometheus_client/metrics.py", line 179 in _multi_samples
  File "/usr/local/lib/python3.7/site-packages/prometheus_client/metrics.py", line 68 in collect
  File "/usr/local/lib/python3.7/site-packages/prometheus_client/registry.py", line 75 in collect
  File "/usr/local/lib/python3.7/site-packages/prometheus_client/exposition.py", line 89 in generate_latest
  File "/usr/local/lib/python3.7/site-packages/prometheus_client/exposition.py", line 45 in prometheus_app
  File "/usr/local/lib/python3.7/wsgiref/handlers.py", line 137 in run
  File "/usr/local/lib/python3.7/wsgiref/simple_server.py", line 133 in handle
  File "/usr/local/lib/python3.7/socketserver.py", line 717 in __init__
  File "/usr/local/lib/python3.7/socketserver.py", line 357 in finish_request
  File "/usr/local/lib/python3.7/socketserver.py", line 344 in process_request
  File "/usr/local/lib/python3.7/socketserver.py", line 313 in _handle_request_noblock
  File "/usr/local/lib/python3.7/socketserver.py", line 234 in serve_forever
  File "/job_exporter/main.py", line 129 in main
  File "/job_exporter/main.py", line 162 in <module>

You can see C stack trace here. Our code is here. We are using the latest code from pip3. I know this issue from the comment in code.

Still looking what happened.

@xudifsd
Copy link
xudifsd commented Jan 31, 2019

It seems this and this locks the same lock. Because GC can interrupt call to metrics._multi_samples and during gc callback it will call metrics.labels. It should be RLock.

@brian-brazil
Copy link
Contributor

Fixed by #371

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants
0