8000 Reset multiproc values when the pid changes. by brian-brazil · Pull Request #169 · prometheus/client_python · GitHub
[go: up one dir, main page]

Skip to content

Reset multiproc values when the pid changes. #169

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 19, 2017
Merged

Conversation

brian-brazil
Copy link
Contributor

This puts us at ~2000ns for an operation,
which is ~300ns than previously but still slightly
better than we were before we switched to a single lock.

This is twice as slow as without multiproc mode.

This puts us at ~2000ns for an operation,
which is ~300ns than previously but still slightly
better than we were before we switched to a single lock.

This is twice as slow as without multiproc mode.
@jkemp101
Copy link
jkemp101 commented Jun 30, 2017

I did some testing and gauge files would sometimes get left behind. There is a race condition where the MultiProcessCollector can create files after they have been deleted. This is how I worked around it multi-lock...closeio:multi-lock

So, gauge livesums could remain. Counters are getting a new file for each pid which I guess will stay around forever? Can we use a single file for those?

@brian-brazil
Copy link
Contributor Author

I did some testing and gauge files would sometimes get left behind. There is a race condition where the MultiProcessCollector can create files after they have been deleted.

Can you expand on this?

Counters are getting a new file for each pid which I guess will stay around forever?

It's intended that you clean them out when the process restarts.

Can we use a single file for those?

No, that'd require coordination between pids which I'm trying to avoid for performance reasons.

@jkemp101
Copy link

Test setup:

  1. Gunicorn: worker=gevent, workers=4, and max_requests=5
  2. Ran while [ 1 ]; do curl -s http://127.0.0.1:5020/metrics/ ; done; for about 30 seconds
    1. That endpoint returns generate_latest(registry)
    2. In post_request I update a counter and gauge metric
    3. This setup causes workers to have a short lifetime and stresses the creating/reporting metrics
  3. When I stopped the curl command, gauge livesum files would be left for pids that are no longer running.

The race condition is caused by MultiProcessCollector.collect getting a list of files and then iterating over it to tally all of the values. But the worker for that file might have exited between getting the list of filenames and getting the values out of the file. This causes a new file to be created after the "real" one has been deleted from the woker's child_exit. MultiProcessCollector should never create files if they don't exist.

The counter files are worrisome. On active servers these could pile up. Also I prefer not to create too many local files when I'm running inside containers. This could easily get into 100s or 1000s of files. Could we do something like this:

  1. On worker exit the counter file gets renamed from counter_123.db to counter_123.db.dead
  2. The first step of MultiProcessCollector is to take all *.dead files and summarize them into a summed.db file. Then it deletes the *.dead file. (This has race conditions all over it)
  3. The remaining MultiProcessCollector code sums up the individual live counter files and the single summed file.

Need to think about this more, just thinking out loud here.

@brian-brazil
Copy link
Contributor Author

First off the good news is it sounds like the code is working as expected.

I haven't given too much thought to high churn situations. Do you have a notion of the performance impact?

@jkemp101
Copy link
jkemp101 commented Jul 3, 2017

The fixes in this PR did seem to keep the files aligned with the worker process pids so that is good.

I didn't really look at performance. I don't think the fixes I made would have any impact to the performance of this PR. There is just the one if statement to check if the memory mapped file should be open readonly.

Also, the race condition I was experiencing was caused by the high churn rate in my test. But the likelihood of this failure would also increase as more and more metric files are left to be processed. That would increase the time between getting the list of files and processing the last file which would increase the chance the collector would create a file it shouldn't.

@brian-brazil
Copy link
Contributor Author

The question for now is are the right metrics being exposed, this is the first attempt at supporting all the various modes of multi-processing.

@brian-brazil brian-brazil merged commit 50d6e0b into master Jul 19, 2017
@brian-brazil brian-brazil deleted the multi-lock branch July 19, 2017 08:09
@jkemp101
Copy link

I'm pretty sure if the race condition is triggered a gauge with multiprocess_mode='min' is going to return 0 when it potentially shouldn't.

@brian-brazil
Copy link
Contributor Author

I don't think so, as the metric will be missing from the file rather than having the value 0.

@jkemp101
Copy link

You are right. The potential issue would actually be with liveall not min because min doesn't delete the metric files. And as you said the file is basically empty so nothing gets counted/reported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0