-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redesign shutdown logic to support rolling restarts using so_reuseport #902
Conversation
src/main.c
Outdated
@@ -481,8 +481,14 @@ static void handle_sigint(evutil_socket_t sock, short flags, void *arg) | |||
die("takeover was in progress, going down immediately"); | |||
if (cf_pause_mode == P_SUSPEND) | |||
die("suspend was in progress, going down immediately"); | |||
cf_pause_mode = P_PAUSE; | |||
cf_shutdown = 1; | |||
if (cf_so_reuseport) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Point of discussion: Do we want to piggy-back on SIGINT? Or should we use a new signal for this behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer new signal, see above.
767b4b9
to
d53fd21
Compare
In pgbouncer#902 I want to introduce a 4th shutdown value. So I started using an enum for this to improve readibility. I think this refactor from using integer values to enum values stands on its own. So to make reviewing easier, and not have it be blocked on adding tests and some discussions for pgbouncer#902 this PR includes only that refactor.
In pgbouncer#902 I want to introduce a 4th shutdown value. So I started using an enum for this to improve readibility. I think this refactor from using integer values to enum values stands on its own. So to make reviewing easier, and not have it be blocked on adding tests and some discussions for pgbouncer#902 this PR includes only that refactor.
In #902 I want to introduce a 4th shutdown value. So I started using an enum for this to improve readibility. I think this refactor from using integer values to enum values stands on its own. So to make reviewing easier, and not have it be blocked on adding tests and some discussions for #902 this PR includes only that refactor.
beb2693
to
9a5a806
Compare
9a5a806
to
5f13f15
Compare
5f13f15
to
cfcd708
Compare
We were using the full 64 bits of the BackendKeyData message as random bytes. This turned out to be arguably incorrect, because the first 32 bits are used by PostgreSQL as a Process ID (and this is part of the [protocol documentation too][1]). Actual PIDs are always positive, but we also put a random bit in the sign bit so were setting it to negative numbers half of the time. For most cases this does not matter, but it turned out that [`pg_basebackup` relies on the PID part to actually be positive][2]. While it seems semi-likely that `pg_basebackup` will be fixed to support negative Process IDs, it still seems good to adhere to this implicit requirement on positive numbers in case other clients also depend on it. Since this change requires changing the bytes of the cancel key in which we encode the `peer_id`, this change breaks cancelations in peered clusters of different PgBouncer versions. This seems like a minor enough problem that we should not care about this. In practice this should only happen during a rolling upgrade, which we currently don't support well anyway (see pgbouncer#902 for improvements on that). And even if we did, breaking cancellations for a few minutes in this transitional stage doesn't seem like a huge deal. [1]: https://www.postgresql.org/docs/current/protocol-message-formats.html [2]: https://www.postgresql.org/message-id/flat/CAGECzQQOGvYfp8ziF4fWQ_o8s2K7ppaoWBQnTmdakn3s-4Z%3D5g%40mail.gmail.com
We were using the full 64 bits of the BackendKeyData message as random bytes. This turned out to be arguably incorrect, because the first 32 bits are used by PostgreSQL as a Process ID (and this is part of the [protocol documentation too][1]). Actual PIDs are always positive, but we also put a random bit in the sign bit so were setting it to negative numbers half of the time. For most cases this does not matter, but it turned out that [`pg_basebackup` relies on the PID part to actually be positive][2]. While it seems semi-likely that `pg_basebackup` will be fixed to support negative Process IDs, it still seems good to adhere to this implicit requirement on positive numbers in case other clients also depend on it. Since this change requires changing the bytes of the cancel key in which we encode the `peer_id`, this change breaks cancelations in peered clusters of different PgBouncer versions. This seems like a minor enough problem that we should not care about this. In practice this should only happen during a rolling upgrade, which we currently don't support well anyway (see pgbouncer#902 for improvements on that). And even if we did, breaking cancellations for a few minutes in this transitional stage doesn't seem like a huge deal. [1]: https://www.postgresql.org/docs/current/protocol-message-formats.html [2]: https://www.postgresql.org/message-id/flat/CAGECzQQOGvYfp8ziF4fWQ_o8s2K7ppaoWBQnTmdakn3s-4Z%3D5g%40mail.gmail.com
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all looks like a very good direction to me. We need work out how to handle the signals.
|
||
If you have more than two processes, repeat the steps 2, 3 and 4 for each of | ||
them. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This procedure somehow emphasizes two processes, and then says that you can have more than two at the end. But the second process isn't used during the procedure. So I think this could be written like this:
- Have two or more PgBouncer processes running using
so_reuseport
and peering. Pick one process to upgrade, let's call it A. - Send
SIGINT
to process A. - Cause all clients reconnect. [...] Once all clients have reconnected. Process A will exit automatically, because no clients are connected to it anymore.
- Start process A again
- Repeat all steps for the remaining PgBouncer processes, one at a time.
Also, is peering required for this, or just recommended? (Using so_reuseport
is required, as the code is currently written.) It might be good to be precise about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I improved the description of the steps.
: Safe shutdown. If `so_reuseport` is set to 0, then this is the same as | ||
issuing **PAUSE** and **SHUTDOWN** on the console. If `so_reuseport` is set | ||
to 1 then this is the same as issuing **SHUTDOWN WAIT_FOR_CLIENTS** on | ||
the console. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, we should use a separate signal for this. Otherwise, it's too much magic.
I think the best way would be:
- Keep SIGINT as is.
- Move the current SIGTERM behavior to SIGQUIT.
- This new behavior goes to SIGTERM.
It's a compatibility break, but it has some advantages. It restores the intended progression that SIGTERM is "softer" than SIGINT. And it creates some consistency with the postgres/pg_ctl shutdown modes.
We could even have a setting like signal_compat that keeps the old signal assignments. Not sure if it's needed, but it would be possible pretty easily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I understand the desire to be in line with how Postgres uses signals, I don't think breaking compatibility in such a big way is worth it. Changing SIGTERM would start making a lot of Docker and SystemD stop commands suddenly change their behaviour. I don't think that's something we should be doing.
The reason I suggested changing SIGINT, is because if so_reuseport
because there are multiple processes listening is that the current SIGINT behaviour is actively harmful. Because the PgBouncer process is still accepting new connections only to then send an error on them, while one of the other listening procesess might still be accepting connections.
Although I guess it can still make sense if you send SIGINT to all of the processes at the same time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just checked:
- https://hub.docker.com/r/bitnami/pgbouncer/ uses SIGTERM (afaict)
- pgdg apt systemd service file uses SIGINT
- pgdg rpm systemd service file uses SIGINT
So changing behaviour of any of them seems pretty likely to cause problems in production for some users. I think assigning the new behaviour to SIGQUIT is the least disruptive.
It would be possible to add some config compat flag, to change the meaning of the signals to match postgres its intent. I think that's only likely to cause more confusion, because now you have to know what the config is to know how to stop pgbouncer in the way that you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just checked:
- https://hub.docker.com/r/bitnami/pgbouncer/ uses SIGTERM (afaict)
I didn't find the source code for this, but if it's true, then I think it's possible that it's a mistake, possibly exactly because pgbouncer does SIGTERM and SIGINT backwards.
- pgdg apt systemd service file uses SIGINT
- pgdg rpm systemd service file uses SIGINT
So changing behaviour of any of them seems pretty likely to cause problems in production for some users.
My proposal would leave SIGINT unchanged, and almost everybody should be using SIGINT normally.
I'm not saying I'm 100% comfortable, but it doesn't seem too bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I implemented this proposal now, and if we put a warning in the release notes I think we should be fine.
The only issue I have is that Windows does not have SIGQUIT, so now you cannot do an immediate shutdown in windows anymore except for using the SHUTDOWN
command. Which might be fine, but since sending SIGTERM will now not allow you to connect anymore (due to it closing sockets), there's no way to "upgrade" the shutdown behaviour to a SIGQUIT signal.
One option would be to leave the signal behaviour for Windows unchanged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I think I found (and implemented) a rather elegant solution to the windows problem: If a shutdown attempt is already in progress and then a new SIGTERM signal is received, then we will trigger an immediate shutdown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, turns out windows doesn't actually send SIGTERM requests. So this was a non-issue. Still, it seems quite nice to have this "immediate" shutdown happen on repeated signals. I think I'd actually go so far as also wanting to implement this for SIGINT. Having double Ctrl+C do an immediate shutdown seems really nice for interactive usages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented "double SIGINT" == "immediate shutdown"
* | ||
* This allows for a rolling restart in combination with so_reuseport. | ||
*/ | ||
SHUTDOWN_WAIT_FOR_CLIENTS, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose "wait for clients" kind of includes "wait for servers"? Not sure if that is worth writing down, just thinking out loud.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose "wait for clients" kind of includes "wait for servers"?
This wasn't the case before, but with my most recent changes it does. I updated the comment to reflect this.
src/main.c
Outdated
@@ -481,8 +481,14 @@ static void handle_sigint(evutil_socket_t sock, short flags, void *arg) | |||
die("takeover was in progress, going down immediately"); | |||
if (cf_pause_mode == P_SUSPEND) | |||
die("suspend was in progress, going down immediately"); | |||
cf_pause_mode = P_PAUSE; | |||
cf_shutdown = 1; | |||
if (cf_so_reuseport) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer new signal, see above.
src/admin.c
Outdated
/* avoid surprise later if cf_shutdown stays set */ | ||
if (cf_shutdown) { | ||
/* | ||
* Avoid surprises later if cf_shutdown stays set to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe if we're rewriting this comment anyway, we can use a more precise language than "avoid surprises"? I for one don't know what surprise we are trying to avoid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was meant to avoid a future pause to cause the server to be shutdown when sending RESUME during a safe shutdown. I removed this code (and thus also the comment), because resuming from a shutdown is impossible now that we're closing the listening sockets at the start of the shutdown.
We were using the full 64 bits of the BackendKeyData message as random bytes. This turned out to be arguably incorrect, because the first 32 bits are used by PostgreSQL as a Process ID (and this is part of the [protocol documentation too][1]). Actual PIDs are always positive, but we also put a random bit in the sign bit so were setting it to negative numbers half of the time. For most cases this does not matter, but it turned out that [`pg_basebackup` relies on the PID part to actually be positive][2]. While it seems semi-likely that `pg_basebackup` will be fixed to support negative Process IDs, it still seems good to adhere to this implicit requirement on positive numbers in case other clients also depend on it. Since this change requires changing the bytes of the cancel key in which we encode the `peer_id`, this change breaks cancelations in peered clusters of different PgBouncer versions. This seems like a minor enough problem that we should not care about this. In practice this should only happen during a rolling upgrade, which we currently don't support well anyway (see pgbouncer#902 for improvements on that). And even if we did, breaking cancellations for a few minutes in this transitional stage doesn't seem like a huge deal. [1]: https://www.postgresql.org/docs/current/protocol-message-formats.html [2]: https://www.postgresql.org/message-id/flat/CAGECzQQOGvYfp8ziF4fWQ_o8s2K7ppaoWBQnTmdakn3s-4Z%3D5g%40mail.gmail.com
We were using the full 64 bits of the BackendKeyData message as random bytes. This turned out to be arguably incorrect, because the first 32 bits are used by PostgreSQL as a Process ID (and this is part of the [protocol documentation too][1]). Actual PIDs are always positive, but we also put a random bit in the sign bit so were setting it to negative numbers half of the time. For most cases this does not matter, but it turned out that [`pg_basebackup` relies on the PID part to actually be positive][2]. While it seems semi-likely that `pg_basebackup` will be fixed to support negative Process IDs, it still seems good to adhere to this implicit requirement on positive numbers in case other clients also depend on it. Since this change requires changing the bytes of the cancel key in which we encode the `peer_id`, this change breaks cancelations in peered clusters of different PgBouncer versions. This seems like a minor enough problem that we should not care about this. In practice this should only happen during a rolling upgrade, which we currently don't support well anyway (see pgbouncer#902 for improvements on that). And even if we did, breaking cancellations for a few minutes in this transitional stage doesn't seem like a huge deal. [1]: https://www.postgresql.org/docs/current/protocol-message-formats.html [2]: https://www.postgresql.org/message-id/flat/CAGECzQQOGvYfp8ziF4fWQ_o8s2K7ppaoWBQnTmdakn3s-4Z%3D5g%40mail.gmail.com
We were using the full 64 bits of the BackendKeyData message as random bytes. This turned out to be arguably incorrect, because the first 32 bits are used by PostgreSQL as a Process ID (and this is part of the [protocol documentation too][1]). Actual PIDs are always positive, but we also put a random bit in the sign bit so were setting it to negative numbers half of the time. For most cases this does not matter, but it turned out that [`pg_basebackup` relied on the PID part to actually be positive][2]. While `pg_basebackup` is now fixed to support negative Process IDs, it still seems good to adhere to this implicit requirement on positive numbers in case other clients also depend on it. Since this change requires changing the bytes of the cancel key in which we encode the `peer_id`, this change breaks cancelations in peered clusters of different PgBouncer versions. This seems like a minor enough problem that we should not care about this. In practice this should only happen during a rolling upgrade, which we currently don't support well anyway (see #902 for improvements on that). And even if we did, breaking cancellations for a few minutes in this transitional stage doesn't seem like a huge deal. [1]: https://www.postgresql.org/docs/current/protocol-message-formats.html [2]: https://www.postgresql.org/message-id/flat/CAGECzQQOGvYfp8ziF4fWQ_o8s2K7ppaoWBQnTmdakn3s-4Z%3D5g%40mail.gmail.com
cfcd708
to
1daae3b
Compare
85f15d3
to
df78847
Compare
efe4a6a
to
70d6414
Compare
a7c34dd
to
f866b57
Compare
d762825
to
c2ec342
Compare
@JelteF any idea which version will this feature be available in? |
@murugiaha In the next one. There is no release plans yet. |
Thanks for the quick response. Much appreciated. |
My expectation for the next release would be somewhere this summer, but no promises. |
In 1.20 we announced deprecation of online restarts using the
-R
flag and recommended for people to useso_reuseport
instead. But the safe shutdown logic that we executed when receiving aSIGINT
signal didn't actually allow for zero-downtime rolling restarts.This PR changes the behaviour of our shutdown signals to allow for rolling restarts. The changes that this PR makes are:
query_wait_timeout
triggers or PgBouncer shuts down. This way they can reconnect quicker to another PgBouncer, because they were never going to get a new server anyway.Since this changes the existing behaviour of SIGTERM and SIGINT, this can be considered a breaking change. But the amount of breakage should be minimal since most of the time people would not want an immediate shutdown, and the new SIGINT behaviour is closer to the behaviour most people would expect.
Related to #894
Related to #900
Since this changes SIGTERM to not do a fast shutdown anymore this also indirectly addresses #806 and #770.
Fixes #806
Fixes #770