Redesign shutdown logic to support rolling restarts using so_reuseport #902

JelteF · 2023-07-24T09:18:13Z

In 1.20 we announced deprecation of online restarts using the -R flag and recommended for people to use so_reuseport instead. But the safe shutdown logic that we executed when receiving a SIGINT signal didn't actually allow for zero-downtime rolling restarts.

This PR changes the behaviour of our shutdown signals to allow for rolling restarts. The changes that this PR makes are:

SIGTERM does not do "immediate shutdown" anymore, but instead triggers a newly introduced "super safe shutdown". This "super safe shutdown" waits for all clients to disconnect before shutting down the server. This can be used for rolling restarts as described in the newly added documentation. A second SIGTERM will cause an "immediate shutdown".
SIGINT mostly keeps doing what it was doing previously: waiting until all servers are released before shutting down PgBouncer. But it incorporates a few improvements from the new SIGTERM logic:
1. It also stops listening for new connections while PgBouncer is waiting for servers to be released. This does have the downside that SIGUSR2 cannot be used to cancel a safe shutdown anymore, but that was quite a weird/niche feature anyway.
2. Clients that have released their server and then do another query will be disconnected, instead of putting them in pause mode until either query_wait_timeout triggers or PgBouncer shuts down. This way they can reconnect quicker to another PgBouncer, because they were never going to get a new server anyway.
3. A that a second SIGINT will trigger an "immediate shutdown"
SIGQUIT handling is introduced and now causes an immediate shutdown.

Since this changes the existing behaviour of SIGTERM and SIGINT, this can be considered a breaking change. But the amount of breakage should be minimal since most of the time people would not want an immediate shutdown, and the new SIGINT behaviour is closer to the behaviour most people would expect.

Related to #894
Related to #900

Since this changes SIGTERM to not do a fast shutdown anymore this also indirectly addresses #806 and #770.

Fixes #806
Fixes #770

JelteF · 2023-07-24T09:26:10Z

src/main.c

@@ -481,8 +481,14 @@ static void handle_sigint(evutil_socket_t sock, short flags, void *arg)
 		die("takeover was in progress, going down immediately");
 	if (cf_pause_mode == P_SUSPEND)
 		die("suspend was in progress, going down immediately");
-	cf_pause_mode = P_PAUSE;
-	cf_shutdown = 1;
+	if (cf_so_reuseport) {


Point of discussion: Do we want to piggy-back on SIGINT? Or should we use a new signal for this behaviour.

Prefer new signal, see above.

In pgbouncer#902 I want to introduce a 4th shutdown value. So I started using an enum for this to improve readibility. I think this refactor from using integer values to enum values stands on its own. So to make reviewing easier, and not have it be blocked on adding tests and some discussions for pgbouncer#902 this PR includes only that refactor.

In #902 I want to introduce a 4th shutdown value. So I started using an enum for this to improve readibility. I think this refactor from using integer values to enum values stands on its own. So to make reviewing easier, and not have it be blocked on adding tests and some discussions for #902 this PR includes only that refactor.

We were using the full 64 bits of the BackendKeyData message as random bytes. This turned out to be arguably incorrect, because the first 32 bits are used by PostgreSQL as a Process ID (and this is part of the [protocol documentation too][1]). Actual PIDs are always positive, but we also put a random bit in the sign bit so were setting it to negative numbers half of the time. For most cases this does not matter, but it turned out that [`pg_basebackup` relies on the PID part to actually be positive][2]. While it seems semi-likely that `pg_basebackup` will be fixed to support negative Process IDs, it still seems good to adhere to this implicit requirement on positive numbers in case other clients also depend on it. Since this change requires changing the bytes of the cancel key in which we encode the `peer_id`, this change breaks cancelations in peered clusters of different PgBouncer versions. This seems like a minor enough problem that we should not care about this. In practice this should only happen during a rolling upgrade, which we currently don't support well anyway (see pgbouncer#902 for improvements on that). And even if we did, breaking cancellations for a few minutes in this transitional stage doesn't seem like a huge deal. [1]: https://www.postgresql.org/docs/current/protocol-message-formats.html [2]: https://www.postgresql.org/message-id/flat/CAGECzQQOGvYfp8ziF4fWQ_o8s2K7ppaoWBQnTmdakn3s-4Z%3D5g%40mail.gmail.com

petere

This all looks like a very good direction to me. We need work out how to handle the signals.

petere · 2023-10-04T06:51:11Z

doc/usage.md

+
+If you have more than two processes, repeat the steps 2, 3 and 4 for each of
+them.
+


This procedure somehow emphasizes two processes, and then says that you can have more than two at the end. But the second process isn't used during the procedure. So I think this could be written like this:

Have two or more PgBouncer processes running using so_reuseport and peering. Pick one process to upgrade, let's call it A.

Send SIGINT to process A.

Cause all clients reconnect. [...] Once all clients have reconnected. Process A will exit automatically, because no clients are connected to it anymore.

Start process A again

Repeat all steps for the remaining PgBouncer processes, one at a time.

Also, is peering required for this, or just recommended? (Using so_reuseport is required, as the code is currently written.) It might be good to be precise about that.

I improved the description of the steps.

petere · 2023-10-04T07:14:10Z

doc/usage.md

+:   Safe shutdown. If `so_reuseport` is set to 0, then this is the same as
+    issuing **PAUSE** and **SHUTDOWN** on the console. If `so_reuseport` is set
+    to 1 then this is the same as issuing **SHUTDOWN WAIT_FOR_CLIENTS** on
+    the console.



Hmm, we should use a separate signal for this. Otherwise, it's too much magic.

I think the best way would be:

Keep SIGINT as is.

Move the current SIGTERM behavior to SIGQUIT.

This new behavior goes to SIGTERM.

It's a compatibility break, but it has some advantages. It restores the intended progression that SIGTERM is "softer" than SIGINT. And it creates some consistency with the postgres/pg_ctl shutdown modes.

We could even have a setting like signal_compat that keeps the old signal assignments. Not sure if it's needed, but it would be possible pretty easily.

While I understand the desire to be in line with how Postgres uses signals, I don't think breaking compatibility in such a big way is worth it. Changing SIGTERM would start making a lot of Docker and SystemD stop commands suddenly change their behaviour. I don't think that's something we should be doing.

The reason I suggested changing SIGINT, is because if so_reuseport because there are multiple processes listening is that the current SIGINT behaviour is actively harmful. Because the PgBouncer process is still accepting new connections only to then send an error on them, while one of the other listening procesess might still be accepting connections.
Although I guess it can still make sense if you send SIGINT to all of the processes at the same time.

I just checked:

https://hub.docker.com/r/bitnami/pgbouncer/ uses SIGTERM (afaict)

pgdg apt systemd service file uses SIGINT

pgdg rpm systemd service file uses SIGINT

So changing behaviour of any of them seems pretty likely to cause problems in production for some users. I think assigning the new behaviour to SIGQUIT is the least disruptive.

It would be possible to add some config compat flag, to change the meaning of the signals to match postgres its intent. I think that's only likely to cause more confusion, because now you have to know what the config is to know how to stop pgbouncer in the way that you want.

I just checked:

https://hub.docker.com/r/bitnami/pgbouncer/ uses SIGTERM (afaict)

I didn't find the source code for this, but if it's true, then I think it's possible that it's a mistake, possibly exactly because pgbouncer does SIGTERM and SIGINT backwards.

pgdg apt systemd service file uses SIGINT

pgdg rpm systemd service file uses SIGINT

So changing behaviour of any of them seems pretty likely to cause problems in production for some users.

My proposal would leave SIGINT unchanged, and almost everybody should be using SIGINT normally.

I'm not saying I'm 100% comfortable, but it doesn't seem too bad.

Okay I implemented this proposal now, and if we put a warning in the release notes I think we should be fine.

The only issue I have is that Windows does not have SIGQUIT, so now you cannot do an immediate shutdown in windows anymore except for using the SHUTDOWN command. Which might be fine, but since sending SIGTERM will now not allow you to connect anymore (due to it closing sockets), there's no way to "upgrade" the shutdown behaviour to a SIGQUIT signal.

One option would be to leave the signal behaviour for Windows unchanged.

Okay, I think I found (and implemented) a rather elegant solution to the windows problem: If a shutdown attempt is already in progress and then a new SIGTERM signal is received, then we will trigger an immediate shutdown.

Hmm, turns out windows doesn't actually send SIGTERM requests. So this was a non-issue. Still, it seems quite nice to have this "immediate" shutdown happen on repeated signals. I think I'd actually go so far as also wanting to implement this for SIGINT. Having double Ctrl+C do an immediate shutdown seems really nice for interactive usages.

I implemented "double SIGINT" == "immediate shutdown"

petere · 2023-10-04T07:15:48Z

include/bouncer.h

+	 *
+	 * This allows for a rolling restart in combination with so_reuseport.
+	 */
+	SHUTDOWN_WAIT_FOR_CLIENTS,


I suppose "wait for clients" kind of includes "wait for servers"? Not sure if that is worth writing down, just thinking out loud.

I suppose "wait for clients" kind of includes "wait for servers"?

This wasn't the case before, but with my most recent changes it does. I updated the comment to reflect this.

src/pooler.c

petere · 2023-10-04T07:27:40Z

src/main.c

@@ -481,8 +481,14 @@ static void handle_sigint(evutil_socket_t sock, short flags, void *arg)
 		die("takeover was in progress, going down immediately");
 	if (cf_pause_mode == P_SUSPEND)
 		die("suspend was in progress, going down immediately");
-	cf_pause_mode = P_PAUSE;
-	cf_shutdown = 1;
+	if (cf_so_reuseport) {


Prefer new signal, see above.

petere · 2023-10-04T07:30:57Z

src/admin.c

-	/* avoid surprise later if cf_shutdown stays set */
-	if (cf_shutdown) {
+	/*
+	 * Avoid surprises later if cf_shutdown stays set to


Maybe if we're rewriting this comment anyway, we can use a more precise language than "avoid surprises"? I for one don't know what surprise we are trying to avoid.

It was meant to avoid a future pause to cause the server to be shutdown when sending RESUME during a safe shutdown. I removed this code (and thus also the comment), because resuming from a shutdown is impossible now that we're closing the listening sockets at the start of the shutdown.

src/admin.c

We were using the full 64 bits of the BackendKeyData message as random bytes. This turned out to be arguably incorrect, because the first 32 bits are used by PostgreSQL as a Process ID (and this is part of the [protocol documentation too][1]). Actual PIDs are always positive, but we also put a random bit in the sign bit so were setting it to negative numbers half of the time. For most cases this does not matter, but it turned out that [`pg_basebackup` relies on the PID part to actually be positive][2]. While it seems semi-likely that `pg_basebackup` will be fixed to support negative Process IDs, it still seems good to adhere to this implicit requirement on positive numbers in case other clients also depend on it. Since this change requires changing the bytes of the cancel key in which we encode the `peer_id`, this change breaks cancelations in peered clusters of different PgBouncer versions. This seems like a minor enough problem that we should not care about this. In practice this should only happen during a rolling upgrade, which we currently don't support well anyway (see pgbouncer#902 for improvements on that). And even if we did, breaking cancellations for a few minutes in this transitional stage doesn't seem like a huge deal. [1]: https://www.postgresql.org/docs/current/protocol-message-formats.html [2]: https://www.postgresql.org/message-id/flat/CAGECzQQOGvYfp8ziF4fWQ_o8s2K7ppaoWBQnTmdakn3s-4Z%3D5g%40mail.gmail.com

We were using the full 64 bits of the BackendKeyData message as random bytes. This turned out to be arguably incorrect, because the first 32 bits are used by PostgreSQL as a Process ID (and this is part of the [protocol documentation too][1]). Actual PIDs are always positive, but we also put a random bit in the sign bit so were setting it to negative numbers half of the time. For most cases this does not matter, but it turned out that [`pg_basebackup` relied on the PID part to actually be positive][2]. While `pg_basebackup` is now fixed to support negative Process IDs, it still seems good to adhere to this implicit requirement on positive numbers in case other clients also depend on it. Since this change requires changing the bytes of the cancel key in which we encode the `peer_id`, this change breaks cancelations in peered clusters of different PgBouncer versions. This seems like a minor enough problem that we should not care about this. In practice this should only happen during a rolling upgrade, which we currently don't support well anyway (see #902 for improvements on that). And even if we did, breaking cancellations for a few minutes in this transitional stage doesn't seem like a huge deal. [1]: https://www.postgresql.org/docs/current/protocol-message-formats.html [2]: https://www.postgresql.org/message-id/flat/CAGECzQQOGvYfp8ziF4fWQ_o8s2K7ppaoWBQnTmdakn3s-4Z%3D5g%40mail.gmail.com

All feedback is now addressed

murugiaha · 2024-05-28T18:17:14Z

@JelteF any idea which version will this feature be available in?

eulerto · 2024-05-28T18:25:28Z

@murugiaha In the next one. There is no release plans yet.

murugiaha · 2024-05-28T18:31:32Z

@murugiaha In the next one. There is no release plans yet.

Thanks for the quick response. Much appreciated.

JelteF · 2024-05-28T22:05:25Z

My expectation for the next release would be somewhere this summer, but no promises.

JelteF commented Jul 24, 2023

View reviewed changes

JelteF force-pushed the rolling-restart branch 2 times, most recently from 767b4b9 to d53fd21 Compare August 16, 2023 14:34

JelteF mentioned this pull request Aug 16, 2023

refactor: Use an enum for shutdown values #926

Merged

JelteF force-pushed the rolling-restart branch 2 times, most recently from beb2693 to 9a5a806 Compare August 18, 2023 08:27

JelteF requested review from petere and eulerto August 18, 2023 08:27

JelteF force-pushed the rolling-restart branch from 9a5a806 to 5f13f15 Compare August 18, 2023 08:30

JelteF removed request for petere and eulerto August 18, 2023 09:06

JelteF force-pushed the rolling-restart branch from 5f13f15 to cfcd708 Compare August 18, 2023 09:16

JelteF requested review from petere and eulerto August 18, 2023 09:16

JelteF mentioned this pull request Sep 6, 2023

Only use positive numbers in the PID part of the cancel key #945

Merged

petere previously requested changes Oct 4, 2023

View reviewed changes

JelteF mentioned this pull request Feb 2, 2024

online reboot via systemd #593

Closed

JelteF force-pushed the rolling-restart branch from cfcd708 to 1daae3b Compare February 7, 2024 14:07

JelteF mentioned this pull request Feb 12, 2024

Prestop hook for pgbouncer deployment. #1022

Open

JelteF force-pushed the rolling-restart branch 3 times, most recently from 85f15d3 to df78847 Compare February 12, 2024 16:32

JelteF force-pushed the rolling-restart branch 2 times, most recently from efe4a6a to 70d6414 Compare February 20, 2024 21:50

JelteF force-pushed the rolling-restart branch from 70d6414 to 8ad56fb Compare May 6, 2024 11:03

JelteF changed the title ~~Better support for rolling restarts using so_reuseport~~ Support for rolling restarts using so_reuseport May 6, 2024

JelteF force-pushed the rolling-restart branch 3 times, most recently from a7c34dd to f866b57 Compare May 6, 2024 12:51

JelteF changed the title ~~Support for rolling restarts using so_reuseport~~ Redesign shutdown logic to support rolling restarts using so_reuseport May 6, 2024

JelteF force-pushed the rolling-restart branch 6 times, most recently from d762825 to c2ec342 Compare May 6, 2024 14:21

Support rolling restarts

b5337e7

JelteF force-pushed the rolling-restart branch from c2ec342 to b5337e7 Compare May 6, 2024 14:25

JelteF enabled auto-merge (squash) May 6, 2024 14:31

JelteF merged commit de33cd0 into pgbouncer:master May 6, 2024
7 checks passed

durdina mentioned this pull request May 27, 2024

Improve documentation on connecting to AWS RDS and SCRAM #1083

Merged

JelteF mentioned this pull request Jun 27, 2024

v1.23.0 #1095

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redesign shutdown logic to support rolling restarts using so_reuseport #902

Redesign shutdown logic to support rolling restarts using so_reuseport #902


		If you have more than two processes, repeat the steps 2, 3 and 4 for each of
		them.

Redesign shutdown logic to support rolling restarts using so_reuseport #902

Redesign shutdown logic to support rolling restarts using so_reuseport #902

Conversation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment