[go: up one dir, main page]

Page MenuHomePhabricator

Control mw-on-k8s periodic maintenance jobs with an etcd value
Open, Needs TriagePublic

Description

Presently, maintenance jobs are systemd timers. They run in both DCs, but inside a wrapper script that first checks the MW config state in etcd to make sure it's running in the current primary_dc and that DC is not read-only; otherwise, it quits without invoking the maintenance script.

Today, when we do a DC switchover, we kill the jobs that are currently running but we don't actually prevent new jobs from starting. That means in the time between stopping maintenance, and setting read-only, new maintenance jobs can start, and they'll run into trouble when the RW DC disappears out from under them. This race condition adds some more coordination complexity, and a little rush, to the early stages of the switchover, which we could eliminate.

When we move periodic maintenance jobs to Kubernetes CronJobs, we'll add an additional config value to etcd specifically for controlling them. There are a few different ways we might structure that, each with pros and cons:

  1. A new string value maintenance_dc that could be eqiad, codfw, or none. (In an eqiad-codfw switch, we'd change it from eqiad to none, then to codfw.) The downside is this is partly redundant with the primary_dc value; primary_dc: eqiad; maintenance_dc: codfw is a nonsense state that we shouldn't be able to encode.
  2. A new map value maintenance_enabled like {eqiad: true, codfw: false}. (In an eqiad-codfw switch, we'd change it to {eqiad: false, codfw: false}, then to {eqiad: false, codfw: true}.) This has the same downside as #1, plus it allows the additional nonsense state {eqiad: true, codfw: true}. (On the other hand, read_only has a similar structure already.)
  3. A new boolean value maintenance_enabled which would be true most of the time and false during a switchover. (In an eqiad-codfw switch, we'd change it from true to false, then switch the primary_dc, then change it back to true.) We'd get the DC state from primary_dc, and check maintenance_enabled along with (or instead of) read_only. The only nonsense state here is if maintenance is enabled while the primary DC is RO, and we can mitigate that by checking both.

Clearly I'm leaning toward #3 for simplicity, but open to other possibilities. (I prefer "enabled" to "disabled," because maintenance_disabled: false is an unnecessarily confusing double negative. Unfortunately that means the sense is reversed from read_only but I think that's okay.)

Event Timeline

#3 sounds like the most sane option to me as well, and enables us to halt maintenance for other reasons than a DC switchover or a read-only window. Maintenance should check read_only state anyways as part of the pre-flight.

Scott_French subscribed.

+1 to option #3 as the most sensible / obvious one: adding something more complex than a single global boolean invites odd nonsense states in combination with read-only (currently a per-DC toggle) and primary DC.

I'd like to make this happen soon, and also wire it into the preflight checks in mw-cli-wrapper.py so that we can us it in the upcoming switchover.

Reminder to self: once live, wire this into the 01-stop-maintenance.py and 08-start-maintenance.py cookbooks.

See also T359130 for the cookbook work. We aren't as far as I expected we'd be, so we can revisit which of those steps for cronjobs-on-k8s need to be accomplished before the switchover, but I agree it's a good idea to add this to mw-cli-wrapper.py in the meantime and start using it.

Thanks, Reuven!

Agreed, yeah: Some subset of those items will need done before the switchover, but exactly which subset depends on how far we expect things to be by then. I'll follow up on the task shortly.

I took some time earlier today to sketch out what putting maintenance-enabled state in etcd might look like, and have some thoughts ...

A hypothetical "path of least resistance" approach could look something like: introduce a new mwconfig-type object [0] and associated schema, wire that into the mediawiki.yaml confd template [1] instantiated by P:conftool::state and already consumed by mw-cli-wrapper.py, and update the latter to now check the new field.

Now, one significant issue with that is that there's no good reason for this to be a mwconfig object, as it's not actually consumed by MediaWiki (and there isn't a clear future use case for doing so).

Also, it would be nice to take a step back and think a bit about how the k8s case would consume this, along with mwconfig state, and gate as needed.

I tend to be of the opinion that adding direct clients of etcd is something to be avoided, so sticking with a config file written by confd seems to be the right approach.

How about something like the following:

  • Rather than using the mwconfig type, introduce a new type for maintenance-related dynamic state, which includes enabledness.
  • Introduce a new confd template (referencing objects from both the new type and mwconfig), instantiated by a new puppet profile, that renders a combined configuration file for mw-cli-wrapper.py.
  • Add the profile to maintenance hosts and migrate the wrapper to the new configuration (i.e., get it out of the business of following mediawiki.yaml directly).
  • In the k8s case, add the new profile to the deployment host, along with a simple script to sync the configuration to a ConfigMap object in the mw-script namespace, with the latter mounted into the job's container.

This would do the trick, without muddying mwconfig and providing a fairly(?) straightforward path to making the same wrapper work on k8s (T341555).

Note: mw-cli-wrapper.py will need to lose the check_confd_template call if we go this route. Fun fact, though: that check does nothing, and probably has not for some time, given that the confd::file resource in P:conftool::state doesn't specify a check command (so no error state files will ever be produced).

Thoughts?

[0] https://gerrit.wikimedia.org/g/operations/puppet/+/8fae5805c21c77137273d3c86eeef719eb267c4a/modules/profile/files/conftool/schema.yaml#19

[1] https://gerrit.wikimedia.org/g/operations/puppet/+/8fae5805c21c77137273d3c86eeef719eb267c4a/modules/profile/templates/conftool/state-mediawiki.tmpl.erb

Unassigning this, as it's not something I'm planning to work on in the near future (i.e., it was not needed to the switchover).

Also, there's ongoing work for T341555 that may supersede this in some form.