[go: up one dir, main page]

Page MenuHomePhabricator

Race condition on puppetdb in sre.hosts.rename cookbook
Closed, ResolvedPublic

Description

While running the rename campaign for T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets, we've hit a couple of race conditions concerning debmonitor and puppetdb.

The sre.hosts.rename cookbook does:

self.debmonitor.host_delete(self.old_fqdn)
self.puppet_master.delete(self.old_fqdn)
self.puppet_server.delete(self.old_fqdn)

through spicerack, however there is a small race window if a puppet run is in progress or starts during these steps, resulting in the hosts being re-added to puppetdb and needing to be manually cleaned up from both puppetdb and debmonitor.

Since we are supposed to reimage the host immediately with --new, I think we can safely begin the sre.hosts.rename cookbook by disabling puppet on the node, reducing the risk to hit that window.

CR coming shortly

Event Timeline

Change #1071588 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/cookbooks@master] sre.hosts.rename: Disable puppet to avoid race-condition

https://gerrit.wikimedia.org/r/1071588

Tested via test-cookbook on mw2428 and mw2429 and they seem to have been correctly removed from both puppetdb and debmonitor.

Correction, it worked for puppetdb, but they got added back to debmonitor. Will investigate further.

Correction, it worked for puppetdb, but they got added back to debmonitor. Will investigate further.

The debmonitor ingestion isn't driven by Puppet, but by a systemd timer. I think simply running

systemctl mask debmonitor-client.timer

should fix it.

While the above is totally true the probability that a rename+reimage happens exactly at the time of the timer that runs once a day is fairly low.

Your problem is not a Puppet run and disabling puppet doesn't help at all with this, the apt-get update that re-populates Debmonitor comes from the puppet-agent-timer.timer that runs every 30m and with which the probability of a race condition is 48 times higher than with the debmonitor timer ;)

Feel free to stop/mask both of them.

Your problem is not a Puppet run and disabling puppet doesn't help at all with this, the apt-get update that re-populates Debmonitor comes from the puppet-agent-timer.timer that runs every 30m and with which the probability of a race condition is 48 times higher than with the debmonitor timer ;)

I don't think this is needed: the puppet timer runs /usr/local/sbin/puppet-run and that script checks whether puppet is disabled and if so, the apt-get update run which triggers the debmonitor-client submit never gets executed in the first place?

I don't think it does anymore unfortunately...

In https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/files/puppet/bin/puppet-run.sh#7 we check for /run/puppet/disabled but if I disable puppet on a host that file doesn't exists.
The one that exists is /var/lib/puppet/state/agent_disabled.lock and its path can be get with puppet agent --configprint agent_disabled_lockfile.

I've just tested the behavior on sretest1001 and with puppet disabled apt-get run happily:

Sep 10 10:24:08 sretest1001 puppet-agent-cronjob[1195759]: INFO:debmonitor:Found 532 installed binary packages
Sep 10 10:24:08 sretest1001 puppet-agent-cronjob[1195759]: INFO:debmonitor:Found 3 upgradable binary packages (including new dependencies)
Sep 10 10:24:09 sretest1001 puppet-agent-cronjob[1195759]: INFO:debmonitor:Successfully sent the upgradable update to the DebMonitor server
Sep 10 10:24:10 sretest1001 puppet-agent[1196366]: Skipping run of Puppet configuration client; administratively disabled (Reason: 'test - volans');

Change #1071588 merged by jenkins-bot:

[operations/cookbooks@master] sre.hosts.rename: Disable puppet and debmonitor

https://gerrit.wikimedia.org/r/1071588

I don't think it does anymore unfortunately...

In https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/files/puppet/bin/puppet-run.sh#7 we check for /run/puppet/disabled but if I disable puppet on a host that file doesn't exists.
The one that exists is /var/lib/puppet/state/agent_disabled.lock and its path can be get with puppet agent --configprint agent_disabled_lockfile.

Ah, right. I misremembered the purpose of /run/puppet/disabled, in fact John and myself added this specifically as reliable means to prevent Puppet runs during P5->P7 migrations (even if some other component enables Puppet).

So, yes, let's also disable the puppet timer in the cookbook.

Change #1071887 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/cookbooks@master] sre.hosts.rename: Mask puppet-agent-timer

https://gerrit.wikimedia.org/r/1071887

Sorry I didn't see the updates to the discussion before merging the previous iteration. Patch up to disable puppet-agent-timer.timer

Change #1071887 merged by jenkins-bot:

[operations/cookbooks@master] sre.hosts.rename: Mask puppet-agent-timer

https://gerrit.wikimedia.org/r/1071887

I don't think this has reoccurred during the rest of the rename campaign, resolving