[go: up one dir, main page]

Page MenuHomePhabricator

MoritzMuehlenhoff (Moritz Mühlenhoff)
User

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
Apr 1 2015, 4:33 PM (492 w, 5 d)
Availability
Available
LDAP User
Moritz Mühlenhoff
MediaWiki User
MMuhlenhoff (WMF) [ Global Accounts ]

Recent Activity

Today

MoritzMuehlenhoff closed T373819: Issues reimaging kubernetes workers due to user conflicts in systemd-timesyncd as Resolved.

The override is now fixed on the Debian archive side and bullseye installations should work again. Please reopen if you still see reimages failing.

Tue, Sep 10, 5:18 AM · serviceops, Infrastructure-Foundations
MoritzMuehlenhoff closed T373819: Issues reimaging kubernetes workers due to user conflicts in systemd-timesyncd , a subtask of T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets, as Resolved.
Tue, Sep 10, 5:17 AM · serviceops, SRE

Yesterday

MoritzMuehlenhoff added a comment to T374351: Race condition on puppetdb in sre.hosts.rename cookbook.

Correction, it worked for puppetdb, but they got added back to debmonitor. Will investigate further.

Mon, Sep 9, 3:02 PM · Patch-For-Review, SRE-tools, Infrastructure-Foundations, serviceops-radar
MoritzMuehlenhoff triaged T374351: Race condition on puppetdb in sre.hosts.rename cookbook as Medium priority.
Mon, Sep 9, 2:37 PM · Patch-For-Review, SRE-tools, Infrastructure-Foundations, serviceops-radar
MoritzMuehlenhoff updated the task description for T373783: Integrate Bookworm 12.7 point update.
Mon, Sep 9, 1:11 PM · Infrastructure-Foundations, SRE
MoritzMuehlenhoff added a comment to T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets.

mw2379 is also still in puppetboard: https://puppetboard.wikimedia.org/catalog/mw2379.codfw.wmnet

Mon, Sep 9, 11:04 AM · serviceops, SRE
MoritzMuehlenhoff renamed T332015: Migrate poolcounter hosts to bookworm from Migrate poolcounter hosts to bullseye to Migrate poolcounter hosts to bookworm.
Mon, Sep 9, 9:44 AM · serviceops
MoritzMuehlenhoff added a comment to T372817: reimage gerrit1004.wikimedia.org as phab1005.eqiad.wmnet.

@Dzahn gerrit1004 is still in puppetdb: https://puppetboard.wikimedia.org/catalog/gerrit1004.wikimedia.org

Mon, Sep 9, 9:18 AM · SRE, DC-Ops, ops-eqiad, collaboration-services
MoritzMuehlenhoff added a comment to T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets.

Something went wrong with the 2430 rename, it's still showing up in Puppetboard: https://puppetboard.wikimedia.org/node/mw2430.codfw.wmnet

Mon, Sep 9, 8:30 AM · serviceops, SRE

Fri, Sep 6

MoritzMuehlenhoff added a comment to T372472: docker-registry.wikimedia.org/dcl-puppet-pki fails to install debmonitor-client.

I think

Fri, Sep 6, 3:47 PM · Infrastructure-Foundations
MoritzMuehlenhoff created T374250: Check home/HDFS leftovers of manuel-wmde.
Fri, Sep 6, 2:55 PM · Data-Platform-SRE
MoritzMuehlenhoff closed T372472: docker-registry.wikimedia.org/dcl-puppet-pki fails to install debmonitor-client as Resolved.

I've uploaded a fixed bullseye build to apt.wikimedia.org and upgraded build2001 (the rest of Bullseye hosts is WIP), that unbreak the next docker-report run.

Fri, Sep 6, 11:16 AM · Infrastructure-Foundations
MoritzMuehlenhoff added a comment to T372472: docker-registry.wikimedia.org/dcl-puppet-pki fails to install debmonitor-client.

Just for fun I tried backporting the patch, and it does apply cleaning, so perhaps running a custom glibc is an option? @MoritzMuehlenhoff we would love to hear your opinion as well?

Fri, Sep 6, 10:00 AM · Infrastructure-Foundations
MoritzMuehlenhoff closed T365799: Revert back to fleet-wide acmechief config once all ACME consumers are on Puppet 7 as Resolved.

All done!

Fri, Sep 6, 8:43 AM · Traffic, Acme-chief, Puppet-Infrastructure, SRE, Infrastructure-Foundations
MoritzMuehlenhoff closed T365799: Revert back to fleet-wide acmechief config once all ACME consumers are on Puppet 7, a subtask of T365798: Shutdown of Puppet 5 servers, as Resolved.
Fri, Sep 6, 8:42 AM · Patch-For-Review, Puppet-Infrastructure, SRE, Infrastructure-Foundations
MoritzMuehlenhoff updated the task description for T365799: Revert back to fleet-wide acmechief config once all ACME consumers are on Puppet 7.
Fri, Sep 6, 8:42 AM · Traffic, Acme-chief, Puppet-Infrastructure, SRE, Infrastructure-Foundations

Thu, Sep 5

MoritzMuehlenhoff closed T373888: puppetmaster1003: broken disk as Resolved.

I've kicked off the RAID rebuild; it should complete in half an hour. I've also re-added puppetmaster1003 back to active duty.

Thu, Sep 5, 3:12 PM · SRE, ops-eqiad, DC-Ops
MoritzMuehlenhoff added a comment to T373980: Hosts using nftables are not reachable via ssh from alert[12]002. Reboot needed..

Thank you to all involved so far with the reboots -- much appreciated!

I can confirm the hosts are now reachable from alert2002, except lists1004 and lists2001. On these hosts, unlike the others, there is both nft and iptables, similarly there's /etc/ferm and /etc/nftables which I'm assuming is the cause of the problem (i.e. iptables and nftables together). Does that ring a bell @eoghan ?

Thu, Sep 5, 2:18 PM · collaboration-services, Infrastructure-Foundations, SRE Observability (FY2024/2025-Q1), Observability-Alerting
MoritzMuehlenhoff updated the task description for T373980: Hosts using nftables are not reachable via ssh from alert[12]002. Reboot needed..
Thu, Sep 5, 11:51 AM · collaboration-services, Infrastructure-Foundations, SRE Observability (FY2024/2025-Q1), Observability-Alerting
MoritzMuehlenhoff added a comment to T373888: puppetmaster1003: broken disk.

@VRiley-WMF puppetmaster1003 has been taken out of active duty and I've set downtime, you can proceed with the drive swap any time.

Thu, Sep 5, 10:13 AM · SRE, ops-eqiad, DC-Ops
MoritzMuehlenhoff triaged T373980: Hosts using nftables are not reachable via ssh from alert[12]002. Reboot needed. as High priority.
Thu, Sep 5, 9:34 AM · collaboration-services, Infrastructure-Foundations, SRE Observability (FY2024/2025-Q1), Observability-Alerting
MoritzMuehlenhoff updated the task description for T373980: Hosts using nftables are not reachable via ssh from alert[12]002. Reboot needed..
Thu, Sep 5, 9:24 AM · collaboration-services, Infrastructure-Foundations, SRE Observability (FY2024/2025-Q1), Observability-Alerting
MoritzMuehlenhoff added a comment to T373980: Hosts using nftables are not reachable via ssh from alert[12]002. Reboot needed..

Any host which switches from iptables/ferm to nftables strictly needs a reboot after the provider has been changed. Some of the kernel modules used by iptables cannot be unloaded at runtime without a reboot (I tried various -f hacks, but to no avail). If the old iptables kernel modules are still loaded the constants formerly defined by ferm still persist (and this is what we are seeing here: the hosts don't know about alert2002 being in the new global list of monitoring hosts).

Thu, Sep 5, 8:28 AM · collaboration-services, Infrastructure-Foundations, SRE Observability (FY2024/2025-Q1), Observability-Alerting
MoritzMuehlenhoff claimed T365799: Revert back to fleet-wide acmechief config once all ACME consumers are on Puppet 7.
Thu, Sep 5, 7:02 AM · Traffic, Acme-chief, Puppet-Infrastructure, SRE, Infrastructure-Foundations

Tue, Sep 3

MoritzMuehlenhoff added a comment to T373888: puppetmaster1003: broken disk.

Nice! I suppose the disk swap needs downtime? Then I'll take the server out of rotation Thursday morning (I'm off tomorrow)

Tue, Sep 3, 5:13 PM · SRE, ops-eqiad, DC-Ops
MoritzMuehlenhoff added a comment to T373819: Issues reimaging kubernetes workers due to user conflicts in systemd-timesyncd .

Tracking bug is https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1080418

Tue, Sep 3, 3:08 PM · serviceops, Infrastructure-Foundations
MoritzMuehlenhoff created T373888: puppetmaster1003: broken disk.
Tue, Sep 3, 2:47 PM · SRE, ops-eqiad, DC-Ops
MoritzMuehlenhoff added a comment to T373819: Issues reimaging kubernetes workers due to user conflicts in systemd-timesyncd .

I'll reproduce this in an nspawn contained and report upstream.

Tue, Sep 3, 2:10 PM · serviceops, Infrastructure-Foundations
MoritzMuehlenhoff added a comment to T373819: Issues reimaging kubernetes workers due to user conflicts in systemd-timesyncd .

Sure, but if it is not d-i that installs it, then it is puppet (via systemd::timesyncd and the related profile included in profile::base) that does it right? If there is a race condition we can fix it in puppet, this is my point, but maybe I am not getting what is your idea/plan for the fix.

Tue, Sep 3, 1:26 PM · serviceops, Infrastructure-Foundations
MoritzMuehlenhoff added a comment to T373819: Issues reimaging kubernetes workers due to user conflicts in systemd-timesyncd .

I think this might be a bug in the latest systemd update for LTS:

Tue, Sep 3, 1:12 PM · serviceops, Infrastructure-Foundations
MoritzMuehlenhoff added a comment to T373819: Issues reimaging kubernetes workers due to user conflicts in systemd-timesyncd .

deb11u5 is from the point release, deb11u6 is from https://lists.debian.org/debian-lts-announce/2024/09/msg00001.html (released yesterday)

Tue, Sep 3, 12:50 PM · serviceops, Infrastructure-Foundations
MoritzMuehlenhoff updated the task description for T373783: Integrate Bookworm 12.7 point update.
Tue, Sep 3, 11:08 AM · Infrastructure-Foundations, SRE
MoritzMuehlenhoff updated the task description for T373795: Integrate Bullseye 11.11 point update.
Tue, Sep 3, 11:08 AM · Infrastructure-Foundations, SRE
MoritzMuehlenhoff updated the task description for T368288: Integrate Bullseye 11.10 point update.
Tue, Sep 3, 10:23 AM · Infrastructure-Foundations, SRE
MoritzMuehlenhoff updated the task description for T373795: Integrate Bullseye 11.11 point update.
Tue, Sep 3, 10:23 AM · Infrastructure-Foundations, SRE
MoritzMuehlenhoff updated the task description for T373783: Integrate Bookworm 12.7 point update.
Tue, Sep 3, 10:22 AM · Infrastructure-Foundations, SRE
MoritzMuehlenhoff closed T299894: Updated java.security policy in OpenJDK 11.0.4 as Declined.

This got superceded by https://phabricator.wikimedia.org/T328331

Tue, Sep 3, 7:36 AM · Infrastructure-Foundations, SRE
MoritzMuehlenhoff closed T306429: check_user: manager information not present anymore as Resolved.

The tool has been fixed by https://gerrit.wikimedia.org/r/c/operations/puppet/+/761029, the manager information is now correctly displayed.

Tue, Sep 3, 7:26 AM · User-jbond, Infrastructure-Foundations
MoritzMuehlenhoff closed T317406: Evaluate xbzrle and/or auto-converge in qemu as Declined.

This was intended as a workaround for VMs running on Ganeti servers with 1G memory and Java-based workloads which have a lot of memory activity. We started to buy 10G NICs for all server refreshes and at this point (and when the next refresh is done), the old systems should be mostly gone. As such, this is no longer needed.

Tue, Sep 3, 7:22 AM · Ganeti, Infrastructure-Foundations, SRE
MoritzMuehlenhoff updated the task description for T373795: Integrate Bullseye 11.11 point update.
Tue, Sep 3, 7:08 AM · Infrastructure-Foundations, SRE
MoritzMuehlenhoff updated the task description for T373783: Integrate Bookworm 12.7 point update.
Tue, Sep 3, 7:07 AM · Infrastructure-Foundations, SRE
MoritzMuehlenhoff added a comment to T373783: Integrate Bookworm 12.7 point update.

I updated the bullseye and bookworm netist images this morning

Tue, Sep 3, 7:07 AM · Infrastructure-Foundations, SRE

Mon, Sep 2

MoritzMuehlenhoff claimed T373637: services using libnet-dns-perl can't use nftables as firewall provider.

I'll look into a fix

Mon, Sep 2, 1:41 PM · Patch-For-Review, Infrastructure-Foundations, collaboration-services
MoritzMuehlenhoff merged task T373432: Some of the packages present in the Docker registry are not visible in Debmonitor into T348876: Container image reports in debmonitor are broken.
Mon, Sep 2, 1:39 PM · User-Elukey, Infrastructure-Foundations
MoritzMuehlenhoff merged T373432: Some of the packages present in the Docker registry are not visible in Debmonitor into T348876: Container image reports in debmonitor are broken.
Mon, Sep 2, 1:39 PM · Release-Engineering-Team (Radar), GitLab (Integrations), User-Elukey, collaboration-services, serviceops, Infrastructure-Foundations
MoritzMuehlenhoff added a comment to T373432: Some of the packages present in the Docker registry are not visible in Debmonitor.

That is already tracked as T348876, I'll merge that in.

Mon, Sep 2, 1:39 PM · User-Elukey, Infrastructure-Foundations
MoritzMuehlenhoff triaged T373783: Integrate Bookworm 12.7 point update as Medium priority.
Mon, Sep 2, 1:36 PM · Infrastructure-Foundations, SRE
MoritzMuehlenhoff updated the task description for T349619: Migrate roles to puppet7.
Mon, Sep 2, 1:30 PM · Data-Platform-SRE (2024.06.17 - 2024.07.07), serviceops, collaboration-services, SRE-tools, Puppet-Core, Puppet (Puppet 7.0), Infrastructure-Foundations, SRE
MoritzMuehlenhoff added a comment to T372507: Prepare WMF PHP 8.1 packages for Bullseye.

In any case, while this may(?) be something we can simplify with some package surgery, I'm not sure that's warranted here.

This is something we'll need to think about for any use cases that require co-installable php versions. That clearly does not include the production image use case, but may include, e.g., maintenance hosts (if the migration to mw-script on k8s has not yet completed).

Mon, Sep 2, 1:11 PM · MediaWiki-Platform-Team (Radar), serviceops
MoritzMuehlenhoff triaged T373795: Integrate Bullseye 11.11 point update as Medium priority.
Mon, Sep 2, 11:03 AM · Infrastructure-Foundations, SRE
MoritzMuehlenhoff created T373795: Integrate Bullseye 11.11 point update.
Mon, Sep 2, 11:02 AM · Infrastructure-Foundations, SRE
MoritzMuehlenhoff created T373783: Integrate Bookworm 12.7 point update.
Mon, Sep 2, 8:03 AM · Infrastructure-Foundations, SRE

Jun 28 2024

MoritzMuehlenhoff added a comment to T364416: Q4:rack/setup/install deploy1003.

Let's directly install this server with Puppet 7, there should be no issues in the deployment-server manifests in terms of Puppet 5/7 compat at this point.

Jun 28 2024, 10:34 AM · SRE, serviceops, ops-eqiad, DC-Ops

Jun 27 2024

MoritzMuehlenhoff updated the task description for T368288: Integrate Bullseye 11.10 point update.
Jun 27 2024, 9:29 PM · Infrastructure-Foundations, SRE
MoritzMuehlenhoff added a comment to T343377: Grant slightly broader access to Klaxon.

Anf FYI, https://gerrit.wikimedia.org/g/operations/software/bitu-ldap is a wrapper for simplifying LDAP operations within Wikimedia (originally written for Bitu, but other Python also use it). Should be helpful for writing the dump script.

Jun 27 2024, 7:25 PM · Stewards-Onboarding-Tool, Sustainability (Incident Followup), Incident Tooling, SRE-OnFire, SRE
MoritzMuehlenhoff added a comment to T354855: ferm sometimes fails to restart on Kubernetes workers via xtables lock held by kube-proxy.

This issue is biting us again, the time between a puppet run that fails to start ferm.service and the subsequent run is enough to cause issues with eventgate, leading to flurries of Can't enqueue job errors in mediawiki. It is in particular triggered when we add new kubernetes nodes, because that causes all other kubernetes nodes in the cluster to update their ferm rules.

Would adding Restart=on-failure to ferm.service for kubernetes worker be a possible short-term solution @MoritzMuehlenhoff ?

Jun 27 2024, 5:11 PM · Infrastructure-Foundations, serviceops, SRE
MoritzMuehlenhoff added a comment to T343377: Grant slightly broader access to Klaxon.

Using ldap-maint1001 has the benefit that it already does r/w changes to the r/w slapd servers. Currently we don't restrict that, but we've been gradually shifting r/o access only to the replicas and I'd like to come to a state where the only r/w changes to our LDAP are coming from Horizon (for cloud VPS access management), ldap-maint and Bitu and then all other hosts in production get access denied via firewall rules.

Jun 27 2024, 3:35 PM · Stewards-Onboarding-Tool, Sustainability (Incident Followup), Incident Tooling, SRE-OnFire, SRE
MoritzMuehlenhoff added a comment to T343377: Grant slightly broader access to Klaxon.

One thing that we could do is to

Jun 27 2024, 3:32 PM · Stewards-Onboarding-Tool, Sustainability (Incident Followup), Incident Tooling, SRE-OnFire, SRE
MoritzMuehlenhoff added a comment to T352245: Migrate etcd::tlsproxy Nginx certs and etcd itself to PKI.

Thanks, @MoritzMuehlenhoff - you're absolutely right that we could decouple these.

The TLS proxy will go away with the v3 migration, since its primary use case will be absorbed into etcd itself (role-based access control). Thus, I think the main question is whether the effort is worth it.

Jun 27 2024, 9:22 AM · Patch-For-Review, serviceops
MoritzMuehlenhoff claimed T355663: Allocate more available UNIX UIDs for human users.

I'll take care of this when I'm back from sabbatical

Jun 27 2024, 9:08 AM · User-MoritzMuehlenhoff, Bitu, Infrastructure-Foundations, cloud-services-team, LDAP
MoritzMuehlenhoff added a project to T345070: Attach opencontainers image metadata to docker images: User-MoritzMuehlenhoff.
Jun 27 2024, 9:07 AM · User-MoritzMuehlenhoff, User-Elukey, Release-Engineering-Team, serviceops, docker-pkg
MoritzMuehlenhoff assigned T368597: Decommission ganeti1019 to Jclark-ctr.
Jun 27 2024, 9:04 AM · DC-Ops, ops-eqiad, Infrastructure-Foundations, SRE, decommission-hardware
MoritzMuehlenhoff updated the task description for T368597: Decommission ganeti1019.
Jun 27 2024, 9:04 AM · DC-Ops, ops-eqiad, Infrastructure-Foundations, SRE, decommission-hardware
MoritzMuehlenhoff triaged T368597: Decommission ganeti1019 as Medium priority.
Jun 27 2024, 8:56 AM · DC-Ops, ops-eqiad, Infrastructure-Foundations, SRE, decommission-hardware
MoritzMuehlenhoff created T368597: Decommission ganeti1019.
Jun 27 2024, 8:46 AM · DC-Ops, ops-eqiad, Infrastructure-Foundations, SRE, decommission-hardware
MoritzMuehlenhoff closed T331702: Migrate mw_rc_irc servers to Bullseye as Resolved.

The old nodes have been decommissioned, all done.

Jun 27 2024, 8:39 AM · Patch-For-Review, Wikimedia-IRC-RC-Server, SRE-Unowned, SRE
MoritzMuehlenhoff closed T331702: Migrate mw_rc_irc servers to Bullseye, a subtask of T291916: Tracking task for Bullseye migrations in production, as Resolved.
Jun 27 2024, 8:38 AM · User-Elukey, Epic, Infrastructure-Foundations, SRE
MoritzMuehlenhoff closed T368503: Update CAS to 6.6.15.2 as Resolved.
Jun 27 2024, 7:44 AM · Infrastructure-Foundations, CAS-SSO
MoritzMuehlenhoff triaged T368503: Update CAS to 6.6.15.2 as High priority.
Jun 27 2024, 7:39 AM · Infrastructure-Foundations, CAS-SSO
MoritzMuehlenhoff updated the task description for T365799: Revert back to fleet-wide acmechief config once all ACME consumers are on Puppet 7.
Jun 27 2024, 7:32 AM · Traffic, Acme-chief, Puppet-Infrastructure, SRE, Infrastructure-Foundations
MoritzMuehlenhoff added a comment to T332016: Migrate docker registry hosts to bookworm.

Why bullseye, this should be bookworm? docker-registry is packaged in Debian, so we can simply use bookworm and use the package from it. In fact, we are already using the bookworm package on the existing registry hosts (2.8.2+ds1-1)

Jun 27 2024, 7:24 AM · serviceops

Jun 26 2024

MoritzMuehlenhoff updated the task description for T368288: Integrate Bullseye 11.10 point update.
Jun 26 2024, 2:57 PM · Infrastructure-Foundations, SRE
MoritzMuehlenhoff renamed T367981: Update Proton to include Chromium 128.0.6613.119-1 from Update Proton to include Chromium 126.0.6478.114 to Update Proton to include Chromium 126.0.6478.126.
Jun 26 2024, 11:21 AM · Content-Transform-Team-WIP, Essential-Work, Proton
MoritzMuehlenhoff added a comment to T367981: Update Proton to include Chromium 128.0.6613.119-1.

New release:
https://lists.debian.org/debian-security-announce/2024/msg00131.html

Jun 26 2024, 11:20 AM · Content-Transform-Team-WIP, Essential-Work, Proton

Jun 25 2024

MoritzMuehlenhoff added a comment to T368088: upgrade prometheus-ipmi-exporter to 1.8.0.

Indeed, CGO_ENABLED=0 rings a bell.

Jun 25 2024, 3:49 PM · SRE Observability (FY2024/2025-Q1), Patch-For-Review, Packaging, Infrastructure-Foundations
MoritzMuehlenhoff updated the task description for T365165: Q4:rack/setup/install krb1002.
Jun 25 2024, 2:39 PM · SRE, Infrastructure-Foundations, ops-eqiad, DC-Ops
MoritzMuehlenhoff added a comment to T365165: Q4:rack/setup/install krb1002.

@MoritzMuehlenhoff would you be able to update site.pp file for this server?

Jun 25 2024, 2:39 PM · SRE, Infrastructure-Foundations, ops-eqiad, DC-Ops
MoritzMuehlenhoff updated the task description for T365799: Revert back to fleet-wide acmechief config once all ACME consumers are on Puppet 7.
Jun 25 2024, 1:23 PM · Traffic, Acme-chief, Puppet-Infrastructure, SRE, Infrastructure-Foundations
MoritzMuehlenhoff added a comment to T368088: upgrade prometheus-ipmi-exporter to 1.8.0.

The dependency is added because some feature in the compiled Go code uses syscalls which were only wired up in 2.34 (maybe openat() at al). We ran into this problem before and there was a Go build flag to force it to use a fallback. I can't find a reference currently, but maybe Filippo remembers when he's back.

Jun 25 2024, 10:56 AM · SRE Observability (FY2024/2025-Q1), Patch-For-Review, Packaging, Infrastructure-Foundations
MoritzMuehlenhoff updated the task description for T349619: Migrate roles to puppet7.
Jun 25 2024, 10:48 AM · Data-Platform-SRE (2024.06.17 - 2024.07.07), serviceops, collaboration-services, SRE-tools, Puppet-Core, Puppet (Puppet 7.0), Infrastructure-Foundations, SRE
MoritzMuehlenhoff closed T367554: Cloud VPS "sso" project Buster deprecation as Resolved.

The Buster instances have been removed.

Jun 25 2024, 8:43 AM · Cloud-VPS (Debian Buster Deprecation)
MoritzMuehlenhoff updated the task description for T367554: Cloud VPS "sso" project Buster deprecation.
Jun 25 2024, 8:41 AM · Cloud-VPS (Debian Buster Deprecation)
MoritzMuehlenhoff added a comment to T367554: Cloud VPS "sso" project Buster deprecation.

hi all i wanted to say that the sso project is used so that users have an SSO testing infrastructure to use in cloud services. Originally this was also used to provide sso to production like services in cloud services, however this later functionality has been moved.

If there is still a desire to keep a development environment then we will still need all theses machines

  • puppetprimary.sso.eqiad1.wikimedia.cloud: The project uses its own puppet master as we have secrets
  • sso-db.sso.eqiad1.wikimedia.cloud: This is a mysql db used to store e.g. mfa keys
  • sso-pdb.sso.eqiad1.wikimedia.cloud: A puppet db instance, not sure why this is used but guessing the idp classes somehow need some puppetdb functionality.
Jun 25 2024, 8:26 AM · Cloud-VPS (Debian Buster Deprecation)

Jun 24 2024

MoritzMuehlenhoff triaged T368288: Integrate Bullseye 11.10 point update as Medium priority.
Jun 24 2024, 7:11 PM · Infrastructure-Foundations, SRE
MoritzMuehlenhoff created T368288: Integrate Bullseye 11.10 point update.
Jun 24 2024, 3:39 PM · Infrastructure-Foundations, SRE
MoritzMuehlenhoff added a comment to T367757: Request to add mnz to analytics-research-admins.

Then the next steps will be creating an SSH key and signing the access agreement. Here are the details:

Jun 24 2024, 3:36 PM · Patch-For-Review, SRE, SRE-Access-Requests
MoritzMuehlenhoff added a comment to T367757: Request to add mnz to analytics-research-admins.

@KFrancis can you please make sure @MunizaA's NDA is signed? Thank you!

Jun 24 2024, 3:02 PM · Patch-For-Review, SRE, SRE-Access-Requests
MoritzMuehlenhoff added a project to T368088: upgrade prometheus-ipmi-exporter to 1.8.0: SRE Observability.
Jun 24 2024, 2:30 PM · SRE Observability (FY2024/2025-Q1), Patch-For-Review, Packaging, Infrastructure-Foundations
MoritzMuehlenhoff triaged T368088: upgrade prometheus-ipmi-exporter to 1.8.0 as Medium priority.

Given that this is a Go static ELF we can also simply build on bookworm and copy over the deb to bullseye-wikimedia, we're doing this for other exporters as well. buster might be tricky due to it's old libc6, but we can also ignore it, there's less than 150 hosts left and they can simply live the old IPMI monitoring.

Jun 24 2024, 2:30 PM · SRE Observability (FY2024/2025-Q1), Patch-For-Review, Packaging, Infrastructure-Foundations
MoritzMuehlenhoff triaged T368023: Move the private Puppet repository to puppetserver1001 as High priority.
Jun 24 2024, 2:11 PM · Patch-For-Review, User-Elukey, Puppet-Infrastructure, SRE, Infrastructure-Foundations
MoritzMuehlenhoff triaged T367861: Migrate ldap-ro and ldap-ro-ssl to IPIP encapsulation as Medium priority.
Jun 24 2024, 1:03 PM · Infrastructure-Foundations, Traffic
MoritzMuehlenhoff triaged T367487: Update CAS to 7.0 as Medium priority.
Jun 24 2024, 1:03 PM · Patch-For-Review, CAS-SSO, Infrastructure-Foundations, SRE
MoritzMuehlenhoff added a comment to T350567: Migrate Cassandra to Java 11.

Very nice!

Jun 24 2024, 11:53 AM · Cassandra, Data-Persistence, SRE
MoritzMuehlenhoff updated the task description for T273950: Modernise memcached systemd unit / sync, and make it presentable.
Jun 24 2024, 11:25 AM · Cloud-Services, serviceops, User-jijiki, SRE
MoritzMuehlenhoff added a comment to T273950: Modernise memcached systemd unit / sync, and make it presentable.

CAS 7.0 (what we are currently migrating to) removed the memcached backend. As such, this change won't be needed anymore for the idp servers, I'll tick them off.

Jun 24 2024, 11:25 AM · Cloud-Services, serviceops, User-jijiki, SRE

Jun 20 2024

MoritzMuehlenhoff added a project to T310087: Advance declaration of query parameters: User-MoritzMuehlenhoff.
Jun 20 2024, 2:05 PM · User-MoritzMuehlenhoff, SRE, Traffic, MediaWiki-General
MoritzMuehlenhoff added a comment to T367399: Default to the Puppet 7 PCC CI test, make it voting and eventually remove the Puppet 5 one.

Did one of these changes possinbly break PCC here?
https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler-test/3739/console

Jun 20 2024, 12:50 PM · Patch-For-Review, Puppet-Infrastructure, SRE, Infrastructure-Foundations
MoritzMuehlenhoff added a comment to T368023: Move the private Puppet repository to puppetserver1001.

And prior to the migration, puppetserver1001 needs to be allowed in profile::tcpircbot

Jun 20 2024, 11:54 AM · Patch-For-Review, User-Elukey, Puppet-Infrastructure, SRE, Infrastructure-Foundations
MoritzMuehlenhoff updated the task description for T349619: Migrate roles to puppet7.
Jun 20 2024, 11:04 AM · Data-Platform-SRE (2024.06.17 - 2024.07.07), serviceops, collaboration-services, SRE-tools, Puppet-Core, Puppet (Puppet 7.0), Infrastructure-Foundations, SRE
MoritzMuehlenhoff added a comment to T352647: Move Cassandra clusters to PKI.

The task can be closed, or is there anything still open?

Jun 20 2024, 10:04 AM · Patch-For-Review, Data-Persistence, Cassandra