[go: up one dir, main page]

Page MenuHomePhabricator

docker-registry.wikimedia.org/dcl-puppet-pki fails to install debmonitor-client
Closed, ResolvedPublic

Description

On build2001 we run the docker-report-base unit to publish the latest Docker image version packages to debmonitor.

It is currently failing due to this error for docker-registry.wikimedia.org/dcl-puppet-pki:bullseye:

Aug 14 04:18:53 build2001 docker-report-base[1827590]: 2024-08-14 04:18:53,280 INFO[docker-report] Building debmonitor report for docker-registry.wikimedia.org/dcl-puppet-pki:bullseye
Aug 14 04:18:53 build2001 docker-report-base[1827590]: 2024-08-14 04:18:53,780 INFO[docker-report] Running: report generation
Aug 14 04:19:01 build2001 docker-report-base[1827590]: 2024-08-14 04:19:01,056 ERROR[docker-report] Report generation exited with exit code 100. Output:
Aug 14 04:19:01 build2001 docker-report-base[1827590]: 2024-08-14 04:19:01,057 ERROR[docker-report] Hit:1 http://deb.debian.org/debian-debug bullseye-debug InRelease
Aug 14 04:19:01 build2001 docker-report-base[1827590]: Get:2 http://security.debian.org/debian-security bullseye-security InRelease [48.4 kB]
Aug 14 04:19:01 build2001 docker-report-base[1827590]: Hit:3 http://mirrors.wikimedia.org/debian bullseye InRelease
Aug 14 04:19:01 build2001 docker-report-base[1827590]: Hit:4 http://apt.wikimedia.org/wikimedia bullseye-wikimedia InRelease
Aug 14 04:19:01 build2001 docker-report-base[1827590]: Get:5 http://mirrors.wikimedia.org/debian bullseye-updates InRelease [44.1 kB]
Aug 14 04:19:01 build2001 docker-report-base[1827590]: Get:6 http://mirrors.wikimedia.org/debian bullseye-backports InRelease [49.0 kB]
Aug 14 04:19:01 build2001 docker-report-base[1827590]: Get:7 http://security.debian.org/debian-security bullseye-security/main Sources [186 kB]
Aug 14 04:19:01 build2001 docker-report-base[1827590]: Get:8 http://security.debian.org/debian-security bullseye-security/main amd64 Packages [280 kB]
Aug 14 04:19:01 build2001 docker-report-base[1827590]: Fetched 607 kB in 2s (278 kB/s)
Aug 14 04:19:01 build2001 docker-report-base[1827590]: Reading package lists...
Aug 14 04:19:01 build2001 docker-report-base[1827590]: Reading package lists...
Aug 14 04:19:01 build2001 docker-report-base[1827590]: Building dependency tree...
Aug 14 04:19:01 build2001 docker-report-base[1827590]: Reading state information...
Aug 14 04:19:01 build2001 docker-report-base[1827590]: The following packages were automatically installed and are no longer required:
Aug 14 04:19:01 build2001 docker-report-base[1827590]:   libboost-filesystem1.74.0 libboost-locale1.74.0 libboost-log1.74.0
Aug 14 04:19:01 build2001 docker-report-base[1827590]:   libboost-nowide1.74.0 libboost-program-options1.74.0 libboost-thread1.74.0
Aug 14 04:19:01 build2001 docker-report-base[1827590]:   libcpp-hocon0.3.0 libfacter3.14.12 libleatherman1.12.1 libyaml-cpp0.6
Aug 14 04:19:01 build2001 docker-report-base[1827590]: Use 'apt autoremove' to remove them.
Aug 14 04:19:01 build2001 docker-report-base[1827590]: The following additional packages will be installed:
Aug 14 04:19:01 build2001 docker-report-base[1827590]:   python-apt-common python3-apt
Aug 14 04:19:01 build2001 docker-report-base[1827590]: Suggested packages:
Aug 14 04:19:01 build2001 docker-report-base[1827590]:   python3-apt-dbg python-apt-doc
Aug 14 04:19:01 build2001 docker-report-base[1827590]: Recommended packages:
Aug 14 04:19:01 build2001 docker-report-base[1827590]:   iso-codes
Aug 14 04:19:01 build2001 docker-report-base[1827590]: The following NEW packages will be installed:
Aug 14 04:19:01 build2001 docker-report-base[1827590]:   debmonitor-client python-apt-common python3-apt
Aug 14 04:19:01 build2001 docker-report-base[1827590]: 0 upgraded, 3 newly installed, 0 to remove and 1 not upgraded.
Aug 14 04:19:01 build2001 docker-report-base[1827590]: Need to get 301 kB of archives.
Aug 14 04:19:01 build2001 docker-report-base[1827590]: After this operation, 1364 kB of additional disk space will be used.
Aug 14 04:19:01 build2001 docker-report-base[1827590]: Get:1 http://mirrors.wikimedia.org/debian bullseye/main amd64 python-apt-common all 2.2.1 [96.5 kB]
Aug 14 04:19:01 build2001 docker-report-base[1827590]: Get:2 http://apt.wikimedia.org/wikimedia bullseye-wikimedia/main amd64 debmonitor-client all 0.4.0-1+deb11u1 [14.2 kB]
Aug 14 04:19:01 build2001 docker-report-base[1827590]: Get:3 http://mirrors.wikimedia.org/debian bullseye/main amd64 python3-apt amd64 2.2.1 [190 kB]
Aug 14 04:19:01 build2001 docker-report-base[1827590]: debconf: delaying package configuration, since apt-utils is not installed
Aug 14 04:19:01 build2001 docker-report-base[1827590]: Fetched 301 kB in 0s (3559 kB/s)
Aug 14 04:19:01 build2001 docker-report-base[1827590]: Selecting previously unselected package python-apt-common.
Aug 14 04:19:01 build2001 docker-report-base[1827590]: [614B blob data]
Aug 14 04:19:01 build2001 docker-report-base[1827590]: Preparing to unpack .../python-apt-common_2.2.1_all.deb ...
Aug 14 04:19:01 build2001 docker-report-base[1827590]: Unpacking python-apt-common (2.2.1) ...
Aug 14 04:19:01 build2001 docker-report-base[1827590]: Selecting previously unselected package python3-apt.
Aug 14 04:19:01 build2001 docker-report-base[1827590]: Preparing to unpack .../python3-apt_2.2.1_amd64.deb ...
Aug 14 04:19:01 build2001 docker-report-base[1827590]: Unpacking python3-apt (2.2.1) ...
Aug 14 04:19:01 build2001 docker-report-base[1827590]: Selecting previously unselected package debmonitor-client.
Aug 14 04:19:01 build2001 docker-report-base[1827590]: Preparing to unpack .../debmonitor-client_0.4.0-1+deb11u1_all.deb ...
Aug 14 04:19:01 build2001 docker-report-base[1827590]: Unpacking debmonitor-client (0.4.0-1+deb11u1) ...
Aug 14 04:19:01 build2001 docker-report-base[1827590]: Setting up python-apt-common (2.2.1) ...
Aug 14 04:19:01 build2001 docker-report-base[1827590]: Setting up python3-apt (2.2.1) ...
Aug 14 04:19:01 build2001 docker-report-base[1827590]: Setting up debmonitor-client (0.4.0-1+deb11u1) ...
Aug 14 04:19:01 build2001 docker-report-base[1827590]: Creating group debmonitor with gid 499.
Aug 14 04:19:01 build2001 docker-report-base[1827590]: Creating user debmonitor (DebMonitor system user) with uid 499 and gid 499.
Aug 14 04:19:01 build2001 docker-report-base[1827590]: /var/lib/dpkg/info/debmonitor-client.postinst: line 5:   569 Segmentation fault      (core dumped) systemd-sysusers
Aug 14 04:19:01 build2001 docker-report-base[1827590]: dpkg: error processing package debmonitor-client (--configure):
Aug 14 04:19:01 build2001 docker-report-base[1827590]:  installed debmonitor-client package post-installation script subprocess returned error exit status 139
Aug 14 04:19:01 build2001 docker-report-base[1827590]: Errors were encountered while processing:
Aug 14 04:19:01 build2001 docker-report-base[1827590]:  debmonitor-client
Aug 14 04:19:01 build2001 docker-report-base[1827590]: E: Sub-process /usr/bin/dpkg returned an error code (1)
Aug 14 04:19:01 build2001 docker-report-base[1827590]: 2024-08-14 04:19:01,057 ERROR[docker-report] Debmonitor report for image docker-registry.wikimedia.org/dcl-puppet-pki:bullseye failed

Highlight:

Aug 14 04:19:01 build2001 docker-report-base[1827590]: /var/lib/dpkg/info/debmonitor-client.postinst: line 5:   569 Segmentation fault      (core dumped) systemd-sysusers
Aug 14 04:19:01 build2001 docker-report-base[1827590]: dpkg: error processing package debmonitor-client (--configure):

@jhathaway anything that you already seen by any chance?

Event Timeline

I have not seen that before, but after testing, I am pretty sure we are hitting this bug, https://github.com/systemd/systemd/issues/6512, which is caused by a glibc bug, https://sourceware.org/bugzilla/show_bug.cgi?id=20338. If I shorten the group members of adm in /etc/gshadow then systemd-sysusers no longer segfaults. It also only segfaults when it is making a change.

I'm a little surprised we haven't seen this on any bullseye hosts in production, but perhaps we haven't provisioned a new one host or added a user since we exceeded the length limit?

I'm not sure of the best course of action:

  1. Ask everyone to remove letters from their login names 😜
  2. Try to get debian to backport a glibc patch to bullseye
  3. Patch glibc ourselves

None of those seem too attractive.

I can confirm I ran into this exact same issue in production with https://gerrit.wikimedia.org/r/q/b27f071184aeec789f06ef3ef75d3a33c8d63b2e (thank you @CDanis for mentioning this task)

I can confirm I ran into this exact same issue in production with https://gerrit.wikimedia.org/r/q/b27f071184aeec789f06ef3ef75d3a33c8d63b2e (thank you @CDanis for mentioning this task)

that is unfortunate, however it does remove my todo to test this on an sre test host, so thanks! 😉

jhathaway triaged this task as Medium priority.Aug 19 2024, 2:56 PM

I'm a little surprised we haven't seen this on any bullseye hosts in production, but perhaps we haven't provisioned a new one host or added a user since we exceeded the length limit?

I think that I triggered the limit when adding the dcops folks to adm for T360356, it would make sense as "trigger" that didn't surfaced before.

I'm not sure of the best course of action:

  1. Ask everyone to remove letters from their login names 😜
  2. Try to get debian to backport a glibc patch to bullseye
  3. Patch glibc ourselves

None of those seem too attractive.

One additional possibility would be to remove all SREs from the adm group (annoying I know but not the end of the world), and wait for Bookworm to become more used before re-adding them. Bullseye is in LTS mode as for few days ago, so I think 2) is not really a possibility :(

Thoughts?

Change #1067986 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::docker::reporter: exclude dcl-puppet-pki from base rules

https://gerrit.wikimedia.org/r/1067986

Change #1067986 merged by Elukey:

[operations/puppet@production] profile::docker::reporter: exclude dcl-puppet-pki from base rules

https://gerrit.wikimedia.org/r/1067986

Just for fun I tried backporting the patch, and it does apply cleaning, so perhaps running a custom glibc is an option? @MoritzMuehlenhoff we would love to hear your opinion as well?

Change #1071154 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/software/debmonitor-client@debian] debmonitor: Also use adduser on Bullseye to create the system user

https://gerrit.wikimedia.org/r/1071154

Just for fun I tried backporting the patch, and it does apply cleaning, so perhaps running a custom glibc is an option? @MoritzMuehlenhoff we would love to hear your opinion as well?

We ran into this before (https://phabricator.wikimedia.org/T256098) and back at the time there were some efforts to land the fix in Debian, which didn't work out due to some complexities around version symbols: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=969926. Now that Bullseye is in LTS, this is even less likely to get merged.

And we should avoid to patch glibc ourselves, if there's a security issue for glibc, then we'd always need to rebase our build.

Instead I made https://gerrit.wikimedia.org/r/c/operations/software/debmonitor-client/+/1071154, that fixes it for good and we can drop the patch when we're fully on >= bookworm.

Change #1071154 merged by Muehlenhoff:

[operations/software/debmonitor-client@debian] debmonitor: Also use adduser on Bullseye to create the system user

https://gerrit.wikimedia.org/r/1071154

Mentioned in SAL (#wikimedia-operations) [2024-09-06T10:39:09Z] <moritzm> uploaded debmonitor-client 0.4.0-2+deb11u1 on bullseye-wikimedia (didn't rebuild the other suites since the fix is specific to Bullseye) T372472

Mentioned in SAL (#wikimedia-operations) [2024-09-06T11:02:56Z] <moritzm> rolling out debmonitor-client 0.4.0-2+deb11u1 on bullseye-wikimedia on bullseye hosts T372472

MoritzMuehlenhoff claimed this task.

I've uploaded a fixed bullseye build to apt.wikimedia.org and upgraded build2001 (the rest of Bullseye hosts is WIP), that unbreak the next docker-report run.

Just for fun I tried backporting the patch, and it does apply cleaning, so perhaps running a custom glibc is an option? @MoritzMuehlenhoff we would love to hear your opinion as well?

We ran into this before (https://phabricator.wikimedia.org/T256098) and back at the time there were some efforts to land the fix in Debian, which didn't work out due to some complexities around version symbols: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=969926. Now that Bullseye is in LTS, this is even less likely to get merged.

And we should avoid to patch glibc ourselves, if there's a security issue for glibc, then we'd always need to rebase our build.

Instead I made https://gerrit.wikimedia.org/r/c/operations/software/debmonitor-client/+/1071154, that fixes it for good and we can drop the patch when we're fully on >= bookworm.

Do you know why Florian Weimer's patch was not merged? https://salsa.debian.org/glibc-team/glibc/-/merge_requests/2. @fgiunchedi also ran into this bug, so I fear this will not be our last issue, since we may have bullseye hosts for at least another year.

I think

Do you know why Florian Weimer's patch was not merged? https://salsa.debian.org/glibc-team/glibc/-/merge_requests/2. @fgiunchedi also ran into this bug, so I fear this will not be our last issue, since we may have bullseye hosts for at least another year.

I think the Debian glibc maintainers mostly only merge what ends up in the upstream stable branches and the fix was never merged into the 2.31 stable branch.

Debian packages have only started to use systemd-sysuser starting with Bookworm and later, so at least the impact is mostly around what we deploy via Puppet.

I think the Debian glibc maintainers mostly only merge what ends up in the upstream stable branches and the fix was never merged into the 2.31 stable branch.

makes sense, its a complex code base

Debian packages have only started to use systemd-sysuser starting with Bookworm and later, so at least the impact is mostly around what we deploy via Puppet.

good point