As it says on the tin, regenerate UcfirstOverrides.php [0] (maintaining the static override for 'ß') using generateUcfirstOverrides.php and friends [1].
Description
Details
Event Timeline
Alright, so there is at least one tricky bit to this: How do we run generateUpperCharTable.php on 8.1 without also installing 8.1 on maintenance hosts?
Although it's entirely possible that we'll need to support 8.1 on maintenance hosts for some period, that's not something I want to front-load in this process (e.g., turning up new bullseye-based maintenance hosts with both 7.4 and 8.1 installed).
I think the simplest option will be to wait for multi-base-image-flavor mediawiki image builds to work in scap, such that (as-yet unused) 8.1-based mediawiki images are available. Once that's ready, we can use mwscript-on-k8s to run generateUpperCharTable.php with --output 'php://stdout' using said image and pull the result from the container logs.
In fact, we can test that right now with the existing 7.4 images.
This is simpler than extending mwscript-on-k8s with an optional (writable) volume for output files, that's deleted when the "synthesized" release is garbage collected.
Edit: This trick might not work, as I didn't account for the fact that the table is ~ 32 MiB of text (certainly works locally, but streaming from container logs might be a challenge).
Although a bit of a process, the following will definitely work:
- Use mwscript-k8s to launch shell.php in --attach mode - when doing this "for real" we'll select the 8.1 image flavor via --mediawiki_image.
- Run what is effectively the main body of generateUpperCharTable.php in --titlecase mode, saving the result to /tmp in the overlay FS:
> $toUpperTable = []; for ( $i = 0; $i <= 0x10ffff; $i++ ) { if ( $i >= 0xd800 && $i <= 0xdfff ) continue; $char = UtfNormal\Utils::codepointToUtf8( $i ); $toUpperTable[$char] = mb_convert_case( $char, MB_CASE_TITLE ); } > file_put_contents( '/tmp/uctable.json', json_encode( $toUpperTable ) );
- Use kubectl cp to copy out /tmp/uctable.json.
This avoids solving the "durable output files" problem for mwscript-on-k8s. It can also be done well before an mw-debug deployment running 8.1 is turned up (e.g., that we could exec or mw-debug-repl into).
The changes in T370934 are now live, and we have 8.1-based MediaWiki images built during scap deployments. I'll give the procedure in T372603#10104750 a try shortly.
Alright, step 1, using the 7.4 and 8.1 flavors of the same image:
swfrench@deploy2002:~$ mwscript-k8s --comment='Generating uctable on 7.4 - T372603' --mediawiki_image=restricted/mediawiki-multiversion:2024-11-05-213458-publish --attach -- shell.php --wiki=testwiki ... snip ... > echo phpversion(); 7.4.33⏎ > $toUpperTable = []; for ( $i = 0; $i <= 0x10ffff; $i++ ) { if ( $i >= 0xd800 && $i <= 0xdfff ) continue; $char = UtfNormal\Utils::codepointToUtf8( $i ); $toUpperTable[$char] = mb_convert_case( $char, MB_CASE_TITLE ); } > file_put_contents( '/tmp/uctable.json', json_encode( $toUpperTable ) ); = 32599320 > ^D INFO Ctrl+D. swfrench@deploy2002:~$ mwscript-k8s --comment='Generating uctable on 8.1 - T372603' --mediawiki_image=restricted/mediawiki-multiversion:2024-11-05-213458-publish-81 --attach -- shell.php --wiki=testwiki ... snip ... > echo phpversion(); 8.1.30⏎ > $toUpperTable = []; for ( $i = 0; $i <= 0x10ffff; $i++ ) { if ( $i >= 0xd800 && $i <= 0xdfff ) continue; $char = UtfNormal\Utils::codepointToUtf8( $i ); $toUpperTable[$char] = mb_convert_case( $char, MB_CASE_TITLE ); } > file_put_contents( '/tmp/uctable.json', json_encode( $toUpperTable ) ); = 32599320 > ^D INFO Ctrl+D.
And now using mwscript on a maintenance host (to save a bit of complexity copy / pasting code and transferring files):
swfrench@mwmaint2002:~$ mwscript maintenance/language/generateUcfirstOverrides.php --wiki=testwiki --override uctable_8.1.json --with uctable_7.4.json --outfile overrides.php
i.e., read as "overriding the title-case mapping on 8.1 to look like 7.4" we obtain: https://phabricator.wikimedia.org/P70952
I'll put together a patch to update the overrides table (together with the static override for 'ß').
Change #1087604 had a related patch set uploaded (by Scott French; author: Scott French):
[operations/mediawiki-config@master] Add title-case mapping to support migration to PHP 8.1
I generated overrides.php locally and confirmed that it has the same MD5 hash as Scott's paste. The character tables are in the PHP source tree, so any local installation should be sufficient to generate these tables.
I reviewed the list, and it all looks benign from the point of view of linguistics and title conflicts. Various new combined diacritics, a "reversed half H" used in ancient Roman Gaul, and a new block 10570–105BF for the Vithkuqi alphabet, a 19th century Albanian invention meant as a "religiously neutral" alternative to Arabic, Latin and Greek scripts, never broadly adopted.
Thank you very much for the analysis, @tstarling. Great, so if I understand correctly, it should be safe to proceed with these overrides for 7.4 / 8.1 consistency, and then when it comes time to remove them, the impact of doing so would be fairly low.
Separately, I've added some functional validation in the comments on https://gerrit.wikimedia.org/r/1087604 to confirm the overrides would produce the expected effect in 8.1. Although we don't have a way to patch the source tree in the container, we can upload arbitrary text files (out of tree). This allows us to upload UcfirstOverrides.php from the patch and replace $wgOverrideUcfirstCharacters, which produces the expected behavior of Language::ucfirst (7.4-like).
Change #1089805 had a related patch set uploaded (by Krinkle; author: Krinkle):
[mediawiki/core@master] maintenance: Update generateUcfirstOverrides.php description
Change #1089805 merged by jenkins-bot:
[mediawiki/core@master] maintenance: Update generateUcfirstOverrides.php description
Change #1087604 merged by jenkins-bot:
[operations/mediawiki-config@master] Add title-case mapping to support migration to PHP 8.1
Mentioned in SAL (#wikimedia-operations) [2024-11-12T18:12:51Z] <swfrench@deploy2002> Started scap sync-world: Backport for [[gerrit:1087604|Add title-case mapping to support migration to PHP 8.1 (T372603)]]
Mentioned in SAL (#wikimedia-operations) [2024-11-12T18:19:09Z] <swfrench@deploy2002> swfrench: Backport for [[gerrit:1087604|Add title-case mapping to support migration to PHP 8.1 (T372603)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
Mentioned in SAL (#wikimedia-operations) [2024-11-12T18:24:38Z] <swfrench-wmf> verified consistent 7.4-like title-case behavior in 7.4- and 8.1-based images, verified expected treatment of eszett in mwdebug - T372603
Mentioned in SAL (#wikimedia-operations) [2024-11-12T18:31:40Z] <swfrench@deploy2002> Finished scap sync-world: Backport for [[gerrit:1087604|Add title-case mapping to support migration to PHP 8.1 (T372603)]] (duration: 18m 48s)
For the record, the "verified consistent 7.4-like title-case behavior" part of T372603#10313980 used the same procedure as described in this [0] comment thread, just without the manual override of wgOverrideUcfirstCharacters.
[0] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1087604/comments/93d2df23_76b89772
Alright, that should be everything explicitly tracked here for now. The next step is to switch mwdebug-next over to the 8.1-based images (T372604).
I've opened T379675 to request some manner of maintenance-script output file support in mwscript-k8s (to avoid the tricks for keeping the pod alive in T372603#10104750).