User Details
- User Since
- Jan 18 2024, 5:33 PM (33 w, 4 d)
- Availability
- Available
- LDAP User
- Scott French
- MediaWiki User
- SFrench-WMF [ Global Accounts ]
Yesterday
Fri, Sep 6
I chatted with @Clement_Goubert a bit earlier, and it sounds like targeting 60% utilization at p95 is probably not necessary - 70% or possibly even 75% should be a fine starting point (the PHPFPMTooBusy alert threshold is the latter).
Thu, Sep 5
Wed, Sep 4
Tue, Sep 3
To recap, this was the result of the sre.hosts.decommission run on 2024-09-02 for mw[2261-2262,2268-2270].codfw.wmnet (5 hosts in this rack).
At a high level, we can split this into two phases: TLS proxy (nginx) and etcd.
Sat, Aug 31
As we've reached the end of August and the v3 migration is still pending due to higher priority work, I think it's time to reassess this.
Fri, Aug 30
Thu, Aug 29
Although a bit of a process, the following will definitely work:
Thanks for chatting earlier today @dduvall.
Spent a bit of time thinking about this today.
Wed, Aug 28
The 8.1-based production images are ready to go and seem to work per some basic local smoke tests.
Tue, Aug 27
Alright, so there is at least one tricky bit to this: How do we run generateUpperCharTable.php on 8.1 without also installing 8.1 on maintenance hosts?
Mon, Aug 26
One last test and point-of-note:
Fri, Aug 23
Following up on the status of the php-geoip extension (h/t to @Krinkle for all the discussion out of band):
Thu, Aug 22
Thank you very much for the explanation in T373037#10085662, Amir - that makes sense. On my quick read of the key-structure description, it did not occur to me that both "tiers" are in the same store.
Shared objects are now present in all three packages, as well as a previously missing .ini file from wikidiff2. Verified that a local build of docker-registry.wikimedia.org/php8.1-fpm-multiversion-base no longer produces warnings about missing extensions.
Alright, I think I see a way out of this: I'd overlooked that the debian/rules files for these packages set an explicit INSTALL_ROOT for make install in override_dh_auto_install, which did not include the version number (i.e., does not match the "manually made coinstallable" package name). Fixing that makes it so that the result in where dh-php expects it to be.
Interesting! @Ladsgroup - Could you expand on the first point? ("Make sure the sharding ...") The reference to a 50% flush on section removal sounds like going from a naive mod N (= number of sections) to a static number of logical shards (so, approaching consistent hashing, which aligns with your later points), but I'm not sure I understand the relationship with the cache key structure in the first sentence.
Wed, Aug 21
While working through the production image definitions for T372602, I discovered that the three extension packages maintained by WMF (php-luasandbox, php-wmerrors, wikidiff2) build successfully, but with incomplete contents.
Verified that in a fresh docker-registry.wikimedia.org/bullseye:latest image, I can successfully:
Tue, Aug 20
Fri, Aug 16
Thanks for writing this up, Reuven.
Agreed, yeah: Some subset of those items will need done before the switchover, but exactly which subset depends on how far we expect things to be by then. I'll follow up on the task shortly.
Reminder to self: once live, wire this into the 01-stop-maintenance.py and 08-start-maintenance.py cookbooks.
+1 to option #3 as the most sensible / obvious one: adding something more complex than a single global boolean invites odd nonsense states in combination with read-only (currently a per-DC toggle) and primary DC.
Thu, Aug 15
Wed, Aug 14
Quick update from another occurrence starting at ~ 20:45 UTC today:
Tue, Aug 13
Additional period(s) of badness later today starting around 21:00 UTC.
Aug 8 2024
Though mainly focused on supporting the php 8.1 migration, there's ongoing work to support multiple base-image “flavors” and a helm-release-to-flavor mapping in scap (T370934), which may be useful here.
Aug 7 2024
Edit: Whoops, I completely missed T371885#10048618 onward before posting this. In any case, question still stands re: the annotation :)
Aug 6 2024
Ah, these are good questions.
Agreed with @Joe's assessment above: for each image type (e.g., mediawiki), scap would need to support a configurable set of base images from which the image will be built ("flavors").
Aug 1 2024
With the cache_warmup class relocated, I think the near-term work is done. There are two TODOs related to fully removing the script etc. from the maintenance hosts, but IMO we can just wait for the latter to go away as planned.
Alright, this should now be done:
- the script's clone subcommand enumerates pod IP:port pairs via the -tls-service Endpoints object(s)
- the script and URL files are now installed on the deployment hosts, where the necessary k8s configs and credentials are present
- the switchdc warmup caches cookbook now invokes the script on the primary deployment host via cumin (with updated arguments)
Jul 30 2024
Jul 26 2024
Jul 25 2024
Jul 24 2024
Jul 23 2024
Many thanks, all who helped get this out the door.
Silenced ProbeDown for api-https:443 and appservers-https:443 for 24h:
- f6f67d8d-6381-43b3-9262-9a8cf58f2b19
- ed0d352b-fb83-4bd4-a586-142b100ca6e5
Jul 22 2024
Following up here after various chats on IRC:
Jul 19 2024
Jul 18 2024
In both cases, workers start failing with SIGILL at the start of badness, e.g. (from mw-api-ext.eqiad.main-7686884f77-ql69d):
appservers-ro.discovery.wmnet and api-ro.discovery.wmnet now resolve to failoid, by way of manually updating their DYNA records in the wmnet zone template to point to geoip!disc-failoid:
Jul 16 2024
Current status:
- appservers-rw and api-rw are depooled everywhere, and resolve to failoid as of 17:45 UTC
- api-ro is serving only from eqiad as of 17:40 UTC
- appservers-ro is depooled everywhere as of 19:25 UTC
Jul 12 2024
Jul 11 2024
For the record: v1.0.3 is live in staging only (production is untouched), after it became apparent that additional changes are needed. If a 1.0.4 is available with an updated swagger spec, let me know and I'm happy to assist.
Ah, great - thanks for confirming those older docs will go away, @mforns.
+1 to using a more descriptive name for the resource operated on.
Jul 10 2024
Cool, it sounds like the conversation has evolved to using a dedicated schema, and we're on the same page that a multi-value set should work (to accommodate reason).
Alright, good(er) news: the service is now live at /api/rest_v1/metrics/commons-analytics.
Great, thank you very much @dcausse for cleaning up the old config and @Clement_Goubert for confirming.
Jul 9 2024
Ah, thanks for surfacing that, @mforns.
In short, and I realize this doesn't help much, my understanding is that what makes sense as an object name vs. an object tag is really up to you (e.g., ergonomics of tag selectors for common operations).
Found my way here via the highly informative "wdqs-streaming-updater-test-T361935/WMF" user-agent you've used. Thanks for that!
Ah, interesting - I wasn't aware of the prior art with dnsbox. Indeed, reusing node for a fundamentally "host shaped thing" where (1) you anticipate eventually using as-yet unused fields and (2) do not anticipate ever needing to enrich node with new fields, seems less concerning.
@SGupta-WMF - thanks for documenting the API at [0]. One thing I noticed while updating wikitech: it looks like the examples assume the service is reachable at /api/rest_v1/metrics/commons-impact-analytics rather than /api/rest_v1/metrics/commons-impact.
Alright, good news: /api/rest_v1/metrics/commons-impact should now be publicly available.
Thanks for the excellent / detailed write-up, @ssingh!
@mforns - The v1.0.2 image is now live in staging. Please take a look when you get a chance, and let me know if / when you'd like me to proceed with the remaining steps.
Jul 8 2024
@mforns - The v1.0.1 image is now live in staging. As before, it can be reached internally at https://commons-impact-analytics.k8s-staging.discovery.wmnet:30443/ from any production host.
Jul 2 2024
@mforns sure, that's no problem at all! Just let me know when the image is ready.
Thanks for taking a look, @xcollazo. I'll defer to @mforns and @SGupta-WMF here, as my quick check was only based on comparison with [0] (which uses timestamp in the public API).
Thanks for the sample data, @xcollazo.
Jul 1 2024
The service is up and running in staging, and can be reached at https://commons-impact-analytics.k8s-staging.discovery.wmnet:30443 internally.