[go: up one dir, main page]

Page MenuHomePhabricator

Create a class to represent the identity of wikis on the same wiki farm
Open, Needs TriagePublic

Description

The concept of a "wiki ID" is spread throughout the MediaWiki code base, but it not formally defined anywhere, and there is quite a bit of confusion about edge cases and its relationship to other things, like database names (see for instance T184529: Define a way to get a database connection based on a logical wiki ID.). Currently, the identity of other wikis in the same wiki farm is represented as a string, more or less identical with their database name. Interwiki prefixes are used to refer to other known sites (but also to the wikis on the same wiki farm).

To resolve this, a WikiID class should be introduced. Or rather a hierarchy of three classes:

  • SiteID identifies any site of any kind the wiki can refer to or interact with
  • WikiID identifies a MediaWiki site the wiki can refer to or interact with
  • LocalWikiID identifies a MediaWiki site the wiki on the local cluster (aka wiki farm) which to which direct database access is possible.

WikiIDs (or rather, all SiteIDs) have a string representation that can be used for persistence in the database and representation in the web API. That string representation should be treated as a URI, and should be fairly stable, though not impossible to change. The string representation has to be unique and consistent at least across the local cluster (aka wiki farm), but should ideally be universally unique. Using the wiki's base URL (domain plus article path) would be the obvious choice.

WikiIDs (or rather, all SiteIDs) are constructed by a SiteIDFactory service (or SiteInfoLookup, compare T113034: RFC: Overhaul Interwiki map, unify with Sites and WikiMap) that provides the following methods:

  • getSiteId( string|false ) will instantiate a SiteID from the given string, applying any normalization, aliasing, and lookups. At the very least, a round trip with a SiteID's string representation must be supported, but completely different things like database names may be supported for some sites, especially wikis on the local cluster. The SiteID returned should be of the most specific type possible. If false is passed in, a LocalWikiID representing the current wiki (home wiki if the factory service) is returned.
  • isHomeWiki( SiteID|null ): returns true if the given site ID is the home wiki of the factory service. Will always return true if null is passed in.
  • getHomeWikiID: returns the LocalWikiID representing the current wiki (home wiki if the factory service).

In general, false instead of a string ID, or null instead of a SiteID object, are considered to represent the current wiki (home wiki). However, to allow transparent cross-wiki functionality, this notion is meaningful only relative to a given SiteIDFactory service; so it's not necessarily identical to the wiki the request was made to.

Some requirements for a wiki ID class representing farm-local wikis (LocalWikiId):

  • should have an equals() method, such that if two LocalWikiIds are equal, they refer to the same wiki.
  • should have a __toString method, but the string representation must not be considered stable (it could perhaps change from a database name to a base URI)
  • Can be based on the database name (plus table prefix, plus schema name, plus host name) for now, but using the wiki's base URL (domain and base path) would probably be better in the long run.
  • null should be used to represent the "current" wiki (just like false is currently used instead of a string based ID in this case)
  • WikiIDs must only be constructed by a factory service (WikiIdResolver), that has a getWikiId() method.
  • getWikiId() will resolve any aliases (such as interwiki prefixes or domain names) and construct a canonical LocalWikiId()
  • WikiIdResolver should also have get getDefaultWikiId() method that returns the canonical ID object for the wiki that is otherwise represented by null ("this" wiki, the "current" wiki).
  • LocalWikiId may extend a WikiId class that can also be used to represent "foreign" wikis.
  • LocalWikiId may extend a SiteId class that can also be used to represent "foreign" non-wiki sites.
  • a mapping should exist between LocalWikiId and DatabaseDomain
  • a mapping should exist between LocalWikiId and Interwiki

See also:

Event Timeline

using the wiki's base URL (domain and base path) would probably be better in the long run.

For one thing, having globally unique WikiIDs would be valuable for dealing with provenance data in a network of wikis where each wiki only knows about its immediate neighbors (think about a wiki having a ForeignAPIRepo for images and that foreign wiki having a ForeignAPIRepo itself, or a similar situation with Wikibase).

How would this relate to the existing wikifarm related concepts that are already in core? It seems there is already:

  • SiteConfiguration / wgConf
  • WikiMap that builds on top of SiteConfiguration and combines (duplicates?) it with $wgLocalDatabases
  • SiteStore / SiteLookup, i.e. the sites table.

Personally I'm a bit worried that trying to create a new unified replacement would merely extend this list with a fourth option. It already doesn't seem clear (for me at least) how the existing implementations are distributed between use cases—for instance, the sites table only seems to be used by InterwikiLookupAdapter (that is not enabled by default,) WikiMap is extensively used to allow a wiki to push jobs into a job queue of a different wiki (which is a feature that only Wikibase seems to be making use of) etc.

Thanks, I will move this comment there!

This seems very overcomplicated. The proposal in this task seems to be https://xkcd.com/927/ for the concept of "identifiers", and would likely turn out much like the last panel there.

What do we actually need?

  1. A short, human readable string (yes, string) to appear in configuration files and such to refer to local wikis for various purposes. Mostly for getting database connections, also for other configuration.
  2. An identifier for known "sites", that probably doesn't need to be so short because it probably won't be used much in configuration files.
  3. A bidirectional mapping between #1 and #2.
  4. A bidirectional mapping between #2 (which may not be wikis) and interwiki prefixes used for linking to those sites.
  5. A mapping between #2 and their public endpoints (primarily the interwiki target URL, but possibly also stuff like api.php for "sites" known to be running a compatible version of MediaWiki).
  6. ???

SiteConfiguration is intended as a mapping from local wiki identifiers to configuration. Unfortunately it's not terribly useful as it is because you need to know a lot of metadata about the target wiki in order for SiteConfiguration to actually determine its configuration.

Wikimedia\Rdbms\DatabaseDomain seems to have resulted from overcomplicating #1. Originally we used MySQL's "database" name as the local wiki identifier, one database per wiki, and things were fine. When we added table-prefixes-in-one-database, and used PostgreSQL's "database" as the equivalent of MySQL's "database" (PG's "schema" would have been the better equivalence), we should have kept the identifiers as simple references to database configuration that just happened to match the MySQL database names rather than trying to encode the database name, table prefix, and PG schema into the identifier. Facility for swapping an existing server connection (Database object) to a different database/schema and prefix should have been a private implementation detail.

The interwiki map is a bit of #4 and #5, being a mostly one-way mapping from interwiki prefix to endpoints (primarily a URL, maybe also an Action API endpoint).

Sites seems to be a somewhat overcomplicated mapping from "global" identifiers (#2-ish) to a URL and some other metadata, plus a way to query by some of that metadata. It seems intended to be able to handle most/all of #5, but on WMF wikis at least it looks like it's only populated for local sites.

WikiMap is, as far as I can tell, a one-way mapping from either a local identifier or a Sites "global" identifier to a URL.


What we probably actually need are two different identifiers: a local identifier string like "enwiki" (#1 above), and a global identifier string (that might be a URI or a UUID or even a Wikidata Q-number) (#2 above). Then in MediaWiki we need a few changes:

  • Add a bidirectional mapping from the local identifier to the global identifier. (#3 above)
  • Add a bidirectional mapping from global identifier to interwiki prefixes. (#4 above)
    • Possibly qualified between "global" prefixes (expected to exist on any MW site) and "local" prefixes (e.g. "de" on enwiki maps to dewiki, while on enwiktionary it maps to dewiktionary and on some other sites it might not map anywhere at all).
    • We might need an indication of which endpoint at the site a prefix is targeting, so e.g. "Gerrit"/"Git" or "FlickrUser"/"FlickrPhoto" at https://meta.wikimedia.org/wiki/Interwiki_map don't have to point to two different "sites" for different endpoints at one site.
  • Add a mapping from global identifier to public endpoints and any useful bits of Sites metadata. Perhaps at least partially bidirectional. (#5 above)
  • Either rework SiteConfiguration to not take $suffix, $params, and $wikiTags as parameters to ->get() and similar methods (instead rely on the ->siteParamsCallback provide those from the local identifier), or make sure those three things are part of the metadata in the third bullet's mapping for local sites so callers can reliably use SiteConfiguration.
    • Althought I note SiteConfiguration still won't work right for things set by code in CommonSettings.php, Setup.php, or the like rather than directly from InitialiseSettings.php. And possibly won't work right for things defaulting to DefaultSettings.php on the target wiki but not on the calling wiki. At some point we just have to take what we can easily get.
  • Drop DatabaseDomain in favor of using the local identifier to get the database, schema, and table prefix from the configuration passed to LBFactory. Or at least stop pretending it's actually any sort of an identifier by dropping the conversion to and from a string.

Interwiki mapping, WikiMap's functionality, and Sites functionality uses the three added maps to convert from what they have to what they need. In terms of things in this task, the "LocalWikiID" is the local identifier string, and both the "WikiID" and "SiteID" are the global identifier string. I can't see much point for differentiating the latter two at the level of the identifiers, rather than just having code that needs a MediaWiki site raise an error if passed a non-MediaWiki global identifier.

In general, false instead of a string ID, or null instead of a SiteID object, are considered to represent the current wiki (home wiki).

Why should false be used to represent the current wiki instead of null? null is meant to be used for cases like this, and in particular it can be used with type hinting (nullable types) while false cannot.

In general, types like "foo|bool" where the bool can only be false strike me as an antipattern. They should be "foo|null". (Of course, PHP itself does this, like in strpos(), but that certainly doesn't prove it's not an antipattern!)

(I recognize that this use of false is the status quo and is not introduced by this proposal. I'm just putting this out there so it can be considered for change if we're changing things anyway.)

In general, false instead of a string ID, or null instead of a SiteID object, are considered to represent the current wiki (home wiki).

Why should false be used to represent the current wiki instead of null?

Just because that's how it was for the past 15 years, and it's ingrained in the codebase in a thousand places. My intention here is to migrate from string|bool to SiteID|null. Actually, anything new accepting string|bool should probably accept string|bool|null. But we'll want false instead of a string to continue working for the foreseeable future. But per this proposal, strings should only be used to represent sites in config and external communication internally, it should always be a SiteID (or null).

What do we actually need?

  1. An identifier for known "sites", that probably doesn't need to be so short because it probably won't be used much in configuration files.

Also for unknown sites (which are more than one hop away in the wiki graph), so we can e.g. track provenance.

Wikimedia\Rdbms\DatabaseDomain seems to have resulted from overcomplicating #1.

DatabaseDomain is just an object holding all the DB connection config, which is IMO a sensible thing to have. What's less sensible is using it as a wiki identifier, although I don't think that happens that much.

What we probably actually need are two different identifiers: a local identifier string like "enwiki" (#1 above), and a global identifier string (that might be a URI or a UUID or even a Wikidata Q-number) (#2 above).

An URI is practical in that it ensures you can actually find the site (also, fake IDs are less of a problem). Also it doesn't require the target site to cooperate with you (e.g. have a recent enough MediaWiki version). Probably more human-readable too.
OTOH site URLs can change, which would be a major headache.

  • Althought I note SiteConfiguration still won't work right for things set by code in CommonSettings.php, Setup.php, or the like rather than directly from InitialiseSettings.php.

Doesn't the SiteConfiguration callback on Wikimedia sites shell out to a maintenance script on the target wiki and fetch the accurate configuration?

In any case, accessing cross-wiki config is a can of worms for another time.

In terms of things in this task, the "LocalWikiID" is the local identifier string, and both the "WikiID" and "SiteID" are the global identifier string. I can't see much point for differentiating the latter two at the level of the identifiers, rather than just having code that needs a MediaWiki site raise an error if passed a non-MediaWiki global identifier.

IMO making your IDs classes is usually a good practice, even if they are basically just a strings. It's self-documenting (as the class is a natural node for documentation, while a developer might have a harder time finding more info on @param string $id Site ID), more future-proof, and prevents mistakes where you pass the wrong type of ID. Having separate classes for remote wiki and remote site is practical for the same reason, mistakes can be caught during static analysis. (The two classes might well use the same identifier schema internally.)

DatabaseDomain is just an object holding all the DB connection config, which is IMO a sensible thing to have.

It holds the "database" (schema) name, schema name, and table prefix, but not the host, port, and so on.

What's less sensible is using it as a wiki identifier, although I don't think that happens that much.

wfWikiID() uses the "database" name + table prefix as the ID, so everything does in a sense. DatabaseDomain adds $wgDBmwschema to the mix. It's not clear from 847b91bf1 why that was added in.

Doesn't the SiteConfiguration callback on Wikimedia sites shell out to a maintenance script on the target wiki and fetch the accurate configuration?

SiteConfiguration::getConfig() does. It's documented as being expensive and needing special setup.

The more commonly used methods such as ::get() or ::getSetting() don't.

IMO making your IDs classes is usually a good practice, even if they are basically just a strings. It's self-documenting (as the class is a natural node for documentation, while a developer might have a harder time finding more info on @param string $id Site ID), more future-proof, and prevents mistakes where you pass the wrong type of ID.

And as I noted, on the down side it's harder to use a class object in configuration files or database fields, which tends to be exactly what these are being used for in practice. PHP doesn't even allow them as keys.

Having separate classes for remote wiki and remote site is practical for the same reason, mistakes can be caught during static analysis. (The two classes might well use the same identifier schema internally.)

As I see it, all it means is moving the check for what type of remote site it is from the point of use to the point of creation. In some cases that's a good thing (easier to find where invalid data is coming from), in some cases not (performing checks that are seldom actually required, possibly a need for type conversion when a thing that needs "wiki" calls a thing that needs "any site"). In this case my instinct leans towards "not".