Explore usage tracking for chart pages and tabular data pages
Closed, ResolvedPublic3 Estimated Story PointsSpike
Actions

Assigned To

Authored By

	• Catrope
	Jul 18 2024, 12:03 AM

Description

We plan to put chart definitions in the Data: namespace on Commons, and to use the existing Data:*.tab pages on Commons as data sources for charts.

We will need to track which pages on other wikis use each Data:*.chart and Data:*.tab page, so that we can do the following:

On each chart page and tabular data page, display a list of pages that uses it
When a chart page or tabular data page is edited, purge/rerender the pages that use it

To explore how to do this, we can look at:

The GlobalUsage extension, which has a global database table for usage tracking and enqueues purge jobs when a file is changed. This appears to do everything we need, but unfortunately its database schema is specific to images. We could explore generalizing this to non-image resources.
The JsonConfig extension, which provides the content model for chart and tabular data pages, and supports loading them cross-wiki. It doesn't do any usage tracking, but this could be a logical place to either put the usage tracking itself, or to put code that instructs the GlobalUsage extension to track usages.
How Wikibase does its own usage tracking
This stalled RFC from 2020: T253026: Introduce a centralized Dependency Tracking Service

Whatever we do here, we should discuss with the MediaWiki-Platform-Team , even if we do the work ourselves.

Current proposal

Add a new DB table that belongs to Commons but lives on x1 instead of s4 (meaning it couldn't be joined with tables from the commonswiki DB). Use this to track usage of JsonConfig pages similarly to how the GlobalUsage extension currently tracks image usage, with similar invalidation handling through the job queue to ensure that when a chart is edited, pages that use it are invalidated.

Proposed DB schema:

-- This has the same schema as the linktarget table, but since we intend for globaljsonlinks to be in x1, we can't join against linktarget
CREATE TABLE globaljsonlinks_target (
    gjlt_id BIGINT UNSIGNED AUTO_INCREMENT NOT NULL,
    gjlt_namespace INT NOT NULL,
    gjlt_title VARBINARY(255) NOT NULL,
    UNIQUE INDEX gjlt_namespace_title (gjlt_namespace, gjlt_title),
    PRIMARY KEY (gjlt_id)
);

CREATE TABLE globaljsonlinks_source_ns (
    gjlsn_id UNSIGNED AUTO_INCREMENT NOT NULL,
    gjlsn_wiki VARBINARY(32) NOT NULL,
    gjlsn_namespace VARBINARY(255) NOT NULL,
    UNIQUE INDEX gjlsn_wiki_namespace (gjlsn_wiki, gjlsn_namespace),
    PRIMARY KEY (gjlsn_id)
);

CREATE TABLE globaljsonlinks (
    gjl_source_wiki_ns UNSIGNED NOT NULL, /* refers to globaljsonlinks_source_ns.gjlsn_id */
    gjl_source_title VARBINARY(255) NOT NULL,
    gjl_target BIGINT UNSIGNED NOT NULL, /* refers to globaljsonlinks_target.gjlt_id */
    INDEX gjl_target_source (gjl_target, gjl_source_wiki_namespace, gjl_source_title),
    PRIMARY KEY (gjl_source_wiki_namespace, gjl_source_title, gjl_target)
);

Open questions

~~What should the scope of this table be?~~ Cross-wiki usage of JsonConfig pages hosted on Commons
- ~~Should it also include image links in the future (replacing globalimagelinks, or should those live in a separate table with a similar schema?~~ No, those should live in a separate table
- ~~Should it be limited to Charts usage only? Should it be slightly broader but limited to JsonConfig usage?~~ It should be limited to JsonConfig usage, but not necessarily Charts usage
~~Depending on the scope, should the namespace of the target page (gjlt_namespace) exist as a field, or should it be implicit (Data for the Charts/JsonConfig table, File for the image table)~~ The JsonConfig extension currently allows its pages to live in different namespaces, so in theory we still need the namespace field. But in practice we probably don't, so we could still change this by limiting the scope to JsonConfig pages in the Data namespace specifically.
~~Depending on the scope, what should the name of the table be? (e.g. globallinks if broad, globalchartlinks or globaljsonlinks if narrow)~~ globaljsonlinks
~~Which extension should define this table and maintain/use its contents?~~ The JsonConfig extension

Details

	Subject	Repo	Branch	Lines +/-
	ADR for usage tracking table	mediawiki/extensions/Chart	master	+70 -0

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Invalid		None	T155601 Stabilizing Interactive Products
Open		None	T153966 Track Commons Dataset usage across wikis (what links here)
Open		CCiufo-WMF	T374746 Cache invalidation based on usage tracking of Data: pages
Resolved		bvibber	T374747 Usage tracking table for JsonConfig data pages
Open		Jdlrobson	T374065 Document all the technicals decisions we've made so far in the codebase
Resolved	Spike	bvibber	T370378 Explore usage tracking for chart pages and tabular data pages

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

pmiazga moved this task from Inbox, needs triage to Current Sprint on the MediaWiki-Platform-Team board.Aug 12 2024, 2:44 PM

• Catrope renamed this task from Explore usage tracking for chart pages and tabular data pages to Spike: Explore usage tracking for chart pages and tabular data pages.Aug 12 2024, 6:32 PM

• Catrope triaged this task as Medium priority.

• Catrope edited projects, added Charts (Sprint 4); removed Charts.

• Catrope set the point value for this task to 3.

LGoto renamed this task from Spike: Explore usage tracking for chart pages and tabular data pages to Explore usage tracking for chart pages and tabular data pages.Aug 12 2024, 6:32 PM

LGoto raised the priority of this task from Medium to Needs Triage.

LGoto added a project: Spike.

LGoto removed the point value for this task.

Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptAug 12 2024, 6:32 PM

LGoto triaged this task as Medium priority.Aug 12 2024, 6:33 PM

• Catrope raised the priority of this task from Medium to Needs Triage.Aug 12 2024, 6:33 PM

• Catrope set the point value for this task to 3.

LGoto triaged this task as Medium priority.Aug 12 2024, 6:33 PM

• Catrope moved this task from Incoming to Ready for Dev on the Charts (Sprint 4) board.Aug 12 2024, 6:44 PM

I've been meaning to migrate global usage table to a shared table in x1 to reduce the confusion of other wikis reading and writing to s4 database and also reduce the size of overly large s4. Even though the new table won't take much space I suggest using virtual domains put it in x1/wikishared, the only downside would be that you wouldn't be able to join with core tables and it won't be replicated to the cloud (we will fix the latter eventually but not right now). That would make the future migration of the globalimagelinks table easier too.

In T370378#10053729, @Tgr wrote:

Core has multiple tracking tables because you might want to record different relationships for the same pair of pages (an article can include an image, transclude the image description page, link to the image description page...). Should the generalized global usage tracking table have a "type" or "purpose" field? There is no overlap between image usage and tabular data usage, but in the future we might want to track things that aren't separated by distinct namespaces.

This is a good point, I think there should be a type/purpose field. MediaWiki core recognizes the three types of relationships between pages that you listed (A links to B; A transcludes B; A embeds B as an image) and one more (A is in category B), but it tracks these three things separately using almost identical DB tables and ParserOutput APIs (the pagelinks and templatelinks schemas are identical; imagelinks differs only in that the namespace is implicit, and categorylinks has some extra data). Merging these tables into one is probably too ambitious for now, but I think it makes sense to create one global table for this purpose rather than three, and use a field to distinguish the different relationship types.

(However, I'm not proposing we track interwiki page links here, that would require more work, would probably add a lot of data to this table, and we wouldn't need it for dependency tracking. I'm also excluding categorylinks from consideration here, because I don't think cross-wiki category membership relationships make sense.)

Some wikis load Commons tabular data pages into Lua and generate a wiki table or something like that (I think the cross-wiki aspects of doing that are enabled by JsonConfig). Is that in scope?

For the overall system, yes. I don't know that we will necessarily implement tracking for Lua usage as part of our work on this task (that depends on how easy it is, and on how much the work needed for it overlaps with the work we already need to do for chart usage), but the tracking table for chart usage should also support tracking Lua usage (and usage of things in general, see below).

Is the intent to create a generic tracking service that will eventually become the canonical one (e.g. Wikibase might use it some day)?

Ideally yes, for tracking cross-wiki dependencies between pages. Wikibase is a bit of a special case in that the relationships it needs to track can be more granular than page->page relationships (it often cares about which properties of an entity are being used, rather than just which entity). But for more straightforward page->page relationships across wikis, I think this should become the canonical system (including for globalimagelinks, which we'd migrate to this).

Outside the scope of the task, but the way Wikibase handles change notifications (inserting recent changes entries so edits to Wikidata items which affect the generated wikitext show up in watchlists etc) is worth looking at, especially if you also want to allow use of chart data via Lua.

Yes that's outside the scope of this task, and I'm also not as sure about it. We do change notifications via RC entries for Wikidata usage, but we don't do the same thing for templates or Lua modules for example: if someone edits Template:US presidents (the box with the list of all presidents that appears on the bottom of every president's article), that doesn't insert RC entries for Joe Biden, Donald Trump, etc, so if you watch one of those pages you wouldn't be notified about that change. The same is true for image reuploads. So I don't think there's a consistent precedent here.

In T370378#10059305, @Ladsgroup wrote:

I've been meaning to migrate global usage table to a shared table in x1 to reduce the confusion of other wikis reading and writing to s4 database and also reduce the size of overly large s4. Even though the new table won't take much space I suggest using virtual domains put it in x1/wikishared, the only downside would be that you wouldn't be able to join with core tables and it won't be replicated to the cloud (we will fix the latter eventually but not right now). That would make the future migration of the globalimagelinks table easier too.

I didn't realize the current globalimagelinks table lived in s4 as a local commonswiki table. In my view, the new table should be an actually-global table (like the CentralAuth table) and should live in wikishared, which is where the other global tables live. With that in mind, I already wasn't expecting to be able to join it with other tables, so I think that's fine.

I like your idea of making the migration easier this way: we'd just keep globalimagelinks in s4, create the new table in wikishared, and then migrate image tracking to the new system. Once that's done, the globalimagelinks table in s4 will be unused and can be deleted.

@Ladsgroup while I'm already picking your brain here... do you think it would make sense to normalize link targets in this new table, similar to how this works in MW core?

For local dependency tracking, we have this schema in core, where the "from" field refers to a page ID, and the "to" field refers to a linktarget:

CREATE TABLE linktarget (
  lt_id BIGINT UNSIGNED AUTO_INCREMENT NOT NULL,
  lt_namespace INT NOT NULL,
  lt_title VARBINARY(255) NOT NULL,
  UNIQUE INDEX lt_namespace_title (lt_namespace, lt_title),
  PRIMARY KEY(lt_id)
);

CREATE TABLE templatelinks (
  tl_from INT UNSIGNED DEFAULT 0 NOT NULL, /* page ID of the page that transcludes another page; refers to page.page_id */
  tl_target_id BIGINT UNSIGNED NOT NULL, /* target ID of the page being transcluded; refers to linktarget.lt_id */
  tl_from_namespace INT DEFAULT 0 NOT NULL, /* denormalized page.page_namespace value of the row tl_from points to */
  INDEX tl_target_id (tl_target_id, tl_from),
  INDEX tl_backlinks_namespace_target_id (tl_from_namespace, tl_target_id, tl_from),
  PRIMARY KEY(tl_from, tl_target_id)
);

/* pagelinks has the same schema but with pl_ prefixes instead of tl_ */

The current globalimagelinks table in GlobalUsage uses this schema:

CREATE TABLE globalimagelinks (
  gil_wiki VARCHAR(32) NOT NULL, /* wiki ID of the "from" page */
  gil_page INT UNSIGNED NOT NULL, /* page ID of the "from" page, refers to page.page_id in gil_wiki's DB */
  gil_to VARCHAR(255) NOT NULL, /* name of the image, in the File: namespace on commonswiki */
  gil_page_namespace_id INT NOT NULL, /* namespace ID of the "from" page (denormalized-ish; could be looked up by querying page.page_namespace in gil_wiki's DB) */
  gil_page_namespace VARCHAR(255) DEFAULT NULL, /* namespace name of the "from" page (could be looked up in gil_wiki's config) */
  gil_page_title VARCHAR(255) DEFAULT NULL, /* name of the "from" page (denormalized-ish; could be looked by querying page.page_title in gil_wiki's DB */
  INDEX globalimagelinks_wiki (gil_wiki, gil_page),
  INDEX globalimagelinks_wiki_nsid_title (gil_wiki, gil_page_namespace_id, gil_page_title),
  PRIMARY KEY(gil_to, gil_wiki, gil_page)
);

In this schema, the "to" wiki name and namespace are implicit ("commonswiki" and "File"), we'd have to add fields for those. I think that presents an opportunity to normalize both the "from" and "to" field clusters using two linktarget-esque tables (with slightly different schemas, because the "from" page always exists but the "to" page doesn't always exist). Maybe the schema could look something like this:

CREATE TABLE globallinks_source (
    gls_id BIGINT UNSIGNED AUTO_INCREMENT NOT NULL,
    gls_wiki VARBINARY(32) NOT NULL,
    gls_page_id INT UNSIGNED NOT NULL,
    gls_page_namespace INT NOT NULL, /* renamed from page_namespace_id, for consistency with the page table */
    gls_page_namespace_text VARCHAR(255) NOT NULL, /* renamed from page_namespace, to avoid confusion */
    gls_page_title VARBINARY(255) NOT NULL,
    UNIQUE INDEX gls_wiki_page_id (gls_wiki, gls_page_id),
    PRIMARY KEY (gls_id)
);

CREATE TABLE globallinks_target (
    glt_id BIGINT UNSIGNED AUTO_INCREMENT NOT NULL,
    glt_wiki VARBINARY(32) NOT NULL,
    glt_page_namespace_text VARBINARY(255) NOT NULL,
    glt_page_title VARBINARY(255) NOT NULL,
    UNIQUE INDEX glt_wiki_ns_title (glt_wiki, glt_page_namespace_text, glt_page_title),
    PRIMARY KEY (glt_id)
);

CREATE TABLE globallinks (
    gl_id BIGINT UNSIGNED NOT NULL,
    gl_source BIGINT UNSIGNED NOT NULL, /* refers to globallinks_source.gls_id */
    gl_target BIGINT UNSIGNED NOT NULL, /* refers to globallinks_target.glt_id */
    gl_type INT NOT NULL, /* meaning defined in code: 0 for page link, 1 for transclusion, 2 for image, etc. */
    UNIQUE KEY gl_source_type_target (gl_source, gl_type, gl_target),
    INDEX gl_target_type (gl_target, gl_type),
    PRIMARY KEY (gl_id)
);

Does that make sense to you, or do you think it should be done differently / do you have any other advice?

See also T241053#5757372 for further proposal to normalize wiki and namespace in (here proposed) globallinks_source and globallinks_target table.

For normalizing wiki, since T224020 has not been resolved, previously we create dedicated namestore for wiki id in T368151#10007712. Normalizing wiki may be unnecessary if namespace is normalized.
For namespace, also note local namespace may be renamed so we should use only its ID to find a namespace.
- If a namespace got renamed, we need and only need to change its "text" if we normalize namespaces to a dedicated table (and we can do it in normal update of this table).
- if we do not normalize namespace, there may be millions of globallinks_source rows to update when a namespace got renamed. Before it is fully done user will see inconsistent data in GlobalUsage.

globallinks_source will contain: gls_id, gls_wiki_namespace, gls_page_title
globallinks_target will contain: glt_id, glt_wiki_namespace, glt_page_title
globallinks_wiki_namespace table: glwn_id, glwn_wiki (varchar or id to globallinks_wiki below), glwn_namespace, glwn_namespace_text
(maybe unnecessary) globallinks_wiki: glw_id, glw_wiki

Note the above proposed design assumed target wiki namespace is normalized before storage, otherwise we do not know what is the namespace ID in target wiki (and may require a different new table without namespace ID to normalize target namespace). This is not a problem if we want to track global image or chart usage though, since the target page always resides in the same namespace. It may be an issue if we support T6547: Support crosswiki template inclusion (transclusion => interwiki templates, etc.) in the future, and use this table to track page dependency; or use this table for other type of dependency tracking which involves variable target namespace.

Would wikifunctions tracking be a potential use case here too?

Michael subscribed.Aug 14 2024, 2:11 PM

Ottomata added a subscriber: daniel.Aug 14 2024, 2:14 PM

BTullis subscribed.Aug 14 2024, 2:36 PM

The current globalimagelinks table in GlobalUsage uses this schema:

Regarding the GlobalUsage extension, and its globalimagelinks table - This is quite a large schema currently. Do we actually need each of those columns and indexes today?

gil_wiki local wiki ID, naturally.
gil_page local page ID, naturally.
gil_to target title, naturally.
gil_page_namespace_id local page namespace. Why?
gil_page_namespace local page namespace. Why?
gil_page_title local page title. Why?
PRIMARY KEY(gil_to, gil_wiki, gil_page). To look up by target title. Primary read usage.
globalimagelinks_wiki (gil_wiki, gil_page). To insert from local wiki. Primary write usage.
globalimagelinks_wiki_nsid_title (gil_wiki, gil_page_namespace_id, gil_page_title). Why?

From a quick look through its jobs and its UI, I could not find a use of the namespaces. Unlike core's WhatLinksHere, Special:GlobalUsage does not have a namespace filter. Afaik the community generally would for very large results move to the local UI where more advanced filters exist. E.g. after GlobalUsage example one can go to enwiki WhatLinksHere for a further breakdown.

@Catrope wrote:

[…] our preference would be to generalize the GlobalUsage extension that beyond images by adding the namespace of the "to" title to the database table.

@Tgr wrote:

Core has multiple tracking tables because you might want to record different relationships for the same pair of pages […]. Should the generalized global usage tracking table have a "type" or "purpose" field?
Is the intent to create a generic tracking service that will eventually become the canonical one (e.g. Wikibase might use it some day)?

@Ladsgroup wrote:

I suggest using virtual domains put it in x1/wikishared, […]. That would make the future migration of the globalimagelinks table easier too.

To me, having a single giant table is not obviously better. I would default to a single server but with separate tables on that server, unless we can quantify the benefits of a combined approach, as well as address potential concerns with that approach. It's a lot harder to split things out again in the future once they're in one table!

The size and ergonomics of extra "type" data, which you don't need otherwise. Table names are cheap for any N rows, whereas type strings add up. Type strings also likely motivate the creation of a type "registry" to reduce the expense of the type strings, and thus further add complexity and non-determinism between local and prod. This isn't a blocker by any means, but it's certainly not simplicy or benefitial by itself.
Combined write contention. For any edit relating to global links (wikidata, charts, jsonconfig, commons images) all wikis combined would be writing to a single table. Will write contention not be a concern in the foreseeable future? If Wikidata is involved, I think it would be, but maybe not?
Combined size and read traffic. Especially Wikidata usage. For Commons, commonswiki.globalimagelinks is at 740M rows today, according to information_schema. Co-locating this as separate tables on 1 server means we can move this out in the future if data size becomes an issue. Or, rather, size may be fine, but query performance may start to suffer due to high reads, or reduced cache efficiency if we can't fit as much of the index into the limited memory of the x1 server. We could then move one of the tables to a separate host.
Reduced index locality. I vaguely recall that with this kind of schema, we've often ended up in a situation where for certain scenarios (e.g. pages with lots of links of one type, or wikis with lots of links of one type) the index can only bring you to the start of a very large range, where lots of irrelevant rows have to be scanned past. Does this ring a bell? This is avoided if you have separate tables right? (That simplicity would of course enforce making separate writes and reads, but that seems fine/preferred?)
Reduced disaster/recovery risk. A truncate or delete gone wrong. Etc.

The core linktables have much shared tooling and more or less act as one already. There's relatively little insertion code per-table. The GUI/API of course varies for each core link type, and that's expected either way.

In addition to the size of the link table, we'd get a similarly large global_link_target table. And this would need a repetition of the wiki id string (or yet another index table for that, adding more complexity, and cutting off more developer experience and debuggability). There's also the matter of linktarget leaving behind rows for deleted and non-existent titles. This appears to be managable on local wikis, but combining this over 1000x wikis onto a single table, makes me think twice. Having the link tables without a linktarget may be worth considering. I know it reduced size and notably improved cache/index use and reduced latency for local wikis. It's not obvious to me that it will have the same effect globally. Or rather, there may be limits to those benefits, that, combined with other potential concerns above, might tip the balance.

CDanis subscribed.Aug 14 2024, 5:00 PM

dr0ptp4kt subscribed.Aug 14 2024, 5:19 PM

VirginiaPoundstone subscribed.Aug 14 2024, 7:24 PM

Even if not a single global table, a standardized data model, PHP API, and Hook to subscribe to link changes would be quite nice!

after GlobalUsage example one can go to enwiki WhatLinksHere for a further breakdown.

Each usage of Commons files results in one globalimagelinks table row (in Commons) and one imagelinks table row (in local wiki), so user can also check usage of Commons file in WhatLinksHere. This is not the case of chart, since we have no current means to track cross-wiki transclusion in local wiki. Some potential solutions:

using iwlinks table, but there are currently no user-facing feature displaying it (T68293: Make a special page to expose iwbacklinks) and it does not support change propagation
T91162: RFC: Shadow namespaces
store every link data (with namespace) in one central database and use it for all queries

• Catrope assigned this task to bvibber.Aug 19 2024, 6:43 PM

Sorry for very late response, I was sick and then T370304 (ongoing outages).

In T370378#10063266, @Catrope wrote:

In T370378#10053729, @Tgr wrote:

Core has multiple tracking tables because you might want to record different relationships for the same pair of pages (an article can include an image, transclude the image description page, link to the image description page...). Should the generalized global usage tracking table have a "type" or "purpose" field? There is no overlap between image usage and tabular data usage, but in the future we might want to track things that aren't separated by distinct namespaces.

This is a good point, I think there should be a type/purpose field. MediaWiki core recognizes the three types of relationships between pages that you listed (A links to B; A transcludes B; A embeds B as an image) and one more (A is in category B), but it tracks these three things separately using almost identical DB tables and ParserOutput APIs (the pagelinks and templatelinks schemas are identical; imagelinks differs only in that the namespace is implicit, and categorylinks has some extra data). Merging these tables into one is probably too ambitious for now, but I think it makes sense to create one global table for this purpose rather than three, and use a field to distinguish the different relationship types.

Actually, it's much easier to maintain three large tables than one gigantic one. From DBA perspective it's better to keep them separate.

Is the intent to create a generic tracking service that will eventually become the canonical one (e.g. Wikibase might use it some day)?

Ideally yes, for tracking cross-wiki dependencies between pages. Wikibase is a bit of a special case in that the relationships it needs to track can be more granular than page->page relationships (it often cares about which properties of an entity are being used, rather than just which entity). But for more straightforward page->page relationships across wikis, I think this should become the canonical system (including for globalimagelinks, which we'd migrate to this).

Outside the scope of the task, but the way Wikibase handles change notifications (inserting recent changes entries so edits to Wikidata items which affect the generated wikitext show up in watchlists etc) is worth looking at, especially if you also want to allow use of chart data via Lua.

Yes that's outside the scope of this task, and I'm also not as sure about it. We do change notifications via RC entries for Wikidata usage, but we don't do the same thing for templates or Lua modules for example: if someone edits Template:US presidents (the box with the list of all presidents that appears on the bottom of every president's article), that doesn't insert RC entries for Joe Biden, Donald Trump, etc, so if you watch one of those pages you wouldn't be notified about that change. The same is true for image reuploads. So I don't think there's a consistent precedent here.

FWIW, I worked on this several times during my tenure at WMDE. Wikibase has to have this very special system otherwise RC entries and reparses will break the system (as it has brought us down multiple times because of this). That's why there is "sub-entity" level dependencies and not page to page. I highly discourage you from changing how wikidata tracks dependencies.

In T370378#10063296, @Catrope wrote:

@Ladsgroup while I'm already picking your brain here... do you think it would make sense to normalize link targets in this new table, similar to how this works in MW core?

[snip]

Does that make sense to you, or do you think it should be done differently / do you have any other advice?

I have a couple of comments:

I don't think normalizing source would be useful. There isn't much repetition there (a page can't link to thousands of images while the other way around is quite commons, pun intended). I know it's better for data hygiene but every new table comes with overhead in querying and so on. Target definitely should be normalized though (as you did). We could also normalize on pair of wiki and namespace text (and namespace id) on the source. That way, namespace renames will be less painful too.
Drop wiki in target, I don't know any potential usecase outside of commons. Do we really need the all text fields? Why not a FK to page in commons?

In T370378#10064429, @Krinkle wrote:

The current globalimagelinks table in GlobalUsage uses this schema:

Regarding the GlobalUsage extension, and its globalimagelinks table - This is quite a large schema currently. Do we actually need each of those columns and indexes today?

gil_wiki local wiki ID, naturally.

gil_page local page ID, naturally.

gil_to target title, naturally.

gil_page_namespace_id local page namespace. Why?

Agree, let's get rid of it?

gil_page_namespace local page namespace. Why?

I can assume when rendering pages like https://commons.wikimedia.org/wiki/Special:GlobalUsage/002_The_lion_king_Snyggve_in_the_Serengeti_National_Park_Photo_by_Giles_Laurent.jpg it should make it easier to just append the namespace and not need to look it up in the local wiki? That can get expensive in large batches.

gil_page_title local page title. Why?

ditto. It probably means we can get rid of local page_id instead?

PRIMARY KEY(gil_to, gil_wiki, gil_page). To look up by target title. Primary read usage.

globalimagelinks_wiki (gil_wiki, gil_page). To insert from local wiki. Primary write usage.

globalimagelinks_wiki_nsid_title (gil_wiki, gil_page_namespace_id, gil_page_title). Why?

I can look up in production if it's used. Agree we can clean up this table a lot. This is doubly important since it's sitting in the largest section and largest database (by a wide margin) and causing issues.

From a quick look through its jobs and its UI, I could not find a use of the namespaces. Unlike core's WhatLinksHere, Special:GlobalUsage does not have a namespace filter. Afaik the community generally would for very large results move to the local UI where more advanced filters exist. E.g. after GlobalUsage example one can go to enwiki WhatLinksHere for a further breakdown.

@Catrope wrote:

[…] our preference would be to generalize the GlobalUsage extension that beyond images by adding the namespace of the "to" title to the database table.

@Tgr wrote:

Core has multiple tracking tables because you might want to record different relationships for the same pair of pages […]. Should the generalized global usage tracking table have a "type" or "purpose" field?
Is the intent to create a generic tracking service that will eventually become the canonical one (e.g. Wikibase might use it some day)?

@Ladsgroup wrote:

I suggest using virtual domains put it in x1/wikishared, […]. That would make the future migration of the globalimagelinks table easier too.

To me, having a single giant table is not obviously better. I would default to a single server but with separate tables on that server, unless we can quantify the benefits of a combined approach, as well as address potential concerns with that approach. It's a lot harder to split things out again in the future once they're in one table!

FWIW, x1 is smaller than smallest section (and even most of it filled with Echo junk notifications for wikidata, T308084: Reduce DB space used by Echo notifications otherwise it be basically empty), from DBA perspective, having them in a separate server allows for horizontal scaling, currently commons is hitting limit in everything basically.

The size and ergonomics of extra "type" data, which you don't need otherwise. Table names are cheap for any N rows, whereas type strings add up. Type strings also likely motivate the creation of a type "registry" to reduce the expense of the type strings, and thus further add complexity and non-determinism between local and prod. This isn't a blocker by any means, but it's certainly not simplicy or benefitial by itself.

Combined write contention. For any edit relating to global links (wikidata, charts, jsonconfig, commons images) all wikis combined would be writing to a single table. Will write contention not be a concern in the foreseeable future? If Wikidata is involved, I think it would be, but maybe not?

I have said that wikidata should not used here, it's a different beast with different needs and uses. But FWIW, Write contention happens a lot on row basis, if the table is large it doesn't automatically mean write contention, if writes compete for updating same rows, sure but I don't see it happening here.

Combined size and read traffic. Especially Wikidata usage. For Commons, commonswiki.globalimagelinks is at 740M rows today, according to information_schema. Co-locating this as separate tables on 1 server means we can move this out in the future if data size becomes an issue. Or, rather, size may be fine, but query performance may start to suffer due to high reads, or reduced cache efficiency if we can't fit as much of the index into the limited memory of the x1 server. We could then move one of the tables to a separate host.

Currently, it's sitting in s4 which has the same memory capacity but 1TB more data so it's already quite under pressure in the read, moving it to x1 frees up a lot and would make it much faster.

Reduced index locality. I vaguely recall that with this kind of schema, we've often ended up in a situation where for certain scenarios (e.g. pages with lots of links of one type, or wikis with lots of links of one type) the index can only bring you to the start of a very large range, where lots of irrelevant rows have to be scanned past. Does this ring a bell? This is avoided if you have separate tables right? (That simplicity would of course enforce making separate writes and reads, but that seems fine/preferred?)

We had this issue with externallinks but I'm not sure it would be an issue here.

Reduced disaster/recovery risk. A truncate or delete gone wrong. Etc.

We have backups but also the benefit of removing these data from core tables that are extremely under pressure (such as s4) outweighs the costs.

Restricted Application added a subscriber: mdaniels5757. · View Herald TranscriptAug 22 2024, 1:00 PM

In T370378#10064429, @Krinkle wrote:

To me, having a single giant table is not obviously better. I would default to a single server but with separate tables on that server, unless we can quantify the benefits of a combined approach, as well as address potential concerns with that approach. It's a lot harder to split things out again in the future once they're in one table!
[…]

Combined size and read traffic. […]

Reduced disaster/recovery risk. A truncate or delete gone wrong. Etc. […]

In T370378#10084787, @Ladsgroup wrote:

[…]

FWIW, x1 is smaller than smallest section […], from DBA perspective, having them in a separate server allows for horizontal scaling, […].

Currently, it's sitting in s4 which has the same memory capacity but 1TB more data so it's already quite under pressure in the read, moving it to x1 frees up a lot and would make it much faster. […]

We have backups but also the benefit of removing these data from core tables that are extremely under pressure (such as s4) outweighs the costs.

I'm slightly confused. Are you saying both ideas are fine as long as it's on x1? Or do you see some evidence for some of the hypothetical concerns I raised? I'm a bit confused as it sounds like you're comparing s4 to x1. I was suggesting we consider separate tables on on x1 (globalimagelinks, globalchartlinks, ..). I raised questions with potential concerns on the proposal to have a single large table on x1 (globallinks-anything?).

By "default to a single server" I meant that there's no need to pre-emptively spread things across multiple external servers. The key thing to do early is to have separate tables (if we see a need to separate them in the future).

Bugreporter mentioned this in T238503: Enhance GlobalUsage extension to show usage of modules and templates.Aug 23 2024, 9:22 AM

gil_page_namespace_id local page namespace. Why?

Agree, let's get rid of it?

In T370378#10086816, @Krinkle wrote:

In T370378#10064429, @Krinkle wrote:

To me, having a single giant table is not obviously better. I would default to a single server but with separate tables on that server, unless we can quantify the benefits of a combined approach, as well as address potential concerns with that approach. It's a lot harder to split things out again in the future once they're in one table!
[…]

Combined size and read traffic. […]

Reduced disaster/recovery risk. A truncate or delete gone wrong. Etc. […]

In T370378#10084787, @Ladsgroup wrote:

[…]

FWIW, x1 is smaller than smallest section […], from DBA perspective, having them in a separate server allows for horizontal scaling, […].

Currently, it's sitting in s4 which has the same memory capacity but 1TB more data so it's already quite under pressure in the read, moving it to x1 frees up a lot and would make it much faster. […]

We have backups but also the benefit of removing these data from core tables that are extremely under pressure (such as s4) outweighs the costs.

I'm slightly confused. Are you saying both ideas are fine as long as it's on x1? Or do you see some evidence for some of the hypothetical concerns I raised? I'm a bit confused as it sounds like you're comparing s4 to x1. I was suggesting we consider separate tables on on x1 (globalimagelinks, globalchartlinks, ..). I raised questions with potential concerns on the proposal to have a single large table on x1 (globallinks-anything?).

You said "To me, having a single giant table is not obviously better." and I'm saying it depends on where. One giant table in x1 is obviously better than one giant table in s4 (where it currently resides) as globalimagelinks is already quite big.

By "default to a single server" I meant that there's no need to pre-emptively spread things across multiple external servers. The key thing to do early is to have separate tables (if we see a need to separate them in the future).

For wikidata and global images it would make sense (and as I said, we shouldn't touch wikidata here). But to my understanding charts will be quite a small use case, e.g. with vega only several thousand pages used it. Correct?

Overall, I don't mind either way between splitting or keeping the tables in one place and leave it at the discretion of the developer as long as: 1- you're not including wikidata 2- it sits in x1 and not s4.

In T370378#10084787, @Ladsgroup wrote:

Actually, it's much easier to maintain three large tables than one gigantic one. From DBA perspective it's better to keep them separate.

Understood. Instead I'll just propose one global links table for transclusion-style usage then (revised schema below). We can then use a similar schema for images, but that wouldn't be in scope for this task.

I have a couple of comments:

I don't think normalizing source would be useful. There isn't much repetition there (a page can't link to thousands of images while the other way around is quite commons, pun intended). I know it's better for data hygiene but every new table comes with overhead in querying and so on. Target definitely should be normalized though (as you did).

Makes sense, I'll normalize the target but not the source.

We could also normalize on pair of wiki and namespace text (and namespace id) on the source. That way, namespace renames will be less painful too

I have done this in my proposed schema. Hopefully it won't make querying too painful, but if we run into problems we could always denormalize this during development.

Drop wiki in target, I don't know any potential usecase outside of commons. Do we really need the all text fields? Why not a FK to page in commons?

I guess you're right, we don't need to think of this as a global table. We can instead think of this as a commons table (but one that happens to live on x1).

gil_page_namespace_id local page namespace. Why?

Agree, let's get rid of it?

This field is currently only used for namespace filtering in ApiGlobalUsage, and for nothing else (there's no namespace filtering in the UI). I'll drop it from my proposed schema.

gil_page_namespace local page namespace. Why?

I can assume when rendering pages like https://commons.wikimedia.org/wiki/Special:GlobalUsage/002_The_lion_king_Snyggve_in_the_Serengeti_National_Park_Photo_by_Giles_Laurent.jpg it should make it easier to just append the namespace and not need to look it up in the local wiki? That can get expensive in large batches.

That's right, this is used for display purposes on Special:GlobalUsage. Since that special page runs on Commons, it can't easily look up the ns+title for the IDs of the pages it's listing, since those are on all sorts of different wikis.

gil_page_title local page title. Why?

ditto. It probably means we can get rid of local page_id instead?

This is used for the same display purposes as above. But I think you're right that we can get rid of gil_page instead. GlobalUsage does use this for various purposes, but I think all of those could be replaced by ns+title.

New proposed schema:

-- This has the same schema as the linktarget table, but since we intend for globaltemplatelinks to be in x1, we can't join against linktarget
CREATE TABLE globallinks_target (
    glt_id BIGINT UNSIGNED AUTO_INCREMENT NOT NULL,
    glt_namespace INT NOT NULL,
    glt_title VARBINARY(255) NOT NULL,
    UNIQUE INDEX glt_namespace_title (glt_namespace, glt_title),
    PRIMARY KEY (glt_id)
);

CREATE TABLE globallinks_source_wiki_namespace (
    gswn_id BIGINT UNSIGNED AUTO_INCREMENT NOT NULL,
    gswn_wiki VARBINARY(32) NOT NULL,
    gswn_namespace VARBINARY(255) NOT NULL,
    UNIQUE INDEX gswn_wiki_namespace (gswn_wiki, gswn_namespace),
    PRIMARY KEY (gswn_id)
);

CREATE TABLE globaltemplatelinks (
    gtl_id BIGINT UNSIGNED NOT NULL,
    gtl_source_wiki_namespace UNSIGNED NOT NULL, /* refers to globallinks_source_wiki_namespace.gswn_id */
    gtl_source_title VARBINARY(255) NOT NULL,
    gtl_target BIGINT UNSIGNED NOT NULL, /* refers to globallinks_target.glt_id */
    UNIQUE INDEX gtl_source_target (gtl_source_wiki_namespace, gtl_source_title, gtl_target),
    INDEX gtl_target_source (gtl_target, gtl_source_wiki_namespace, gtl_source_title),
    PRIMARY KEY (gtl_id)
);

-- In the future, we could migrate globalimagelinks to something that has the same schema as globaltemplatelinks

In T370378#10087403, @Ladsgroup wrote:

You said "To me, having a single giant table is not obviously better." and I'm saying it depends on where. One giant table in x1 is obviously better than one giant table in s4 (where it currently resides) as globalimagelinks is already quite big.

By "default to a single server" I meant that there's no need to pre-emptively spread things across multiple external servers. The key thing to do early is to have separate tables (if we see a need to separate them in the future).

[…] But to my understanding charts will be quite a small use case, e.g. with vega only several thousand pages used it. Correct?

Overall, I don't mind either way between splitting or keeping the tables in one place and leave it at the discretion of the developer as long as: 1- you're not including wikidata 2- it sits in x1 and not s4.

Even if the Charts extension adds only 1 row or even 0, it would mean the schema has to accomodate its at least a "linktype" column of sorts. This makes global everything links significantly wider and larger than globalimagelinks today, before we look at any rows specific for Charts.

Disaster/recovery increases, any accidents around Charts would affect globalimagelinks. Potential contention concerns, if these exist, depending on the kind of contention, would affect Charts even if its own read/writes are rare.

Anyway, I feel I'm adding more strength to these points than intended. I raised them as potential concerns and reminders of something to think about, not as concrete actual concerns that I've validated or feel strongly about myself. I'm making sure the question is understood, I don't disagree with your answers.

@Catrope wrote:
[…] I'll just propose [a] global links table for transclusion-style usage […]. We can then use a similar schema for images, but that wouldn't be in scope for this task.

@Ladsgroup wrote:

Drop wiki in target, I don't know any potential usecase outside of commons. Do we really need the all text fields? Why not a FK to page in commons?

I guess you're right, we don't need to think of this as a global table. We can instead think of this as a commons table (but one that happens to live on x1).

[…]
CREATE TABLE globaltemplatelinks (
    gtl_id BIGINT UNSIGNED NOT NULL,
    gtl_source_wiki_namespace UNSIGNED NOT NULL, /* refers to globallinks_source_wiki_namespace.gswn_id */
    gtl_source_title VARBINARY(255) NOT NULL,
    gtl_target BIGINT UNSIGNED NOT NULL, /* refers to globallinks_target.glt_id */
    UNIQUE INDEX gtl_source_target (gtl_source_wiki_namespace, gtl_source_title, gtl_target),
    INDEX gtl_target_source (gtl_target, gtl_source_wiki_namespace, gtl_source_title),
    PRIMARY KEY (gtl_id)
);

-- In the future, we could migrate globalimagelinks to something that has the same schema as globaltemplatelinks

The reduced scope LGTM. But, the name and description implies it would be re-used for other transclusion-like semantics as well. That seems fine, except that it assumes unrelated transclusion-type features would be hosted only on Wikimedia Commons. Do we want to commit to that?

I suggest one of these two options:

Make it specific to charts (global charts links), thus the only business requirement we encode is that a given feature (e.g. Charts) designates one central wiki from which it transcludes content.
- Cost: one-time cost of DBA table creation, once after a major new feature is deployed.
- Benefit: No abstraction. Presumably the extension would own the schema and works transparently in local development (e.g. chartslinks) but in production we place it on x1 named globalchartslinks, mapped via virtual domain virtual-chartslinks = x1.globalchartslinks.
Or; Make it specific to commons (globaltemplateslinks_commonswiki), which we can hide from MediaWiki code via Rdbms virtual domains. I.e. charts would register virtual-chartslinks as virtual domain, that in production wmf-config we map to x1.globaltemplateslinks_commonswiki.
- Cost: Cognitive overhead that muliple two different in-code table names, in production, map to a shared table that neither extension owns.
- Cost: one-time cost of DBA table creation, if of different type, or not on commons.
- Benefit: no DBA table creation task when adding a second extension that also is of type "transclusion" and also uses Commons.
- Question: Which extension owns the schema? Would it be in core as unused schema file for wiki farms to use if/when needed? How would the installer provision this for local dev when enabling the extension?

In T370378#10089242, @Krinkle wrote:

But, the name and description implies it would be re-used for other transclusion-like semantics as well. That seems fine, except that it assumes unrelated transclusion-type features would be hosted only on Wikimedia Commons. Do we want to commit to that?

Note that transclusion of Wikidata items already happens (#statements), though it's mostly done through Lua in the real world, and that transclusion of Wikifunctions calls is in development. Of course, if you don't want to track those with this, that's fine, but I worry we'll just have to duplicate this for all the async content fragments work (for Wikifunctions and other users) moderately soon.

• Catrope moved this task from Sprint 4 to Sprint 5 on the Charts board.Aug 26 2024, 6:12 PM

• Catrope edited projects, added Charts (Sprint 5); removed Charts (Sprint 4).

CCiufo-WMF moved this task from Incoming to Ready for Dev on the Charts (Sprint 5) board.Aug 26 2024, 6:45 PM

Random further notes on the schema but they are clearly on the details. General idea looks good. I agree with krinkle and either direction he suggested sounds good to me (I slightly prefer the first one since overhead of extra tables is large but if they are per wiki, central tables, specially the ones in x1, don't have any extra overhead):

gswn_id (wiki + namespace) doesn't need to be BIGINT. We can't reasonably ever reach 4B limit of unsigned int. But bigint takes four more bytes in each row of globaltemplatelinks
gtl_id: I'm on the fence on this one. If you have all other three rows and can turn that into the PK, we don't need an extra 8byte per row. There are some benefits from it notably doing checksums and backing up but since this doesn't hold canonical data, I'd say get rid of it and make that UNIQUE INDEX, the PK. This is what local *links tables do.
I agree the naming can be better (globallinks_source_wiki_namespace is a bit long :D) but I also acknowledge that naming is hard so feel free to ignore.

In T370378#10094335, @Ladsgroup wrote:

Random further notes on the schema but they are clearly on the details. General idea looks good. I agree with krinkle and either direction he suggested sounds good to me (I slightly prefer the first one since overhead of extra tables is large but if they are per wiki, central tables, specially the ones in x1, don't have any extra overhead):

To clarify, are you saying we should narrow the scope of this to Charts, and exclude other future use cases like Wikifunctions? I'm OK with that, but I wanted to check if that's what you meant. If we do narrow the scope of this, I would suggest at least making it a little more general and leave the door open for tracking other kinds of cross-wiki JsonConfig usage in the future. Either way, if we narrow the scope of this, we should change the table names from globallinks to globalchartlinks or globaljsonlinks or whatever.

gswn_id (wiki + namespace) doesn't need to be BIGINT. We can't reasonably ever reach 4B limit of unsigned int. But bigint takes four more bytes in each row of globaltemplatelinks

Understood. A lot of new primary keys where BIGINTs for reasons I couldn't discern, so I thought maybe that was the modern way of doing things, but now I see what the logic is. Changed back to an unsigned.

gtl_id: I'm on the fence on this one. If you have all other three rows and can turn that into the PK, we don't need an extra 8byte per row. There are some benefits from it notably doing checksums and backing up but since this doesn't hold canonical data, I'd say get rid of it and make that UNIQUE INDEX, the PK. This is what local *links tables do.

OK, will remove. Here too, I thought adding explicit PKs was the more modern way of doing things and the *links tables were old school, but I didn't appreciate the primary data vs derived data distinction.

I agree the naming can be better (globallinks_source_wiki_namespace is a bit long :D) but I also acknowledge that naming is hard so feel free to ignore.

Yeah it's not great. I'll rename it to globallinks_source_ns but feel free to suggest a different name if you don't like that one either.

I'll put the updated schema in the task description.

• Catrope updated the task description. (Show Details)Aug 29 2024, 11:28 PM

• Catrope updated the task description. (Show Details)Aug 29 2024, 11:30 PM

• Catrope updated the task description. (Show Details)Aug 29 2024, 11:34 PM

In T370378#10089242, @Krinkle wrote:

The reduced scope LGTM. But, the name and description implies it would be re-used for other transclusion-like semantics as well. That seems fine, except that it assumes unrelated transclusion-type features would be hosted only on Wikimedia Commons. Do we want to commit to that?

Probably not. @Jdforrester-WMF mentions Wikidata and Wikifunctions, which both involve transcludable content hosted on wikis other than Commons.

I suggest one of these two options:

Make it specific to charts (global charts links), thus the only business requirement we encode is that a given feature (e.g. Charts) designates one central wiki from which it transcludes content.

Cost: one-time cost of DBA table creation, once after a major new feature is deployed.

Benefit: No abstraction. Presumably the extension would own the schema and works transparently in local development (e.g. chartslinks) but in production we place it on x1 named globalchartslinks, mapped via virtual domain virtual-chartslinks = x1.globalchartslinks.

This option makes sense to me, although I think we should consider making it slightly broader and scoping it to the JsonConfig extension. I believe Lua modules can already use tabular data from Commons this way, and we don't have usage tracking for that either.

Or; Make it specific to commons (globaltemplateslinks_commonswiki), which we can hide from MediaWiki code via Rdbms virtual domains. I.e. charts would register virtual-chartslinks as virtual domain, that in production wmf-config we map to x1.globaltemplateslinks_commonswiki.

Cost: Cognitive overhead that muliple two different in-code table names, in production, map to a shared table that neither extension owns.

Cost: one-time cost of DBA table creation, if of different type, or not on commons.

Benefit: no DBA table creation task when adding a second extension that also is of type "transclusion" and also uses Commons.

I intuitively have a slight preference for this, but it would get messier if there are multiple host wikis that host transclusion type content, which there will be (Commons, Wikidata, Wikifunctions).

Question: Which extension owns the schema? Would it be in core as unused schema file for wiki farms to use if/when needed? How would the installer provision this for local dev when enabling the extension?

I figured that, in this alternative, the GlobalUsage extension would own it, because that extension already owns the other "global content hosted on Commons and used on other wikis" tracking table for tracking image usage (which is a slightly different type than transclusion, but has basically the same data shape, functionality and requirements).

Another option would be to elide the distinction between image usage and transclusion, and track them both in one table. That table would then be managed by the GlobalUsage extension, which would track image usage itself, and expose an API that allows other extensions that manage Commons-hosted content used on other wikis to register usages of it. Charts and/or JsonConfig would then call this API.

• Catrope mentioned this in T369950: Application Security Review Request : Chart extension and chart-renderer service.Aug 30 2024, 12:39 AM

Krinkle updated the task description. (Show Details)Aug 30 2024, 1:12 AM

In T370378#10104830, @Catrope wrote:

In T370378#10094335, @Ladsgroup wrote:

Random further notes on the schema but they are clearly on the details. General idea looks good. I agree with krinkle and either direction he suggested sounds good to me (I slightly prefer the first one since overhead of extra tables is large but if they are per wiki, central tables, specially the ones in x1, don't have any extra overhead):

To clarify, are you saying we should narrow the scope of this to Charts, and exclude other future use cases like Wikifunctions? I'm OK with that, but I wanted to check if that's what you meant. If we do narrow the scope of this, I would suggest at least making it a little more general and leave the door open for tracking other kinds of cross-wiki JsonConfig usage in the future. Either way, if we narrow the scope of this, we should change the table names from globallinks to globalchartlinks or globaljsonlinks or whatever.

Yeah, globaljsonlinks or some other way to track tabular data sounds good to me. Wikidata is a different beast altogether and I think we will look into how wikifunctions needs tracking once we get there, I think it'll have different requirements.

To clarify, are you saying we should narrow the scope of this to Charts, and exclude other future use cases like Wikifunctions? I'm OK with that, but I wanted to check if that's what you meant. If we do narrow the scope of this, I would suggest at least making it a little more general and leave the door open for tracking other kinds of cross-wiki JsonConfig usage in the future.

A way to do this would be to work on a common data model that could work for most of these use cases. Even if they aren't in the same table (for scalability reasons), having the same data model would make working joining and working with this data outside of MediaWiki much easier. Different kinds of link information could be analyzed using similar code and queries, just by swapping out table names.

• Catrope updated the task description. (Show Details)Sep 3 2024, 6:22 PM

Ottomata mentioned this in T331399: Create new mediawiki.page_links_change stream based on fragment/mediawiki/state/change/page.Sep 4 2024, 3:49 PM

A way to do this would be to work on a common data model that could work for most of these use cases.

I just discussed this a bit with Amir, and I have a better understanding of how MW is modeling this right now. It seems like there is generally a common data model (and PHP interface), especially for link tables that use linktarget. For external usages then, hopefully it isn't too hard to use this data from MariaDB analytics replicas.

Even better would be a MW core managed hook that exposed a nice data model for page link changes :D. Hopefully this is more for FY2024-2025 WE 5.2 'hooks' KR..

CCiufo-WMF moved this task from Ready for Dev to Doing on the Charts (Sprint 5) board.Sep 4 2024, 6:01 PM

Jdlrobson mentioned this in T374065: Document all the technicals decisions we've made so far in the codebase.Sep 10 2024, 11:32 PM

Jdlrobson added a parent task: T374065: Document all the technicals decisions we've made so far in the codebase.

Some notes:

there general agreement on using a standalone set of tables to simplify use from both commons and local sides
- desire to keep the tables small, and keep them on x1 in production rather than on s4
- try to avoid contention. shouldn't be too bad, though edits can result in deletions/insertions.
rather than generalizing different types of links, make it specific to JsonConfig consumption, call it globaljsonlinks_*
- we expect some defs will be reused many thousands of times, but few pages will have large numbers of charts
- to allow chart defs and tabular data sets to be reused efficiently, use a target table to use small integer keys in the linking table
proposal to have a (wikiid, namespace) pair in the source table, then individual page names in the linking table
- this grouping feels a little weird, might or might not adjust the namespace positioning
warning: could leak no-longer-used string keys. we can either aggressively prune on no-longer-used or just let them leak.

Table layouts as listed in the current proposal are in the description at top.

Also need to consider the mechanism for pushing updates:

Changes to a chart or tabular data page page on Commons need to look up which wikis have affected pages
For each affected wiki, look up the API for its wiki id, and contact an API endpoint
- via a special action, inform the remote wiki to queue a job for purging affected pages
- we can either have Commons manage this whole process and make an API call for each page, or make one call per wiki and let the other wiki queue its own meta-job

Jdlrobson edited projects, added Charts (Sprint 6); removed Charts (Sprint 5).Sep 11 2024, 8:06 PM

Jdlrobson moved this task from Incoming to Doing on the Charts (Sprint 6) board.

We need to decide on the mechanism part and make sure that we send the pings from one wiki site to the other. Brooke left the above notes proposing an API method with an API action for purging all pages that use a single chart and then when Commons updates those data pages it will look up through the wiki ID impacted page and communicate those changes to the wiki.

Next step will be to submit the ADR for review. We'd like to get input on that from @Ladsgroup and @Krinkle. Once that's approved we'll set up follow up tasks to implement this:

Setting up the schema to make it possible to have a schema definition installable by standard wiki tools
Deploy the beta and product schema updates.
API endpoint to invalidate pages using a chart.

Change #1072622 had a related patch set uploaded (by Jdlrobson; author: Bvibber):

[mediawiki/extensions/Chart@master] Provisional ADR for usage tracking

https://gerrit.wikimedia.org/r/1072622

gerritbot added a project: Patch-For-Review.Sep 12 2024, 10:38 PM

make it specific to JsonConfig consumption, call it globaljsonlinks_*
possible to have a schema definition installable by standard wiki tools

Note:

Chart is not the only .tab consumer. See also T153966: Track Commons Dataset usage across wikis (what links here).
JsonConfig previously does not need to create database table, and this task may bring schema change to the extension. Since JsonConfig is deployed in every Wikimedia wikis ~~(where GlobalUsage only in Commons)~~ we may want to somehow prevent this table be created in other wikis.
- Also for single standalone wiki, is this feature still useful or is existing link tables enough? If we do not need a dedicated tracking table then JsonConfig may not be the best place for the feature.
- Alternatively we may create a dedicated extension for global JSON usage tracking, or reuse GlobalUsage (though with different tables) for it, i.e. introduce a "GlobalUsage 2.0" backend, supporting storing global usage data in another cluster, and different variant of usage (For example we have two kinds of usage: file and json; each require three table: source namespace/target/main but two source namespace tables may or may not be shared, so five or six tables in total for two kinds of usage).

Bugreporter added a parent task: T153966: Track Commons Dataset usage across wikis (what links here).Sep 13 2024, 3:33 AM

Chart is not the only .tab consumer. See also T153966: Track Commons Dataset usage across wikis (what links here).

Yep, we'll want to file tasks to (sooner or later) save entries for the other existing usages, allowing us to now purge those on change where previously we didn't. (eg, the use of lua library for fetching tabular data _ideally_ should track usage so we can invalidate, though this is strictly speaking out of scope for Charts work)

JsonConfig previously does not need to create database table, and this task may bring schema change to the extension. Since JsonConfig is deployed in every Wikimedia wikis (where GlobalUsage only in Commons) we may want to somehow prevent this table be created in other wikis.

Hm, lemme double-check what GlobalUsage does. Presumably it needs to run code on the client sites so that it can record usages, so it has to be used on all wikis, not just on Commons, even though its shared table lives in one place. I know we have a system for marking individual tables as shared/living on another server as long as they're not used for joins.

Also for single standalone wiki, is this feature still useful or is existing link tables enough? If we do not need a dedicated tracking table then JsonConfig may not be the best place for the feature.

I'm thinking: the same table would be used on a single standalone wiki to track jsonconfig page usage and perform local cache invalidations. It would only look for records for its own wiki id, it would find them, and it would perform the purges locally. The configuration of the JSON types/namespace would control whether local changes perform local purges only, or do local purges and also push to other sites.

Alternatively we may create a dedicated extension for global JSON usage tracking, or reuse GlobalUsage (though with different tables) for it, i.e. introduce a "GlobalUsage 2.0" backend, supporting storing global usage data in another cluster, and different variant of usage (For example we have two kinds of usage: file and json; each require three table: source namespace/target/main but two source namespace tables may or may not be shared, so five or six tables in total for two kinds of usage).

If we move it anywhere other than JsonConfig it probably needs to live in core so it can be used consistently. From the discussion above though it sounds like we don't want to do anything like that, and want to keep each link type distinct and separate?

Ok, confirmed GlobalUsage is enabled on all non-closed, non-private wikis:

'wmgUseGlobalUsage' => [
	'default' => true, # Enabled on all PUBLIC wikis
	'closed' => false,
	'private' => false,
	'wikitech' => false,
],

and the configuration is the same for all wikis, specifying the table existing on a particular wiki's database:

if ( $wmgUseGlobalUsage ) {
	wfLoadExtension( 'GlobalUsage' );
	$wgGlobalUsageDatabase = 'commonswiki';
	$wgGlobalUsageSharedRepoWiki = 'commonswiki';
	$wgGlobalUsagePurgeBacklinks = true;
}

while GLobalUsage's getGlobalDB function fetches specifically from that db backend:

	/**
	 * @param int $index DB_PRIMARY/DB_REPLICA
	 * @param array $groups
	 * @return IDatabase
	 */
	public static function getGlobalDB( $index, $groups = [] ) {
		global $wgGlobalUsageDatabase;

		$lbFactory = MediaWikiServices::getInstance()->getDBLoadBalancerFactory();
		$lb = $lbFactory->getMainLB( $wgGlobalUsageDatabase );

		return $lb->getConnection( $index, [], $wgGlobalUsageDatabase );
	}

Note this is different from just sharing the imagelinks table onto a shared DB, in that a standalone instance using JsonConfig + globaljsonlinks would use the same table on its local database, rather than a separate one in addition.

So the question I guess is:

should DBAs create the tables on non-commons, non-closed, non-private wikis and leave them empty/unused, OR
should DBAs not create the tables on non-commons, non-closed, non-private wikis and have to remember which are which for database management, OR
should devs create a _local_ jsonlinks table for usage by standalone sites, and a _global_ globaljsonlinks_* table family for use in the shared case *only*?

I'm leaning towards option number 3. This would have a jsonlinks like:

CREATE TABLE jsonlinks (
    jl_source BIGINT UNSIGNED DEFAULT 0 NOT NULL, /* refers to linktarget.lt_id */
    jl_target BIGINT UNSIGNED DEFAULT 0 NOT NULL, /* refers to linktarget.lt_id */
    INDEX jl_target_source (jl_target,jl_source),
    PRIMARY KEY (jl_from, jl_target)
);

Further decision point:

Should this store only local usages, disjoint from remote usages? (Eg, usages of charts *on pages within Commons*)

Or should it store all the usages, with the shared table holding a copy that's usable on the shared wiki for listing remote pages and pushing purge actions? This would leave some duplication of data, but makes it easy to distinguish between the shared data and the remote data and doesn't leave confusing directions on what you have to install.

bvibber mentioned this in T374746: Cache invalidation based on usage tracking of Data: pages.Sep 13 2024, 9:56 PM

bvibber mentioned this in T374747: Usage tracking table for JsonConfig data pages.Sep 13 2024, 10:02 PM

bvibber added a parent task: T374747: Usage tracking table for JsonConfig data pages.Sep 13 2024, 10:05 PM

In T370378#10145016, @bvibber wrote:
Ok, confirmed GlobalUsage is enabled on all non-closed, non-private wikis:
'wmgUseGlobalUsage' => [
	'default' => true, # Enabled on all PUBLIC wikis
	'closed' => false,
	'private' => false,
	'wikitech' => false,
],

wikitech will become a normal wiki in the first week of October. I don't think a closed wiki would need to use a new feature. That leaves us with private wikis only.

and the configuration is the same for all wikis, specifying the table existing on a particular wiki's database:
if ( $wmgUseGlobalUsage ) {
	wfLoadExtension( 'GlobalUsage' );
	$wgGlobalUsageDatabase = 'commonswiki';
	$wgGlobalUsageSharedRepoWiki = 'commonswiki';
	$wgGlobalUsagePurgeBacklinks = true;
}
while GLobalUsage's getGlobalDB function fetches specifically from that db backend:
	/**
	 * @param int $index DB_PRIMARY/DB_REPLICA
	 * @param array $groups
	 * @return IDatabase
	 */
	public static function getGlobalDB( $index, $groups = [] ) {
		global $wgGlobalUsageDatabase;

		$lbFactory = MediaWikiServices::getInstance()->getDBLoadBalancerFactory();
		$lb = $lbFactory->getMainLB( $wgGlobalUsageDatabase );

		return $lb->getConnection( $index, [], $wgGlobalUsageDatabase );
	}
Note this is different from just sharing the imagelinks table onto a shared DB, in that a standalone instance using JsonConfig + globaljsonlinks would use the same table on its local database, rather than a separate one in addition.

So the question I guess is:

should DBAs create the tables on non-commons, non-closed, non-private wikis and leave them empty/unused, OR

should DBAs not create the tables on non-commons, non-closed, non-private wikis and have to remember which are which for database management, OR

should devs create a _local_ jsonlinks table for usage by standalone sites, and a _global_ globaljsonlinks_* table family for use in the shared case *only*?

We won't create a new empty table for wikis that don't use it. Adding an empty adds two files per wiki that mariadb needs to open and is already causing issues in s3 where there are around 800 wikis which having 200 tables.

To ease maintenance, you could create the exact same tables in x1 for "normal" wikis and in the local db for private wikis. Then use virtual domain configs to control where to read and write.

Another option is that maybe we can simply not track nor update private wiki usages. It will have minimal usage and things will be purged at most by end of 30 days expiry of ParserCache anyway and worst case, people can use action=purge. We can implement that feature if and when it gets requested by users of private wikis.

I'm not sure I follow the discussion about local copy of the tracking data. Are you talking non-private wikis or private wikis? for non-private wikis, it'll add a lot of data which we don't have capacity for.

Another option is that maybe we can simply not track nor update private wiki usages.

If we only track json usage centrally and not locally, we should also make sure third party (non-WMF and standalone) wiki users does not get this (unused) globaljsonlinks table or any other table created after they installed JsonConfig with default setting and run update.php.

bvibber moved this task from Doing to Code Review on the Charts (Sprint 6) board.Sep 16 2024, 6:18 PM

If we only track json usage centrally and not locally, we should also make sure third party (non-WMF and standalone) wiki users does not get this (unused) globaljsonlinks table or any other table created after they installed JsonConfig with default setting and run update.php.

*nod* for now let's go with the status quo -- "local" usage is recorded via templatelinks as though it were a transclusion, requiring no extra tables in any base usage of JsonConfig, while cross-wiki data usages will be recorded in globaljsonlinks & friends in the shared x1.

Assuming we also don't deploy JsonConfig and Chart into private wikis, this avoids any unexpected data leakage or surprises when configurations change.

Change #1072622 merged by jenkins-bot:

[mediawiki/extensions/Chart@master] ADR for usage tracking table

https://gerrit.wikimedia.org/r/1072622

• brooke mentioned this in rECRT74d279206c23: ADR for usage tracking table.Sep 20 2024, 9:08 PM

Jdlrobson moved this task from Code Review to Ready for Signoff on the Charts (Sprint 6) board.Sep 20 2024, 9:26 PM

Maintenance_bot removed a project: Patch-For-Review.Sep 20 2024, 9:30 PM

ReleaseTaggerBot added a project: MW-1.43-notes (1.43.0-wmf.24; 2024-09-24).Sep 20 2024, 10:00 PM

CCiufo-WMF closed this task as Resolved.Sep 23 2024, 6:18 PM

FYI, I believe decisions made here have been documented at https://github.com/wikimedia/mediawiki-extensions-Chart/blob/master/doc/adr/0003.%20Usage%20tracking%20in%20a%20shared%20table%20via%20JsonConfig.md