[go: up one dir, main page]

Page MenuHomePhabricator

Search not working for entity schemas
Closed, ResolvedPublic3 Estimated Story PointsBUG REPORT

Description

Searching for E10 (https://www.wikidata.org/w/index.php?search=E10&ns640=1) does not find any results. It should find https://www.wikidata.org/wiki/EntitySchema:E10

Searching for human (https://www.wikidata.org/w/index.php?search=human&ns640=1) does not find any results. It should find https://www.wikidata.org/wiki/EntitySchema:E10

Searching for intitle:/E/ (https://www.wikidata.org/w/index.php?search=intitle%3A%2FE%2F&ns640=1) only finds two results, E123 and E431. It should find all entity schemas (about 400 results), because all entity schemas have "E" in the title.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Can not reproduce. Which language do you use?

I generally test in private tabs, so English, because it always uses English for people who aren't logged in. Which language are you using?

The first search now includes E10 for me, but not other things like https://www.wikidata.org/wiki/EntitySchema:E257 (which has "E10" on the first line).

The second search is similar, it has E10 but not E257 (which has "human" on the first line). The top result is "Clinical Interpretations of Variants in Cancer (E70)" too, when E10 should be the top result.

The last one is currently returning 109 results, which is only a quarter of the results I should get.

Gehel triaged this task as Medium priority.Jun 24 2024, 3:24 PM
Gehel moved this task from needs triage to Current work on the Discovery-Search board.
Gehel edited projects, added Discovery-Search (Current work); removed Discovery-Search.
Gehel set the point value for this task to 3.Jun 24 2024, 3:28 PM
Gehel subscribed.

Discovery-Search will investigate, but it is likely that improvements to entity schema search will require major work.

There are currently 354 pages indexed in the entity schema, the all pages api does seem to suggest that there are 397 schemas.

Mentioned in SAL (#wikimedia-operations) [2024-06-25T14:24:29Z] <dcausse> re-indexing all wikidata entity schemas (T368010)

The above reindex did not work as I expected, the attached patch should remedy this by allowing non indexed page to be re-indexed properly when manually re-indexing a whole namespace.
The root cause as to why these schemas were not indexed in the first place is yet to be investigated.

Surprisingly E378 which is one of the schemas that is not indexed appears to be indexed in the "content" index of wikidata, but AFAICT 640 is not a content namespace.
But it might have been considered as a content namespace few weeks ago.
I wonder if T363153 and esp. https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EntitySchema/+/1040113/ might not be the reason of this change. When a namespace with existing documents has its search characteristics changed (wgContentNamespace and/or wgNamespacesToBeSearchedDefault) the indexed docs are not moved automatically from one index to another and will rely on the saneitizer to slowly fix the inconsistencies, this is what might have happened here and explain why the schemas suddenly disappeared and got re-indexed slowly overtime.

@ArthurTaylor could you confirm that the change in the EntitySchema characteristics (stopped being a content namespace) was expected?

Well, the patch you found looks like it’s supposed to still register EntitySchema as a content namespace… but I think I vaguely remember a similar issue from before, and it’s that SomeMediaWikiComponent™ has already finished reading $wgContentNamespaces by the time our hook handler runs and adds 640 to it, and so the assignment is a no-op?

So this is fun. I tried to check how Lexeme solves the issue of declaring its dynamically registered namespace as content, and it just doesn’t. We add 120 (Property) and 146 (Lexeme) to $wgContentNamespaces in the production config, which is why they’re content namespaces there; other / third-party wikis apparently get to pound sand. (On my local wiki, the Lexeme namespace is not considered a content namespace.) IMHO we should fix this, but also in the meantime, let’s just add 640 to that production config block to make it content again.

SomeMediaWikiComponent™ has already finished reading $wgContentNamespaces by the time our hook handler runs and adds 640 to it, and so the assignment is a no-op?

This is broadly correct, by the way – the NamespaceInfo service gets created with a copy of $wgContentNamespaces, and then later that service calls the CanonicalNamespaces hook, so any assignment to $wgContentNamespaces in a CanonicalNamespaces hook handler is never going to be seen by the already-existing NamespaceInfo service.

Change #1049924 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[operations/mediawiki-config@master] wikidatawiki: Add namespace 640 (EntitySchema) to $wgContentNamespaces

https://gerrit.wikimedia.org/r/1049924

Change #1049924 merged by jenkins-bot:

[operations/mediawiki-config@master] wikidatawiki: Add namespace 640 (EntitySchema) to $wgContentNamespaces

https://gerrit.wikimedia.org/r/1049924

Mentioned in SAL (#wikimedia-operations) [2024-06-26T14:11:50Z] <logmsgbot> lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1049924|wikidatawiki: Add namespace 640 (EntitySchema) to $wgContentNamespaces (T368010)]]

Mentioned in SAL (#wikimedia-operations) [2024-06-26T14:14:20Z] <logmsgbot> lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Backport for [[gerrit:1049924|wikidatawiki: Add namespace 640 (EntitySchema) to $wgContentNamespaces (T368010)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-06-26T14:19:48Z] <logmsgbot> lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1049924|wikidatawiki: Add namespace 640 (EntitySchema) to $wgContentNamespaces (T368010)]] (duration: 07m 57s)

Alright, EntitySchema is a content namespace again. @dcausse, I guess we’ll have to reindex some recently touched EntitySchemas?

Hm, though the search links in the task description still don’t yield the expected results :/

Hm, though the search links in the task description still don’t yield the expected results :/

yes this is sadly kind of expected (I should have told you about this on the config patch, sorry). The cleanup process had already started moving pages around while the entity schema namespace was considered non-content and thus these ones are no longer findable now it was brought back again in the content namespace. I need to reindex these pages to make search working again but sadly our tooling is not working as expected and I need to deploy https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/143 first to be able to fix the index. If this is causing major disruption I can messup with the index by hand but I'd rather not do that if not strictly required, sorry for the inconvenience!

In the meantime a ugly workaround is to search both EntitySchema and EntitySchema talk namespaces but filter on the content model using the keyword contentmodel:EntitySchema: https://www.wikidata.org/w/index.php?search=contentmodel%3AEntitySchema+intitle%3A%2FE%2F&title=Special:Search&profile=advanced&fulltext=1&ns640=1&ns641=1 .

If this is causing major disruption I can messup with the index by hand but I'd rather not do that if not strictly required, sorry for the inconvenience!

I think it’s okay for this to wait a bit, thanks for working on it!

Change #1050599 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/EntitySchema@master] Improve namespace registration

https://gerrit.wikimedia.org/r/1050599

Change #1050599 merged by jenkins-bot:

[mediawiki/extensions/EntitySchema@master] Improve namespace registration

https://gerrit.wikimedia.org/r/1050599

Change #1054582 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/deployment-charts@master] cirrus-streaming-updater: bump image version

https://gerrit.wikimedia.org/r/1054582

Change #1054582 merged by jenkins-bot:

[operations/deployment-charts@master] cirrus-streaming-updater: bump image version

https://gerrit.wikimedia.org/r/1054582

I think that all 412 schemas are now properly indexed.