[go: up one dir, main page]

Page MenuHomePhabricator

[M] Estimate how many unillustrated articles on Cebuano and Arabic wikis would have matches in MediaSearch
Closed, ResolvedPublic

Description

As an SDAW product manager, I want to understand how many articles there is potential to illustrate with our bot writing partners on Cebuano and Arabic wikis, so that I can determine whether we should consider other features, partners and improvements to illustrate a larger number of articles to meet our SDAW grant requirements.

We know that (as of January 2020) there are 1,939,115 total unillustrated articles across Cebuano and Arabic wikis, but only 121,390 of them have a candidate from the image matching algorithm. How close to illustrating the full 1,939,115 is there potential to get with the addition of matches from MediaSearch?

Acceptance Criteria:

  • Using the full list of unillustrated articles from Cebuano (ceb) and Arabic (ar) wikis generated from @Miriam's image matching algorithm, write a script to determine an estimate of what percentage of those articles have matches with elastic search scores over (score threshold TBD).

Note that:

  • we will be tweaking elastic scores based on classified data and/or ltr data, so current results will change based on that work (T271799)
  • we have not yet confirmed whether there’s a strong enough correlation between elastic scores & relevance (T272710)

Event Timeline

CBogen renamed this task from Estimate how many unillustrated articles on Cebuano and Arabic wikis would have matches in MediaSearch to [M] Estimate how many unillustrated articles on Cebuano and Arabic wikis would have matches in MediaSearch .Feb 9 2021, 5:18 PM

Hey @Miriam! I'm going to be working on this ticket—how can I get the list of unillustrated Cebuano and Arabic articles from you?

Hey @Miriam! I'm going to be working on this ticket—how can I get the list of unillustrated Cebuano and Arabic articles from you?

See T273062#6823308

I've got a simple tool up on Toolforge (media-search-measure-hits, code here) to compile this data. Thanks to @Cparle for his help!

Running the script to compile data for both lists of articles will take days. In the meantime, I'm working on getting some random samples to estimate the final number, with the caveat that these random samples won't necessarily be representative of the entire list. More on that tomorrow...

Actually, more on that now! Please take these estimates with a Maldon-sized grain of salt as they are based on a relatively small sample size:

Arabic
Average 8.16% of articles have mediasearch matches, which would translate to 47,467 total articles with mediasearch matches if the trend continues

Cebuano
Average 22.41% of articles have mediasearch matches, which translates to 304,194 total

@AnneT -- interesting work! What's the criteria you used to determine whether an article has a match?

Hey @MMiller_WMF! For each article title, I'm doing a search:

  • on Commons
  • using the media search profile
  • in the NS_FILE namespace
  • with a filter for filetype:bitmap

This is basically what you'd get on the Images tab on Special:MediaSearch (I probably should have also included filetype:drawing, but there aren't a lot of files with that type, so it shouldn't make much of a difference). I'm recording the totalhits for each title, and if a title has at least one result, it's considered a match.

We realized that Wikipedia disambiguation pages were included in the lists of articles, which aren't good candidates for the image recommendation tool and therefore need to be removed to avoid skewing results. This had a significant negative impact on the number of titles with media search matches for Cebuano Wikipedia.

However, the work to add labels and aliases of Wikidata items related to the search term to the search query could have a significant positive effect on Arabic and other languages that have decent coverage in Wikidata. In the end, we're pretty close to the initial total estimate. Here are the numbers (a random sample of 20,000 titles were tested from each list):

Arabic

  • Before disambiguation pages were removed: 8.16% of articles had media search matches (estimated 47,467 total articles)
  • After pages were removed: 8.56% with matches (estimated 47,238 total)
  • With synonyms: 37% with matches (estimated 204,184 total)

Cebuano

  • Before pages were removed: 22.41% with matches (estimated 304,194 total)
  • After pages were removed: 10.13% with matches (estimated 106,789 total)
  • With synonyms: 11.31% with matches (estimated 119,288 total)