Goal
- Surface metric numbers from scraper data
- Think about how we support self-serve for accessing scraper data
Steps
- See spreadsheet prepared in T363327: Prepare spreadsheet for new scraper results.
- Retrieve the data from the results
- Document instructions on how the data can be fund/extracted
Metrics:
Should be retrievable from current scraper results
- # of duplicate (identical) refs in a given wiki
- identical_refs_count in column E gives the absolute number of identical refs.
- # of articles with at least one identical ref
- pages_with_identical_refs_count in column J
- proportion_of_pages_with_identical_refs in column AF for this number as a proportion of total pages.
- # of articles with more than 25 refs and have at least one identical reference,
- proportion of duplicate refs in articles with >25 refs vs. proportion of duplicates in articles <25 refs, split by wiki.
- Assumption: longer reference lists have more duplicates because hard to find and manage
- # of articles without references
- pages_with_refs_count in column O for the number of pages with at least one ref.
- proportion_of_pages_with_refs in column AI for this number as a proportion of total pages.
- Requested metric can be found with page_count - pages_with_refs_count
- ratio of reference to paragraph per wiki ( TBD: Can we even do that without a code change to the scraper and a re-run? )
- wikitext_length_average in column C is a good proxy for paragraph count.