[go: up one dir, main page]

Page MenuHomePhabricator

Aggregate some numbers from scraper results
Closed, ResolvedPublic

Description

Goal

  • Surface metric numbers from scraper data
  • Think about how we support self-serve for accessing scraper data

Steps

Metrics:
Should be retrievable from current scraper results

  • # of duplicate (identical) refs in a given wiki
    • identical_refs_count in column E gives the absolute number of identical refs.
  • # of articles with at least one identical ref
    • pages_with_identical_refs_count in column J
    • proportion_of_pages_with_identical_refs in column AF for this number as a proportion of total pages.
  • # of articles with more than 25 refs and have at least one identical reference,
  • proportion of duplicate refs in articles with >25 refs vs. proportion of duplicates in articles <25 refs, split by wiki.
    • Assumption: longer reference lists have more duplicates because hard to find and manage
  • # of articles without references
    • pages_with_refs_count in column O for the number of pages with at least one ref.
    • proportion_of_pages_with_refs in column AI for this number as a proportion of total pages.
    • Requested metric can be found with page_count - pages_with_refs_count
  • ratio of reference to paragraph per wiki ( TBD: Can we even do that without a code change to the scraper and a re-run? )
    • wikitext_length_average in column C is a good proxy for paragraph count.

Code to review

Event Timeline

WMDE-Fisch updated the task description. (Show Details)
WMDE-Fisch subscribed.
WMDE-Fisch renamed this task from Scraper metrics to Aggregate some numbers from Scraper results.Apr 24 2024, 10:33 AM
awight renamed this task from Aggregate some numbers from Scraper results to Aggregate some numbers from scraper results.Apr 24 2024, 11:43 AM
awight updated the task description. (Show Details)

TODO: a bit of coding to reprocess existing page summarizes to produce the "25 refs or more" statistic.

proportion of duplicate refs in articles with >25 refs vs. proportion of duplicates in articles <25 refs, split by wiki.

FWIW, the naive calculation for this will be confounded by the increased chance of having an accidental duplicate as there are more refs.

Is there anything left to do, or should be close this ticket?

awight claimed this task.