Aggregate some numbers from scraper results
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Lina_Farid_WMDE
	Apr 24 2024, 7:01 AM

Description

Goal

Steps

See spreadsheet prepared in T363327: Prepare spreadsheet for new scraper results.
Retrieve the data from the results
Document instructions on how the data can be fund/extracted

Metrics:
Should be retrievable from current scraper results

# of duplicate (identical) refs in a given wiki
- identical_refs_count in column E gives the absolute number of identical refs.
# of articles with at least one identical ref
- pages_with_identical_refs_count in column J
- proportion_of_pages_with_identical_refs in column AF for this number as a proportion of total pages.
# of articles with more than 25 refs and have at least one identical reference,
proportion of duplicate refs in articles with >25 refs vs. proportion of duplicates in articles <25 refs, split by wiki.
- Assumption: longer reference lists have more duplicates because hard to find and manage
# of articles without references
- pages_with_refs_count in column O for the number of pages with at least one ref.
- proportion_of_pages_with_refs in column AI for this number as a proportion of total pages.
- Requested metric can be found with page_count - pages_with_refs_count
ratio of reference to paragraph per wiki ( TBD: Can we even do that without a code change to the scraper and a re-run? )
- wikitext_length_average in column C is a good proxy for paragraph count.

WMDE-Fisch updated the task description. (Show Details)

WMDE-Fisch subscribed.

WMDE-Fisch renamed this task from Scraper metrics to Aggregate some numbers from Scraper results.Apr 24 2024, 10:33 AM

awight renamed this task from Aggregate some numbers from Scraper results to Aggregate some numbers from scraper results.Apr 24 2024, 11:43 AM

awight updated the task description. (Show Details)

TODO: a bit of coding to reprocess existing page summarizes to produce the "25 refs or more" statistic.

proportion of duplicate refs in articles with >25 refs vs. proportion of duplicates in articles <25 refs, split by wiki.

FWIW, the naive calculation for this will be confounded by the increased chance of having an accidental duplicate as there are more refs.

Is there anything left to do, or should be close this ticket?

awight closed this task as Resolved.May 30 2024, 12:21 PM

awight claimed this task.