[go: up one dir, main page]

Page MenuHomePhabricator

Run HTML dump scraper (June 2024)
Closed, ResolvedPublic

Description

We need to do the full run again, due to a mistake in how we were detecting template-generated content.

This task will be linked to other subtasks which should be completed before the run.

Event Timeline

Change #1037437 had a related patch set uploaded (by Awight; author: Awight):

[operations/puppet@production] Temporary monitoring for scraper

https://gerrit.wikimedia.org/r/1037437

Change #1037437 merged by Filippo Giunchedi:

[operations/puppet@production] Temporary monitoring for scraper

https://gerrit.wikimedia.org/r/1037437

Verification is slightly harder than expected, because one of the code changes makes it possible to time out on individual pages. This caused the page count to decrease by 224 pages, presumably due to length and complexity.

Generally, columns we expected to stay the same are nearly the same, and the ones we expected to change are very different. For example, ref_by_transclusion_count increased by 47%, and refs_with_solely_transclusion_count increased by 177 *fold*. proportion_of_refs_from_transclusion increased by 48%. Lists of templates producing refs have increased with lots of templates that we were missing due to their structure.

Full details are in the worksheet v0.3.1 verification

I'll go ahead and start the full run.

The timeout issue is interesting. Our new run reported 569 timed-out pages, but only 224 fewer pages were processed than in the previous run using the older software version. Maybe the unexplained 345 pages failed silently in the older version, and the errors weren't logged as nicely?

We could test this theory locally by running a few of the timed-out pages through the older version of the software.

Change #1040075 had a related patch set uploaded (by Awight; author: Awight):

[operations/puppet@production] [DNM] Revert "Temporary monitoring for scraper"

https://gerrit.wikimedia.org/r/1040075

awight updated the task description. (Show Details)

Change #1040075 merged by Filippo Giunchedi:

[operations/puppet@production] Revert "Temporary monitoring for scraper"

https://gerrit.wikimedia.org/r/1040075

awight claimed this task.