The unit for mediawiki-history-drop-snapshot.service is in an error state; the journalctl logs have been cleared but I found the error from syslog:
razzi@an-launcher1002:/var/log$ zgrep 'HDFS directories to check' * ... syslog.7.gz:Mar 30 06:26:25 an-launcher1002 kerberos-run-command[20455]: 2022-03-30T06:26:25 ERROR Selected partitions extracted from table specs ({'snapshot=2022-01-24', 'snapshot=2022-01-31'}) does not match selected partitions extracted from data paths (set()). HDFS directories to check: []
Running the command with --verbose and --dry-run gave the context of what table was erroring:
2022-04-06T19:52:52 DEBUG Processing table wikidata_entity keeping 6 snapshots 2022-04-06T19:52:52 DEBUG Getting partitions to drop... 2022-04-06T19:52:52 DEBUG Running: hive --service cli --database wmf -e SET hive.cli.print.header=false; SHOW PARTITIONS wikidata_entity; 2022-04-06T19:53:05 DEBUG Getting directories to remove... 2022-04-06T19:53:05 DEBUG Running: hive --service cli --database wmf -e SET hive.cli.print.header=false; DESCRIBE FORMATTED wikidata_entity; 2022-04-06T19:53:17 DEBUG Running: hdfs dfs -ls -d hdfs://analytics-hadoop/wmf/data/wmf/wikidata/entity/*/_SUCCESS 2022-04-06T19:53:19 ERROR Selected partitions extracted from table specs ({'snapshot=2022-01-31', 'snapshot=2022-02-07', 'snapshot=2022-01-24'}) does not match selected partitions extracted from data paths (set()). HDFS directories to check: []
@BTullis and I had thought we could fix this by manually adding some _SUCCESS files, but the files are there:
razzi@an-launcher1002:~$ hdfs dfs -ls /wmf/data/wmf/wikidata/entity/snapshot=*/_SUCCESS Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8 -rw-r----- 3 analytics analytics-privatedata-users 0 2022-03-16 17:55 /wmf/data/wmf/wikidata/entity/snapshot=2022-01-24/_SUCCESS -rw-r----- 3 analytics analytics-privatedata-users 0 2022-03-16 17:56 /wmf/data/wmf/wikidata/entity/snapshot=2022-01-31/_SUCCESS -rw-r----- 3 analytics analytics-privatedata-users 0 2022-03-16 17:56 /wmf/data/wmf/wikidata/entity/snapshot=2022-02-07/_SUCCESS -rw-r----- 3 analytics analytics-privatedata-users 0 2022-03-16 17:56 /wmf/data/wmf/wikidata/entity/snapshot=2022-02-14/_SUCCESS -rw-r----- 3 analytics analytics-privatedata-users 0 2022-03-16 17:56 /wmf/data/wmf/wikidata/entity/snapshot=2022-02-21/_SUCCESS -rw-r----- 3 analytics analytics-privatedata-users 0 2022-03-16 17:56 /wmf/data/wmf/wikidata/entity/snapshot=2022-02-28/_SUCCESS
I wonder if the partitions and directories got out of sync, and now the script will not fix the situation because the check for partitions versus directories happens before anything is removed:
if not non_strict: check_partitions_vs_directories(partitions, directories) drop_partitions(hive, table, partitions, dry_run) remove_directories(hive, table, directories, dry_run)
So I ran the script with --dry-run and --non-strict:
PYTHONPATH=${PYTHONPATH}:/srv/deployment/analytics/refinery/python /usr/local/bin/kerberos-run-command analytics /srv/deployment/analytics/refinery/bin/refinery-drop-mediawiki-snapshots --verbose --dry-run --non-strict ... 2022-04-06T20:38:34 DEBUG Processing table wikidata_entity keeping 6 snapshots 2022-04-06T20:38:34 DEBUG Getting partitions to drop... 2022-04-06T20:38:34 DEBUG Running: hive --service cli --database wmf -e SET hive.cli.print.header=false; SHOW PARTITIONS wikidata_entity; 2022-04-06T20:38:47 DEBUG Getting directories to remove... 2022-04-06T20:38:47 DEBUG Running: hive --service cli --database wmf -e SET hive.cli.print.header=false; DESCRIBE FORMATTED wikidata_entity; 2022-04-06T20:38:59 DEBUG Running: hdfs dfs -ls -d hdfs://analytics-hadoop/wmf/data/wmf/wikidata/entity/*/_SUCCESS 2022-04-06T20:39:02 INFO Dropping 3 partitions from wmf.wikidata_entity 2022-04-06T20:39:02 DEBUG snapshot='2022-02-07' 2022-04-06T20:39:02 DEBUG snapshot='2022-01-31' 2022-04-06T20:39:02 DEBUG snapshot='2022-01-24'
Here's the full output, since running it takes a while: https://phabricator.wikimedia.org/P24175
There are 2 hive tables that are affected: wmf.wikidata_item_page_link and wmf.wikidata_entity.
So it looks like it would work to run it with --non-strict. Before I do this I'm hoping anybody on the team can weigh in on my understanding: @JAllemandou and/or @mforns perhaps?