Page MenuHomePhabricator

20240220 database backup dump appears stuck
Closed, ResolvedPublicBUG REPORT

Description

What happens?:
https://dumps.wikimedia.org/backup-index.html

It has been more than 4 days since the commonswiki dump process started. Database dumps for other wikis will not be started.

What should have happened instead?:
Previously, a database dump process would probably complete within three days.

Software version (skip for WMF-hosted wikis like Wikipedia):

Other information (browser name/version, screenshots, etc.):

Event Timeline

TTO renamed this task from Database backup dump is stuck or slow to 20240220 database backup dump appears stuck.Feb 26 2024, 11:29 AM
TTO updated the task description. (Show Details)
TTO subscribed.

Noticed this too. Only wikidatawiki and enwiki have managed to generate a full dump.

xcollazo changed the task status from Open to In Progress.Feb 26 2024, 3:30 PM
xcollazo claimed this task.
xcollazo subscribed.

Thanks for the report. Will investigate.

As per https://wikitech.wikimedia.org/wiki/Dumps/Troubleshooting, we should kill the offending commonswiki dump job, and systemd should restart it automatically.

  1. Figure out who is running the job:
xcollazo@snapshot1010:/mnt/dumpsdata/xmldatadumps/private/commonswiki$ cat lock_20240220 
snapshot1010.eqiad.wmnet 62214
  1. Figure out details of that job:
sudo -u dumpsgen bash

$ ps -Af | grep "dumpsgen 62214"
dumpsgen 58920 25859  0 20:04 pts/0    00:00:00 grep dumpsgen 62214
dumpsgen 62214 35658  9 Feb21 ?        12:55:41 python3 ./worker.py --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --log --job articlesmultistreamdump,articlesmultistreamdumprecombine --skipdone --exclusive --prereqs --date 20240220
  1. Kill it:
$ kill -9 62214

Will now wait to see if the job auto recovers.

Another node has picked up the job:

dumpsgen@snapshot1010:/mnt/dumpsdata/xmldatadumps/private/commonswiki$ cat lock_20240220 
snapshot1011.eqiad.wmnet 4038

There was a query from @brennen in #wikimedia-sre about the logs being generated by mediawiki on snapshot1010, shown here: https://logstash.wikimedia.org/goto/6e47b4196b9f54592fdcdc5c30c9d98d

I investigated briefly and discovered this ticket, so I'm cross-linking for visibility.

Mentioned in SAL (#wikimedia-operations) [2024-02-26T22:42:48Z] <TimStarling> on snapshot1010 killed PHP processes left over from kill -9 of python parents T358458

Most dumps now marked as "Dump complete".

Recombine job for commonswiki failed. Waiting for it to retry automatically.

@xcollazo thanks for sorting this out quickly! Much appreciated.

All dumps marked as complete now.

commonswiki attempted the "Recombine multiple bz2 streams" job 4 times but failed ultimately. All other dumps from commonswiki are available at https://dumps.wikimedia.org/commonswiki/20240220/.

I'll keep an eye on the next run, see if this is a persistent issue or not, but looks like just a one off.

Thanks @Amayus and @TTO for the report.

  NODES
Note 1
Project 2