CX was disabled, but is now back online , because it caused an outage during the datacenter switch on one of the database servers affecting CX and other products.
Summary so far:
- This issue looks similar to the previous incident https://wikitech.wikimedia.org/wiki/Incident_documentation/20160713-ContentTranslation (after which we fixed bugs in the auto-save, added a ping-limiter, and Aaron improved the queries and locking)
- There were hundreds of blocked queries, that eventually brought the database down by exceeding the connection limit
- There wasn't very high load on the database, most of the queries were in the wait state
- Language team changed the front-end to be much more conservative in the amount and delay between retries and saving in general to mitigate the symptoms in the future: https://gerrit.wikimedia.org/r/349214
- CX has been re-enabled.
- Likely root cause has been found: a bug in the frontend code that in certain articles caused the save draft request size to be extra large due to inclusion of unrelated content combined with unoptimal autosave-retry-logic. Both have been fixed.