Page MenuHomePhabricator

Do a root-cause analysis on CX outage during dc switch and get it back online
Closed, ResolvedPublic

Description

CX was disabled, but is now back online , because it caused an outage during the datacenter switch on one of the database servers affecting CX and other products.

Summary so far:

  • This issue looks similar to the previous incident https://wikitech.wikimedia.org/wiki/Incident_documentation/20160713-ContentTranslation (after which we fixed bugs in the auto-save, added a ping-limiter, and Aaron improved the queries and locking)
  • There were hundreds of blocked queries, that eventually brought the database down by exceeding the connection limit
  • There wasn't very high load on the database, most of the queries were in the wait state
  • Language team changed the front-end to be much more conservative in the amount and delay between retries and saving in general to mitigate the symptoms in the future: https://gerrit.wikimedia.org/r/349214
  • CX has been re-enabled.
  • Likely root cause has been found: a bug in the frontend code that in certain articles caused the save draft request size to be extra large due to inclusion of unrelated content combined with unoptimal autosave-retry-logic. Both have been fixed.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 349205 had a related patch set uploaded (by Santhosh):
[mediawiki/extensions/ContentTranslation@master] Fix the retry mechanism for save failure

https://gerrit.wikimedia.org/r/349205

Change 349214 had a related patch set uploaded (by Nikerabbit):
[mediawiki/extensions/ContentTranslation@master] Improve CX draft saving logic

https://gerrit.wikimedia.org/r/349214

Unfortunately I did not capture any of the queries that were long-running during the outage. However, the FOR UPDATE queries are very simple and should complete fast:

EXPLAIN SELECT * FROM `bw_cx_corpora` WHERE cxc_translation_id = '194' AND cxc_section_id = 'mwCA' AND cxc_origin = 'user' ORDER BY cxc_timestamp DESC LIMIT 1 FOR UPDATE;
idselect_typetabletypepossible_keyskeykey_lenrefrowsExtra
1SIMPLEbw_cx_corporarefbw_cx_corpora_uniquebw_cx_corpora_unique188const,const,const1Using where

I don't think the problem is in this query. Obviously it will lock itself to prevent multiple updates for the same user, but it should not block updates for other users. The update queries look very fast too:

EXPLAIN UPDATE  `bw_cx_corpora` SET cxc_sequence_id = '57',cxc_timestamp = '20170420134736',cxc_content = '<p data-seqid=\"57\" id=\"cxmwCA\" class=\"\" data-source=\"mwCA\" data-cx-weight=\"338\" data-cx-draft=\"true\" contenteditable=\"true\"><span class=\"cx-segment\" data-segmentid=\"58\">Maine coon on lempeä, rauhallinen ja leikkisä kissa. </span><span class=\"cx-segment\" data-segmentid=\"59\">Sillä on tuuhea ja pitkä turkki, jota on harjattava aina tarvittaessa. </span><span class=\"cx-segment\" data-segmentid=\"60\">Maine coonin korvanpäitä koristavat ilvesmäiset tupsut. </span><span class=\"cx-segment\" data-segmentid=\"61\">Sen kuono on neliömäinen ja vahva. </span><span class=\"cx-segment\" data-segmentid=\"62\">Maine cooneja on monen väriasiä, kuten suomalaisia kotikissoja. No minkäs teet kun elämä on mitä on. Niin no syö pois vaan.</span></p>' WHERE cxc_translation_id = '194' AND cxc_section_id = 'mwCA' AND cxc_origin = 'user' AND cxc_timestamp = '20170420130920';
idselect_typetabletypepossible_keyskeykey_lenrefrowsExtra
1SIMPLEbw_cx_corporarangebw_cx_corpora_uniquebw_cx_corpora_unique233NULL1Using where

Possible issue here is though that there can be multiple of these select-for-update->update in one save request. I wonder, is it possible that if there are simultaneous save requests, they would somehow get deadlocked with updates?

Change 349205 abandoned by Santhosh:
Fix the retry mechanism for save failure

Reason:
Abandoning in favor of https://gerrit.wikimedia.org/r/349214

https://gerrit.wikimedia.org/r/349205

Change 349214 merged by jenkins-bot:
[mediawiki/extensions/ContentTranslation@master] Improve CX draft saving logic

https://gerrit.wikimedia.org/r/349214

I captured a few (3) draft save requests on my wiki and replayed them in an endless loop. I did not see any deadlocks and queries were completing relatively quickly even under load.

Looking at logstash, I see first query error for CX at 2017-04-19T14:32:07Z and then silent until growing rapidly starting from 2017-04-19T14:39:38Z.

Graphite does not support the theory that there were an unusual amount of save requests coming. In fact there were larger spikes in the previous days. I am not sure how much of the data is sampled, but it looks quite reliable.

pasted_file (737×1 px, 199 KB)

Save events also look normal:

[log]> select hour(timestamp) h, count(*) from ContentTranslation_11628043 where timestamp > 20170419000000 AND timestamp < 20170419230000 and event_action = 'save' group by h;
+------+----------+
| h    | count(*) |
+------+----------+
|    0 |       17 |
|    1 |        3 |
|    2 |        7 |
|    3 |        6 |
|    4 |       10 |
|    5 |       17 |
|    6 |       21 |
|    7 |       21 |
|    8 |       23 |
|    9 |       39 |
|   10 |       30 |
|   11 |       33 |
|   12 |       27 |
|   13 |       38 |
|   14 |       17 |
+------+----------+
15 rows in set (0.02 sec)
mysql:research_prod@db1047 [log]> select hour(timestamp) h, count(*) from ContentTranslationError_11767097 where timestamp > 20170419000000 AND timestamp < 20170419230000 group by h;+------+----------+
| h    | count(*) |
+------+----------+
|    0 |        1 |
|    2 |        1 |
|    4 |        2 |
|    5 |        2 |
|    7 |        1 |
|    9 |        3 |
|   10 |        4 |
|   12 |        1 |
|   13 |        2 |
|   14 |      389 |
|   15 |      643 |
|   16 |      149 |
|   17 |       57 |
|   18 |       71 |
|   19 |        1 |
|   21 |        5 |
+------+----------+
16 rows in set (0.01 sec)
[log]> select event_trace, count(*) from ContentTranslationError_11767097 where timestamp > 20170419130000 AND timestamp < 20170419160000 and event_trace not like '%read-only%' and event_trace not like '%Unrecognized value%' group by event_trace;

The above shows:

  • 36 Wikimedia error
  • 186 timeout
  • ~200 database errors
[log]> select substr(timestamp,9,4) as hourminute, count(*) from ContentTranslationError_11767097 where timestamp > 20170419130000 AND timestamp < 20170419160000 and event_trace not like '%read-only%' and event_trace not like '%Unrecognized value%' group by hourminute; 
+------------+----------+
| hourminute | count(*) |
+------------+----------+
| 1327       |        1 |
| 1339       |        1 |
| 1409       |        1 |
| 1423       |        9 |
| 1424       |       12 |
| 1425       |       14 |
| 1426       |        1 |
| 1432       |        1 |
| 1439       |        3 |
| 1440       |        5 |
| 1441       |        4 |
| 1442       |        2 |
| 1443       |        2 |
| 1444       |        4 |
| 1445       |        2 |
| 1446       |        6 |
| 1447       |        6 |
| 1448       |        5 |
| 1449       |        3 |
| 1450       |        3 |
| 1451       |        4 |
| 1452       |        4 |
| 1453       |        3 |
| 1454       |        6 |
| 1455       |        4 |
| 1456       |        4 |
| 1457       |        2 |
| 1458       |        2 |
| 1459       |        3 |
| 1500       |        1 |
| 1501       |        1 |
| 1502       |        3 |
| 1503       |        2 |
| 1505       |        1 |
| 1506       |        3 |
| 1507       |        3 |
| 1508       |        3 |
| 1509       |        2 |
| 1510       |        2 |
| 1511       |        2 |
| 1512       |        3 |
| 1513       |        4 |
| 1514       |        3 |
| 1515       |        5 |
| 1516       |        5 |
| 1517       |        5 |
| 1518       |        6 |
| 1519       |        5 |
| 1520       |        5 |
| 1521       |        3 |
| 1522       |        6 |
| 1523       |        5 |
| 1524       |        6 |
| 1525       |        7 |
| 1526       |        6 |
| 1527       |        3 |
| 1528       |        5 |
| 1529       |        4 |
| 1530       |       10 |
| 1531       |        8 |
| 1532       |        3 |
| 1533       |        9 |
| 1534       |       17 |
| 1535       |       18 |
| 1536       |       35 |
| 1537       |       34 |
| 1538       |       32 |
| 1539       |       39 |
| 1540       |       16 |
+------------+----------+
69 rows in set (0.01 sec)
mysql:research_prod@db1047 [log]> select event_token, count(*) from ContentTranslationError_11767097 where timestamp > 20170419130000 AND timestamp < 20170419160000 and event_trace not like '%read-only%' and event_trace not like '%Unrecognized value%' group by event_token;
++----------+
|| count(*) |
++----------+
||        1 |
||        5 |
||        1 |
||        1 |
||       16 |
||        2 |
||        7 |
||       26 |
||       20 |
||        6 |
||       13 |
||       19 |
||        4 |
||        8 |
||       18 |
||       12 |
||        1 |
||        1 |
||       26 |
||       10 |
||        1 |
||        1 |
||       10 |
||        4 |
||       23 |
||        7 |
||       59 |
||        4 |
||       13 |
||        5 |
||        4 |
||       18 |
||        1 |
||        3 |
||       34 |
||       49 |
||       14 |
++----------+
37 rows in set (0.02 sec)

(How many errors each unique user got)

Unfortunately I did not capture any of the queries that were long-running during the outage

I can give you that.

On here P5295 (NDA-only, as queries contain private data) you will see the queries longer than 10 seconds sent on the 24h band of the switchover, ordered from more impacting to less impacting, and the number of hits, the max execution time, the average time, the total time, hoists, user and schema https://phabricator.wikimedia.org/P5295. Filter by host=db2033 to get the ones relevant here.

Note that once connections reached 10000 concurrent connections service (including monitoring) went unresponsive.

@jcrespo How can I add myself to WMF-NDA group (AFAIK, I've signed it).

@KartikMistry Phabricator permissions are handled by its admins (the ones with the Administrator role), although I am not sure they personally handle that group. I would start by asking @Aklapper or someone on Security.

This is too much work, I will manually filter and paste here the list of queries related to translation.

P5316 Only visible by subscribers.

The trouble with that paste is that innocent queries are apparently caught up in general slowness on the host. It's hard to imagine how this could be a root cause:

SELECT /* ContentTranslation\CorporaLookup::getByTranslationId */ cxc_translation_id, cxc_origin, cxc_section_id, cxc_timestamp, cxc_sequence_id, cxc_content FROM `cx_corpora` WHERE cxc_translation_id = '307102' /* 82727d73457caba7015d5b694f7e3906 db2033 wikishared 303s */

It uses an index and only scans 10 rows, I checked on db2033. The only really suspicious thing is the locking select, but it is no slower than the non-locking select quoted above.

Maybe if there were a lot of concurrent locking selects that didn't appear in the log, that would explain the very long query times (>2000s) for the write queries on the same table. But you have to consider it in the context of presumably very badly overloaded disk I/O. A transaction design which is fine under ordinary circumstances could easily fail when the disk is overloaded.

I think the most likely situation is that a disk I/O overload caused otherwise innocuous high-traffic queries to overflow the connection limit. The cause of the disk I/O overload is not apparent from @jcrespo's paste. As such, I would recommend re-enabling CX, since it's not clear that it was the root cause.

I reviewed the locking select in the TranslationStorageManager source code, and it looks fine. So to repeat, I think it is unclear whether CX was at fault, and so it should be re-enabled.

Cold DB caches are always going to be a problem with periodic DC failover, as opposed to active/active. The correct resolution is to send query traffic to both DCs.

Change 349869 had a related patch set uploaded (by KartikMistry):
[operations/mediawiki-config@master] Re-enable ContentTranslation

https://gerrit.wikimedia.org/r/349869

I disagree, there are insert queries that were clearly blocked forever- the selects for updates are blocking new inserts- that is a bug, and it has nothing to do with IO load, but with the logic of the application during the failover. Most likely the lack of proper indexes or a very large gap lock were creating huge contention. Dismissing this as a mere overload is a recipe for it to happen again- the database had 10000 running queries, and those where blocked for execution by locks, not by disk performance- you can see at https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=db2033&var-network=eth0&from=1492596763858&to=1492666378218 that load is technically low for having 10000 running queries at the same time- which indicates locking contention, not io or cpu contention. Compare that with the IO overload of db2062: https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&from=1492596763858&to=1492666378218&var-server=db2062&var-network=eth0

Dismissing this as a mere overload is a recipe for it to happen again- the database had 10000 running queries, and those where blocked for execution by locks, not by disk performance- you can see at https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&var-server=db2033&var-network=eth0&from=1492596763858&to=1492666378218 that load is technically low for having 10000 running queries at the same time- which indicates locking contention, not io or cpu contention. Compare that with the IO overload of db2062: https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&from=1492596763858&to=1492666378218&var-server=db2062&var-network=eth0

Yes, fair enough. Grafana even has states from SHOW PROCESSLIST showing definitively that most threads were waiting for locks: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=37&fullscreen&orgId=1&var-dc=codfw%20prometheus%2Fops&var-server=db2033&from=1492609361891&to=1492619776325

there are insert queries that were clearly blocked forever- the selects for updates are blocking new inserts- that is a bug, and it has nothing to do with IO load, but with the logic of the application during the failover.

The "translation auto save" feature in CX is responsible for the queries we are looking at the above pastes. It also has a retry mechanism in case of save failures. In https://gerrit.wikimedia.org/r/#/c/349214/ we did some corrections and improvments on the logic of application when a failure at database side happens. Basically it puts a limit on retry attempts and increases the time between retries. This should avoid building up of the queue.

While we relook into the queries, to better understand the issue, independent of a situation like dc switch, in general, is there a problem we see with queries that demand disabling CX?

I disagree, there are insert queries that were clearly blocked forever- the selects for updates are blocking new inserts- that is a bug

So the code is:

if ( new ) {
  INSERT ...;
} else {
  SELECT ... FOR UPDATE;
  UPDATE ...;
}

Are you suggesting that if two requests compete, in such a way that one manages to create entry in cx_translations but other query manages to lock it with SELECT ... FOR UPDATE; before the first requests manages to INSERT, then they block each other and likely other inserts as well?

What puzzles is me, how can they block forever? Shouldn't we get some kind of deadlock or timeout errors? The only errors I find in logstash are lost connection/can't connect/too many connections.

is there a problem we see with queries that demand disabling CX?

cxtranslation should check if there is read_only on the database or on mediawiki (or both) and not ever lock or try to modify the rows on those cases. It should also not retry saving (at most once), and then return an error to the user without losing his/her form. Basically, it seems to me that because cxtranslation is not using the regular edit workflow for those half-translated bits, it has to reimplement it again to not make the same mistakes probably edits once had. This is clearly repeatable by setting a test database in read only- failure should happen immediately, and not take minutes.

Also, can you identify the gap locking consequences of the queries you are running?

Also, can you identify the gap locking consequences of the queries you are running?

As far as I know that is explained by Aaron in https://github.com/wikimedia/mediawiki-extensions-ContentTranslation/blob/master/includes/TranslationStorageManager.php#L84-L118

is there a problem we see with queries that demand disabling CX?

cxtranslation should check if there is read_only on the database or on mediawiki (or both) and not ever lock or try to modify the rows on those cases. It should also not retry saving (at most once), and then return an error to the user without losing his/her form. Basically, it seems to me that because cxtranslation is not using the regular edit workflow for those half-translated bits, it has to reimplement it again to not make the same mistakes probably edits once had. This is clearly repeatable by setting a test database in read only- failure should happen immediately, and not take minutes.

If either $wgReadOnly or main db is in read only, the api will return read-only error immediately before we execute any queries. If cx tables are in a separate host, which is marked read only while the main db is not, then we get exceptions during update/insert (but afaict we saw no such errors during the outage).

Can you clarify "clearly repeatable by setting a test database in read only", because I could not produce any kind of issue while testing the various combinations (and as far as I know, mysql only supports setting all databases to read only, not a specific database).

Change 350200 had a related patch set uploaded (by KartikMistry; owner: Nikerabbit):
[mediawiki/extensions/ContentTranslation@wmf/1.29.0-wmf.20] Improve CX draft saving logic

https://gerrit.wikimedia.org/r/350200

Change 350200 merged by Dereckson:
[mediawiki/extensions/ContentTranslation@wmf/1.29.0-wmf.20] Improve CX draft saving logic

https://gerrit.wikimedia.org/r/350200

Change 349869 merged by jenkins-bot:
[operations/mediawiki-config@master] Re-enable ContentTranslation

https://gerrit.wikimedia.org/r/349869

Mentioned in SAL (#wikimedia-operations) [2017-04-25T13:49:24Z] <hashar@naos> Synchronized wmf-config/InitialiseSettings.php: Re-enable ContentTranslation - T163344 (duration: 00m 44s)

Nikerabbit lowered the priority of this task from Unbreak Now! to High.Apr 25 2017, 2:57 PM
Nikerabbit removed a project: Patch-For-Review.

Lowering priority as CX is currently back (we are monitoring and ready to disable though). Finding the root cause is important.

On https://fr.wikipedia.org/w/index.php?title=Special:ContentTranslation&campaign=contributionsmenu&to=fr&uselang=en, I've this :

Permission error

You do not have permission to edit this page, for the following reason:

The action you have requested is limited to users in one of the groups: $1.

Reported on frwiki.

Edit: https://en.wikipedia.org/w/index.php?title=Special:ContentTranslation&campaign=contributionsmenu&to=fr&uselang=en works for me !

On https://fr.wikipedia.org/w/index.php?title=Special:ContentTranslation&campaign=contributionsmenu&to=fr&uselang=en, I've this :

Permission error

You do not have permission to edit this page, for the following reason:

The action you have requested is limited to users in one of the groups: $1.

Reported on frwiki.

Edit: https://en.wikipedia.org/w/index.php?title=Special:ContentTranslation&campaign=contributionsmenu&to=fr&uselang=en works for me !

Most likely, the user is not logged in, or hasn't enabled the beta feature.

The message about permissions is confusing, but not related to this issue. It's probably T102729.

Most likely, the user is not logged in, or hasn't enabled the beta feature.

The message about permissions is confusing, but not related to this issue. It's probably T102729.

I've was able to use the tool before the bug, and I've change nothing in beta prefs since last use :(

On https://fr.wikipedia.org/w/index.php?title=Special:ContentTranslation&campaign=contributionsmenu&to=fr&uselang=en, I've this :

Permission error

You do not have permission to edit this page, for the following reason:

The action you have requested is limited to users in one of the groups: $1.

Reported on frwiki.

Edit: https://en.wikipedia.org/w/index.php?title=Special:ContentTranslation&campaign=contributionsmenu&to=fr&uselang=en works for me !

Most likely, the user is not logged in, or hasn't enabled the beta feature.

The message about permissions is confusing, but not related to this issue. It's probably T102729.

I can repro: I've indeed this message when the beta feature is disabled on en., and it works when beta feature is enabled.

@Framawiki It's on en. you need it enabled, as you call the special page from en.

@Framawiki It's on en. you need it enabled, as you call the special page from en.

No, I copied the URL that I've used on frwiki and replace fr by en. As you can see in my last message I used for both campaign=contributionsmenu

I'm not able to reproduce this with my bot account. It is probably a mistake, sorry.

So the code is:

if ( new ) {
  INSERT ...;
} else {
  SELECT ... FOR UPDATE;
  UPDATE ...;
}

It looks to me like this whole thing could be replaced by a call to Database::upsert(), which in MySQL is implemented as INSERT ... ON DUPLICATE KEY UPDATE. That should reduce the lock time by a couple of orders of magnitude. There is a unique index on cx_corpora which is identical to the WHERE conditions in the existing UPDATE.

We discourage the usage of INSERT ... ON DUPLICATE KEY UPDATE- that doesn't mean they shouln't be used, but they are mostly unsafe for binlog statements (they can generate different results on master and slave due to locking). If an upsert is needed, REPLACE is suggested instead when possible. If it is increment of a field, maybe reading on slave + update/insert on master + exception handling retrying once.

We should also reduce the innodb_lock_wait_timeout to a few seconds, but that is a global change and could have negative impact on other places.

There seems to be an complication, perhaps unnecessary, that the timestamp is part of the unique index. I believe this was made so that we could in future store the history as well. But for now we are not keeping history, and instead doing extra effort for finding an existing row (if any) and replacing its contents or inserting a new row.

We could just start inserting new entries unconditionally: this would increase space usage possibly by an order of magnitude. We would also need to audit the code that we always load the newest one (this kind of a bug is already possible, but the proposed change would it make much more apparent). Another option would be to drop the timestamp from the index. In both cases we would need to take care of locking to serialize updates to the same unit and/or handle errors about duplicates.

I also noticed that the ::update() code conditions could be simplified by using cxc_id PRIMARY key instead of repeating the conditions for the UNIQUE INDEX. I'm not sure if that makes any difference performance wise, or can it introduce new locking bugs?

Yet another thing that puzzles me is that once we get a save request for a draft, it can carry multiple translation units. Per my understanding the whole update thing is inside a transaction, but during it we loop over each translation unit acquiring locks incrementally. If another save request sees that the draft exists, and starts acquiring the locks incrementally in the same way, can't that lead into a deadlock? Should we perhaps batch ::find() so that it acquires locks for all units simultaneously? Does it also make sense to batch the INSERTs/UPDATEs too?

Santhosh found and fixed T163105: CX template editor's own HTML may end up in the published as an HTML blob by investigating the slow queries in P5316. This fits well with the observation of increased database traffic during the incident.

Status:

  • Follow-up task T164050 added to the sprint beginning today.
  • T163105 is waiting for deployment.
  • For today's dc switch, we are going to carefully watch what happens with ContentTranslation.

T164050 was declined after it was found out it isn't anywhere near simple to implement. Other fixes have been deployed and incident report has been completed.

  NODES
HOME 1
Note 2
OOP 4
os 42
server 12
text 1
Users 4