Page MenuHomePhabricator

httpb fails upon deployment of 1.44.0-wmf.5
Open, MediumPublicPRODUCTION ERROR

Description

When promoting group 0 wikis to 1.44.0-wmf.5, the httpbb smoke tests encountered a failure. I erroneously cancelled the train and had to run the scap train command again losing time. It turns out the check can be ignored, but then surely it should NOT fail in the first place. Since that has happened again as I promoted wikis to group 1, I am filing this task.

For references, some previous tasks: T364880, T360867

The issue was on November 27th, 2024:

09:25:53 Check 'check_testservers_baremetal-1_of_1' failed: Sending to 4 hosts...
https://wikidata.org/wiki/Wikidata:Main_Page (/srv/deployment/httpbb-tests/appserver/test_main.yaml:104)
  mwdebug1001.eqiad.wmnet
    Status code: expected 301, got 503.
    Location header: expected 'https://www.wikidata.org/wiki/Wikidata:Main_Page', was missing.
===
FAIL: 131 requests sent to each of 4 hosts. 1 request with failed assertions.

Event Timeline

hashar triaged this task as Medium priority.Nov 27 2024, 9:34 AM
hashar created this task.

Some logs from php-fpm being restarted

Timetypelevelmessageprogram
Nov 27, 2024 @ 09:24:59.702syslogNOTICETerminating ...php7.4-fpm
Nov 27, 2024 @ 09:24:59.709syslogNOTICEexiting, bye-bye!php7.4-fpm
Nov 27, 2024 @ 09:24:59.810syslogNOTICEready to handle connectionsphp7.4-fpm
Nov 27, 2024 @ 09:24:59.840syslogNOTICEfpm is running, pid 23867php7.4-fpm
Nov 27, 2024 @ 09:24:59.843syslogNOTICEsystemd monitor interval set to 10000msphp7.4-fpm

And in scap:

Nov 27, 2024 @ 09:25:02.030Finished php-fpm-restarts (duration: 00m 03s)
Nov 27, 2024 @ 09:25:02.030Finished sync-testservers (duration: 00m 19s)
Nov 27, 2024 @ 09:25:02.032Started check-testservers
Nov 27, 2024 @ 09:25:02.032Executing check 'check_testservers_baremetal-1_of_1'
Nov 27, 2024 @ 09:25:53.719Check 'check_testservers_baremetal-1_of_1' failed: Sending to 4 hosts...

So the check is started more than two seconds after fpm is running. Not sure whether there is an additional warm up needed.

I could not find any details about the 503 / Server side error, then I don't have access to the raw log on mwdebug1001 :/

This is probably what you are looking for

[2024-11-27T09:25:11.887Z] "GET /wiki/Wikidata:Main_Page HTTP/1.1" 503 UC 0 1953 0 - "-" "httpbb (https://wikitech.wikimedia.org/wiki/Httpbb) hashar@deploy2002" "59a38e7d-b14f-933f-b1f8-d950308c74ab" "wikidata.org" "[2620:0:861:103:10:64:32:123]:80"

This is from envoy (which fronts apache, which fronts php-fpm) and it's the only one is close enough to what the pasted timestamps above are.

UC for envoy means UpstreamConnectionTermination. The next 3 fields are bytes received, bytes sent and response time.

This looks like a long running HTTP connection that got terminated while in flight. It's possibly random in nature, but could also be result of some race condition. How many times has it been reproduced up to now?

FWIW, I had two similar errors on Monday, no idea if the logs are still around. IRC:

2024-11-25 14:46:13 	<Lucas_WMDE> 	one of the test server checks failed
2024-11-25 14:46:19 	<Lucas_WMDE> 	expected 200 got 503
2024-11-25 14:46:28 	<Lucas_WMDE> 	retrying
2024-11-25 14:47:17 	<Lucas_WMDE> 	failed again but this time a different check
2024-11-25 14:47:29 	<Lucas_WMDE> 	first one was the internal.w.o check in k8s-2_of_2
2024-11-25 14:47:39 	<Lucas_WMDE> 	second one was the techconduct.w.o check in baremetal-1_of_1
2024-11-25 14:47:42 	<Lucas_WMDE> 	o_O
2024-11-25 14:47:47 	<Lucas_WMDE> 	retrying again…
2024-11-25 14:48:17 	<logmsgbot> 	!log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, matmarex: Backport for [[gerrit:1097381|Pass context to 'revreview-pending-basic' on history page (T380519)]], [[gerrit:1097382|Use Contexts for Message objects in review dialog (tooltip) (T380519)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
2024-11-25 14:48:25 	<Lucas_WMDE> 	now it went through

How many times has it been reproduced up to now?

I had one yesterday when promoting group 0 which came from boardgovcomwiki (group 1) and died on the k8s mwdebug:

Nov 26, 2024 @ 11:21:17.591
Check 'check_testservers_k8s-2_of_2' failed: Sending to mwdebug-next.discovery.wmnet...
https://boardgovcom.wikimedia.org/wiki/Main_Page (/srv/deployment/httpbb-tests/appserver/test_remnant.yaml:43)
    Status code: expected 200, got 503.
    Body: expected to contain 'Board Governance Committee', got '<!DOCTYPE html>\n<html lang="en">\n<meta charset="ut'... (1952 characters total).
===
FAIL: 131 requests sent to mwdebug-next.discovery.wmnet. 1 request with failed assertions.
 -

They can be found via the Scap ECS OpenSearch dashboard and searching for httpbb.

I am not sure what Envoy message Upstream connection termination in addition to 503 response code means. That sounds like Apache having terminated the connection with a 503? (so maybe php-fpm timing out)

I have removed this from the list of train subtasks, it only got there cause I have used Create Subtask to create it, but it is NOT a blocker since we can retry the checks.

Adding some more information, no smoking gun yet.

Per https://logstash.wikimedia.org/goto/548ff670fa572570dee5f9265f988778 we had some httpbb related error, during scap, 33 times in the last 2 months. Of those

  • 11 times for mwdebug1001
  • 11 times for mwdebug1002
  • 11 times for mwdebug2001
  • 8 times for mwdebug2002
  • 2 times for mwdebug-next
  • 8 times for mwdebug

In case you wonder why these don't add up, it's because in some cases we 've had the error for multiple ones, e.g.

Check 'check_testservers_baremetal' failed: Sending to 4 hosts...
https://it.wikiquote.org/wiki/Pagina_principale (/srv/deployment/httpbb-tests/appserver/test_main.yaml:169)
  mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet
    Status code: expected 200, got 500.
    Body: expected to contain 'Wikiquote', got '<!DOCTYPE html>\n<html lang="en"

Of the 33 times above, only 18 are about a 503. The other 15 are about a 500. Since the 500s are also the ones that are commonly shared across more than 1 mwdebug group and they are also a valid MediaWiki response, I 'll not be looking into those more. But here's an apache log excerpt for one of these

2024-11-19T09:12:58	365564	2620:0:861:103:10:64:32:123	proxy:unix:/run/php/fpm-www-7.4.sock|fcgi://www-7.4/500	109	GET	http://en.wikipedia.org/favicon.ico	-	text/html	-	-	httpbb (https://wikitech.wikimedia.org/wiki/Httpbb) aklapper@deploy2002	-	-	-	-	2620:0:861:103:10:64:32:123	d0c7e876-6942-998b-b20f-bd56b9203d41

The above was from php-fpm.

Looking into the 18 503s

  • 2 for mwdebug
  • 2 for mwdebug-next
  • 2 for mwdebug2002
  • 4 for mwdebug2001
  • 5 for mwdebug1002
  • 3 for mwdebug1001

Those do add up to 18

We don't have logs from more than 2 weeks on mwdebug hosts, so I switched to 2 weeks logs and there are 6 503 errors. All of those between Nov 25 and Nov 27.

All of the 6 are of the form pasted above.

As far as what Upstream connection termination in addition to 503 response code goes, it means that envoy returns a 503 because the upstream cluster (in this case indeed Apache) has terminated the connection. Most probably as envoy starts sending the request.

This grafana explore panel aligns pretty well with the errors in the last 3 days.

image.png (660×1 px, 68 KB)

upstream_cx_destroy_remote_with_active_rq means that the remote side (in this case Apache) destroyed the connections while at least 1+ request was active.

So, Apache is for some reason killing those requests.

How many times has it been reproduced up to now?

I also experienced this issue last week (1.44.0-wmf.4) once and for group0 only. The Retry succeeded so I didn't bother to report it (thanks hashar for doing it!).

How many times has it been reproduced up to now?

I also experienced this issue last week (1.44.0-wmf.4) once and for group0 only. The Retry succeeded so I didn't bother to report it (thanks hashar for doing it!).

Yes, yours was the example 500 I pasted in T380958#10361297 (it has your username in it as well). PHP fumbled on that one and a 500 error was returned. That's a MediaWiki issue, it probably should be tracked in a different task.

I 've merged https://gerrit.wikimedia.org/r/1099159 which should introduce 1 retry when envoy talks to Apache. Given the future of mwdebug hosts and that this should solve the majority of 503 related issues, I think it's a solution which fits the current problem. I 'll monitor for next week and unless there are more reports, I 'll resolve.

Regarding HTTP 500 errors when httpbb runs, those are real MediaWiki/PHP errors and the usual processes for resolving those should be followed.

Now erroring with:

11:13:16 Check 'check_testservers_baremetal-1_of_1' failed: Sending to 4 hosts...
https://boardgovcom.wikimedia.org/wiki/Main_Page (/srv/deployment/httpbb-tests/appserver/test_remnant.yaml:43)
  mwdebug2002.codfw.wmnet
    Status code: expected 200, got 503.
    Body: expected to contain 'Board Governance Committee', got '<!DOCTYPE html>\n<html lang="en">\n<meta charset="ut'... (1953 characters total).
===
FAIL: 131 requests sent to each of 4 hosts. 1 request with failed assertions.

I'm confused about this. I see the 503 in envoy log:

[2024-12-02T11:12:47.052Z] "GET /wiki/Main_Page HTTP/1.1" 503 UC 0 1953 5 - "-" "httpbb (https://wikitech.wikimedia.org/wiki/Httpbb) ladsgroup@deploy2002" "2432a294-955e-92e4-bb7d-1778db5d43c5" "boardgovcom.wikimedia.org" "10.192.16.66:80"

It's not the first request in the test batch, and I see a successful requests in the same few seconds in the apache log, so it doesn't look like a race condition between the tests and php-fpm restart (timestamped at 11:12:32).

I would also have expected a URX response flag if it had retried and failed.

As far as I can tell, setting profile::tlsproxy::envoy::retries removes

retry_policy:
  num_retries: 0

falling back to the default of 1 retry, but I wonder if it actually does retry

  NODES
Idea 1
idea 1
INTERN 1
Note 1
Project 1