Parsoid migration to php 7.4
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Joe
	Jul 8 2022, 11:23 AM

Description

Given parsoid doesn't receive traffic from the public, and the fact we've noticed huge performance gains when using php 7.4 for parsing, we want to transition parsoid ASAP.

This would go as follows:

First we'll transition scandium, the parsoid test server, to use php7.4. @ssastry please let us know when that would be acceptable for your team. Ideally we'd like you to run a full batch of tests before we move any traffic too.
Then we'll start installing the new parse1* servers with php7.4 only, and progressively add them to the rotation, removing the old servers in the process. This will allow us to move traffic by 4% chunks (1/24)

Details

Subject	Repo	Branch	Lines +/-
Update wgLinterSubmitterWhitelist	operations/mediawiki-config	master	+24 -24
parsoid: Add parse1* servers to conftool	operations/puppet	production	+24 -0
parsoid: Install parse1* servers with only php 7.4	operations/puppet	production	+15 -1
parsoid::testing: switch to php 7.4 by default	operations/puppet	production	+2 -1
parsoid::testing: install php 7.4	operations/puppet	production	+12 -7

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		None	T248925 Make MediaWiki release tarball compatible with PHP 8.0
Resolved		Jdforrester-WMF	T300463 Make PHP 8.0 voting on MW master
Resolved		Jdforrester-WMF	T313563 Bump lcobucci/jwt & league/uri-components for php8
Resolved		Jdforrester-WMF	T313564 Bump onoi/message-reporter in vendor.git to 1.4.2 for php 8 support
Resolved		Jdforrester-WMF	T247658 Make Wikimedia Production MediaWiki compatible with PHP 7.4
Resolved		• toan	T243590 Fix WikibaseDataModel CI for php 7.4
Resolved		Lucas_Werkmeister_WMDE	T316923 Restore skipped test in ReferenceListTest.php
Resolved		Joe	T318918 Undeploy patch to use old PHP serialization in PHP 7.4
Resolved		Jdforrester-WMF	T264168 Drop PHP 7.2 support from Wikibase master branch, once Wikimedia production is on 7.4
Resolved		Ladsgroup	T270740 Drop hacky support for doctrine/dbal class renaming
Invalid		None	T303505 [S] Remove Deprecated methods "serialize" and "unserialize" after php production upgrade to PHP 7.4
Resolved		Reedy	T251043 Cleanup css-sanitizer when it only requires PHP >= 7.4
Open		None	T166010 The Great Namespaceization Effort
Resolved		• tstarling	T277618 var_dump() on various objects writes gigabytes of data and takes minutes to run
Resolved		Jdforrester-WMF	T261872 Drop PHP 7.2 & 7.3 support from MediaWiki master branch, once Wikimedia production is on 7.4
Resolved	PRODUCTION ERROR	Legoktm	T293568 PHP Notice: Undefined offset in wikimedia/remex-html when rendering rest.php error page
Resolved		• tstarling	T297667 mysqli/mysqlnd memory leak
Resolved		Joe	T271736 Migrate WMF production from PHP 7.2 to PHP 7.4
Resolved		Clement_Goubert	T312638 Parsoid migration to php 7.4
Resolved		Clement_Goubert	T307219 Put parse parse10[01-24] in production
Resolved		Clement_Goubert	T318946 Parsoid migration: Cleanup

Event Timeline

Joe created this task.Jul 8 2022, 11:23 AM

Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptJul 8 2022, 11:23 AM

Joe triaged this task as High priority.Jul 26 2022, 3:34 PM

You can switch scandium to php 7.4 this week, and we can redo baseline test runs.

But, please give me a heads up before you do that switch so we can make sure not to kick off test runs in that period OR ask you to wait for an additional few hours for any ongoing test run to finish.

ssastry moved this task from Needs Triage to Performance on the Parsoid board.Jul 26 2022, 10:42 PM

Change 817699 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] parsoid::testing: install php 7.4

https://gerrit.wikimedia.org/r/817699

Change 817701 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] parsoid::testing: switch to php 7.4 by default

https://gerrit.wikimedia.org/r/817701

Once we've moved scandium, my plan for parsoid would be to install the new hardware we got with php 7.4 only and progressively swap out old hardware for the new one, thus moving traffic towards php 7.4 more and more.

Change 817699 merged by Giuseppe Lavagetto:

[operations/puppet@production] parsoid::testing: install php 7.4

https://gerrit.wikimedia.org/r/817699

@subbu would it be ok if we switched to php 7.4 tomorrow, July 28th, at 9:00 UTC?

I have all the patches lined up to do that.

Sure, that works for us.

Let us know when it's done and we can kick off an rt-test run to kick the tires.

Change 817701 merged by Giuseppe Lavagetto:

[operations/puppet@production] parsoid::testing: switch to php 7.4 by default

https://gerrit.wikimedia.org/r/817701

@cscott @ssastry it's done now, all requests without an explicit PHP_ENGINE cookie will be routed to php 7.4 on scandium.

Quick summary:

After an initial hiccup during which roundtrip testing was broken for about a week, we got a test run in y'day. Based on rough estimations (eyeball comparison of area under the scandium load curve during testing), performance seems to have improved about 8-10%. This is the total time across about 180K pages (of varying sizes) and includes wt -> html, html -> wt (without selective serialization), and html -> wt (with selective serialization).

There were some warnings in logstash on about 4 pages in the test run that seem to comes from Math rendering. So, unless that is somehow related to PHP 7.4, I think the test themselves are clean. No unexpected regressions or errors or fatals.

(You will have to zoom into the respective test window period for logstash and grafana).

Overall, everything seems good so far. There is another test run that will probably complete in about 2-3 hours and once that is done, I'll reconfirm.

The new test run completed and the perf is roughly similar. As for the logstash warning, looking at a 3-month window, I see identical warnings from May and June test runs.

So, I think we are good to go with bumping PHP versions on the production cluster.

@Arlolra @cscott. FYI.

Change 827498 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] parsoid: Install parse1* servers with only php 7.4

https://gerrit.wikimedia.org/r/827498

Change 827498 merged by Clément Goubert:

[operations/puppet@production] parsoid: Install parse1* servers with only php 7.4

https://gerrit.wikimedia.org/r/827498

Change 827513 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] parsoid: Add parse1* servers to dsh

https://gerrit.wikimedia.org/r/827513

Change 827513 merged by Clément Goubert:

[operations/puppet@production] parsoid: Add parse1* servers to conftool

https://gerrit.wikimedia.org/r/827513

Mentioned in SAL (#wikimedia-operations) [2022-08-29T16:08:09Z] <claime> pooled parse1001.eqiad.wmnet (php 7.4 only) in parsoid cluster https://phabricator.wikimedia.org/T312638

Mentioned in SAL (#wikimedia-operations) [2022-08-29T16:12:49Z] <claime> depooled wtp1034.eqiad.wmnet from parsoid cluster https://phabricator.wikimedia.org/T312638

parse1001.eqiad.wmnet pooled in place of wtp1034.eqiad.wmnet
Now serving 4% of parsoid traffic from php7.4 only.

Reverted almost immediately due to https://phabricator.wikimedia.org/T316601

Icinga downtime and Alertmanager silence (ID=9b43b97d-2497-4b6c-848a-987b749a898e) set by cgoubert@cumin1001 for 7 days, 0:00:00 on 24 host(s) and their services with reason: Downtiming php7.4 parsoid servers until they are ready to pool

parse[1001-1024].eqiad.wmnet

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1001 for host parse1002.eqiad.wmnet with OS buster

• Jersione subscribed.Aug 31 2022, 9:24 AM

• Jersione claimed this task.Aug 31 2022, 9:26 AM

Ladsgroup removed • Jersione as the assignee of this task.Aug 31 2022, 9:27 AM

Clement_Goubert claimed this task.Aug 31 2022, 9:28 AM

Clement_Goubert moved this task from Incoming 🐫 to Doing 😎 on the serviceops board.Aug 31 2022, 9:37 AM

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1001 for host parse1002.eqiad.wmnet with OS buster completed:

parse1002 (WARN)
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh buster OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202208310917_cgoubert_297807_parse1002.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Failed to run httpbb tests
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB

@Clement_Goubert Just wanted to share something, it's not a problem at all but Icinga monitoring does alert on hosts "not being in dsh groups". This may sound strange at first without context. Example link that currently alerts for parse1002 is:

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=parse1002&service=mediawiki-installation+DSH+group

It's called that because back in the days we used https://wikitech.wikimedia.org/wiki/Dsh for scap deployments. Then the static files with lists of servers were replaced with https://wikitech.wikimedia.org/wiki/Conftool.

So what it means by "in dsh groups" is actually "it's not pooled in conftool" (does not appear at https://config-master.wikimedia.org/pybal/eqiad/parsoid-php)
but it's also not completely removed yet.

P.S. the cookbook was supposed to set the dowtime for that but that failed for some reason, so nothing you were expected to do differently. I ACKed it manually in Icinga.

In T312638#8204469, @Dzahn wrote:

P.S. the cookbook was supposed to set the dowtime for that but that failed for some reason, so nothing you were expected to do differently. I ACKed it manually in Icinga.

That's not correct. The cookbooks worked as expected and the downtime was properly set:

Downtimed the new host on Icinga/Alertmanager

And was not removed at the end of the cookbook because Icinga was not green:

Icinga status is not optimal, downtime not removed

It then just expired.

The last item is highlighted in Italic in the report and not bold because it's a warning as some hosts after reimage are not green by design until manual intervention.
And the actual failing checks are reported in the console:

2022-08-31 10:05:14,264 cgoubert 297807 [WARNING] [14/15, retrying in 42.00s] Attempt to run 'spicerack.icinga.IcingaHosts.wait_for_optimal.
<locals>.check' raised: Not all services are recovered: parse1002:mediawiki-installation DSH group
[...SNIP...]
2022-08-31 10:05:57,228 cgoubert 297807 [WARNING] //Icinga status is not optimal, downtime not removed//

When that happen the operator should check the failing checks and act accordingly.

Then the downtime expired 2h after it was set, at 11:33:

Aug 31 11:33:07 alert1001 icinga: HOST DOWNTIME ALERT: parse1002;STOPPED; Host has exited from a period of scheduled downtime

Fopr the record, that failure was expected and Clement was well aware of

In T312638#8204449, @Dzahn wrote:

@Clement_Goubert Just wanted to share something, it's not a problem at all but Icinga monitoring does alert on hosts "not being in dsh groups". This may sound strange at first without context. Example link that currently alerts for parse1002 is:

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=parse1002&service=mediawiki-installation+DSH+group

It's called that because back in the days we used https://wikitech.wikimedia.org/wiki/Dsh for scap deployments. Then the static files with lists of servers were replaced with https://wikitech.wikimedia.org/wiki/Conftool.

So what it means by "in dsh groups" is actually "it's not pooled in conftool" (does not appear at https://config-master.wikimedia.org/pybal/eqiad/parsoid-php)
but it's also not completely removed yet.

Clement is aware of all of the above, we just had a downtime expire on us because of T316601. They are working with me on this.

Icinga downtime and Alertmanager silence (ID=e0342c4d-b0d1-490b-b740-0d2962a32ac0) set by cgoubert@cumin1001 for 7 days, 0:00:00 on 1 host(s) and their services with reason: Readding downtime removed by reimage

parse1002.eqiad.wmnet

What happened is, I downtimed all parse1* for a week while we worked on them with @Joe
Reimage of parse1002 reset that to 2h for that host, which led to Icinga alerting. As mentioned above, I re-added the downtime, and we'll work on adding these hosts to conftool soon.

Change 828786 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/mediawiki-config@master] Update wgLinterSubmitterWhitelist

https://gerrit.wikimedia.org/r/828786

Clement_Goubert changed the task status from Open to In Progress.Sep 1 2022, 9:27 AM

Clement_Goubert added a subtask: T307219: Put parse parse10[01-24] in production.

Clement_Goubert changed the status of subtask T307219: Put parse parse10[01-24] in production from Open to In Progress.

Change 828786 merged by jenkins-bot:

[operations/mediawiki-config@master] Update wgLinterSubmitterWhitelist

https://gerrit.wikimedia.org/r/828786

Mentioned in SAL (#wikimedia-operations) [2022-09-01T09:58:30Z] <cgoubert@deploy1002> Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:828786|Update wgLinterSubmitterWhitelist (T312638)]] (duration: 03m 37s)

Mentioned in SAL (#wikimedia-operations) [2022-09-01T10:43:05Z] <claime> pooled parse1001.eqiad.wmnet (php 7.4 only) in parsoid cluster https://phabricator.wikimedia.org/T312638

Mentioned in SAL (#wikimedia-operations) [2022-09-01T10:43:38Z] <claime> depooled wtp1034.eqiad.wmnet from parsoid cluster https://phabricator.wikimedia.org/T312638

Sgs unsubscribed.Sep 1 2022, 10:44 AM

Mentioned in SAL (#wikimedia-operations) [2022-09-01T10:58:38Z] <claime> pooled parse1002.eqiad.wmnet (php 7.4 only) in parsoid cluster https://phabricator.wikimedia.org/T312638

Mentioned in SAL (#wikimedia-operations) [2022-09-01T11:04:20Z] <claime> depooled wtp1035.eqiad.wmnet from parsoid cluster https://phabricator.wikimedia.org/T312638

I will be moving production operations comments to https://phabricator.wikimedia.org/T307219 as not to clutter this task further

8% of parse traffic now served in php7.4 only

In T312638#8205255, @Volans wrote:

That's not correct. The cookbooks worked as expected and the downtime was properly set:

Sorry, @Volans. The timing made it look like it on IRC to me. I saw the cookbook run and the alert afterwards. This is even better if there is no issue.

In T312638#8205257, @Joe wrote

Clement is aware of all of the above, we just had a downtime expire on us because of T316601. They are working with me on this.

ACK. My intention was just to share some info why it's called dsh and has an alert. I did definitely not mean to make it a big deal. The reason I chose the ticket over IRC was to stay async and avoid realtime pings or disrupt work. I'm sorry if that sounded different or like I thought the alert itself was important. Please disregard and carry on.

Currently none of the parse1* hosts is a canary server, is that intended to be changed at some point?

In T312638#8209540, @Zabe wrote:

Currently none of the parse1* hosts is a canary server, is that intended to be changed at some point?

I will be working iteratively through the list in conftool-data/node/eqiad.yaml, looping around when I reach the end. When I replace wtp102[5-6].eqiad.wmnet (current canaries), I'll put parse100[1-2] as canaries.

Mentioned in SAL (#wikimedia-operations) [2022-09-05T11:16:19Z] <claime> pooled parse1003.eqiad.wmnet (php 7.4 only) in parsoid cluster https://phabricator.wikimedia.org/T312638

Mentioned in SAL (#wikimedia-operations) [2022-09-05T11:24:55Z] <claime> depooled wtp1036.eqiad.wmnet from parsoid cluster https://phabricator.wikimedia.org/T312638

Icinga downtime and Alertmanager silence (ID=5982b372-9469-405c-a18d-48d12b854a91) set by cgoubert@cumin1001 for 7 days, 0:00:00 on 3 host(s) and their services with reason: Downtiming replaced wtp servers

wtp[1034-1036].eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2022-09-05T11:53:52Z] <claime> pooled parse1004.eqiad.wmnet (php 7.4 only) in parsoid cluster T312638

Mentioned in SAL (#wikimedia-operations) [2022-09-05T12:14:10Z] <claime> depooled wtp1037.eqiad.wmnet from parsoid cluster T312638

Icinga downtime and Alertmanager silence (ID=83cf85ce-8731-463b-9d53-4500611c52ac) set by cgoubert@cumin1001 for 7 days, 0:00:00 on 18 host(s) and their services with reason: Downtime pending inclusion in production

parse[1007-1024].eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=3d4522be-4ec5-4d1a-8bba-1d5621e4d400) set by cgoubert@cumin1001 for 7 days, 0:00:00 on 3 host(s) and their services with reason: Downtiming replace wtp servers

wtp[1036-1038].eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=adaec891-8daa-470f-9d2f-6c2b62e7f043) set by cgoubert@cumin1001 for 7 days, 0:00:00 on 2 host(s) and their services with reason: Downtiming replaced wtp servers

wtp[1039-1040].eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=bdf6f9e0-2c64-471e-8b85-05f874724182) set by cgoubert@cumin1001 for 7 days, 0:00:00 on 3 host(s) and their services with reason: Downtiming replaced wtp servers

wtp[1041-1043].eqiad.wmnet

100% of parse traffic served in php 7.4

Clement_Goubert closed subtask T307219: Put parse parse10[01-24] in production as Resolved.Sep 9 2022, 9:59 AM

Old wtp servers now in the hands of DCops for decom.

https://grafana.wikimedia.org/d/000000048/parsoid-timing-wt2html?orgId=1&refresh=30s&from=now-90d&to=now&viewPanel=43 shows a dip in the time per output KB over the last few days which is a good sign.

NOTE: (1) This is 90-day view (2) This reports time per output KB which is impacted a bit less by request rate volumes as long as change in request rates don't significantly skew the page and page size distribution. (3) This is log10 scale on the Y-axis so the change is more prominent than it appears.

Saved from https://grafana.wikimedia.org/d/000000048/parsoid-timing-wt2html for future reference:

Screenshot 2022-09-09 at 20.36.24.png (1×2 px, 476 KB)

Screenshot 2022-09-09 at 20.37.11.png (1×2 px, 492 KB)

Krinkle added a project: Wikimedia-Performance-publish.Sep 9 2022, 6:40 PM

Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.

Krinkle moved this task from Untriaged to Misc bookmarks on the Wikimedia-Performance-publish board.

html2wt time per output KB also saw a dip although it is not as prominent as the wt2html one.

But, the p75 html2wt components graph makes it really obvious. This change is for all the components, even the small lines at the bottom (which shows up when you suppress the domdiff and serialize graphs).

Looks like maybe an approximately 30% speedup based on eyeballing the various plots in the full panel.

In T312638#8225508, @ssastry wrote:

html2wt time per output KB also saw a dip although it is not as prominent as the wt2html one.

But, the p75 html2wt components graph makes it really obvious. This change is for all the components, even the small lines at the bottom (which shows up when you suppress the domdiff and serialize graphs).

Looks like maybe an approximately 30% speedup based on eyeballing the various plots in the full panel.

From the aggregated averages we also collect from apache I would say the performance gain is between 30 and 35%. It is due to both the new php version and the use of newer hardware.

I am mostly interested in looking at the effect on the number of parsoid timeouts.

https://logstash.wikimedia.org/goto/ad3709f398e3d85115d1d6088fa9e888

Screenshot 2022-09-13 at 12.55.40.png (682×2 px, 83 KB)

Last 30 days of parsoid timeouts, with migration steps added. It's still a small sample size, but it seems like there's a marked reduction of timeouts after completing the migration.

Clement_Goubert closed this task as Resolved.Sep 27 2022, 9:47 AM

Clement_Goubert closed subtask T318946: Parsoid migration: Cleanup as Resolved.Oct 4 2022, 1:53 PM

	F35519031: Screenshot 2022-09-13 at 12.55.40.png
	Sep 13 2022, 11:03 AM

	F35515194: Screenshot 2022-09-09 at 20.37.11.png
	Sep 9 2022, 6:39 PM

	F35515191: Screenshot 2022-09-09 at 20.36.24.png
	Sep 9 2022, 6:39 PM

Parsoid migration to php 7.4Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Parsoid migration to php 7.4
Closed, ResolvedPublic
Actions

Related Objects
Search...