Page MenuHomePhabricator

Investigate async cache-eval load-time regression (2016-09-08)
Closed, ResolvedPublic

Assigned To
Authored By
Krinkle
Sep 23 2016, 11:12 PM
Referenced Files
F4738864: Screen Shot 2016-11-17 at 18.56.08.png
Nov 18 2016, 2:58 AM
F4738863: Screen Shot 2016-11-17 at 18.55.52.png
Nov 18 2016, 2:58 AM
F4738369: Screen Shot 2016-11-17 at 15.39.04.png
Nov 18 2016, 2:58 AM
F4738862: Screen Shot 2016-11-17 at 18.55.36.png
Nov 18 2016, 2:58 AM
F4738376: Screen Shot 2016-11-17 at 15.40.35.png
Nov 18 2016, 2:58 AM
F4738855: Screen Shot 2016-11-17 at 18.50.35.png
Nov 18 2016, 2:58 AM
F4726051: Screen Shot 2016-11-14 at 15.09.05.png
Nov 14 2016, 11:12 PM
F4726028: Screen Shot 2016-11-14 at 15.05.02.png
Nov 14 2016, 11:11 PM
Tokens
"Orange Medal" token, awarded by Peter."Orange Medal" token, awarded by Gilles."Love" token, awarded by hashar.

Description

  • Affected metrics
  • RUM Median story
  • RUM p95 story
  • WebPageTest story
  • ResourceLoader traffic affected

Affected metrics

NavigationTiming metrics around September 8th:

(Links to Grafana viewing from 20160801 to 20160918)

MetricDifferenceGraph
responseStartunaffected-
domInteractive (From navigationStart to document received, parsed, and blocking resources ready)Median: -50ms (from 870ms to 820ms)-
p95: -6% (4.5s to 4.2s)-
firstPaintMedian: unaffected (~1000ms)
Screen Shot 2016-09-21 at 20.23.05.png (916×2 px, 155 KB)
p95: +441% (from 4.9s to 21.6s)
Screen Shot 2016-09-21 at 20.23.00.png (1×2 px, 194 KB)
domCompleteMedian:  -20% (from 1900ms to 1500ms)
Screen Shot 2016-09-21 at 20.28.39.png (1×1 px, 262 KB)
p95: -8% (from 9.4s to 8.7s)-
loadEventEndMedian:  -25% (from 2000ms to 1600ms)
Screen Shot 2016-09-21 at 20.38.55.png (1×2 px, 277 KB)
p95:  -10% (from 10s to 9s)-
mediaWikiLoadComplete (Time startup.js loaded execution and NavTiming module being loaded)Median:  +400ms (from 900ms to 1300ms)
Screen Shot 2016-09-21 at 20.02.48.png (1×1 px, 192 KB)
p95: +100% (from 5s to 10s)
Screen Shot 2016-09-21 at 20.02.57.png (1×1 px, 170 KB)

RUM Median story

Before:

0s..870ms1000ms..1900ms2000ms
Navigation TimingnavigationStartdomInteractivefirstPaintdomCompleteloadEvent
MediaWikimediaWikiLoadStart..(+900ms)mediaWikiLoadEnd

After:

0s..820ms1000ms..1500ms1600ms
Navigation TimingnavigationStartdomInteractivefirstPaintdomCompleteloadEvent
MediaWikimediaWikiLoadStart..(+1300ms)mediaWikiLoadEnd

For the median use case it seems like events mostly re-ordered, in a way that ultimately actually makes the page finish loading quicker. Assuming that this is caused by "resourceloader: Make cache-eval asynchronous ", it happened by letting the browser prioritise getting the page ready, on-screen and interactive; in favour of spreading out module execution.

RUM p95

Before:

0s..4.5s4.9s..9.4s10s
Navigation TimingnavigationStartdomInteractivefirstPaintdomCompleteloadEvent
MediaWikimediaWikiLoadStart..(+5s)mediaWikiLoadEnd

After:

0s..4.2s8.7s9s..21.6s
Navigation TimingnavigationStartdomInteractivedomCompleteloadEventfirstPaint
MediaWikimediaWikiLoadStart..(+10s)mediaWikiLoadEnd

Here too overall page load is faster. But JS execution is spread out even more. This could be due to various reasons. The most likely reasons is a combination of the following three reasons.

  • requestIdleCallback in chrome causing a delay upto 30s (fixed in master)
  • setTimeout being deprioritized in background tabs (probably acceptable and desirable; not sure if a significant amount of page views initialise in a background tab and/or pre-render).
  • Cache-eval goes in chunks of 50ms. On slower devices (e.g. mobile) will require a lot of chunks, compared to the previous approach of executing everything in one go. However the downside is that this will render the user's browser unresponsive for a long time - which is arguably a much worse user experience. To be confirmed.

WebPageTest

https://grafana.wikimedia.org/dashboard/db/webpagetest-drilldown
(user: Anonymous, project: enwiki, page: Facebook)

Desktop first view:

  • SpeedIndex: Unaffected.
  • Start render:
    • us-east1/Firefox: Consistently hitting the low-end more frequently (range unchanged).
    • us-east1/Chrome: Range is consistently 100ms lower.
    • us-east1/IE: Unchanged.
  • Fully loaded:
    • us-east1/Firefox: Unchanged.
    • us-east1/Chrome: Consistently 100ms lower.
    • us-east1/IE: Unchanged.
  • mwLoadStart/mwLoadEnd
    • Firefox: Unchanged.
    • Chrome: Consistently ends 100ms earlier.
    • IE: No clear change.

Desktop repeat view (Chrome only):
https://grafana.wikimedia.org/dashboard/db/webpagetest-drilldown?var-view=repeatView

  • SpeedIndex: 1.2s -> 800ms.
  • Start render: 790ms -> 570ms.
  • Fully loaded: 2.6s -> 2.3s.

Real Mobile (Motorola G, 3G-Fast):
https://grafana.wikimedia.org/dashboard/db/webpagetest-drilldown?panelId=22&fullscreen

  • SpeedIndex: About 200 lower on average
  • fullyLoaded: No clear change.
  • mwLoadStart/End: No clear change.
  • render: No clear change.

Second page view (Chrome, emulateMobile):
https://grafana.wikimedia.org/dashboard/db/webpagetest-drilldown?panelId=26&fullscreen

  • SpeedIndex: -100 (2014 -> 1930)
  • render: -100 (1990 -> 1891)
  • fullyLoaded: +300ms (2.1s -> 2.4s)
  • mwLoadStart: +/-0 (210ms)
  • mwLoadEnd: +200ms (2.4s -> 2.6s)

ResourceLoader traffic affected

In addition, it appears ResourceLoader traffic was also affected in interesting ways.

https://grafana.wikimedia.org/dashboard/db/resourceloader?from=1472080149720&to=1474672149720

Screen Shot 2016-09-23 at 23.59.12.png (1×2 px, 491 KB)

Screen Shot 2016-09-24 at 00.00.35.png (528×2 px, 175 KB)

Detail:

https://grafana.wikimedia.org/dashboard/db/resourceloader?from=1473339406287&to=1473486088050

Screen Shot 2016-09-24 at 00.06.17.png (1×1 px, 249 KB)

Screen Shot 2016-09-24 at 00.06.38.png (488×1 px, 120 KB)

On September 8 between 18:45 and 19:15 there was a very sudden increase in total traffic to load.php. Most of these extra requests had an If-None-Match header, and were responded to with HTTP 200 and a short cache-expiry.

Log entries confirm that this is indeed also caused by the async cache-eval change:

19:02 logmsgbot: demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group2 to wmf.18

Event Timeline

ori triaged this task as Medium priority.Sep 26 2016, 2:15 PM
ori moved this task from Inbox, needs triage to Doing (old) on the Performance-Team board.

Mentioned in SAL (#wikimedia-operations) [2016-11-01T14:50:57Z] <ori> Local-hacking some JavaScript changes on mw1099 to debug T146510

Just adding for history: It affects multiple Chrome versions, you can see it Opera and and MSIE/Edge.

Screen Shot 2016-11-02 at 8.36.54 AM.png (1×2 px, 466 KB)

Screen Shot 2016-11-02 at 8.42.01 AM.png (1×2 px, 421 KB)

Since it 95-99 percentile I'm thinking nothing cached in the browsers and maybe slowish connections? Thinking maybe it could reproducible using WPT on slow connections and getting the devtools information (the one we run on http://wpt.wmftest.org/ is has higher connectivity).

Also from @ori:
19:23 so I compared firstPaint distribution for just HTTP2 and just HTTP1.1 clients
19:23 the regression happened for both
19:24 no big change in the geographical distribution of users from whom we collected metrics, either

No WPT cannot help: https://www.webpagetest.org/result/161102_M6_aac25caa75b5e2487c3a8052c0bf967c/

I used the slowest connection, using Chrome but the first paint always happens just a second or two after the main HTML is downloaded.

Change 319409 had a related patch set uploaded (by Krinkle):
Add visibilitychange.js measurements

https://gerrit.wikimedia.org/r/319409

Change 319409 merged by jenkins-bot:
Track visibilitychange events to statsd

https://gerrit.wikimedia.org/r/319409

Change 320555 had a related patch set uploaded (by Krinkle):
Track visibilitychange events to statsd

https://gerrit.wikimedia.org/r/320555

Change 320641 had a related patch set uploaded (by Krinkle):
Track visibilitychange events to statsd

https://gerrit.wikimedia.org/r/320641

Change 320555 abandoned by Krinkle:
Track visibilitychange events to statsd

Reason:
Wrong branch.

https://gerrit.wikimedia.org/r/320555

Change 320641 abandoned by Krinkle:
Track visibilitychange events to statsd

https://gerrit.wikimedia.org/r/320641

Change 319409 merged by jenkins-bot:
Track visibilitychange events to statsd

https://gerrit.wikimedia.org/r/319409

Data from https://grafana-admin.wikimedia.org/dashboard/db/resourceloader-feature-test#visibilitychange for the past 5 days since it went out to all wikis last Thursday:

Screen Shot 2016-11-14 at 15.09.05.png (1×1 px, 141 KB)

This shows Hidden as 8.41% (122.6 K) of all sampled page views where the Page Visibility API was supported (1.458 Mil).

As pointed out by @Gilles in code review the actual number of page views where the page was hidden, for at least part of the page load process, is most likely higher since we are not tracking whether the page is hidden between navigationStart (of the HTML request) and responseEnd of the second JavaScript request (in which our tracking code is run). If the page was hidden before that and visible after that, it is recorded as a page view that was fully visible.

Change 319902 had a related patch set uploaded (by Krinkle):
Don't report stats when page visibility changes during page load

https://gerrit.wikimedia.org/r/319902

The above confirms that there is indeed a correlation between ~10% hidden page loads and the regression in the upper 10% of firstPaint.

Next step:

If after deploying https://gerrit.wikimedia.org/r/319902 , firstPaint median and p95 go back (at least) toward where it was, then we can probably assume that hidden page loads were the cause of that regression. Most likely it will go back further than what it was because we've always included hidden page loads and the effect of it only got more significant.

We can do a definitive confirmation by undoing and subsequently re-applying the cache-eval patch, but I'm not sure it's worth the effort.

Wow super interesting @Krinkle and great work! It will be exciting to see what happens to the number when https://gerrit.wikimedia.org/r/319902 goes live.

Change 319902 merged by jenkins-bot:
Don't report stats when page visibility changes during page load

https://gerrit.wikimedia.org/r/319902

Change 322095 had a related patch set uploaded (by Ori.livneh):
Don't report stats when page visibility changes during page load

https://gerrit.wikimedia.org/r/322095

Change 322096 had a related patch set uploaded (by Ori.livneh):
Don't report stats when page visibility changes during page load

https://gerrit.wikimedia.org/r/322096

Change 322095 merged by jenkins-bot:
Don't report stats when page visibility changes during page load

https://gerrit.wikimedia.org/r/322095

Change 322096 merged by jenkins-bot:
Don't report stats when page visibility changes during page load

https://gerrit.wikimedia.org/r/322096

The above confirms that there is indeed a correlation between ~10% hidden page loads and the regression in the upper 10% of firstPaint.

Next step:

If after deploying https://gerrit.wikimedia.org/r/319902 , firstPaint median and p95 go back (at least) toward where it was, then we can probably assume that hidden page loads were the cause of that regression. Most likely it will go back further than what it was because we've always included hidden page loads and the effect of it only got more significant.

We can do a definitive confirmation by undoing and subsequently re-applying the cache-eval patch, but I'm not sure it's worth the effort.

IMO it would be worth the effort. You've put in a lot of effort. It'd be good to be fully confident in the answer.

Krinkle closed this task as Resolved.EditedNov 18 2016, 2:58 AM

Change 319902 merged by jenkins-bot:
Don't report stats when page visibility changes during page load

https://gerrit.wikimedia.org/r/319902

This patch makes it so that we ignore Navigation Timing metrics if the browser window/tab was not visible during the page load process.

As we'd hoped, deployment of this patch brought the p95/p99 of firstPaint back to where they were before last month's regression. This apparent regression was caused by the use of setTimeout for part of the async JavaScript execution. In invisible windows (most likely background tabs), setTimeout can sometimes be artificially delayed by the browser for long periods of time (10-30 seconds even).

Original regression:
https://grafana.wikimedia.org/dashboard/db/navigation-timing?var-metric=firstPaint&from=1471564800000&to=1479513600000

Screen Shot 2016-11-17 at 15.39.04.png (1×1 px, 189 KB)

Roll-out of fix that ignores hidden page loads:

2016-11-17 15:40 ori@tin: Synchronized extensions/NavigationTiming/modules/ext.navigationTiming.js: I8e8ec96f: Don't report stats when page visibility changes during page load

Screen Shot 2016-11-17 at 15.40.35.png (736×1 px, 131 KB)

The JavaScript code in question is not part of the critical path, as such we expected it to only affect our custom metric that measures the time between the first and last of the main JavaScript modules being initialised (mwLoadEnd). Instead, it also caused a notable regression in firstPaint for ~10% of page views. (See opening post of this task for more details).

Now that we're excluding metrics from background loads, we've also done a definitive confirmation for the root cause by temporarily reverting the patch that introduced the use of setTimeout.

Temporary undo of setTimeout use

2016-11-17 23:22 krinkle@tin: Synchronized resources/src/mediawiki/mediawiki.js: Ie21f5c: temp revert cache-eval for metric observation
2016-11-17 23:42 krinkle@tin: Synchronized resources/src/mediawiki/mediawiki.js: Ie21f5c: undo temp revert

Medians:

  • firstPaint unchanged (as before)
  • mwLoadEnd earlier (as before, since execution is not deferred)
  • loadEventEnd later (as before, since execution now happens during page load)

https://grafana.wikimedia.org/dashboard/db/performance-metrics?from=1479422700000&to=1479427200000

Screen Shot 2016-11-17 at 18.50.35.png (1×2 px, 108 KB)

p75/p95/p99:

  • firstPaint p99 previously regressed with async cache-eval, and improved without. This time, with hidden tabs ignored, it actually regressed when we temporarily reverted async cache-eval. This means it's a universal improvement for all users.
  • mwLoadEnd earlier (as before)
  • loadEventEnd later (as before)

https://grafana.wikimedia.org/dashboard/db/navigation-timing?from=1479422700000&to=1479427200000&var-metric=firstPaint

Screen Shot 2016-11-17 at 18.55.36.png (988×1 px, 127 KB)
Screen Shot 2016-11-17 at 18.55.52.png (986×1 px, 131 KB)
Screen Shot 2016-11-17 at 18.56.08.png (928×1 px, 108 KB)
  NODES
admin 1
Note 1
Project 7
USERS 4