Since I4e71671c8, WikibaseClient's OutputPageParserOutput & ParserAfterParse hook handlers call WikibaseClient::getDefaultInstance()->getLangLinkSiteGroup(). Since
$wgWBClientSettings['languageLinkSiteGroup'] is unset, it defaults toWikibaseClient::getSite, which calls SiteSQLStore::getSites, which requires retrieving a huge object from memcached, with a predictable impact on the cluster: see http://noc.wikimedia.org/~ori/SiteSQLStore.html -- you can guess when I4e71671c8 was deployed.
Description
Details
- Reference
- bz56602
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Addshore | T76705 SiteStore / SiteList performance and caching (tracking) | |||
Resolved | aude | T58602 avoid fetching SiteList object from memcached |
Event Timeline
Change 93648 had a related patch set uploaded by Ori.livneh:
Set enwiki's languageLinkSiteGroup to 'wikipedia'
Didn't improve things a whole lot, since the call to getSite in WikibaseClient.hooks.php's onSkinTemplateOutputPageBeforeExec hook handler is executed much more frequently.
(In reply to comment #3)
Didn't improve things a whole lot, since the call to getSite in
WikibaseClient.hooks.php's onSkinTemplateOutputPageBeforeExec hook handler is
executed much more frequently.
That's Ie17f2af09, to be specific.
Change 93661 merged by jenkins-bot:
Re-introduce siteGroup setting for performance reasons
Change 93767 had a related patch set uploaded by Aude:
Re-introduce siteGroup setting for performance reasons
Change 93767 merged by jenkins-bot:
Re-introduce siteGroup setting for performance reasons
Change 93769 had a related patch set uploaded by Aude:
Update Wikibase, use siteGroup setting instead of doing lookup
Change 93772 had a related patch set uploaded by Aude:
Update Wikibase, use siteGroup setting instead of doing lookup
Change 93769 merged by jenkins-bot:
Update Wikibase, use siteGroup setting instead of doing lookup
Change 93772 merged by jenkins-bot:
Update Wikibase, use siteGroup setting instead of doing lookup
Change 93773 had a related patch set uploaded by Aude:
Add siteGroup setting for Wikibase
This is once again an issue; it is loading on every request.
Impact: http://i.imgur.com/v9ebld6.png
This causes:
https://ganglia.wikimedia.org/latest/graph.php?r=year&z=xlarge&h=mc1005.eqiad.wmnet&m=network_report&s=by+name&mc=2&g=network_report&c=Memcached+eqiad
https://ganglia.wikimedia.org/latest/graph.php?r=year&z=xlarge&h=mc1011.eqiad.wmnet&m=network_report&s=by+name&mc=2&g=network_report&c=Memcached+eqiad
https://ganglia.wikimedia.org/latest/graph.php?r=year&z=xlarge&h=mc1014.eqiad.wmnet&m=cpu_report&s=by+name&mc=2&g=network_report&c=Memcached+eqiad
Top keys are:
enwiki:sites/SiteList#2014-03-17+Site:2013-01-23 (54MB/s)
wikidatawiki:sites/SiteList#2014-03-17+Site:2013-01-23 (20MB/s)
commonswiki:sites/SiteList#2014-03-17+Site:2013-01-23 (18MB/s)
This has been pointed out as early as September 2013, and again March 2014 and again September 2014 and is still happening. Having e.g. 80% of mc1005's total network bandwidth being a single wikidata key, or SiteList keys being consistently on the top of memcached bandwidth output by multiple factors compared to the rest, is frankly indicative of a serious design failure and unacceptable. I don't understand why this bug was closed either.
Can we just have a current version hash stored in memcached and used to validate server-local CDB caches (made on the fly, with a special key holding the hash of the other key/values). This would reduce the memcached I/O to a minuscule amount.
Change 174113 had a related patch set uploaded by Aude:
Lazy initialize OtherProjectsSidebarGenerator in hook handlers
my patch (https://gerrit.wikimedia.org/r/#/c/174113/) ensures the memcached lookup of SiteList is confined to users with the other projects beta feature enabled. This should help quite a lot to reduce memcached access for the SiteList.
the SiteList is used in similar functionality as the interwiki data, used to add links to related sister projects in the sidebar.
to roll out the feature more widely, we should have local caching (json, like i18n?) of the site list data and may want to have memcached store the hash (like done for i18n), per Aaron's suggestion.
Change 174113 merged by jenkins-bot:
Lazy initialize OtherProjectsSidebarGenerator in hook handlers
see T47532 which addresses this issue more generally, to avoid memcached entirely for the SiteList and have a file-based cache for it.
the specific issue of languageLinkSiteGroup is resolved (some time ago) and https://gerrit.wikimedia.org/r/174113 (merged now) addressed a related but different issue.
r174113, part of wmf10, was deployed across all Wikipedias today and had no effect whatsoever.
I poked at this some more and have been able to actually reduce the traffic. See https://lists.wikimedia.org/pipermail/wikidata-tech/2014-December/000682.html
{$wgDBname}:SiteList:sites/SiteList#2014-03-17+Site:2013-01-23 items are at or near the top of memcached keys sorted by bandwidth utilization on the production memcached cluster. This really needs to be fixed and stay fixed.
Fetching SiteList from memcached does not seem to happen on the page view code path. It does happen once on the edit code path. So this is not a regression to the old behavior, it's just that Wikidata is now used so much, even doing this on edit is an issue.
So it seems that we have to tackle T76706: Design caching infrastructure for SiteStore, probably going for T47532: Add file-based cached implementation of SiteStore.
This is going to take a while. I can't think of a good quick fix for this.
We could cache a separate SiteList for each wiki, with the members of the wiki's family plus sister wikis (that is, all the Site entries relevant for sidebars on that wiki). That would give us one cache entry per wiki, and just as many requests to memcached, but the SiteLists returned would be smaller (say, 300 entries instead of 800). Not sure if that's worth the trouble.
Yes, the graph "Memcached eqiad aggregated bytes_out" didn't visibly increase over what it was after T58602#809009. If we go for T47532 which would remove the use of memcache for this then splitting the memcache use by wiki as an in between step is not necessary.
Every so often I crunch some statistics about memcached usage. I can't tell you how demoralizing it is to find sites/SiteList#2014-03-17+Site:2013-01-23 near the top again and again. If it's still there the next time I check I'm going to start disabling extensions. Changing priority to UBN! for that reason.
Please fix this by making the list a static array in a PHP file in operations/mediawiki-config and then include it in CommonSettings.php. This way HHVM will compile it to byte-code and the OS will keep it in memory.
Daniel suggests that CACHE_ACCEL (APC) could be used here. I don't know how big the blob is, but probably not to big for that.
Change 225719 had a related patch set uploaded (by Hoo man):
Use CACHE_ACCEL for SiteLists if on HHVM
Change 225726 had a related patch set uploaded (by Ori.livneh):
Use CACHE_ACCEL for SiteLists if on HHVM
Solved by moving the sites cache into CACHE_ACCEL. As you can see, the traffic of some memcached servers dropped considerably at around 18:18 (UTC).