avoid fetching SiteList object from memcached
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ori
	Nov 5 2013, 6:22 AM

Description

Since I4e71671c8, WikibaseClient's OutputPageParserOutput & ParserAfterParse hook handlers call WikibaseClient::getDefaultInstance()->getLangLinkSiteGroup(). Since
$wgWBClientSettings['languageLinkSiteGroup'] is unset, it defaults toWikibaseClient::getSite, which calls SiteSQLStore::getSites, which requires retrieving a huge object from memcached, with a predictable impact on the cluster: see http://noc.wikimedia.org/~ori/SiteSQLStore.html -- you can guess when I4e71671c8 was deployed.

Details

Reference: bz56602

	Subject	Repo	Branch	Lines +/-
	Use CACHE_ACCEL for SiteLists if on HHVM	mediawiki/core	master	+1 -1
	Use CACHE_ACCEL for SiteLists if on HHVM	mediawiki/core	wmf/1.26wmf14	+1 -1

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Addshore	T76705 SiteStore / SiteList performance and caching (tracking)
		Resolved		aude	T58602 avoid fetching SiteList object from memcached

Event Timeline

• bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 2:39 AM

• bzimport added projects: MediaWiki-extensions-WikibaseClient, Performance Issue.

• bzimport set Reference to bz56602.

• bzimport added a subscriber: Unknown Object (MLST).

ori created this task.Nov 5 2013, 6:22 AM

Change 93648 had a related patch set uploaded by Ori.livneh:
Set enwiki's languageLinkSiteGroup to 'wikipedia'

https://gerrit.wikimedia.org/r/93648

Change 93648 merged by Ori.livneh:
Set enwiki's languageLinkSiteGroup to 'wikipedia'

https://gerrit.wikimedia.org/r/93648

Didn't improve things a whole lot, since the call to getSite in WikibaseClient.hooks.php's onSkinTemplateOutputPageBeforeExec hook handler is executed much more frequently.

(In reply to comment #3)

Didn't improve things a whole lot, since the call to getSite in
WikibaseClient.hooks.php's onSkinTemplateOutputPageBeforeExec hook handler is
executed much more frequently.

That's Ie17f2af09, to be specific.

Change 93661 merged by jenkins-bot:
Re-introduce siteGroup setting for performance reasons

https://gerrit.wikimedia.org/r/93661

Change 93767 had a related patch set uploaded by Aude:
Re-introduce siteGroup setting for performance reasons

https://gerrit.wikimedia.org/r/93767

Change 93767 merged by jenkins-bot:
Re-introduce siteGroup setting for performance reasons

https://gerrit.wikimedia.org/r/93767

Change 93769 had a related patch set uploaded by Aude:
Update Wikibase, use siteGroup setting instead of doing lookup

https://gerrit.wikimedia.org/r/93769

Change 93772 had a related patch set uploaded by Aude:
Update Wikibase, use siteGroup setting instead of doing lookup

https://gerrit.wikimedia.org/r/93772

Change 93769 merged by jenkins-bot:
Update Wikibase, use siteGroup setting instead of doing lookup

https://gerrit.wikimedia.org/r/93769

Change 93772 merged by jenkins-bot:
Update Wikibase, use siteGroup setting instead of doing lookup

https://gerrit.wikimedia.org/r/93772

Change 93773 had a related patch set uploaded by Aude:
Add siteGroup setting for Wikibase

https://gerrit.wikimedia.org/r/93773

Change 93773 merged by Ori.livneh:
Add siteGroup setting for Wikibase

https://gerrit.wikimedia.org/r/93773

This is once again an issue; it is loading on every request.

Impact: http://i.imgur.com/v9ebld6.png

This wasn't fixed, so I'm not sure why the bug was closed.

This causes:
https://ganglia.wikimedia.org/latest/graph.php?r=year&z=xlarge&h=mc1005.eqiad.wmnet&m=network_report&s=by+name&mc=2&g=network_report&c=Memcached+eqiad
https://ganglia.wikimedia.org/latest/graph.php?r=year&z=xlarge&h=mc1011.eqiad.wmnet&m=network_report&s=by+name&mc=2&g=network_report&c=Memcached+eqiad
https://ganglia.wikimedia.org/latest/graph.php?r=year&z=xlarge&h=mc1014.eqiad.wmnet&m=cpu_report&s=by+name&mc=2&g=network_report&c=Memcached+eqiad

Top keys are:
enwiki:sites/SiteList#2014-03-17+Site:2013-01-23 (54MB/s)
wikidatawiki:sites/SiteList#2014-03-17+Site:2013-01-23 (20MB/s)
commonswiki:sites/SiteList#2014-03-17+Site:2013-01-23 (18MB/s)

This has been pointed out as early as September 2013, and again March 2014 and again September 2014 and is still happening. Having e.g. 80% of mc1005's total network bandwidth being a single wikidata key, or SiteList keys being consistently on the top of memcached bandwidth output by multiple factors compared to the rest, is frankly indicative of a serious design failure and unacceptable. I don't understand why this bug was closed either.

Can we just have a current version hash stored in memcached and used to validate server-local CDB caches (made on the fly, with a special key holding the hash of the other key/values). This would reduce the memcached I/O to a minuscule amount.

Change 174113 had a related patch set uploaded by Aude:
Lazy initialize OtherProjectsSidebarGenerator in hook handlers

https://gerrit.wikimedia.org/r/174113

my patch (https://gerrit.wikimedia.org/r/#/c/174113/) ensures the memcached lookup of SiteList is confined to users with the other projects beta feature enabled. This should help quite a lot to reduce memcached access for the SiteList.

the SiteList is used in similar functionality as the interwiki data, used to add links to related sister projects in the sidebar.

to roll out the feature more widely, we should have local caching (json, like i18n?) of the site list data and may want to have memcached store the hash (like done for i18n), per Aaron's suggestion.

faidon merged a task: T1339: Set languageLinkSiteGroup in $wgWBClientSettings to avoid fetching SiteList object from memcached.Nov 24 2014, 2:14 PM

faidon added a subscriber: • csteipp.

faidon triaged this task as High priority.Nov 24 2014, 2:16 PM

faidon updated the task description. (Show Details)

faidon added projects: Wikidata, MediaWiki-Core-Team, Scrum-of-Scrums.

faidon set Security to None.

Change 174113 merged by jenkins-bot:
Lazy initialize OtherProjectsSidebarGenerator in hook handlers

https://gerrit.wikimedia.org/r/174113

hoo closed this task as Resolved.Nov 25 2014, 1:18 PM

hoo claimed this task.

see T47532 which addresses this issue more generally, to avoid memcached entirely for the SiteList and have a file-based cache for it.

the specific issue of languageLinkSiteGroup is resolved (some time ago) and https://gerrit.wikimedia.org/r/174113 (merged now) addressed a related but different issue.

• bd808 moved this task from Scheduled to Done on the Scrum-of-Scrums board.Nov 25 2014, 5:49 PM

bytes out over all memcache servers: https://ganglia.wikimedia.org/latest/?r=year&cs=&ce=&c=Memcached+eqiad&h=&tab=m&vn=&hide-hf=false&m=bytes_out&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name

• bd808 moved this task from Backlog to Done on the MediaWiki-Core-Team board.Nov 25 2014, 11:27 PM

Change 174874 merged by jenkins-bot:
Implement SiteListFileCache and rebuild script

https://gerrit.wikimedia.org/r/174874

• bd808 moved this task from Done to Archive on the MediaWiki-Core-Team board.Dec 1 2014, 10:13 PM

r174113, part of wmf10, was deployed across all Wikipedias today and had no effect whatsoever.

JanZerebecki renamed this task from Set languageLinkSiteGroup in $wgWBClientSettings to avoid fetching SiteList object from memcached to avoid fetching SiteList object from memcached.Dec 3 2014, 9:39 PM

JanZerebecki reopened this task as Open.

JanZerebecki reassigned this task from hoo to • Wikidata-bugs.

JanZerebecki removed a project: Patch-For-Review.

In T58602#808530, @faidon wrote:

r174113, part of wmf10, was deployed across all Wikipedias today and had no effect whatsoever.

I poked at this some more and have been able to actually reduce the traffic. See https://lists.wikimedia.org/pipermail/wikidata-tech/2014-December/000682.html

aaron closed this task as Resolved.Dec 8 2014, 10:16 PM

RandomDSdevel awarded a token.Dec 12 2014, 12:35 AM

{$wgDBname}:SiteList:sites/SiteList#2014-03-17+Site:2013-01-23 items are at or near the top of memcached keys sorted by bandwidth utilization on the production memcached cluster. This really needs to be fixed and stay fixed.

JanZerebecki removed • Wikidata-bugs as the assignee of this task.Jun 2 2015, 7:58 PM

JanZerebecki moved this task from incoming to needs discussion or investigation on the Wikidata board.

Fetching SiteList from memcached does not seem to happen on the page view code path. It does happen once on the edit code path. So this is not a regression to the old behavior, it's just that Wikidata is now used so much, even doing this on edit is an issue.

So it seems that we have to tackle T76706: Design caching infrastructure for SiteStore, probably going for T47532: Add file-based cached implementation of SiteStore.

This is going to take a while. I can't think of a good quick fix for this.

We could cache a separate SiteList for each wiki, with the members of the wiki's family plus sister wikis (that is, all the Site entries relevant for sidebars on that wiki). That would give us one cache entry per wiki, and just as many requests to memcached, but the SiteLists returned would be smaller (say, 300 entries instead of 800). Not sure if that's worth the trouble.

Yes, the graph "Memcached eqiad aggregated bytes_out" didn't visibly increase over what it was after T58602#809009. If we go for T47532 which would remove the use of memcache for this then splitting the memcache use by wiki as an in between step is not necessary.

Every so often I crunch some statistics about memcached usage. I can't tell you how demoralizing it is to find sites/SiteList#2014-03-17+Site:2013-01-23 near the top again and again. If it's still there the next time I check I'm going to start disabling extensions. Changing priority to UBN! for that reason.

Please fix this by making the list a static array in a PHP file in operations/mediawiki-config and then include it in CommonSettings.php. This way HHVM will compile it to byte-code and the OS will keep it in memory.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 19 2015, 4:25 PM

@aude is going to poke at this when she's back from Wikimania.

In T58602#1463722, @hoo wrote:

@aude is going to poke at this when she's back from Wikimania.

Cool, thanks.

Daniel suggests that CACHE_ACCEL (APC) could be used here. I don't know how big the blob is, but probably not to big for that.

Change 225719 had a related patch set uploaded (by Hoo man):
Use CACHE_ACCEL for SiteLists if on HHVM

https://gerrit.wikimedia.org/r/225719

gerritbot added a project: Patch-For-Review.Jul 19 2015, 5:31 PM

Change 225726 had a related patch set uploaded (by Ori.livneh):
Use CACHE_ACCEL for SiteLists if on HHVM

https://gerrit.wikimedia.org/r/225726

Change 225726 merged by Ori.livneh:
Use CACHE_ACCEL for SiteLists if on HHVM

https://gerrit.wikimedia.org/r/225726

ori mentioned this in rMW015758761ddd: Use CACHE_ACCEL for SiteLists if on HHVM.Jul 19 2015, 6:18 PM

Change 225719 merged by jenkins-bot:
Use CACHE_ACCEL for SiteLists if on HHVM

https://gerrit.wikimedia.org/r/225719

ori mentioned this in rMW2d20b88c7d79: Use CACHE_ACCEL for SiteLists if on HHVM.Jul 19 2015, 6:27 PM

Solved by moving the sites cache into CACHE_ACCEL. As you can see, the traffic of some memcached servers dropped considerably at around 18:18 (UTC).

hoo removed projects: Patch-For-Review, MediaWiki-extensions-WikibaseClient.Jul 19 2015, 6:42 PM

Thanks for the quick response.

• Forrestbot added projects: WMF-deploy-2015-07-21_(1.26wmf15), WMF-deploy-2015-07-14_(1.26wmf14), MW-1.26-release.Jul 19 2015, 7:00 PM

Krinkle mentioned this in T74024: Audit Memcache load (Spring 2017).Mar 28 2017, 1:43 AM

	F199192: Memcached sites.png
	Jul 19 2015, 6:41 PM

avoid fetching SiteList object from memcachedClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

avoid fetching SiteList object from memcached
Closed, ResolvedPublic
Actions

Related Objects
Search...