Page MenuHomePhabricator

[Epic] Store media information for files on Wikimedia Commons as structured data
Closed, ResolvedPublic

Assigned To
None
Authored By
Jdforrester-WMF
Jun 4 2014, 1:33 AM
Referenced Files
None
Tokens
"Yellow Medal" token, awarded by Tgr."Like" token, awarded by Liuxinyu970226."Mountain of Wealth" token, awarded by SandraF_WMF."Love" token, awarded by Mattias_Ostmar-WMSE."Like" token, awarded by Sadads."Like" token, awarded by Deskana."Like" token, awarded by Jdforrester-WMF."Love" token, awarded by Smalyshev."Mountain of Wealth" token, awarded by Bene."Like" token, awarded by Filceolaire."Love" token, awarded by Ricordisamoa.

Description

Adding structured data based on Wikibase for all media files on Wikimedia Commons is something the Multimedia team are planning to work on with the Wikidata team.

Details

Reference
bz66108

Related Objects

View Standalone Graph
This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.
StatusSubtypeAssignedTask
ResolvedNone
OpenFeatureNone
ResolvedNone
ResolvedNone
DuplicateNone
DuplicateNone
InvalidNone
InvalidNone
DuplicateNone
ResolvedLydia_Pintscher
InvalidNone
InvalidNone
InvalidNone
InvalidNone
InvalidNone
OpenNone
ResolvedNone
OpenNone
ResolvedLydia_Pintscher
Resolveddaniel
ResolvedTobi_WMDE_SW
DuplicateNone
OpenNone
OpenNone
Resolved Ramsey-WMF
ResolvedJdforrester-WMF
ResolvedArielGlenn

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

T159884 maybe a use case for this at some point (mentioning for bookkeeping)

Mentioned in SAL (#wikimedia-operations) [2019-01-09T21:04:30Z] <James_F> Creating Wikibase repo tables on Commons for T68108

I've been told wikibase tables have been created on s4. We would like to have been notified of this- we are not sure wikibase for commons should live on s4, if the growth of structured data is as large of it was for wikidata, we should create a separate cluster, dedicated to it (s4, like s1 and s8 are quite bloated). Changing it before or at the beginning is easy, doing it later is more complicated. Please talk to DBAs to understand hw needs. I prefer to request more hw that we need and later not buy it that needing it and not having the budget available. CC @Marostegui.

I asked to talk to us months before that deployment happened.

@WMDE-leszek - I spoke with @jcrespo and some others about potential issues with the SDC deployments, and we basically decided that it would be best to move the Wikibase tables to a separate cluster out of an abundance of caution. They wanted a WMDE perspective as to whether making that move would be possible and what effects might be seen based on the various refactoring going on, in particular with the wb_terms table. Given you've been our point of contact on Wikibase work recently, I hoped you could chime in or bother the correct person to give feedback on this matter.

It should be noted here, also, that the Wikibase tables on the Commons database are currently all-but-empty, so it's not a huge threat to the cluster. However, we're investigating the impact on the revisions table and will be exploring what is needed to avoid further issues.

@MarkTraceur, @jcrespo et al: We've discussed this briefly at WMDE, and we believe the suggested idea should not be problematic with regards to wb_terms table.
Neither the existing wb_terms table, neither its refactored replacement is expected to be used in joins with "standard" MW tables, hence moving the table to separate cluster shouldn't create an issue. The move would likely require some changes to the Wikibase code, so it would be preferred from WMDE side, that the separation happened after our current storage work is done (1-2 months from now), so we can limit number of variables we deal with.

It should be noted wb_terms (or what is about to replace it) is not the only DB table created and used by Wikibase extension. These other tables (e.g. wbc_changes) couldn't, in our understanding, be easily moved to a separate cluster, and we'd recommend against such move, unless there are important reasons to do so.

It is not clear from the comments above, whether the separation you have been discussing only considers wb_terms table, or all Wikibase-specific tables.

The move would likely require some changes to the Wikibase code

Could you clarify why? As all other hosts seem to be ok with wikibase server for wikidata being on a separate database? Does MCR use wikibase differently or is it something else? Maybe having 2 wikibase services to use? Note again we notified of this need months in advance during planning phase, as new features require usually extra resources.

Wikibase-specific tables

there is wbc_changes, and maybe other wikibase client tables- we don't have a problem with those, as those exist locally on all (wikidata-enabled) wikis. The ones we are worried about are the wikibase server ones (aka the equivalent of s8 on s4). I am not sure we should wait for the refactoring, but as long as the tables on s4 are empty or almost empty (with I think a single row on wb_id_counters), we are flexible. What we don't want is lots of data there that later is more complex to migrate away (we are assuming there will be a large amount of updates there when SDC is at full steam).

A few hosts were budgeted for the split for FY2019-2020, we are not in a rush, but it would be nice to have some planning in place for next fiscal year so there are no unexpected delays, specially given persistence team is at 50% capacity ATM.

The move would likely require some changes to the Wikibase code

Could you clarify why? As all other hosts seem to be ok with wikibase server for wikidata being on a separate database? Does MCR use wikibase differently or is it something else? Maybe having 2 wikibase services to use? Note again we notified of this need months in advance during planning phase, as new features require usually extra resources.

What I had in mind is wb_terms (and its successor being introduced currently) are also used in queries that do joins with Mediawiki's "standard" tables, e.g. page or revision. This code/queries will not continue to work if the wb_terms table of commons gets moved to another server than the one mw tables are.

The move would likely require some changes to the Wikibase code

Could you clarify why? As all other hosts seem to be ok with wikibase server for wikidata being on a separate database? Does MCR use wikibase differently or is it something else? Maybe having 2 wikibase services to use? Note again we notified of this need months in advance during planning phase, as new features require usually extra resources.

What I had in mind is wb_terms (and its successor being introduced currently) are also used in queries that do joins with Mediawiki's "standard" tables, e.g. page or revision. This code/queries will not continue to work if the wb_terms table of commons gets moved to another server than the one mw tables are.

I was just corrected by the WMDE colleague the above point about joins is not correct, there are no such joins

I believe the only code change that Wikibase might be facing might then be a new to have a DB connection for wb_terms table server, and the DB connection for the regular mw tables server. This is not going to be rocket science, but also not something that would just work out of the box.

So just to be clear- based on growth projections I recently got from SDC team of wikibase on commons, the separation is not only convenient or needed for performance, literally s4 would not be able to fit except the initial deployment of data, or do it for very short term. Disk usage is close to 2 TB right now, with and additional 2TB of structured data (metadata only). IOPS would be close to wikidata, which requires a dedicated cluster. This is based on their statistics and projected growth, on top of the current growth and usage. We can scale those out with already budgeted hw, but we need software support. Extra content data would not be a concern as we already planned ES expansion for next fiscal.

I think we should plan a bit the different high level tasks related to the database, to schedule them properly, both the ones that directly affect SDC (implementation, wikibase dedicated db, wb_terms migration, MCR) and the ones that affect it indirectly by increase or release of used resources (actor, comment, *links migrations).

I'm going to boldly claim this is Resolved, though of course there's loads more to do. All credit to the current and former members of the Structured Data on Commons team, from the Foundation, WMDE, and the community.

  NODES
Bugs 1
COMMUNITY 1
Idea 1
idea 1
Note 6
Project 8