Page MenuHomePhabricator

Rewrite all Airflow sensors that use datacenter prepartitions to depend on both datacenters
Open, Needs TriagePublic1 Estimated Story Points

Description

New description:

@xcollazo do we need this anymore now that we've enabled canary events for all MW state event streams? You should be able to depend on both datacenter partitions being marked as ready, even if there are no real events in one of the DCs.

Nice!

In that case, what we want is to rewrite all instances in Airflow where we do pre_partitions=['datacenter=eqiad'], to read like pre_partitions:=[["datacenter=eqiad", "datacenter=codfw"]].

So will update description above and we can reuse the same ticket, for context.


Old description:
For some Data Engineering workflows, we depend on knowing which datacenter is active and producing event data so that our pipelines can ingest it. Right now, we modify the pipelines manually, and we invariably forget till the SLA alarms remind us.

It would be nice to have an API to know what datacenter is active. Nothing fancy, just an HTTP GET that would tell me whether its eqiad or the like.

While reviewing the new Datacenter Switchover Policy, I suggested such an API and @akosiaris quickly pointed me to an existing endpoint at https://config-master.wikimedia.org/mediawiki.yaml that spits out:

# the master datacenter for mediawiki
primary_dc: eqiad
# read-only settings
read_only:
  codfw: false
  eqiad: false

What we want is the primary_dc.

In this task we should:

  • Investigate the stability of this API.
  • If deemed stable, then modify our codebase so that we don't have to manually do these changes.
  • Make sure we can override if necessary.

Event Timeline

From @akosiaris via Datacenter Switchover Policy Comment:

Can you point me to repo that holds that file? Do other folks depend on it (i.e. is it stable)?

Sure. That file gets generated via the following template:

https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+log/refs/heads/production/modules/profile/templates/conftool/state-mediawiki.tmpl.erb. It fetches 1 key from our etcd datastore and interpolates it in that template creating that file (confd is that software that does that).

It hasn't changed 1 bit in 4 years and 9 months, so you can safely assume it is as stable as it can every be. Multiple other components (e.g. pontoon, puppet compiler, mw-cli-wrapper.py) rely on it. Data Persistence intends to use it as well soon. So if we ever end up wanting to change it (can't see why, but I am not good at predicting the future ;-)), we will be informing people and offering ways to migrate to a replacement.

Adding @Ladsgroup for his information. Data Persistence is already interested in a machine readable way to figure out the active datacenter.

Adding @Ladsgroup for his information. Data Persistence is already interested in a machine readable way to figure out the active datacenter.

I can update my tools if/when the yaml output in config-master is stable and final.

Adding @Ladsgroup for his information. Data Persistence is already interested in a machine readable way to figure out the active datacenter.

I can update my tools if/when the yaml output in config-master is stable and final.

It is stable. Nothing is ever final though, not even the heat death of black holes 😛

That being said, if we are to break that "API", we will be informing people beforehand and offering alternatives.

Well played Alex. Well played. I'll change them tomorrow. Thanks!

Well played Alex. Well played. I'll change them tomorrow. Thanks!

For the sake of completeness, this has happened for a while now https://gitlab.wikimedia.org/toolforge-repos/switchmaster/-/commit/2d3935080699871c9f59c683f0a14c30789adf7b

@xcollazo do we need this anymore now that we've enabled canary events for all MW state event streams? You should be able to depend on both datacenter partitions being marked as ready, even if there are no real events in one of the DCs.

@xcollazo do we need this anymore now that we've enabled canary events for all MW state event streams? You should be able to depend on both datacenter partitions being marked as ready, even if there are no real events in one of the DCs.

Nice!

In that case, what we want is to rewrite all instances in Airflow where we do pre_partitions=['datacenter=eqiad'], to read like pre_partitions:=[["datacenter=eqiad", "datacenter=codfw"]].

So will update description above and we can reuse the same ticket, for context.

xcollazo renamed this task from Use config-master.wikimedia.org/mediawiki.yaml to automatically switch code that depends on active datacenter to Rewrite all Airflow sensors that use datacenter prepartitions to depend on both datacenters.Dec 21 2023, 5:22 PM
xcollazo updated the task description. (Show Details)

Reopening this one.

We had decided to depend on all datacenter pre-partitions so that we do not need to manually intervene every time a datacenter switchover happens. However, we have had multiple instances in which canary events fail to be generated and this has made the sensors for multiple pipelines brittle. Thus, reverting these changes until we have a more robust canary system.

xcollazo merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/614

Revert to consuming a single datacenter until the canary system is more robust.

  NODES
Note 1
Project 15