Page MenuHomePhabricator

[Event Platform] Instrument EventBus with prometheus MW Statslib
Open, Needs TriagePublic13 Estimated Story Points

Description

There are discussions about occasions where EventBus fails producing events to EventGate.

etc.

We have some metrics from envoy mesh local proxies and from eventgate when things fail. (See Errors section of the EventGate grafana dashboard).

However, we do not have these metrics from EventBus itself. We do have failure logs in logstash.

Especially on 5xx errors, the client will know best when they happen.

Now that MediaWiki has prometheus support (T350592: EPIC: migrate in use metrics and dashboards to statslib), we should instrument EventBus and add metrics around event production and whatever else might be nice to have.

https://www.mediawiki.org/wiki/Manual:Stats has instructions for how to use the MW Stats library to do this.

Doing this will help us quantify when we fail to produce events, which will help us with defining SLOs and documentation for T120242: Eventually Consistent MediaWiki State Change Events.

Done is
  • EventBus emits metrics about event produce and failure counters, with informative labels. Labels should probably include
  • stream name
  • event service name (eventgate name)
  • HTTP status code (?)
  • timing metrics: send() function timing & http request timing - will do this in different task
  • maybe $schema if it isn't hard to get? (not really that useful)
  • etc.
  • Any other easy/useful/relavent EventBus metrics are emitted.
  • EventBus metrics are shown in a Grafana dashboard, either in the existent EventGate one, or a new one for EventBus.
  • Temporary feature flag logic removed from EventBus

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@gmodena I'm considering a change in EventBus for which I'd need to know stream name in the EventBus send() method. I think you said you'd need this too?

We have $events in send(). meta.stream is a pretty strongly enforced convention; I don't think EventBus has ever not supported it. What if we just extracted it from the $events inside of send()?

...Or we could consider changing the method signature to include $streamName ?

@gmodena I'm considering a change in EventBus for which I'd need to know stream name in the EventBus send() method. I think you said you'd need this too?

You beat me to this message :).

We have $events in send(). meta.stream is a pretty strongly enforced convention; I don't think EventBus has ever not supported it. What if we just extracted it from the $events inside of send()?

This was my initial plan of attack.
I've been looking into reporting metrics (number of records) per stream by parsing the messages payload and extracting stream name from the records.
This is fine for the happy path where we pass send() an array and all messages are delivered. We'd record something along the lines of

mediawiki_eventbus_outgoing_events_total{stream_name="mediawiki_page_change_v1"} 3
mediawiki_eventbus_outgoing_events_total{stream_name="mediawiki_revision_create"} 3
mediawiki_eventbus_outgoing_events_total{stream_name="resource_change"} 1

If all events in the send() call have the same stream name, we can break down response status codes by stream without having to re-parse the response body. E.g.

mediawiki_eventbus_responses_total{status_code="201", stream_name: "mediawiki_page_change_v1"} 3
mediawiki_eventbus_responses_total{status_code="207", stream_name: "malformed_payload"} 5 // malformed test input

...Or we could consider changing the method signature to include $streamName ?

Are we guaranteed that all events in the array passed to send() will have the same stream name?

If we can rely on this guarantee, than book-keeping stream name should be easy. Passing it as method argument would be nice to have, but possibly redundant.
I would not break APIs for a "nice to have".

However, I would like to keep the SerDe at a minimum in this code path. If we pass an array of events, we can access meta.stream with $event['meta']['stream']; (xprofile tells me it's cheap).
If we'd have to parse and convert a string of events (or parse the response body - see above) from json, that could be a different story.

I made some progress on this (see also comment above). Here's an idea of how I would like to name and label metrics. I would like to start small and interate in Beta. Some code paths are difficult to test out locally.

For these metrics, the component would be eventbus. We'd be adding to the default mediawiki prefix.

1. number of calls to send()

Metric name: function_calls
Labels function_name=send. Example:

mediawiki_eventbus_function_calls{function_name="send"} 7

Global counter of how often this code path is invoked.

Don't know yet if actually useful other than for debug purposes. TBH it's mostly a curiosity, because I don't have
a good feeling fro how much eventbus is used outside use cases I'm familiar with.
This also potentially duplicates info recorded in deferred_updates (see below).

2. number of outgoing records

These would be the number of records send() is posting to the intake gateway.

Metrics name: outgoing_events_total
Labels: stream_name
Example:

mediawiki_eventbus_outgoing_events_total{stream_name="mediawiki_page_change_v1"}
2. response status codes by stream

Metric name: responses_total
Labels: status_code, stream_name
Example:

mediawiki_eventbus_responses_total{status_code="201", stream_name: "mediawiki_page_change_v1"} 3

Other considerations

event service name (eventgate name)

We can prase this from $this->url. Would this label be useful? Does it add much to stream_name? Afaik there is a 1:1 mapping between stream names and gateways.

maybe $schema if it isn't hard to get?

We can extract this by parsing the event payload. I need to validate whether all events passed to a send() call (events array arguments) will have the same $schema.

Any other easy/useful/relavent EventBus metrics are emitted.

We get a bunch of defaults in the deferred_updates component:

mediawiki_deferred_updates_total{http_method="post",type="MediaWiki_Deferred_MWCallableUpdate_MediaWiki_Extension_EventBus_EventBusHooks_sendResourceChangedEvent"} 1
mediawiki_deferred_updates_total{http_method="post",type="MediaWiki_Deferred_MWCallableUpdate_MediaWiki_Extension_EventBus_EventBusHooks_sendRevisionCreateEvent"} 3
mediawiki_deferred_updates_total{http_method="post",type="MediaWiki_Deferred_MWCallableUpdate_MediaWiki_Extension_EventBus_HookHandlers_MediaWiki_PageChangeHooks_sendEvents"} 3

event service name (eventgate name)

We can prase this from $this->url

It won't hurt to add the event service name itself as a EventBus instance property. The url itself might be nice to have too. I've found that it's okay (and sometimes good) if labels are redundant, as it makes it easier to do templating in Grafana. E.g. sometimes we might want to filter by the event_service_name, sometimes we might want to display the endpoint URLs in the graph.

Afaik there is a 1:1 mapping between stream names and gateways.

Yes, but it can change, as it is in stream config. It usually does not, but sometimes it does. E.g. we may want to produce mediawiki.page_content_change.v1 to a multi-DC Kafka cluster one day.


We should brain bounce on some possible EventBus changes too. For T346046 I'm considering getting rid of the EventBus $url param, and instead just injecting a $streamConfigs directly. Then EventBus itself can figure out where to POST based on $streamConfigs, rather than doing based on the EventBus instance created by EventBusFactory. I'm not sure about this, and perhaps it will be too much, but I keep running into awkwardness because EventBus doesn't have an instance of StreamConfigs to use.

event service name (eventgate name)

We can prase this from $this->url

It won't hurt to add the event service name itself as a EventBus instance property.

Sure. We def don't want to parse the same thing for every message.

The url itself might be nice to have too.

This we have already as an instance property, but I don't think we need to label with the full endpoint.
If we ever version bump the API, we could capture that at a later stage.

I've found that it's okay (and sometimes good) if labels are redundant, as it makes it easier to do templating in Grafana. E.g. sometimes we might want to filter by the event_service_name, sometimes we might want to display the endpoint URLs in the graph.

Sounds good! Labels won't impact cardinality.

Afaik there is a 1:1 mapping between stream names and gateways.

Yes, but it can change, as it is in stream config. It usually does not, but sometimes it does. E.g. we may want to produce mediawiki.page_content_change.v1 to a multi-DC Kafka cluster one day.

We could always reverse look up ESC, but point taken.

We should brain bounce on some possible EventBus changes too. For T346046 I'm considering getting rid of the EventBus $url param, and instead just injecting a $streamConfigs directly. Then EventBus itself can figure out where to POST based on $streamConfigs, rather than doing based on the EventBus instance created by EventBusFactory.

I got confused by this. But I am not sure if EB is awkward, or ESC is :)

Change #1049831 had a related patch set uploaded (by Gmodena; author: Gmodena):

[mediawiki/extensions/EventBus@master] eventbus: add instrumentation to send() method.

https://gerrit.wikimedia.org/r/1049831

Change #1051709 had a related patch set uploaded (by Gmodena; author: Gmodena):

[operations/mediawiki-config@master] beta: eventbus: enable instrumentation

https://gerrit.wikimedia.org/r/1051709

Change #1049831 merged by jenkins-bot:

[mediawiki/extensions/EventBus@master] eventbus: add instrumentation to send() method.

https://gerrit.wikimedia.org/r/1049831

Change #1051709 merged by jenkins-bot:

[operations/mediawiki-config@master] beta: eventbus: enable instrumentation

https://gerrit.wikimedia.org/r/1051709

Change #1053534 had a related patch set uploaded (by Gmodena; author: Gmodena):

[mediawiki/extensions/EventBus@master] EventBus: label meterics with event type name.

https://gerrit.wikimedia.org/r/1053534

Instrumentation has been enabled in beta. You can test it by modifying a page on https://simple.wikipedia.beta.wmflabs.org.

Metrics are not scraped by prometheus (I think), but can be accessed from the statsd exporter endpoint on an appserver localhost.
For example:

$ ssh deployment-mediawiki12.deployment-prep.eqiad1.wikimedia.cloud
$ curl localhost:9112/metrics | grep -i eventbus

will return a bunch of metrics

Change #1053534 merged by jenkins-bot:

[mediawiki/extensions/EventBus@master] EventBus: label metrics with event type name.

https://gerrit.wikimedia.org/r/1053534

Change #1054357 had a related patch set uploaded (by Gmodena; author: Gmodena):

[operations/mediawiki-config@master] eventbus: enable instrumentation on group 0

https://gerrit.wikimedia.org/r/1054357

Change #1054854 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/EventBus@master] Move instrumentation feature flag to extension.json

https://gerrit.wikimedia.org/r/1054854

Change #1054357 merged by jenkins-bot:

[operations/mediawiki-config@master] eventbus: enable instrumentation on group 0

https://gerrit.wikimedia.org/r/1054357

Mentioned in SAL (#wikimedia-operations) [2024-07-17T13:02:58Z] <urbanecm@deploy1002> Started scap sync-world: Backport for [[gerrit:1054084|Add Portal namespace for Ingush Wikipedia (T326089)]], [[gerrit:1054357|eventbus: enable instrumentation on group 0 (T363587)]]

Mentioned in SAL (#wikimedia-operations) [2024-07-17T13:05:28Z] <urbanecm@deploy1002> nmw03, gmodena, urbanecm: Backport for [[gerrit:1054084|Add Portal namespace for Ingush Wikipedia (T326089)]], [[gerrit:1054357|eventbus: enable instrumentation on group 0 (T363587)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-07-17T13:13:04Z] <urbanecm@deploy1002> Finished scap: Backport for [[gerrit:1054084|Add Portal namespace for Ingush Wikipedia (T326089)]], [[gerrit:1054357|eventbus: enable instrumentation on group 0 (T363587)]] (duration: 10m 06s)

A dashboard for EventBus is available in Grafana

Change #1054854 merged by jenkins-bot:

[mediawiki/extensions/EventBus@master] Move instrumentation feature flag to extension.json

https://gerrit.wikimedia.org/r/1054854

Change #1059900 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/mediawiki-config@master] eventbus: enable instrumentation on all wikis

https://gerrit.wikimedia.org/r/1059900

Change #1059900 merged by jenkins-bot:

[operations/mediawiki-config@master] eventbus: enable instrumentation on all wikis

https://gerrit.wikimedia.org/r/1059900

Mentioned in SAL (#wikimedia-operations) [2024-08-05T19:07:46Z] <otto@deploy1003> Started scap sync-world: Backport for [[gerrit:1059900|eventbus: enable instrumentation on all wikis (T363587)]]

Mentioned in SAL (#wikimedia-operations) [2024-08-05T19:09:53Z] <otto@deploy1003> otto: Backport for [[gerrit:1059900|eventbus: enable instrumentation on all wikis (T363587)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-08-05T19:14:55Z] <otto@deploy1003> Finished scap: Backport for [[gerrit:1059900|eventbus: enable instrumentation on all wikis (T363587)]] (duration: 07m 08s)

I enabled on all wikis! Woohoo!

I wonder if wiki_id would be a useful label. I'm augmenting your dashboard and think that might be a nice way to slice.

send() function call and also http POST request timing metrics would be really nice too.

@gmodena I've rearranged and slightly improved your grafana dashboard:

https://grafana.wikimedia.org/goto/0Iqm6ErIg?orgId=1

I've tried to make it match the '4 golden signals' approach that EventGate and other service dashboards use. "Saturation" won't be a useful signal here, but Latency will: we should get some timing metrics!

Change #1060156 had a related patch set uploaded (by Ottomata; author: Ottomata):

[mediawiki/extensions/EventBus@master] EventBus send - Fix json_decode bug in error metric handling

https://gerrit.wikimedia.org/r/1060156

Change #1060187 had a related patch set uploaded (by Ottomata; author: Ottomata):

[mediawiki/extensions/EventBus@master] EventBus send - DRY code to avoid differing behavior for string $events

https://gerrit.wikimedia.org/r/1060187

Change #1060191 had a related patch set uploaded (by Ottomata; author: Ottomata):

[mediawiki/extensions/EventBus@master] Eventbus send - rename some metrics and standardize labels

https://gerrit.wikimedia.org/r/1060191

@gmodena for when you are back:

I've been messing with the EventBus send code to try and improve a few things:

Let me know what you think.

I'd also like to do 2 more things before we resolve this ticket:

  • Remove the temporary feature flag
  • Add timing metrics for the whole send() call for the HTTP POST.

Change #1060156 merged by jenkins-bot:

[mediawiki/extensions/EventBus@master] EventBus send - Fix json_decode bug in error metric handling

https://gerrit.wikimedia.org/r/1060156

Change #1062399 had a related patch set uploaded (by Gmodena; author: Gmodena):

[mediawiki/extensions/EventBus@master] eventbus: remove instrumentation feature flag.

https://gerrit.wikimedia.org/r/1062399

Change #1062430 had a related patch set uploaded (by Gmodena; author: Gmodena):

[operations/mediawiki-config@master] config: remove eventbus instrumentation setting

https://gerrit.wikimedia.org/r/1062430

I'd also like to do 2 more things before we resolve this ticket:

Remove the temporary feature flag

There's already patches in flight for this.

Add timing metrics for the whole send() call for the HTTP POST.

We discussed this over call, and decided to not measure latencies for now. Can we have dedicated phab for this?

Yes, lets do new ticket. Will edit this one.

Change #1060191 abandoned by Ottomata:

[mediawiki/extensions/EventBus@master] Eventbus send - rename some metrics and standardize labels

Reason:

I squashed into Ia3faec55f2031809fcec75331bb2eddfddac9050

https://gerrit.wikimedia.org/r/1060191

Change #1062399 merged by jenkins-bot:

[mediawiki/extensions/EventBus@master] eventbus: remove instrumentation feature flag.

https://gerrit.wikimedia.org/r/1062399

Change #1060187 merged by jenkins-bot:

[mediawiki/extensions/EventBus@master] EventBus - DRY code and rename some metrics and standardize labels

https://gerrit.wikimedia.org/r/1060187

Just tested recent merged change in beta, looks good to me.

Change #1065043 had a related patch set uploaded (by Ottomata; author: Ottomata):

[mediawiki/extensions/EventBus@master] EventBus metric - remove failure_kind label from events_accepted_total

https://gerrit.wikimedia.org/r/1065043

Change #1065043 merged by jenkins-bot:

[mediawiki/extensions/EventBus@master] EventBus send - fix buggy events_accepted_total metric

https://gerrit.wikimedia.org/r/1065043

Change #1070252 had a related patch set uploaded (by Ottomata; author: Ottomata):

[mediawiki/extensions/EventBus@master] Fix buggy events_accepted_total metric

https://gerrit.wikimedia.org/r/1070252

Change #1070252 merged by jenkins-bot:

[mediawiki/extensions/EventBus@master] Fix buggy events_accepted_total metric

https://gerrit.wikimedia.org/r/1070252

Change #1070252 merged by jenkins-bot:

[mediawiki/extensions/EventBus@master] Fix buggy events_accepted_total metric

https://gerrit.wikimedia.org/r/1070252

Metrics look much better after last week's deployment: https://grafana.wikimedia.org/goto/9pIwwRRNR?orgId=1

I'll keep an eye on this for a few more days, and if all is good we can resolve this task.

Change #1062430 merged by jenkins-bot:

[operations/mediawiki-config@master] config: remove eventbus instrumentation setting

https://gerrit.wikimedia.org/r/1062430

hashar subscribed.

Change #1062430 merged by jenkins-bot:

[operations/mediawiki-config@master] config: remove eventbus instrumentation setting

https://gerrit.wikimedia.org/r/1062430

I have reverted that patch (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1097995) because the change got merged while I was doing the MediaWiki train deployment. I did not want to deploy the config change given I don't know what it is about.

I have reverted that patch (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1097995) because the change got merged while I was doing the MediaWiki train deployment. I did not want to deploy the config change given I don't know what it is about.

Hey hashar, apologies for the confusion. It's a CR we had open for a while that removes a feature flag, for a feature that is now deployed on all wikis.
I'll coordinate on irc for a roll out during a backport window.

  NODES
Idea 1
idea 1
Note 13
Project 18