Page MenuHomePhabricator

Timeout errors when making requests to Firebase for push notifications
Closed, ResolvedPublic

Description

After updating our push-notifications service to use the latest version of Firebase, it looks like we're seeing ETIMEDOUT errors because it's attempting to send a request to an IPv6 address:

https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-k8s-1-7.0.0-1-2024.11.12?id=9w80IZMB1cXMpx2m0Glg

Event Timeline

I am reverting production envs only and leave staging in case SREs need to debug the actual running pod.
FYI we tried to reproduce it locally but we didn't get any failures.

Given the shape of the error message and stack trace, I suspect that there's some new code path in FCM or its dependencies for which only setting the httpAgent param isn't sufficient, and we also need to set an environment variable.

I hand-kubectl edited the deployment on the eqiad staging cluster to also set the http_proxy env var, but staging is apparently broken for other reasons.

But I did produce P71024 which should allow for local testing of this hypothesis by devs.

I also want to note that catching this kind of thing is what the canary deployment in production is for. SRE doesn't yet make it really simple for service owners to make good usage of it (Mediawiki deploys being the one exception), but we could make good use of it here.

Update from debugging:

  • After running a local env with
    • squid proxy
    • local dns forwarder
    • external traffic of the app container blocked

I managed to reproduce the issue of timeouts even with ipv4 resolution. By adding the compatibility flag on firebase to not use http2 outgoing requests went through.

Next steps to fix the problem is to implement an httpAgent to pass to the firebase app initialization that is http2 aware OR use the compatibility flag.

Change #1090848 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[mediawiki/services/push-notifications@master] firebase: Use legacy http transport

https://gerrit.wikimedia.org/r/1090848

Change #1090848 merged by jenkins-bot:

[mediawiki/services/push-notifications@master] firebase: Use legacy http transport

https://gerrit.wikimedia.org/r/1090848

Change #1093919 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[operations/deployment-charts@master] push-notifications: Bump image to latest version

https://gerrit.wikimedia.org/r/1093919

Change #1093919 merged by jenkins-bot:

[operations/deployment-charts@master] push-notifications: Bump image to latest version

https://gerrit.wikimedia.org/r/1093919

We deployed the above change, and it was working in staging, but when rolled out to production we're now seeing an influx of logstash messages like this.

Failed to send message: app/invalid-credential - Credential implementation provided to initializeApp() via the "credential" property failed to fetch a valid Google OAuth2 access token with the following error: "Error fetching access token: Error while making request: connect ETIMEDOUT 2607:f8b0:4023:1009::54:443."

This seems to be a similar ETIMEDOUT error to the original issue.

Have you tried also setting the http_proxy and https_proxy env variables in your deployment? Some of the Firebase documentation implies this is also necessary.

Change #1100513 had a related patch set uploaded (by Dbrant; author: Dbrant):

[operations/deployment-charts@master] push-notifications: Add proxy env vars.

https://gerrit.wikimedia.org/r/1100513

Change #1100513 merged by jenkins-bot:

[operations/deployment-charts@master] push-notifications: New release & proxy env vars.

https://gerrit.wikimedia.org/r/1100513

Change #1100535 had a related patch set uploaded (by Dbrant; author: Dbrant):

[operations/deployment-charts@master] push-notifications: Add no_proxy: localhost, for making API calls.

https://gerrit.wikimedia.org/r/1100535

Change #1100535 merged by jenkins-bot:

[operations/deployment-charts@master] push-notifications: Add no_proxy: localhost, for making API calls.

https://gerrit.wikimedia.org/r/1100535

Can confirm that this is now resolved.
Many thanks to @CDanis for all the help!

CDanis claimed this task.
  NODES
debugging 1
HOME 1
Note 2
os 18