Page MenuHomePhabricator

dcaro (David Caro)
SRE & amauteur yak shaver

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
Nov 2 2020, 11:59 AM (217 w, 3 d)
Availability
Available
IRC Nick
dcaro
LDAP User
David Caro
MediaWiki User
DCaro (WMF) [ Global Accounts ]

Recent Activity

Thu, Dec 12

dcaro added a comment to T381899: Add support for Python 3.13.

Related T380127: [builds-builder] Add support for Heroku's "24" builder stack based on Ubuntu 2024.04 noble

Thu, Dec 12, 8:50 AM · cloud-services-team, Toolforge
dcaro added a comment to T381923: Toolforge Build Service does not support .python-version.

Yep, the documentation shifted and we did not update (yet) the buildpack underneath, this will be solved with the next buildpack upgrade, related tasks:

Thu, Dec 12, 8:49 AM · cloud-services-team, Toolforge
dcaro triaged T374056: Upgrade python buildpack to v0.17.0 or newer for Poetry support as Medium priority.

Related T380127: [builds-builder] Add support for Heroku's "24" builder stack based on Ubuntu 2024.04 noble

Thu, Dec 12, 8:49 AM · cloud-services-team, Toolforge
Restricted Application added a project to T363854: Upgrade golang buildpack to 1.22: cloud-services-team.

Related T380127: [builds-builder] Add support for Heroku's "24" builder stack based on Ubuntu 2024.04 noble

Thu, Dec 12, 8:49 AM · cloud-services-team, Toolforge
Restricted Application added a project to T353762: Python buildpack does not detect requirements from pyproject.toml: cloud-services-team.

Related T380127: [builds-builder] Add support for Heroku's "24" builder stack based on Ubuntu 2024.04 noble

Thu, Dec 12, 8:49 AM · cloud-services-team, Toolforge, Upstream

Wed, Dec 11

dcaro added a comment to T381911: cloud VPS project codesearch or devtools - quota increase request for storage volumes.

Is this considered a lot or not at all compared to what you have available for all of cloud VPS?

Wed, Dec 11, 7:01 AM · collaboration-services, Cloud-VPS (Quota-requests)

Nov 29 2024

dcaro moved T380832: [jobs-api] crashing from Next Up to Done on the Toolforge (Toolforge iteration 16) board.
Nov 29 2024, 10:16 AM · Toolforge (Toolforge iteration 16), User-aborrero, cloud-services-team

Nov 28 2024

dcaro closed T366579: [infra,k8s,monitoring] Add an alert to warn when the prometheus k8s cert is about to expire as Resolved.
Nov 28 2024, 5:06 PM · Toolforge (Toolforge iteration 16), cloud-services-team
dcaro moved T379893: [toolforge-weld] read setting from envvars too from In Review to In Progress on the Toolforge (Toolforge iteration 16) board.
Nov 28 2024, 5:06 PM · Toolforge (Toolforge iteration 16)
dcaro closed T366579: [infra,k8s,monitoring] Add an alert to warn when the prometheus k8s cert is about to expire, a subtask of T309782: toolforge: Refresh certs that are not controlled by kubeadm (mid 2024 edition), as Resolved.
Nov 28 2024, 5:06 PM · Toolforge (Toolforge iteration 16), cloud-services-team
dcaro added a comment to T366579: [infra,k8s,monitoring] Add an alert to warn when the prometheus k8s cert is about to expire.

Up and running!

Nov 28 2024, 5:06 PM · Toolforge (Toolforge iteration 16), cloud-services-team
dcaro added a comment to T380127: [builds-builder] Add support for Heroku's "24" builder stack based on Ubuntu 2024.04 noble.

I think we can add a flag like '--new-builder' or similar, for people to try (the tekton pipeline already allows customizing it, might have to be added to the api too). Then when we are happy, make it the default, and move the flag to --old-builder for people to migrate if they have not yet, and eventually deprecate, we can probably be a bit more relaxed than with the API changes and give tool maintainers a bit more time to migrate.

Nov 28 2024, 4:30 PM · cloud-services-team, Toolforge
dcaro added a comment to T380127: [builds-builder] Add support for Heroku's "24" builder stack based on Ubuntu 2024.04 noble.

I think we can add a flag like '--new-builder' or similar, for people to try (the tekton pipeline already allows customizing it, might have to be added to the api too). Then when we are happy, make it the default, and move the flag to --old-builder for people to migrate if they have not yet, and eventually deprecate, we can probably be a bit more relaxed than with the API changes and give tool maintainers a bit more time to migrate.

Nov 28 2024, 4:04 PM · cloud-services-team, Toolforge
dcaro added a comment to T379927: Puppet removed "nameserver" line from /etc/resolv.conf.

I tried with:

resolver = Resolv::DNS.new(
  :nameserver => '127.0.0.1',
  :raise_timeout_erros => true,
)
Nov 28 2024, 11:41 AM · Puppet, Infrastructure-Foundations, cloud-services-team, Cloud-VPS
dcaro added a comment to T380960: kernel error detector: have a way to ignore certain messages.

That would be nice to have. But I would not know how to implement this off the top of my head. How would you implement it?

Nov 28 2024, 9:31 AM · cloud-services-team, User-aborrero
dcaro added a comment to T379927: Puppet removed "nameserver" line from /etc/resolv.conf.

From Gerrit, @dcaro writes:

Did a quick test, there's three functions we use to resolve names, and only one of them actually fails if it can't resolve:

wmflib::hosts2ips -> does not fail
dnsquery::lookup (used by hosts2ips) -> does not fail
ipresolve -> fails

maybe hosts2ips should use ipresolve? or have a flag to fail if it's empty? It's only used here and in the firewall (where I'm not sure if we want to fail or not).

It's weird though, as it seems from the code that dnsquery::lookup should raise if there's an error:
https://gerrit.wikimedia.org/g/operations/puppet/+/cfc612449876b7ff631492b5f2c32b3c2e762ac3/vendor_modules/dnsquery/lib/puppet/functions/dnsquery/lookup.rb#30

Nov 28 2024, 9:21 AM · Puppet, Infrastructure-Foundations, cloud-services-team, Cloud-VPS

Nov 27 2024

dcaro added a comment to T371501: Configure DSCP marking for cloudceph* hosts.

A quick search did not find any reference for the mon option on the upstream ceph, but found a commit on a clone:

Nov 27 2024, 7:00 PM · Ceph, Patch-For-Review, netops, Infrastructure-Foundations, SRE
dcaro added a comment to T380960: kernel error detector: have a way to ignore certain messages.

Another possibility (maybe on top of) would be to be able to acknowledge the errors, for example read a timestamp from a file before which the errors will be ignored (ex. if an issue might happen again, but the current event is not relevant anymore).

Nov 27 2024, 6:35 PM · cloud-services-team, User-aborrero
dcaro triaged T381011: [functional-tests,deploy,cookbook] Run only selected tests when deploying a component as Medium priority.
Nov 27 2024, 4:54 PM · cloud-services-team, Toolforge
dcaro created T381011: [functional-tests,deploy,cookbook] Run only selected tests when deploying a component.
Nov 27 2024, 4:54 PM · cloud-services-team, Toolforge
dcaro added a comment to T374830: Various CI jobs running in the integration Cloud VPS project failing due to transient DNS lookup failures, often for our own hosts such as gerrit.wikimedia.org.

Noting that this is not specific of this project, on tools we can see failures spread throughout the day from apiserver pods on k8s:

root@tools-k8s-control-9:~# kubectl logs --timestamps -n kube-system pod/kube-apiserver-tools-k8s-control-7 | grep timeout
2024-11-26T19:26:14.837618565Z W1126 19:26:14.837267       1 logging.go:59] [core] [Channel #18526 SubChannel #18527] grpc: addrConn.createTransport failed to connect to {Addr: "tools-k8s-etcd-23.tools.eqiad1.wikimedia.cloud:2379", ServerName: "tools-k8s-etcd-23.tools.eqiad1.wikimedia.cloud", }. Err: connection error: desc = "transport: Error while dialing: dial tcp: lookup tools-k8s-etcd-23.tools.eqiad1.wikimedia.cloud on 172.20.255.1:53: read udp 172.16.0.144:54305->172.20.255.1:53: i/o timeout"
...
2024-11-27T15:19:17.460516016Z W1127 15:19:17.460205       1 logging.go:59] [core] [Channel #59647 SubChannel #59648] grpc: addrConn.createTransport failed to connect to {Addr: "tools-k8s-etcd-24.tools.eqiad1.wikimedia.cloud:2379", ServerName: "tools-k8s-etcd-24.tools.eqiad1.wikimedia.cloud", }. Err: connection error: desc = "transport: Error while dialing: dial tcp: lookup tools-k8s-etcd-24.tools.eqiad1.wikimedia.cloud on 172.20.255.1:53: read udp 172.16.0.144:57729->172.20.255.1:53: i/o timeout"
2024-11-27T15:29:19.548211117Z W1127 15:29:19.547805       1 logging.go:59] [core] [Channel #59987 SubChannel #59988] grpc: addrConn.createTransport failed to connect to {Addr: "tools-k8s-etcd-24.tools.eqiad1.wikimedia.cloud:2379", ServerName: "tools-k8s-etcd-24.tools.eqiad1.wikimedia.cloud", }. Err: connection error: desc = "transport: Error while dialing: dial tcp: lookup tools-k8s-etcd-24.tools.eqiad1.wikimedia.cloud on 172.20.255.1:53: read udp 172.16.0.144:56126->172.20.255.1:53: i/o timeout"
2024-11-27T15:35:20.845977552Z W1127 15:35:20.845689       1 logging.go:59] [core] [Channel #60194 SubChannel #60195] grpc: addrConn.createTransport failed to connect to {Addr: "tools-k8s-etcd-24.tools.eqiad1.wikimedia.cloud:2379", ServerName: "tools-k8s-etcd-24.tools.eqiad1.wikimedia.cloud", }. Err: connection error: desc = "transport: Error while dialing: dial tcp: lookup tools-k8s-etcd-24.tools.eqiad1.wikimedia.cloud on 172.20.255.1:53: read udp 172.16.0.144:39192->172.20.255.1:53: i/o timeout"
Nov 27 2024, 4:45 PM · Patch-For-Review, User-aborrero, Cloud-VPS, cloud-services-team, Continuous-Integration-Infrastructure, Release-Engineering-Team (Seen), User-brennen, ci-test-error (WMF-deployed Build Failure)
dcaro added a comment to T380833: [harbor] some artifacts and projects seems to have gone missing.

Full list of tools without artifact or with no project in harbor (total of 8, half of them are ours):

dcaro@urcuchillay$ grep '###' out
################ unable to find project tool-mbh
################ unable to find project tool-milhistbot
################ unable to find project tool-nodejs-flask-buildpack-sample
################ unable to find project tool-teddybot
################ unable to find project tool-containers
################ unable to find project tool-lebot
################ unable to find project tool-sample-complex-app
################ unable to find project tool-containers
Nov 27 2024, 4:40 PM · User-aborrero, User-Raymond_Ndibe, Toolforge, cloud-services-team
dcaro added a comment to T380833: [harbor] some artifacts and projects seems to have gone missing.

a quick scan of the harbor images running on the cluster reveals 3 projects with missing harbor project:

dcaro@urcuchillay$ grep '###' out
################ unable to find project tool-nodejs-flask-buildpack-sample
################ unable to find project tool-teddybot
################ unable to find project tool-lebot
Nov 27 2024, 2:59 PM · User-aborrero, User-Raymond_Ndibe, Toolforge, cloud-services-team
dcaro updated the task description for T380959: [docs,envvars-api,jobs-api,builds-api] create docs on how to operate the cluster and core components.
Nov 27 2024, 2:30 PM · Toolforge (Toolforge iteration 16), Sustainability (Incident Followup), User-aborrero, cloud-services-team
dcaro renamed T380985: [infra,k8s,o11y] Introduce worker checks from [infra,k8s] Introduce worker checks to [infra,k8s,o11y] Introduce worker checks.
Nov 27 2024, 2:26 PM · Toolforge, cloud-services-team
dcaro renamed T380892: [infra,k8s,o11y] introduce additional observability for calico and general networking from toolforge: introduce additional observability for calico and general networking to [infra,k8s,o11y] introduce additional observability for calico and general networking.
Nov 27 2024, 2:24 PM · Toolforge, Sustainability (Incident Followup), User-aborrero, cloud-services-team
dcaro renamed T380959: [docs,envvars-api,jobs-api,builds-api] create docs on how to operate the cluster and core components from toolforge: create docs on how to operate the cluster and core components to [docs,envvars-api,jobs-api,builds-api] create docs on how to operate the cluster and core components.
Nov 27 2024, 2:22 PM · Toolforge (Toolforge iteration 16), Sustainability (Incident Followup), User-aborrero, cloud-services-team
dcaro triaged T380959: [docs,envvars-api,jobs-api,builds-api] create docs on how to operate the cluster and core components as High priority.
Nov 27 2024, 2:21 PM · Toolforge (Toolforge iteration 16), Sustainability (Incident Followup), User-aborrero, cloud-services-team
dcaro edited projects for T380832: [jobs-api] crashing, added: Toolforge (Toolforge iteration 16); removed Toolforge.
Nov 27 2024, 2:20 PM · Toolforge (Toolforge iteration 16), User-aborrero, cloud-services-team
dcaro renamed T380832: [jobs-api] crashing from jobs-api crashing to [jobs-api] crashing.
Nov 27 2024, 2:20 PM · Toolforge (Toolforge iteration 16), User-aborrero, cloud-services-team
dcaro created T380985: [infra,k8s,o11y] Introduce worker checks.
Nov 27 2024, 2:19 PM · Toolforge, cloud-services-team
dcaro updated the task description for T380959: [docs,envvars-api,jobs-api,builds-api] create docs on how to operate the cluster and core components.
Nov 27 2024, 11:29 AM · Toolforge (Toolforge iteration 16), Sustainability (Incident Followup), User-aborrero, cloud-services-team
dcaro updated the task description for T380959: [docs,envvars-api,jobs-api,builds-api] create docs on how to operate the cluster and core components.
Nov 27 2024, 10:36 AM · Toolforge (Toolforge iteration 16), Sustainability (Incident Followup), User-aborrero, cloud-services-team
dcaro added a parent task for T320284: [jobs-api,jobs-emailer] Prometheus monitoring toolforge-jobs server side components: T380959: [docs,envvars-api,jobs-api,builds-api] create docs on how to operate the cluster and core components.
Nov 27 2024, 10:34 AM · Toolforge (Toolforge iteration 16), cloud-services-team
dcaro added a subtask for T380959: [docs,envvars-api,jobs-api,builds-api] create docs on how to operate the cluster and core components: T320284: [jobs-api,jobs-emailer] Prometheus monitoring toolforge-jobs server side components.
Nov 27 2024, 10:34 AM · Toolforge (Toolforge iteration 16), Sustainability (Incident Followup), User-aborrero, cloud-services-team
dcaro added a parent task for T325166: tbs: user-story 10: I want to know how to manage the service: T380959: [docs,envvars-api,jobs-api,builds-api] create docs on how to operate the cluster and core components.
Nov 27 2024, 10:33 AM · Toolforge (Toolforge iteration 03), Toolforge Build Service, cloud-services-team, Epic, Cloud-Services-Worktype-Project, Cloud-Services-Origin-Team, User-dcaro
dcaro added a parent task for T325172: [builds-api,harbor,builds-builder] user-story 11: I want to know how to debug the service: T380959: [docs,envvars-api,jobs-api,builds-api] create docs on how to operate the cluster and core components.
Nov 27 2024, 10:33 AM · Toolforge, cloud-services-team, Epic, Cloud-Services-Worktype-Project, Cloud-Services-Origin-Team, User-dcaro
dcaro added subtasks for T380959: [docs,envvars-api,jobs-api,builds-api] create docs on how to operate the cluster and core components: T325172: [builds-api,harbor,builds-builder] user-story 11: I want to know how to debug the service, T325166: tbs: user-story 10: I want to know how to manage the service.
Nov 27 2024, 10:33 AM · Toolforge (Toolforge iteration 16), Sustainability (Incident Followup), User-aborrero, cloud-services-team
dcaro closed T325174: [builds-builder,harbor,bulid-service,docs] user-story 11: Add section to admin docs on how to debug the service, how to pin-point the failing component and how to get the logs for each of them. as Resolved.
Nov 27 2024, 10:33 AM · Toolforge, cloud-services-team, Cloud-Services-Worktype-Project, Cloud-Services-Origin-Team, User-dcaro
dcaro closed T325172: [builds-api,harbor,builds-builder] user-story 11: I want to know how to debug the service, a subtask of T267374: [tbs.beta] Create a toolforge build service beta release, as Resolved.
Nov 27 2024, 10:31 AM · cloud-services-team (FY2023/2024-Q1-Q2), Goal, Cloud-Services-Worktype-Project, Cloud-Services-Origin-Team, Toolforge Build Service, User-dcaro
dcaro closed T325172: [builds-api,harbor,builds-builder] user-story 11: I want to know how to debug the service as Resolved.
Nov 27 2024, 10:31 AM · Toolforge, cloud-services-team, Epic, Cloud-Services-Worktype-Project, Cloud-Services-Origin-Team, User-dcaro
dcaro closed T325174: [builds-builder,harbor,bulid-service,docs] user-story 11: Add section to admin docs on how to debug the service, how to pin-point the failing component and how to get the logs for each of them., a subtask of T325172: [builds-api,harbor,builds-builder] user-story 11: I want to know how to debug the service, as Resolved.
Nov 27 2024, 10:30 AM · Toolforge, cloud-services-team, Epic, Cloud-Services-Worktype-Project, Cloud-Services-Origin-Team, User-dcaro
dcaro added a comment to T380832: [jobs-api] crashing.

why no alerts?

Nov 27 2024, 8:43 AM · Toolforge (Toolforge iteration 16), User-aborrero, cloud-services-team
dcaro updated the task description for T320284: [jobs-api,jobs-emailer] Prometheus monitoring toolforge-jobs server side components.
Nov 27 2024, 8:36 AM · Toolforge (Toolforge iteration 16), cloud-services-team

Nov 26 2024

dcaro added a comment to T380890: jobs-api: Impersonate user instead of loading certs from NFS.

Could it use a service account instead? (would be simpler)

Nov 26 2024, 4:57 PM · cloud-services-team, Toolforge
dcaro added a comment to T380503: Replace optics in cloudsw1-d5-eqiad et-0/0/52 and cloudsw1-e4-eqiad et-0/0/54.

Looks good on my side 👍

Nov 26 2024, 3:30 PM · SRE, ops-eqiad, Cloud-Services, netops, DC-Ops, Infrastructure-Foundations
dcaro added a comment to T380877: Kernel error Server cloudcephmon1004 may have kernel errors.

The first is expected, the second seems harmless too:
https://hetzbiz.cloud/2024/06/11/those-damn-mpt3sas_cm0-messages/

Nov 26 2024, 3:20 PM · cloud-services-team
dcaro added a comment to T380877: Kernel error Server cloudcephmon1004 may have kernel errors.

Current errors:

root@cloudcephmon1004:~# journalctl -k -p err
-- Journal begins at Tue 2024-11-26 12:46:45 UTC, ends at Tue 2024-11-26 15:12:54 UTC. --
Nov 26 13:11:41 cloudcephmon1004 kernel: x86/cpu: VMX (outside TXT) disabled by BIOS
Nov 26 13:11:41 cloudcephmon1004 kernel: mpt3sas_cm0: Trace buffer memory 2048 KB allocated
Nov 26 2024, 3:14 PM · cloud-services-team
dcaro moved T362867: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.28 from In Review to In Progress on the Toolforge (Toolforge iteration 16) board.
Nov 26 2024, 3:08 PM · Patch-For-Review, Toolforge (Toolforge iteration 16), cloud-services-team
dcaro moved T361120: [jobs-cli,jobs-api] quota shows different units for limit and usage from In Review to In Progress on the Toolforge (Toolforge iteration 16) board.
Nov 26 2024, 3:07 PM · Toolforge (Toolforge iteration 16), Patch-For-Review
dcaro reopened T364870: Q4:rack/setup/install new cloudcephmon hosts as "Open".

Node up and running

Nov 26 2024, 1:35 PM · SRE, cloud-services-team (Hardware), ops-eqiad, DC-Ops
dcaro closed T374005: [ceph] install and put in the cluster the cloudcephmon100[1-3] replacements as Resolved.

Last node added

Nov 26 2024, 1:34 PM · Cloud-VPS, cloud-services-team (FY2024/2025-Q1-Q2)
dcaro updated the task description for T374005: [ceph] install and put in the cluster the cloudcephmon100[1-3] replacements.
Nov 26 2024, 1:33 PM · Cloud-VPS, cloud-services-team (FY2024/2025-Q1-Q2)
dcaro closed T364870: Q4:rack/setup/install new cloudcephmon hosts as Resolved.
Nov 26 2024, 1:33 PM · SRE, cloud-services-team (Hardware), ops-eqiad, DC-Ops
dcaro closed T364870: Q4:rack/setup/install new cloudcephmon hosts, a subtask of T374005: [ceph] install and put in the cluster the cloudcephmon100[1-3] replacements, as Resolved.
Nov 26 2024, 1:33 PM · Cloud-VPS, cloud-services-team (FY2024/2025-Q1-Q2)
dcaro reopened T364870: Q4:rack/setup/install new cloudcephmon hosts, a subtask of T374005: [ceph] install and put in the cluster the cloudcephmon100[1-3] replacements, as Open.
Nov 26 2024, 1:33 PM · Cloud-VPS, cloud-services-team (FY2024/2025-Q1-Q2)
dcaro added a comment to T380827: tools-nfs outage 2024-11-25.

This one never got resolved https://bugs.launchpad.net/neutron/+bug/1868098 :/

Nov 26 2024, 1:16 PM · Toolforge, cloud-services-team
dcaro added a comment to T380827: tools-nfs outage 2024-11-25.

An upstream says its a "harmless log message" https://bugzilla.redhat.com/show_bug.cgi?id=1506035

Nov 26 2024, 1:09 PM · Toolforge, cloud-services-team
dcaro added a comment to T380827: tools-nfs outage 2024-11-25.

So my theory is maybe a ceph network hiccup?

Nov 26 2024, 12:51 PM · Toolforge, cloud-services-team
dcaro closed T380834: [bug] <your request here> as Declined.

Forgot to fill up I guess

Nov 26 2024, 11:23 AM · PAWS
dcaro added a comment to T380844: 2024-11-26 Toolforge DNS incident.

The current status is stable again, we are still investigating the root causes, but the cluster is up and running.

Nov 26 2024, 11:12 AM · Wikimedia-Incident, cloud-services-team, Toolforge
dcaro triaged T380844: 2024-11-26 Toolforge DNS incident as High priority.
Nov 26 2024, 11:12 AM · Wikimedia-Incident, cloud-services-team, Toolforge

Nov 25 2024

dcaro moved T378500: [components-api] Add feature flag to disable user endpoints for deployment in tools from Next Up to Done on the Toolforge (Toolforge iteration 16) board.
Nov 25 2024, 6:17 PM · Toolforge (Toolforge iteration 16)
dcaro moved T366579: [infra,k8s,monitoring] Add an alert to warn when the prometheus k8s cert is about to expire from In Progress to In Review on the Toolforge (Toolforge iteration 16) board.
Nov 25 2024, 6:17 PM · Toolforge (Toolforge iteration 16), cloud-services-team
dcaro closed T378500: [components-api] Add feature flag to disable user endpoints for deployment in tools as Resolved.

I think it's not needed anymore yep

Nov 25 2024, 5:12 PM · Toolforge (Toolforge iteration 16)
dcaro closed T378500: [components-api] Add feature flag to disable user endpoints for deployment in tools, a subtask of T362867: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.28, as Resolved.
Nov 25 2024, 5:11 PM · Patch-For-Review, Toolforge (Toolforge iteration 16), cloud-services-team
dcaro closed T380239: CephSlowOps Ceph cluster in eqiad has 1 slow ops as Resolved.

This one had gotten stuck on cloudcephmon1001, restarted the mon process there and went away.

Nov 25 2024, 2:46 PM · cloud-services-team
dcaro closed T380489: CephSlowOps Ceph cluster in eqiad has 253 slow ops as Resolved.

The cable and adapters have been replaced, and the connection is now up and running \o/

Nov 25 2024, 2:46 PM · cloud-services-team
dcaro triaged T380706: [components-api, components-cli] deploy-token: separate create from update as Medium priority.
Nov 25 2024, 11:11 AM · Patch-For-Review, Toolforge (Toolforge iteration 16)
dcaro added a comment to T380703: versions.toolforge.org is down.

There might be a mixture of issues here, as the original error seemed to happen before the network outage (November 25, 2024 at 11:28:51 AM GMT+1 https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/132#note_8339f66545cb4552cd41ab090f629bde43e09aba).

Nov 25 2024, 11:09 AM · Tools
dcaro added a comment to T374830: Various CI jobs running in the integration Cloud VPS project failing due to transient DNS lookup failures, often for our own hosts such as gerrit.wikimedia.org.

There is an ongoing network outage on cloudvps (caused by T380174), working on it.

Nov 25 2024, 11:02 AM · Patch-For-Review, User-aborrero, Cloud-VPS, cloud-services-team, Continuous-Integration-Infrastructure, Release-Engineering-Team (Seen), User-brennen, ci-test-error (WMF-deployed Build Failure)
dcaro closed T380283: [components-api] Limit the amount of deployments to (say) 25, a subtask of T362051: [components-api] First iteration of the component API, as Resolved.
Nov 25 2024, 9:32 AM · Toolforge (Toolforge iteration 16), cloud-services-team (FY2024/2025-Q1-Q2), User-aborrero, Epic
dcaro closed T380283: [components-api] Limit the amount of deployments to (say) 25 as Resolved.
Nov 25 2024, 9:32 AM · Toolforge (Toolforge iteration 16)

Nov 21 2024

dcaro added a parent task for T380503: Replace optics in cloudsw1-d5-eqiad et-0/0/52 and cloudsw1-e4-eqiad et-0/0/54: T380489: CephSlowOps Ceph cluster in eqiad has 253 slow ops.
Nov 21 2024, 5:47 PM · SRE, ops-eqiad, Cloud-Services, netops, DC-Ops, Infrastructure-Foundations
dcaro added a subtask for T380489: CephSlowOps Ceph cluster in eqiad has 253 slow ops: T380503: Replace optics in cloudsw1-d5-eqiad et-0/0/52 and cloudsw1-e4-eqiad et-0/0/54.
Nov 21 2024, 5:47 PM · cloud-services-team
dcaro reassigned T380511: [horizon] failing to show the proxies tab for some projects from dcaro to Andrew.
Nov 21 2024, 5:46 PM · Cloud-Services-Worktype-Unplanned, Cloud-Services-Origin-User, cloud-services-team (FY2024/2025-Q1-Q2), User-dcaro
dcaro added a comment to T380511: [horizon] failing to show the proxies tab for some projects.

it's managed by puppet:
https://gerrit.wikimedia.org/g/operations/puppet/+/741f356e6294467c88c183bb01c3339e0abb27e3/hieradata/eqiad/profile/openstack/eqiad1/horizon.yaml

Nov 21 2024, 5:36 PM · Cloud-Services-Worktype-Unplanned, Cloud-Services-Origin-User, cloud-services-team (FY2024/2025-Q1-Q2), User-dcaro
dcaro added a comment to T380511: [horizon] failing to show the proxies tab for some projects.

:facepalm: I'm adding docker pull there xd

Nov 21 2024, 5:33 PM · Cloud-Services-Worktype-Unplanned, Cloud-Services-Origin-User, cloud-services-team (FY2024/2025-Q1-Q2), User-dcaro
dcaro added a comment to T380511: [horizon] failing to show the proxies tab for some projects.

Yep, we are not running latest:

Nov 21 2024, 5:31 PM · Cloud-Services-Worktype-Unplanned, Cloud-Services-Origin-User, cloud-services-team (FY2024/2025-Q1-Q2), User-dcaro
dcaro added a comment to T380511: [horizon] failing to show the proxies tab for some projects.

This should have fixed it actually:
https://gerrit.wikimedia.org/r/c/openstack/horizon/wmf-proxy-dashboard/+/1091859

Nov 21 2024, 5:26 PM · Cloud-Services-Worktype-Unplanned, Cloud-Services-Origin-User, cloud-services-team (FY2024/2025-Q1-Q2), User-dcaro
dcaro added a comment to T380511: [horizon] failing to show the proxies tab for some projects.

From our code for the special view:

/opt/lib/python/site-packages/wikimediaproxydashboard/views.py:
Nov 21 2024, 5:25 PM · Cloud-Services-Worktype-Unplanned, Cloud-Services-Origin-User, cloud-services-team (FY2024/2025-Q1-Q2), User-dcaro
dcaro added a comment to T380511: [horizon] failing to show the proxies tab for some projects.

The original exception is:

[Thu Nov 21 17:09:40.191562 2024] [wsgi:error] [pid 9:tid 124] [remote 208.80.155.117:56596] Error while rendering table rows.
[Thu Nov 21 17:09:40.191580 2024] [wsgi:error] [pid 9:tid 124] [remote 208.80.155.117:56596] Traceback (most recent call last):                                                                                                                                                                                                                        
[Thu Nov 21 17:09:40.191582 2024] [wsgi:error] [pid 9:tid 124] [remote 208.80.155.117:56596]   File "/opt/lib/python/site-packages/horizon/tables/base.py", line 1933, in get_rows
[Thu Nov 21 17:09:40.191584 2024] [wsgi:error] [pid 9:tid 124] [remote 208.80.155.117:56596]     row = self._meta.row_class(self, datum)                                                                                                                                                                                                               
[Thu Nov 21 17:09:40.191586 2024] [wsgi:error] [pid 9:tid 124] [remote 208.80.155.117:56596]   File "/opt/lib/python/site-packages/horizon/tables/base.py", line 590, in __init__
[Thu Nov 21 17:09:40.191588 2024] [wsgi:error] [pid 9:tid 124] [remote 208.80.155.117:56596]     self.load_cells()                                                                                                                                                                                                                                     
[Thu Nov 21 17:09:40.191589 2024] [wsgi:error] [pid 9:tid 124] [remote 208.80.155.117:56596]   File "/opt/lib/python/site-packages/horizon/tables/base.py", line 640, in load_cells
[Thu Nov 21 17:09:40.191591 2024] [wsgi:error] [pid 9:tid 124] [remote 208.80.155.117:56596]     display_name = table.get_object_display(datum)                                                                                                                                                                                                        
[Thu Nov 21 17:09:40.191592 2024] [wsgi:error] [pid 9:tid 124] [remote 208.80.155.117:56596]   File "/opt/lib/python/site-packages/horizon/tables/base.py", line 1835, in get_object_display
[Thu Nov 21 17:09:40.191594 2024] [wsgi:error] [pid 9:tid 124] [remote 208.80.155.117:56596]     display_key = self.get_object_display_key(datum)                                                                                                                                                                                                      
[Thu Nov 21 17:09:40.191595 2024] [wsgi:error] [pid 9:tid 124] [remote 208.80.155.117:56596] TypeError: get_object_display_key() takes 0 positional arguments but 2 were given
[Thu Nov 21 17:09:40.192967 2024] [wsgi:error] [pid 9:tid 124] [remote 208.80.155.117:56596] Internal Server Error: /project/proxy/                                                                                                                                                                                                                    
[Thu Nov 21 17:09:40.192978 2024] [wsgi:error] [pid 9:tid 124] [remote 208.80.155.117:56596] Traceback (most recent call last): 
[Thu Nov 21 17:09:40.192980 2024] [wsgi:error] [pid 9:tid 124] [remote 208.80.155.117:56596]   File "/opt/lib/python/site-packages/horizon/tables/base.py", line 1933, in get_rows
[Thu Nov 21 17:09:40.192982 2024] [wsgi:error] [pid 9:tid 124] [remote 208.80.155.117:56596]     row = self._meta.row_class(self, datum)
[Thu Nov 21 17:09:40.192984 2024] [wsgi:error] [pid 9:tid 124] [remote 208.80.155.117:56596]   File "/opt/lib/python/site-packages/horizon/tables/base.py", line 590, in __init__        
[Thu Nov 21 17:09:40.192985 2024] [wsgi:error] [pid 9:tid 124] [remote 208.80.155.117:56596]     self.load_cells()  
[Thu Nov 21 17:09:40.192987 2024] [wsgi:error] [pid 9:tid 124] [remote 208.80.155.117:56596]   File "/opt/lib/python/site-packages/horizon/tables/base.py", line 640, in load_cells
[Thu Nov 21 17:09:40.192989 2024] [wsgi:error] [pid 9:tid 124] [remote 208.80.155.117:56596]     display_name = table.get_object_display(datum)           
[Thu Nov 21 17:09:40.192990 2024] [wsgi:error] [pid 9:tid 124] [remote 208.80.155.117:56596]   File "/opt/lib/python/site-packages/horizon/tables/base.py", line 1835, in get_object_display
[Thu Nov 21 17:09:40.192992 2024] [wsgi:error] [pid 9:tid 124] [remote 208.80.155.117:56596]     display_key = self.get_object_display_key(datum)
[Thu Nov 21 17:09:40.192994 2024] [wsgi:error] [pid 9:tid 124] [remote 208.80.155.117:56596] TypeError: get_object_display_key() takes 0 positional arguments but 2 were given
Nov 21 2024, 5:14 PM · Cloud-Services-Worktype-Unplanned, Cloud-Services-Origin-User, cloud-services-team (FY2024/2025-Q1-Q2), User-dcaro
dcaro triaged T380511: [horizon] failing to show the proxies tab for some projects as High priority.
Nov 21 2024, 5:14 PM · Cloud-Services-Worktype-Unplanned, Cloud-Services-Origin-User, cloud-services-team (FY2024/2025-Q1-Q2), User-dcaro
dcaro added a parent task for T380503: Replace optics in cloudsw1-d5-eqiad et-0/0/52 and cloudsw1-e4-eqiad et-0/0/54: T380239: CephSlowOps Ceph cluster in eqiad has 1 slow ops.
Nov 21 2024, 4:42 PM · SRE, ops-eqiad, Cloud-Services, netops, DC-Ops, Infrastructure-Foundations
dcaro added a subtask for T380239: CephSlowOps Ceph cluster in eqiad has 1 slow ops: T380503: Replace optics in cloudsw1-d5-eqiad et-0/0/52 and cloudsw1-e4-eqiad et-0/0/54.
Nov 21 2024, 4:41 PM · cloud-services-team
dcaro changed the status of T380489: CephSlowOps Ceph cluster in eqiad has 253 slow ops from Open to In Progress.
Nov 21 2024, 4:21 PM · cloud-services-team
dcaro changed the status of T380239: CephSlowOps Ceph cluster in eqiad has 1 slow ops from Open to In Progress.
Nov 21 2024, 4:21 PM · cloud-services-team
dcaro added a comment to T380489: CephSlowOps Ceph cluster in eqiad has 253 slow ops.

Might be related to T380239

Nov 21 2024, 3:55 PM · cloud-services-team
dcaro added a comment to T380239: CephSlowOps Ceph cluster in eqiad has 1 slow ops.

From the switch graphs, something made it blip:

image.png (1×596 px, 77 KB)

Nov 21 2024, 3:54 PM · cloud-services-team
dcaro added a comment to T380489: CephSlowOps Ceph cluster in eqiad has 253 slow ops.

image.png (1×1 px, 121 KB)

https://grafana.wikimedia.org/d/5p97dAASz/network-interface-queue-and-error-stats?orgId=1&var-site=eqiad+prometheus%2Fops&var-device=cloudsw1-d5-eqiad&var-interface=et-0%2F0%2F52&viewPanel=43&from=1732193323379&to=1732204123379

Nov 21 2024, 3:51 PM · cloud-services-team
dcaro added a comment to T380489: CephSlowOps Ceph cluster in eqiad has 253 slow ops.

From the switch stats:

image.png (271×1 px, 23 KB)

https://grafana.wikimedia.org/d/f61a7d56-e132-44dc-b9da-d722b11566cf/network-totals-by-site?orgId=1&refresh=30s&var-site=eqiad%20prometheus%2Fops

Nov 21 2024, 3:50 PM · cloud-services-team
dcaro claimed T380489: CephSlowOps Ceph cluster in eqiad has 253 slow ops.
Nov 21 2024, 3:48 PM · cloud-services-team
dcaro added a comment to T380489: CephSlowOps Ceph cluster in eqiad has 253 slow ops.

Having slow heartbeats:

Slow OSD heartbeats on back (longest 6757.204ms)
Slow OSD heartbeats on front (longest 6091.493ms)
Nov 21 2024, 3:48 PM · cloud-services-team

Nov 20 2024

dcaro closed T348633: [api-gateway] add alert for uptime as Resolved.
Nov 20 2024, 6:24 PM · Toolforge (Toolforge iteration 16), cloud-services-team
dcaro closed T348633: [api-gateway] add alert for uptime, a subtask of T348634: ceph slow ops 2023-10-11, as Resolved.
Nov 20 2024, 6:20 PM · cloud-services-team, Cloud-VPS
dcaro reopened T348633: [api-gateway] add alert for uptime as "In Progress".
Nov 20 2024, 4:56 PM · Toolforge (Toolforge iteration 16), cloud-services-team
dcaro moved T348633: [api-gateway] add alert for uptime from In Review to In Progress on the Toolforge (Toolforge iteration 16) board.
Nov 20 2024, 4:55 PM · Toolforge (Toolforge iteration 16), cloud-services-team
dcaro reopened T348633: [api-gateway] add alert for uptime, a subtask of T348634: ceph slow ops 2023-10-11, as In Progress.
Nov 20 2024, 4:55 PM · cloud-services-team, Cloud-VPS
dcaro closed T348633: [api-gateway] add alert for uptime as Resolved.
Nov 20 2024, 3:22 PM · Toolforge (Toolforge iteration 16), cloud-services-team
dcaro closed T348633: [api-gateway] add alert for uptime, a subtask of T348634: ceph slow ops 2023-10-11, as Resolved.
Nov 20 2024, 3:18 PM · cloud-services-team, Cloud-VPS
  NODES
HOME 1
Intern 1
iOS 1
Note 2
os 42
server 15