User Details
- User Since
- Nov 2 2020, 11:59 AM (217 w, 3 d)
- Availability
- Available
- IRC Nick
- dcaro
- LDAP User
- David Caro
- MediaWiki User
- DCaro (WMF) [ Global Accounts ]
Thu, Dec 12
Yep, the documentation shifted and we did not update (yet) the buildpack underneath, this will be solved with the next buildpack upgrade, related tasks:
Wed, Dec 11
Is this considered a lot or not at all compared to what you have available for all of cloud VPS?
Nov 29 2024
Nov 28 2024
Up and running!
I think we can add a flag like '--new-builder' or similar, for people to try (the tekton pipeline already allows customizing it, might have to be added to the api too). Then when we are happy, make it the default, and move the flag to --old-builder for people to migrate if they have not yet, and eventually deprecate, we can probably be a bit more relaxed than with the API changes and give tool maintainers a bit more time to migrate.
I tried with:
resolver = Resolv::DNS.new( :nameserver => '127.0.0.1', :raise_timeout_erros => true, )
That would be nice to have. But I would not know how to implement this off the top of my head. How would you implement it?
Nov 27 2024
A quick search did not find any reference for the mon option on the upstream ceph, but found a commit on a clone:
Another possibility (maybe on top of) would be to be able to acknowledge the errors, for example read a timestamp from a file before which the errors will be ignored (ex. if an issue might happen again, but the current event is not relevant anymore).
Noting that this is not specific of this project, on tools we can see failures spread throughout the day from apiserver pods on k8s:
root@tools-k8s-control-9:~# kubectl logs --timestamps -n kube-system pod/kube-apiserver-tools-k8s-control-7 | grep timeout 2024-11-26T19:26:14.837618565Z W1126 19:26:14.837267 1 logging.go:59] [core] [Channel #18526 SubChannel #18527] grpc: addrConn.createTransport failed to connect to {Addr: "tools-k8s-etcd-23.tools.eqiad1.wikimedia.cloud:2379", ServerName: "tools-k8s-etcd-23.tools.eqiad1.wikimedia.cloud", }. Err: connection error: desc = "transport: Error while dialing: dial tcp: lookup tools-k8s-etcd-23.tools.eqiad1.wikimedia.cloud on 172.20.255.1:53: read udp 172.16.0.144:54305->172.20.255.1:53: i/o timeout" ... 2024-11-27T15:19:17.460516016Z W1127 15:19:17.460205 1 logging.go:59] [core] [Channel #59647 SubChannel #59648] grpc: addrConn.createTransport failed to connect to {Addr: "tools-k8s-etcd-24.tools.eqiad1.wikimedia.cloud:2379", ServerName: "tools-k8s-etcd-24.tools.eqiad1.wikimedia.cloud", }. Err: connection error: desc = "transport: Error while dialing: dial tcp: lookup tools-k8s-etcd-24.tools.eqiad1.wikimedia.cloud on 172.20.255.1:53: read udp 172.16.0.144:57729->172.20.255.1:53: i/o timeout" 2024-11-27T15:29:19.548211117Z W1127 15:29:19.547805 1 logging.go:59] [core] [Channel #59987 SubChannel #59988] grpc: addrConn.createTransport failed to connect to {Addr: "tools-k8s-etcd-24.tools.eqiad1.wikimedia.cloud:2379", ServerName: "tools-k8s-etcd-24.tools.eqiad1.wikimedia.cloud", }. Err: connection error: desc = "transport: Error while dialing: dial tcp: lookup tools-k8s-etcd-24.tools.eqiad1.wikimedia.cloud on 172.20.255.1:53: read udp 172.16.0.144:56126->172.20.255.1:53: i/o timeout" 2024-11-27T15:35:20.845977552Z W1127 15:35:20.845689 1 logging.go:59] [core] [Channel #60194 SubChannel #60195] grpc: addrConn.createTransport failed to connect to {Addr: "tools-k8s-etcd-24.tools.eqiad1.wikimedia.cloud:2379", ServerName: "tools-k8s-etcd-24.tools.eqiad1.wikimedia.cloud", }. Err: connection error: desc = "transport: Error while dialing: dial tcp: lookup tools-k8s-etcd-24.tools.eqiad1.wikimedia.cloud on 172.20.255.1:53: read udp 172.16.0.144:39192->172.20.255.1:53: i/o timeout"
Full list of tools without artifact or with no project in harbor (total of 8, half of them are ours):
dcaro@urcuchillay$ grep '###' out ################ unable to find project tool-mbh ################ unable to find project tool-milhistbot ################ unable to find project tool-nodejs-flask-buildpack-sample ################ unable to find project tool-teddybot ################ unable to find project tool-containers ################ unable to find project tool-lebot ################ unable to find project tool-sample-complex-app ################ unable to find project tool-containers
a quick scan of the harbor images running on the cluster reveals 3 projects with missing harbor project:
dcaro@urcuchillay$ grep '###' out ################ unable to find project tool-nodejs-flask-buildpack-sample ################ unable to find project tool-teddybot ################ unable to find project tool-lebot
why no alerts?
Nov 26 2024
Could it use a service account instead? (would be simpler)
Looks good on my side 👍
The first is expected, the second seems harmless too:
https://hetzbiz.cloud/2024/06/11/those-damn-mpt3sas_cm0-messages/
Current errors:
root@cloudcephmon1004:~# journalctl -k -p err -- Journal begins at Tue 2024-11-26 12:46:45 UTC, ends at Tue 2024-11-26 15:12:54 UTC. -- Nov 26 13:11:41 cloudcephmon1004 kernel: x86/cpu: VMX (outside TXT) disabled by BIOS Nov 26 13:11:41 cloudcephmon1004 kernel: mpt3sas_cm0: Trace buffer memory 2048 KB allocated
Node up and running
Last node added
This one never got resolved https://bugs.launchpad.net/neutron/+bug/1868098 :/
An upstream says its a "harmless log message" https://bugzilla.redhat.com/show_bug.cgi?id=1506035
So my theory is maybe a ceph network hiccup?
Forgot to fill up I guess
The current status is stable again, we are still investigating the root causes, but the cluster is up and running.
Nov 25 2024
I think it's not needed anymore yep
This one had gotten stuck on cloudcephmon1001, restarted the mon process there and went away.
The cable and adapters have been replaced, and the connection is now up and running \o/
There might be a mixture of issues here, as the original error seemed to happen before the network outage (November 25, 2024 at 11:28:51 AM GMT+1 https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/132#note_8339f66545cb4552cd41ab090f629bde43e09aba).
There is an ongoing network outage on cloudvps (caused by T380174), working on it.
Nov 21 2024
:facepalm: I'm adding docker pull there xd
Yep, we are not running latest:
This should have fixed it actually:
https://gerrit.wikimedia.org/r/c/openstack/horizon/wmf-proxy-dashboard/+/1091859
From our code for the special view:
/opt/lib/python/site-packages/wikimediaproxydashboard/views.py:
The original exception is:
[Thu Nov 21 17:09:40.191562 2024] [wsgi:error] [pid 9:tid 124] [remote 208.80.155.117:56596] Error while rendering table rows. [Thu Nov 21 17:09:40.191580 2024] [wsgi:error] [pid 9:tid 124] [remote 208.80.155.117:56596] Traceback (most recent call last): [Thu Nov 21 17:09:40.191582 2024] [wsgi:error] [pid 9:tid 124] [remote 208.80.155.117:56596] File "/opt/lib/python/site-packages/horizon/tables/base.py", line 1933, in get_rows [Thu Nov 21 17:09:40.191584 2024] [wsgi:error] [pid 9:tid 124] [remote 208.80.155.117:56596] row = self._meta.row_class(self, datum) [Thu Nov 21 17:09:40.191586 2024] [wsgi:error] [pid 9:tid 124] [remote 208.80.155.117:56596] File "/opt/lib/python/site-packages/horizon/tables/base.py", line 590, in __init__ [Thu Nov 21 17:09:40.191588 2024] [wsgi:error] [pid 9:tid 124] [remote 208.80.155.117:56596] self.load_cells() [Thu Nov 21 17:09:40.191589 2024] [wsgi:error] [pid 9:tid 124] [remote 208.80.155.117:56596] File "/opt/lib/python/site-packages/horizon/tables/base.py", line 640, in load_cells [Thu Nov 21 17:09:40.191591 2024] [wsgi:error] [pid 9:tid 124] [remote 208.80.155.117:56596] display_name = table.get_object_display(datum) [Thu Nov 21 17:09:40.191592 2024] [wsgi:error] [pid 9:tid 124] [remote 208.80.155.117:56596] File "/opt/lib/python/site-packages/horizon/tables/base.py", line 1835, in get_object_display [Thu Nov 21 17:09:40.191594 2024] [wsgi:error] [pid 9:tid 124] [remote 208.80.155.117:56596] display_key = self.get_object_display_key(datum) [Thu Nov 21 17:09:40.191595 2024] [wsgi:error] [pid 9:tid 124] [remote 208.80.155.117:56596] TypeError: get_object_display_key() takes 0 positional arguments but 2 were given [Thu Nov 21 17:09:40.192967 2024] [wsgi:error] [pid 9:tid 124] [remote 208.80.155.117:56596] Internal Server Error: /project/proxy/ [Thu Nov 21 17:09:40.192978 2024] [wsgi:error] [pid 9:tid 124] [remote 208.80.155.117:56596] Traceback (most recent call last): [Thu Nov 21 17:09:40.192980 2024] [wsgi:error] [pid 9:tid 124] [remote 208.80.155.117:56596] File "/opt/lib/python/site-packages/horizon/tables/base.py", line 1933, in get_rows [Thu Nov 21 17:09:40.192982 2024] [wsgi:error] [pid 9:tid 124] [remote 208.80.155.117:56596] row = self._meta.row_class(self, datum) [Thu Nov 21 17:09:40.192984 2024] [wsgi:error] [pid 9:tid 124] [remote 208.80.155.117:56596] File "/opt/lib/python/site-packages/horizon/tables/base.py", line 590, in __init__ [Thu Nov 21 17:09:40.192985 2024] [wsgi:error] [pid 9:tid 124] [remote 208.80.155.117:56596] self.load_cells() [Thu Nov 21 17:09:40.192987 2024] [wsgi:error] [pid 9:tid 124] [remote 208.80.155.117:56596] File "/opt/lib/python/site-packages/horizon/tables/base.py", line 640, in load_cells [Thu Nov 21 17:09:40.192989 2024] [wsgi:error] [pid 9:tid 124] [remote 208.80.155.117:56596] display_name = table.get_object_display(datum) [Thu Nov 21 17:09:40.192990 2024] [wsgi:error] [pid 9:tid 124] [remote 208.80.155.117:56596] File "/opt/lib/python/site-packages/horizon/tables/base.py", line 1835, in get_object_display [Thu Nov 21 17:09:40.192992 2024] [wsgi:error] [pid 9:tid 124] [remote 208.80.155.117:56596] display_key = self.get_object_display_key(datum) [Thu Nov 21 17:09:40.192994 2024] [wsgi:error] [pid 9:tid 124] [remote 208.80.155.117:56596] TypeError: get_object_display_key() takes 0 positional arguments but 2 were given
Might be related to T380239
From the switch graphs, something made it blip:
Having slow heartbeats:
Slow OSD heartbeats on back (longest 6757.204ms) Slow OSD heartbeats on front (longest 6091.493ms)