User Details
- User Since
- Nov 2 2014, 11:35 PM (530 w, 6 d)
- Availability
- Available
- IRC Nick
- andrewbogott
- LDAP User
- Unknown
- MediaWiki User
- Andrewbogott [ Global Accounts ]
Fri, Jan 3
So either we need to get toolforge roots automatic access to cloudcuminxxxx hosts, or we need to split out a subset of cookbooks to run on cloud-vps (which I fear we just got done with un-splitting). cc'ing @fnegri who likely has already thought this through.
Thu, Jan 2
For those following along with extra energy, here are things I'd be interested in:
I worked on this a bit over the break. I'm pretty happy with the static site that httrack produces. It was generated like this:
We're now using the project-local recursor for integration-agent-docker-*. This won't really give us an answer but it will at least provide a big more diagnostic info.
Sat, Dec 28
This DB filled up and got stuck; I increased the size (and quota) to 85GB in order to provide room to maneuver and get things unstuck.
Mon, Dec 23
For science (and to be totally sure that this failure is upstream from the integration project), I have made a project-local dns recursor in the Integration project.
Sat, Dec 21
I don't know if that's what fixed it, but I enabled 'hard-disk failover' and bios booting and reimaged and it made it to the end.
Fri, Dec 20
I will have another go!
Done (for 'search')
Due to a sad but well-known upstream issue (https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/HENIUNVB475QPYFALDTSMTN67Q32J2X5/) this project probably cannot use object storage. We can definitely create you a new project for use as an object-storage container (or move you to a new project entirely if you prefer).
A cI job failed with a dns failure at 8:13
Thank you for pursuing this, @cmooney
Thu, Dec 19
I designated every drive a non-raid drive in the bios and now the install is completing. I can't make it stop installing though, it just keeps installing debian over and over despite repeated attempts at
2024-12-19 16:33:54.691933412 2024-12-19 16:41:52.098975077 2024-12-19 16:41:57.195007683
This server seems to have a raid controller, which is different from all the other standard ceph OSD nodes. Not sure how that happened but I should be able to work around it -- will update on the task when I get somewhere.
Using hashar's test
Wed, Dec 18
The two small drives should be mirrored (raid 1) and used for the OS, the larger drives left unformatted for Ceph to manage.
As passed labtestwikitech, so passed these clouddb hosts.
Created T382412 about relocating the cloudnet-dev servers.
Tue, Dec 17
@hashar do the containers in question map the system /etc/resolv.conf such that if I alter it the change will take effect immediately in the containers? Or do I need to alter the container config somehow?
Just setting
Updated docs look good. Thanks for polishing up this already-done task!
Can't promise that I'll finish this task but I'm currently working on making an httrack copy of wikitech
indeed, that same postgres crash is happening again
Mon, Dec 16
this seems to be working now
@SLyngshede-WMF one last little thing, can you please update the docs at https://office.wikimedia.org/wiki/Security/LDAP#Disabling_a_Non-Staff/Non-NDA_User_for_Production for what I presume is a new slightly different workflow? thank you!
This is no longer happening, but is concerning! I'm mentally filing it in the same box as T374830 because it's one more reason I don't totally trust our network setup.
Hi @Jclark-ctr -- ideally I would consult with David about racking details but he's out until the first. Are these servers already gathering dust or do we have a while before they show up?
*now working correctly :D
This warning is no longer displayed, and having lots of facts doesn't seem to actually break anything.
@bd808 suspects that this is a conntrack table issue, which fits the pattern of the failure. A table overflow should show up in logs, though, and I can't turn up anything with greps e.g.
Wed, Dec 11
@Lucas_Werkmeister_WMDE I don't think it's necessary to post more examples, we're able to reproduce the issue albeit not with regularity. Most of the SREs are at an offsite and/or out sick this week so progress towards a fix will not be super fast unfortunately.
awesome
These hosts have a somewhat unusual vlan setup, so my guess is something is tripping on that -- paging @cmooney for manual cleanup.
Oh, I should add that there was a roadbump with running resize2fs because the volume was mounted both on the VM OS level and also within the docker container. Had to unmount and also kill the db container before resizing.
@Magnus sorry for the hangup, looks like Trove automation isn't great at this yet. I did a bit of hands-on repair and I think you're good to go now (at least, all the status flags say that thins are up and working.) Do things look good from your end?
Tue, Dec 10
I was just looking over @Sarai-WMF's shoulder and wikitech is for some reason not editable for her. I gave her the comfirmed user right and that didn't change anything.
Timing is flexible although I'd like to do a graceful shutdown and check after the fact. Can this wait until next week when I'm back from the offsite? If you'd like to do it sooner just please do a graceful on-console shutdown of the server and then ping me here when it's back up. Thanks!
Resizing may or may not be graceful but has worked for me recently
Fri, Dec 6
Not directly related but likely to suffer fallout from this one way or another: T381293
Dec 5 2024
@taavi -- for future reference, regenerating the config file just consists of deleting replica.my.cnf on the nfs server and waiting for it to be dropped back in place by maintain-dbusers? Or were there more steps?
Hello again @Multichill -- this request is stalled pending your response.
It completed.
Dec 4 2024
sudo cumin --force O{name:"canary*"} 'set -e; for i in {1..10000}; do dig +short @172.20.255.1 gerrit.wikimedia.org; done'
The primary/reported issue is failing dns lookups. I can reproduce that from a VM with e.g.
Dec 3 2024
I've replaced a lot of these metrics, but maybe not all of them. @aborrero can you tell me how to reproduce the panel in the screenshot?
I've checked all the resolv.confs and they all look fine. I'm passing this task over to @ssingh to review what fnegri and dcaro determined about the low-level puppet resolv functions... I don't see obvious low-level ways to prevent this sort of thing happening in the future but it's worth digging deeper.
The set of 'slow' hypervisors includes cloudvirt1033 which is is currently drained and running only a canary. So that rules out noisy neighbor issues within a cloudvirt.
I was hoping the slow lookups would correspond to rack or row, but it seems not.
andrew@bastion-restricted-eqiad1-3:~$ mtr -b -w -c 1000 172.20.255.1 Start: 2024-12-03T18:00:13+0000 HOST: bastion-restricted-eqiad1-3 Loss% Snt Last Avg Best Wrst StDev 1.|-- cloudinstances2b-gw.svc.eqiad.wmflabs (172.16.0.1) 0.0% 1000 0.2 0.2 0.2 1.8 0.1 2.|-- vlan1107.cloudgw1002.eqiad1.wikimediacloud.org (185.15.56.234) 0.9% 1000 0.3 0.3 0.2 217.8 6.9 3.|-- irb-1120.cloudsw1-d5-eqiad.eqiad1.wikimediacloud.org (185.15.56.243) 1.1% 1000 2.7 7.0 0.8 104.8 10.4 4.|-- ns-recursor.openstack.eqiad1.wikimediacloud.org (172.20.255.1) 0.7% 1000 0.4 22.7 0.2 918.4 128.4
I had some hope that work on T381373 would help with this but it doesn't seem to. I'm still seeing ping hesitations today:
This has been recurring for some time (e.g. T368211) so probably needs DC attention. @Jhancock.wm, it's OK to power down this system if necessary, with a small bit of notice to the wmcs team.
Dec 2 2024
This is possibly related to https://phabricator.wikimedia.org/T381078 although the timing doesn't really line up