Page MenuHomePhabricator

Andrew (Andrew Bogott)
User

Projects (13)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Nov 2 2014, 11:35 PM (530 w, 6 d)
Availability
Available
IRC Nick
andrewbogott
LDAP User
Unknown
MediaWiki User
Andrewbogott [ Global Accounts ]

Recent Activity

Fri, Jan 3

Andrew updated subscribers of T382977: Allow Toolforge roots to use the cookbook to reboot k8s worker nodes (without wmcs-root).

So either we need to get toolforge roots automatic access to cloudcuminxxxx hosts, or we need to split out a subset of cookbooks to run on cloud-vps (which I fear we just got done with un-splitting). cc'ing @fnegri who likely has already thought this through.

Fri, Jan 3, 10:15 PM · cloud-services-team, Kubernetes, Cloud-VPS, Toolforge
Andrew created T382957: Clean up horizon/deploy branches.
Fri, Jan 3, 4:15 PM · Horizon, cloud-services-team
Andrew updated the task description for T374129: openstack: consider removing labs-ip-aliaser.
Fri, Jan 3, 3:26 PM · Patch-For-Review, Cloud-VPS, User-aborrero, cloud-services-team

Thu, Jan 2

Andrew added a comment to T374830: Various CI jobs running in the integration Cloud VPS project failing due to transient DNS lookup failures, often for our own hosts such as gerrit.wikimedia.org.

For those following along with extra energy, here are things I'd be interested in:

Thu, Jan 2, 6:53 PM · Patch-For-Review, User-aborrero, Cloud-VPS, cloud-services-team, Continuous-Integration-Infrastructure, Release-Engineering-Team (Seen), User-brennen, ci-test-error (WMF-deployed Build Failure)
Andrew added a comment to T376400: Redesign wikitech-static.

I worked on this a bit over the break. I'm pretty happy with the static site that httrack produces. It was generated like this:

Thu, Jan 2, 6:44 PM · serviceops-radar, SRE-Unowned, SRE, wikitech.wikimedia.org
Andrew added a comment to T374830: Various CI jobs running in the integration Cloud VPS project failing due to transient DNS lookup failures, often for our own hosts such as gerrit.wikimedia.org.

We're now using the project-local recursor for integration-agent-docker-*. This won't really give us an answer but it will at least provide a big more diagnostic info.

Thu, Jan 2, 6:38 PM · Patch-For-Review, User-aborrero, Cloud-VPS, cloud-services-team, Continuous-Integration-Infrastructure, Release-Engineering-Team (Seen), User-brennen, ci-test-error (WMF-deployed Build Failure)

Sat, Dec 28

Andrew added a comment to T363901: Project WP1.0/mwoffliner requests Trove instance with 75 GB.

This DB filled up and got stuck; I increased the size (and quota) to 85GB in order to provide room to maneuver and get things unstuck.

Sat, Dec 28, 11:31 PM · affects-Kiwix-and-openZIM, User-aborrero, cloud-services-team, Cloud-VPS (Quota-requests)

Mon, Dec 23

Andrew added a comment to T374830: Various CI jobs running in the integration Cloud VPS project failing due to transient DNS lookup failures, often for our own hosts such as gerrit.wikimedia.org.

For science (and to be totally sure that this failure is upstream from the integration project), I have made a project-local dns recursor in the Integration project.

Mon, Dec 23, 5:03 PM · Patch-For-Review, User-aborrero, Cloud-VPS, cloud-services-team, Continuous-Integration-Infrastructure, Release-Engineering-Team (Seen), User-brennen, ci-test-error (WMF-deployed Build Failure)

Sat, Dec 21

Andrew added a comment to T378825: Q2:rack/setup/install cloudcephosd2004-dev.

I don't know if that's what fixed it, but I enabled 'hard-disk failover' and bios booting and reimaged and it made it to the end.

Sat, Dec 21, 12:07 AM · SRE, cloud-services-team (Hardware), ops-codfw, DC-Ops

Fri, Dec 20

Andrew added a comment to T378825: Q2:rack/setup/install cloudcephosd2004-dev.

I will have another go!

Fri, Dec 20, 8:53 PM · SRE, cloud-services-team (Hardware), ops-codfw, DC-Ops
Andrew closed T382601: Object storage quota increase request for search project as Resolved.

Done (for 'search')

Fri, Dec 20, 5:54 PM · Wikidata, Cloud-VPS (Quota-requests), Wikidata-Query-Service, Data-Platform-SRE
Andrew claimed T382601: Object storage quota increase request for search project.

Due to a sad but well-known upstream issue (https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/HENIUNVB475QPYFALDTSMTN67Q32J2X5/) this project probably cannot use object storage. We can definitely create you a new project for use as an object-storage container (or move you to a new project entirely if you prefer).

Fri, Dec 20, 5:25 PM · Wikidata, Cloud-VPS (Quota-requests), Wikidata-Query-Service, Data-Platform-SRE
Andrew added a comment to T374830: Various CI jobs running in the integration Cloud VPS project failing due to transient DNS lookup failures, often for our own hosts such as gerrit.wikimedia.org.

A cI job failed with a dns failure at 8:13

Fri, Dec 20, 2:15 PM · Patch-For-Review, User-aborrero, Cloud-VPS, cloud-services-team, Continuous-Integration-Infrastructure, Release-Engineering-Team (Seen), User-brennen, ci-test-error (WMF-deployed Build Failure)
Andrew added a comment to T374830: Various CI jobs running in the integration Cloud VPS project failing due to transient DNS lookup failures, often for our own hosts such as gerrit.wikimedia.org.

Thank you for pursuing this, @cmooney

Fri, Dec 20, 2:14 PM · Patch-For-Review, User-aborrero, Cloud-VPS, cloud-services-team, Continuous-Integration-Infrastructure, Release-Engineering-Team (Seen), User-brennen, ci-test-error (WMF-deployed Build Failure)

Thu, Dec 19

Andrew added a comment to T378825: Q2:rack/setup/install cloudcephosd2004-dev.

I designated every drive a non-raid drive in the bios and now the install is completing. I can't make it stop installing though, it just keeps installing debian over and over despite repeated attempts at

Thu, Dec 19, 6:44 PM · SRE, cloud-services-team (Hardware), ops-codfw, DC-Ops
Andrew added a comment to T374830: Various CI jobs running in the integration Cloud VPS project failing due to transient DNS lookup failures, often for our own hosts such as gerrit.wikimedia.org.
2024-12-19 16:33:54.691933412
2024-12-19 16:41:52.098975077
2024-12-19 16:41:57.195007683
Thu, Dec 19, 5:08 PM · Patch-For-Review, User-aborrero, Cloud-VPS, cloud-services-team, Continuous-Integration-Infrastructure, Release-Engineering-Team (Seen), User-brennen, ci-test-error (WMF-deployed Build Failure)
Andrew added a comment to T378825: Q2:rack/setup/install cloudcephosd2004-dev.

This server seems to have a raid controller, which is different from all the other standard ceph OSD nodes. Not sure how that happened but I should be able to work around it -- will update on the task when I get somewhere.

Thu, Dec 19, 3:23 PM · SRE, cloud-services-team (Hardware), ops-codfw, DC-Ops
Andrew placed T382492: Q2:rack/setup/install cloudvirt10[68-76] up for grabs.
Thu, Dec 19, 3:02 PM · SRE, ops-eqiad, cloud-services-team (Hardware), DC-Ops
Andrew updated the task description for T382492: Q2:rack/setup/install cloudvirt10[68-76].
Thu, Dec 19, 3:01 PM · SRE, ops-eqiad, cloud-services-team (Hardware), DC-Ops
Andrew renamed T382492: Q2:rack/setup/install cloudvirt10[68-76] from Q2:rack/setup/install cloudvirt10[68-74] to Q2:rack/setup/install cloudvirt10[68-76].
Thu, Dec 19, 2:57 PM · SRE, ops-eqiad, cloud-services-team (Hardware), DC-Ops
Andrew updated subscribers of T382492: Q2:rack/setup/install cloudvirt10[68-76].

@Andrew,

Two call outs! The original ordering task had a bad hostname range provided by you for racking "Hostnames: cloudvirt1068 - cloudvirt1066" but that makes no sense as it goes down by 2, and even a swap of the numbers isn't accurate. If we order 9 hosts and start with cloudvirt1068 as a new hostname, then its cloudvirt1068 to cloudvirt1074 for a total of 9 hosts.

Thu, Dec 19, 2:55 PM · SRE, ops-eqiad, cloud-services-team (Hardware), DC-Ops
Andrew added a comment to T374830: Various CI jobs running in the integration Cloud VPS project failing due to transient DNS lookup failures, often for our own hosts such as gerrit.wikimedia.org.

The last DNS failures I saw were indeed just over 24 hours ago (13:30 UTC yesterday), but slightly after that Puppet change was merged IIUC (but maybe before it rolled out everywhere?).

Thu, Dec 19, 2:12 PM · Patch-For-Review, User-aborrero, Cloud-VPS, cloud-services-team, Continuous-Integration-Infrastructure, Release-Engineering-Team (Seen), User-brennen, ci-test-error (WMF-deployed Build Failure)
Andrew added a comment to T374830: Various CI jobs running in the integration Cloud VPS project failing due to transient DNS lookup failures, often for our own hosts such as gerrit.wikimedia.org.

Using hashar's test

Thu, Dec 19, 1:29 PM · Patch-For-Review, User-aborrero, Cloud-VPS, cloud-services-team, Continuous-Integration-Infrastructure, Release-Engineering-Team (Seen), User-brennen, ci-test-error (WMF-deployed Build Failure)

Wed, Dec 18

Andrew reassigned T380893: decommission cloudcephmon100[1-3].eqiad.wmnet from Andrew to cmooney.
Wed, Dec 18, 9:39 PM · SRE, DC-Ops, ops-eqiad, decommission-hardware, cloud-services-team
Andrew added a comment to T378825: Q2:rack/setup/install cloudcephosd2004-dev.

The two small drives should be mirrored (raid 1) and used for the OS, the larger drives left unformatted for Ceph to manage.

Wed, Dec 18, 5:22 PM · SRE, cloud-services-team (Hardware), ops-codfw, DC-Ops
Andrew closed T229559: CloudVPS: codfw1dev: database backup for clouddb2001-dev.codfw.wmnet, a subtask of T220096: codfw1dev: decide which DBs to reallocate to clouddb2001-dev, as Invalid.
Wed, Dec 18, 4:21 PM · Cloud-VPS, cloud-services-team (Kanban)
Andrew closed T229559: CloudVPS: codfw1dev: database backup for clouddb2001-dev.codfw.wmnet as Invalid.

As passed labtestwikitech, so passed these clouddb hosts.

Wed, Dec 18, 4:21 PM · cloud-services-team, Cloud-VPS
Andrew updated the task description for T382412: Relocate cloudnet1007-dev and cloudnet1008-dev to new racks and rename.
Wed, Dec 18, 2:47 PM · User-aborrero, cloud-services-team (Hardware), SRE, ops-eqiad, Cloud-VPS, DC-Ops
Andrew renamed T382412: Relocate cloudnet1007-dev and cloudnet1008-dev to new racks and rename from Relocate cloudnet1007-dev and cloudnet1008-dev to new racks to Relocate cloudnet1007-dev and cloudnet1008-dev to new racks and rename.
Wed, Dec 18, 2:46 PM · User-aborrero, cloud-services-team (Hardware), SRE, ops-eqiad, Cloud-VPS, DC-Ops
Andrew added a comment to T382356: replace cloudgw100[12] with spare 'second region' dev servers cloudnet100[78]-dev.

Created T382412 about relocating the cloudnet-dev servers.

Wed, Dec 18, 1:07 PM · cloud-services-team (FY2024/2025-Q1-Q2), Cloud-VPS
Andrew created T382412: Relocate cloudnet1007-dev and cloudnet1008-dev to new racks and rename.
Wed, Dec 18, 1:04 PM · User-aborrero, cloud-services-team (Hardware), SRE, ops-eqiad, Cloud-VPS, DC-Ops

Tue, Dec 17

Andrew added a comment to T374830: Various CI jobs running in the integration Cloud VPS project failing due to transient DNS lookup failures, often for our own hosts such as gerrit.wikimedia.org.

@hashar do the containers in question map the system /etc/resolv.conf such that if I alter it the change will take effect immediately in the containers? Or do I need to alter the container config somehow?

Tue, Dec 17, 10:19 PM · Patch-For-Review, User-aborrero, Cloud-VPS, cloud-services-team, Continuous-Integration-Infrastructure, Release-Engineering-Team (Seen), User-brennen, ci-test-error (WMF-deployed Build Failure)
Andrew added a subtask for T342455: Q1:rack/setup/install cloudcontrol100[8-10]-dev cloudnet100[7-8]-dev: T382356: replace cloudgw100[12] with spare 'second region' dev servers cloudnet100[78]-dev.
Tue, Dec 17, 6:47 PM · User-aborrero, SRE, cloud-services-team (Hardware), ops-eqiad, DC-Ops
Andrew added a parent task for T382356: replace cloudgw100[12] with spare 'second region' dev servers cloudnet100[78]-dev: T342455: Q1:rack/setup/install cloudcontrol100[8-10]-dev cloudnet100[7-8]-dev.
Tue, Dec 17, 6:47 PM · cloud-services-team (FY2024/2025-Q1-Q2), Cloud-VPS
Andrew created T382356: replace cloudgw100[12] with spare 'second region' dev servers cloudnet100[78]-dev.
Tue, Dec 17, 6:46 PM · cloud-services-team (FY2024/2025-Q1-Q2), Cloud-VPS
Andrew added a comment to T381548: 'backy2 cleanup' fails on cloudbackup1004.

Just setting

Tue, Dec 17, 3:17 PM · Patch-For-Review, cloud-services-team, Cloud-VPS
Andrew closed T359820: Developer Account Blocking: Migrate the one-stop Developer (un)Blocking from Wikitech to Bitu, a subtask of T367287: Update Wikitech's LDAP credentials to be read-only, as Resolved.
Tue, Dec 17, 1:12 PM · Infrastructure-Foundations, cloud-services-team, LDAP, wikitech.wikimedia.org
Andrew closed T359820: Developer Account Blocking: Migrate the one-stop Developer (un)Blocking from Wikitech to Bitu, a subtask of T371592: LdapAuthentication: Disable extension from Wikitech, as Resolved.
Tue, Dec 17, 1:12 PM · serviceops, Infrastructure-Foundations, cloud-services-team, wikitech.wikimedia.org
Andrew closed T359820: Developer Account Blocking: Migrate the one-stop Developer (un)Blocking from Wikitech to Bitu as Resolved.

Updated docs look good. Thanks for polishing up this already-done task!

Tue, Dec 17, 1:12 PM · collaboration-services, Infrastructure-Foundations, Bitu
Andrew claimed T376400: Redesign wikitech-static.

Can't promise that I'll finish this task but I'm currently working on making an httrack copy of wikitech

Tue, Dec 17, 1:09 PM · serviceops-radar, SRE-Unowned, SRE, wikitech.wikimedia.org
Andrew added a comment to T381548: 'backy2 cleanup' fails on cloudbackup1004.

indeed, that same postgres crash is happening again

Tue, Dec 17, 2:15 AM · Patch-For-Review, cloud-services-team, Cloud-VPS

Mon, Dec 16

Andrew added a comment to T359820: Developer Account Blocking: Migrate the one-stop Developer (un)Blocking from Wikitech to Bitu.

@SLyngshede-WMF one last little thing, can you please update the docs at https://office.wikimedia.org/wiki/Security/LDAP#Disabling_a_Non-Staff/Non-NDA_User_for_Production for what I presume is a new slightly different workflow? thank you!

The capability is part of Bitu, but initially was only enabled for these users: anticomposite, deltaquad, urbanecm, jjmc89 (stewards), bd808, Simon, myself and Taavi.

We should also add a few more people from Wikimedia Cloud Services, can you sort out with the team who to add?

Mon, Dec 16, 9:27 PM · collaboration-services, Infrastructure-Foundations, Bitu
Andrew closed T381545: SystemdUnitDown The systemd unit remove_dangling_cinder_snapshots.service on node cloudbackup1002-dev has been failing for more than two hours. as Resolved.

this seems to be working now

Mon, Dec 16, 7:23 PM · cloud-services-team
Andrew reopened T359820: Developer Account Blocking: Migrate the one-stop Developer (un)Blocking from Wikitech to Bitu, a subtask of T367287: Update Wikitech's LDAP credentials to be read-only, as Open.
Mon, Dec 16, 7:15 PM · Infrastructure-Foundations, cloud-services-team, LDAP, wikitech.wikimedia.org
Andrew reopened T359820: Developer Account Blocking: Migrate the one-stop Developer (un)Blocking from Wikitech to Bitu, a subtask of T371592: LdapAuthentication: Disable extension from Wikitech, as Open.
Mon, Dec 16, 7:15 PM · serviceops, Infrastructure-Foundations, cloud-services-team, wikitech.wikimedia.org
Andrew reopened T359820: Developer Account Blocking: Migrate the one-stop Developer (un)Blocking from Wikitech to Bitu as "Open".

@SLyngshede-WMF one last little thing, can you please update the docs at https://office.wikimedia.org/wiki/Security/LDAP#Disabling_a_Non-Staff/Non-NDA_User_for_Production for what I presume is a new slightly different workflow? thank you!

Mon, Dec 16, 7:14 PM · collaboration-services, Infrastructure-Foundations, Bitu
Andrew updated subscribers of T382220: KernelError Server cloudgw1002 may have kernel errors.

This is no longer happening, but is concerning! I'm mentally filing it in the same box as T374830 because it's one more reason I don't totally trust our network setup.

Mon, Dec 16, 6:38 PM · cloud-services-team (FY2024/2025-Q1-Q2)
Andrew added a comment to T378828: Q2:eqiad:(6) Ceph cluster expansion - custom config 10g.

Hi @Jclark-ctr -- ideally I would consult with David about racking details but he's out until the first. Are these servers already gathering dust or do we have a while before they show up?

Mon, Dec 16, 5:27 PM · DC-Ops
Andrew renamed T381548: 'backy2 cleanup' fails on cloudbackup1004 from 'backy2 cleanup' fails on backy2 cleanup to 'backy2 cleanup' fails on cloudbackup1004.
Mon, Dec 16, 5:06 PM · Patch-For-Review, cloud-services-team, Cloud-VPS
Andrew claimed T381508: Subdomain for catalyst-dev project.

*now working correctly :D

Mon, Dec 16, 4:43 PM · User-bd808, Cloud-VPS (Quota-requests), cloud-services-team (FY2024/2025-Q1-Q2)
Andrew closed T381293: Too many puppet facts on toolforge k8s workers as Resolved.

This warning is no longer displayed, and having lots of facts doesn't seem to actually break anything.

Mon, Dec 16, 3:19 PM · Toolforge, cloud-services-team, Puppet
Andrew added a comment to T374830: Various CI jobs running in the integration Cloud VPS project failing due to transient DNS lookup failures, often for our own hosts such as gerrit.wikimedia.org.

@bd808 suspects that this is a conntrack table issue, which fits the pattern of the failure. A table overflow should show up in logs, though, and I can't turn up anything with greps e.g.

Mon, Dec 16, 12:38 AM · Patch-For-Review, User-aborrero, Cloud-VPS, cloud-services-team, Continuous-Integration-Infrastructure, Release-Engineering-Team (Seen), User-brennen, ci-test-error (WMF-deployed Build Failure)

Wed, Dec 11

Andrew added a comment to T374830: Various CI jobs running in the integration Cloud VPS project failing due to transient DNS lookup failures, often for our own hosts such as gerrit.wikimedia.org.

@Lucas_Werkmeister_WMDE I don't think it's necessary to post more examples, we're able to reproduce the issue albeit not with regularity. Most of the SREs are at an offsite and/or out sick this week so progress towards a fix will not be super fast unfortunately.

Wed, Dec 11, 3:24 PM · Patch-For-Review, User-aborrero, Cloud-VPS, cloud-services-team, Continuous-Integration-Infrastructure, Release-Engineering-Team (Seen), User-brennen, ci-test-error (WMF-deployed Build Failure)
Andrew closed T381745: Trove DB full as Resolved.

awesome

Wed, Dec 11, 10:57 AM · Cloud-VPS (Quota-requests), cloud-services-team
Andrew updated subscribers of T380893: decommission cloudcephmon100[1-3].eqiad.wmnet.

These hosts have a somewhat unusual vlan setup, so my guess is something is tripping on that -- paging @cmooney for manual cleanup.

Wed, Dec 11, 10:30 AM · SRE, DC-Ops, ops-eqiad, decommission-hardware, cloud-services-team
Andrew added a comment to T376267: ☂ Wikitech account linking and SUL error reporting.

I was just looking over @Sarai-WMF's shoulder and wikitech is for some reason not editable for her.

What exactly does this mean? Is there some permission error on the edit/"view source" pages?

Wed, Dec 11, 10:28 AM · wikitech.wikimedia.org
Andrew added a comment to T381959: Trove volume resize doesnt always (ever?) work.

Oh, I should add that there was a roadbump with running resize2fs because the volume was mounted both on the VM OS level and also within the docker container. Had to unmount and also kill the db container before resizing.

Wed, Dec 11, 10:27 AM · Cloud-VPS, cloud-services-team
Andrew placed T381959: Trove volume resize doesnt always (ever?) work up for grabs.
Wed, Dec 11, 10:27 AM · Cloud-VPS, cloud-services-team
Andrew created T381959: Trove volume resize doesnt always (ever?) work.
Wed, Dec 11, 10:26 AM · Cloud-VPS, cloud-services-team
Andrew added a comment to T381745: Trove DB full.

@Magnus sorry for the hangup, looks like Trove automation isn't great at this yet. I did a bit of hands-on repair and I think you're good to go now (at least, all the status flags say that thins are up and working.) Do things look good from your end?

Wed, Dec 11, 10:18 AM · Cloud-VPS (Quota-requests), cloud-services-team

Tue, Dec 10

Andrew updated subscribers of T376267: ☂ Wikitech account linking and SUL error reporting.

I was just looking over @Sarai-WMF's shoulder and wikitech is for some reason not editable for her. I gave her the comfirmed user right and that didn't change anything.

Tue, Dec 10, 5:16 PM · wikitech.wikimedia.org
Andrew added a comment to T380479: PowerSupplyFailure Power Supply - Status - issue on cloudbackup2003:9290.

Timing is flexible although I'd like to do a graceful shutdown and check after the fact. Can this wait until next week when I'm back from the offsite? If you'd like to do it sooner just please do a graceful on-console shutdown of the server and then ping me here when it's back up. Thanks!

Tue, Dec 10, 3:34 PM · SRE, DC-Ops, ops-codfw, cloud-services-team
Andrew added a comment to T381745: Trove DB full.

Resizing may or may not be graceful but has worked for me recently

Tue, Dec 10, 2:22 PM · Cloud-VPS (Quota-requests), cloud-services-team

Fri, Dec 6

Andrew added a comment to T381639: Facter 4 upgrade removed 'mountpoints' fact, breaking cinderutils::ensure.

Not directly related but likely to suffer fallout from this one way or another: T381293

Fri, Dec 6, 4:50 PM · Infrastructure-Foundations, Puppet-Core, cloud-services-team, Cloud-VPS

Dec 5 2024

Andrew added a comment to T348259: Bad credentials for tools.

@taavi -- for future reference, regenerating the config file just consists of deleting replica.my.cnf on the nfs server and waiting for it to be dropped back in place by maintain-dbusers? Or were there more steps?

Dec 5 2024, 3:32 PM · cloud-services-team, Data-Services
Andrew triaged T380099: Audit WMCS compute capacity as Medium priority.
Dec 5 2024, 3:29 PM · Cloud-VPS, cloud-services-team
Andrew triaged T380339: puppet: partman comments in cephosd.cfg are misleading as Low priority.
Dec 5 2024, 3:29 PM · Ceph, Cloud-VPS, cloud-services-team
Andrew triaged T380479: PowerSupplyFailure Power Supply - Status - issue on cloudbackup2003:9290 as Medium priority.
Dec 5 2024, 3:29 PM · SRE, DC-Ops, ops-codfw, cloud-services-team
Andrew renamed T380902: Increase kubernetes quota for tools.multichill from Increase kurbernetes quota for tools.multichill to Increase kubernetes quota for tools.multichill.
Dec 5 2024, 3:28 PM · Toolforge (Quota-requests), cloud-services-team
Andrew triaged T381418: Future catalyst cloud-vps usage as Medium priority.
Dec 5 2024, 3:28 PM · Catalyst, Cloud-VPS, cloud-services-team
Andrew triaged T381419: Future testing-infra growth on cloud-vps as Medium priority.
Dec 5 2024, 3:28 PM · collaboration-services, Continuous-Integration-Infrastructure, QTE-TestingOverview, GitLab (CI & Job Runners), Cloud-VPS, cloud-services-team
Andrew triaged T381420: Future growth of deployment-prep? as Medium priority.
Dec 5 2024, 3:27 PM · Beta-Cluster-Infrastructure, Cloud-VPS, cloud-services-team
Andrew triaged T381499: Upgrade cloud-vps openstack to version 'Dalmation' as Medium priority.
Dec 5 2024, 3:06 PM · Cloud-VPS, cloud-services-team
Andrew claimed T381545: SystemdUnitDown The systemd unit remove_dangling_cinder_snapshots.service on node cloudbackup1002-dev has been failing for more than two hours..
Dec 5 2024, 3:06 PM · cloud-services-team
Andrew triaged T381548: 'backy2 cleanup' fails on cloudbackup1004 as Medium priority.
Dec 5 2024, 2:50 PM · Patch-For-Review, cloud-services-team, Cloud-VPS
Andrew added a comment to T380902: Increase kubernetes quota for tools.multichill.

Hello again @Multichill -- this request is stalled pending your response.

Dec 5 2024, 2:33 PM · Toolforge (Quota-requests), cloud-services-team
Andrew added a comment to T381548: 'backy2 cleanup' fails on cloudbackup1004.

It completed.

Dec 5 2024, 1:24 PM · Patch-For-Review, cloud-services-team, Cloud-VPS
Andrew created T381548: 'backy2 cleanup' fails on cloudbackup1004.
Dec 5 2024, 4:49 AM · Patch-For-Review, cloud-services-team, Cloud-VPS

Dec 4 2024

Andrew added a comment to T374830: Various CI jobs running in the integration Cloud VPS project failing due to transient DNS lookup failures, often for our own hosts such as gerrit.wikimedia.org.
sudo cumin --force O{name:"canary*"} 'set -e; for i in {1..10000}; do dig +short @172.20.255.1 gerrit.wikimedia.org; done'
Dec 4 2024, 9:28 PM · Patch-For-Review, User-aborrero, Cloud-VPS, cloud-services-team, Continuous-Integration-Infrastructure, Release-Engineering-Team (Seen), User-brennen, ci-test-error (WMF-deployed Build Failure)
Andrew created P71552 facts from tools-k8s-worker-nfs-14 -- T381293.
Dec 4 2024, 5:23 PM · Puppet, Cloud-VPS
Andrew updated the task description for T380893: decommission cloudcephmon100[1-3].eqiad.wmnet.
Dec 4 2024, 5:13 PM · SRE, DC-Ops, ops-eqiad, decommission-hardware, cloud-services-team
Andrew updated the task description for T380893: decommission cloudcephmon100[1-3].eqiad.wmnet.
Dec 4 2024, 3:30 PM · SRE, DC-Ops, ops-eqiad, decommission-hardware, cloud-services-team
Andrew created T381499: Upgrade cloud-vps openstack to version 'Dalmation'.
Dec 4 2024, 2:20 PM · Cloud-VPS, cloud-services-team
Andrew added a comment to T374830: Various CI jobs running in the integration Cloud VPS project failing due to transient DNS lookup failures, often for our own hosts such as gerrit.wikimedia.org.

The primary/reported issue is failing dns lookups. I can reproduce that from a VM with e.g.

Dec 4 2024, 1:52 AM · Patch-For-Review, User-aborrero, Cloud-VPS, cloud-services-team, Continuous-Integration-Infrastructure, Release-Engineering-Team (Seen), User-brennen, ci-test-error (WMF-deployed Build Failure)

Dec 3 2024

Andrew placed T380499: Q2:rack/setup/install cloudcontrol1011 up for grabs.
Dec 3 2024, 9:12 PM · SRE, cloud-services-team (Hardware), ops-eqiad, DC-Ops
Andrew added a comment to T373878: openstack: fix missing prometheus metrics.

I've replaced a lot of these metrics, but maybe not all of them. @aborrero can you tell me how to reproduce the panel in the screenshot?

Dec 3 2024, 8:58 PM · Cloud-VPS, cloud-services-team
Andrew reassigned T379927: Puppet removed "nameserver" line from /etc/resolv.conf from Andrew to ssingh.

I've checked all the resolv.confs and they all look fine. I'm passing this task over to @ssingh to review what fnegri and dcaro determined about the low-level puppet resolv functions... I don't see obvious low-level ways to prevent this sort of thing happening in the future but it's worth digging deeper.

Dec 3 2024, 8:56 PM · Puppet, Infrastructure-Foundations, cloud-services-team, Cloud-VPS
Andrew added a comment to T374830: Various CI jobs running in the integration Cloud VPS project failing due to transient DNS lookup failures, often for our own hosts such as gerrit.wikimedia.org.

The set of 'slow' hypervisors includes cloudvirt1033 which is is currently drained and running only a canary. So that rules out noisy neighbor issues within a cloudvirt.

Dec 3 2024, 7:44 PM · Patch-For-Review, User-aborrero, Cloud-VPS, cloud-services-team, Continuous-Integration-Infrastructure, Release-Engineering-Team (Seen), User-brennen, ci-test-error (WMF-deployed Build Failure)
Andrew added a comment to T374830: Various CI jobs running in the integration Cloud VPS project failing due to transient DNS lookup failures, often for our own hosts such as gerrit.wikimedia.org.

I was hoping the slow lookups would correspond to rack or row, but it seems not.

Dec 3 2024, 7:37 PM · Patch-For-Review, User-aborrero, Cloud-VPS, cloud-services-team, Continuous-Integration-Infrastructure, Release-Engineering-Team (Seen), User-brennen, ci-test-error (WMF-deployed Build Failure)
Andrew added a comment to T374830: Various CI jobs running in the integration Cloud VPS project failing due to transient DNS lookup failures, often for our own hosts such as gerrit.wikimedia.org.
andrew@bastion-restricted-eqiad1-3:~$ mtr -b -w -c 1000 172.20.255.1
Start: 2024-12-03T18:00:13+0000
HOST: bastion-restricted-eqiad1-3                                          Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- cloudinstances2b-gw.svc.eqiad.wmflabs (172.16.0.1)                    0.0%  1000    0.2   0.2   0.2   1.8   0.1
  2.|-- vlan1107.cloudgw1002.eqiad1.wikimediacloud.org (185.15.56.234)        0.9%  1000    0.3   0.3   0.2 217.8   6.9
  3.|-- irb-1120.cloudsw1-d5-eqiad.eqiad1.wikimediacloud.org (185.15.56.243)  1.1%  1000    2.7   7.0   0.8 104.8  10.4
  4.|-- ns-recursor.openstack.eqiad1.wikimediacloud.org (172.20.255.1)        0.7%  1000    0.4  22.7   0.2 918.4 128.4
Dec 3 2024, 7:25 PM · Patch-For-Review, User-aborrero, Cloud-VPS, cloud-services-team, Continuous-Integration-Infrastructure, Release-Engineering-Team (Seen), User-brennen, ci-test-error (WMF-deployed Build Failure)
Andrew created P71503 dns lookup times from cloudvirt canary VMs, one per cloudvirt.
Dec 3 2024, 7:22 PM · Cloud-VPS
Andrew added a comment to T381419: Future testing-infra growth on cloud-vps.

We have 18 integration-agent-docker instances each at 24G of Ram (432G). If we wanted to fit 4 builds of 12G in memory, that would be 48G more per host or a 864G increase for a total usage of 1296G. Maybe it is overkill, but maybe we can do that on a couple host and see whether it might be worth it.

Dec 3 2024, 5:51 PM · collaboration-services, Continuous-Integration-Infrastructure, QTE-TestingOverview, GitLab (CI & Job Runners), Cloud-VPS, cloud-services-team
Andrew created T381420: Future growth of deployment-prep?.
Dec 3 2024, 5:18 PM · Beta-Cluster-Infrastructure, Cloud-VPS, cloud-services-team
Andrew created T381419: Future testing-infra growth on cloud-vps.
Dec 3 2024, 5:14 PM · collaboration-services, Continuous-Integration-Infrastructure, QTE-TestingOverview, GitLab (CI & Job Runners), Cloud-VPS, cloud-services-team
Andrew created T381418: Future catalyst cloud-vps usage.
Dec 3 2024, 5:08 PM · Catalyst, Cloud-VPS, cloud-services-team
Andrew updated the task description for T380099: Audit WMCS compute capacity.
Dec 3 2024, 5:00 PM · Cloud-VPS, cloud-services-team
Andrew added a comment to T374830: Various CI jobs running in the integration Cloud VPS project failing due to transient DNS lookup failures, often for our own hosts such as gerrit.wikimedia.org.

I had some hope that work on T381373 would help with this but it doesn't seem to. I'm still seeing ping hesitations today:

Dec 3 2024, 4:03 PM · Patch-For-Review, User-aborrero, Cloud-VPS, cloud-services-team, Continuous-Integration-Infrastructure, Release-Engineering-Team (Seen), User-brennen, ci-test-error (WMF-deployed Build Failure)
Andrew added a project to T380479: PowerSupplyFailure Power Supply - Status - issue on cloudbackup2003:9290: ops-codfw.

This has been recurring for some time (e.g. T368211) so probably needs DC attention. @Jhancock.wm, it's OK to power down this system if necessary, with a small bit of notice to the wmcs team.

Dec 3 2024, 2:39 PM · SRE, DC-Ops, ops-codfw, cloud-services-team

Dec 2 2024

Andrew created T381293: Too many puppet facts on toolforge k8s workers.
Dec 2 2024, 5:19 PM · Toolforge, cloud-services-team, Puppet
Andrew added a comment to T374830: Various CI jobs running in the integration Cloud VPS project failing due to transient DNS lookup failures, often for our own hosts such as gerrit.wikimedia.org.

This is possibly related to https://phabricator.wikimedia.org/T381078 although the timing doesn't really line up

Dec 2 2024, 4:31 PM · Patch-For-Review, User-aborrero, Cloud-VPS, cloud-services-team, Continuous-Integration-Infrastructure, Release-Engineering-Team (Seen), User-brennen, ci-test-error (WMF-deployed Build Failure)
  NODES
HOME 1
Idea 1
idea 1
iOS 2
Note 1
os 58
server 11
Users 3