[cumin1001:~] $ sudo cookbook sre.hosts.decommission -t T274023 mwdebug1002.eqiad.wmnet START - Cookbook sre.hosts.decommission >>> ATTENTION: destructive action for 1 hosts: mwdebug1002.eqiad.wmnet Are you sure to proceed? Type "go" to proceed or "abort" to interrupt the execution > go Looking for matches in puppetmaster1001.eqiad.wmnet:/var/lib/git/operations/puppet ----- OUTPUT of 'cd /var/lib/git/...46[^0-9A-Za-z])'' ----- conftool-data/node/eqiad.yaml: mwdebug1002.eqiad.wmnet: [apache2] modules/install_server/files/dhcpd/linux-host-entries.ttyS0-115200: fixed-address mwdebug1002.eqiad.wmnet; modules/profile/files/trafficserver/x-wikimedia-debug-routing.lua: ["mwdebug1002.eqiad.wmnet"] = "mwdebug1002.eqiad.wmnet", ================ PASS |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00, 1.23hosts/s] FAIL | | 0% (0/1) [00:00<?, ?hosts/s] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'cd /var/lib/git/...46[^0-9A-Za-z])''. 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. Looking for matches in puppetmaster1001.eqiad.wmnet:/srv/private ----- OUTPUT of 'cd /srv/private ...46[^0-9A-Za-z])'' ----- ================ PASS |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00, 1.72hosts/s] FAIL | | 0% (0/1) [00:00<?, ?hosts/s] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'cd /srv/private ...46[^0-9A-Za-z])''. 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. Looking for matches in deploy1001.eqiad.wmnet:/srv/mediawiki-staging ----- OUTPUT of 'cd /srv/mediawik...46[^0-9A-Za-z])'' ----- debug.json: "mwdebug1002.eqiad.wmnet", ================ PASS |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:01<00:00, 1.16s/hosts] FAIL | | 0% (0/1) [00:01<?, ?hosts/s] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'cd /srv/mediawik...46[^0-9A-Za-z])''. 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. >>> Found match(es) in the Puppet or mediawiki-config repositories (see above), proceed anyway? Type "go" to proceed or "abort" to interrupt the execution > go Looking for Kerberos credentials on KDC kadmin node. ----- OUTPUT of 'find /srv/kerber...02.eqiad.wmnet*"' ----- ================ PASS |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00, 1.79hosts/s] FAIL | | 0% (0/1) [00:00<?, ?hosts/s] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'find /srv/kerber...02.eqiad.wmnet*"'. ----- OUTPUT of '/usr/local/sbin/...02.eqiad.wmnet*"' ----- ================ PASS |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00, 1.88hosts/s] FAIL | | 0% (0/1) [00:00<?, ?hosts/s] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/usr/local/sbin/...02.eqiad.wmnet*"'. 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. No Kerberos credentials found. Scheduling downtime on Icinga server alert1001.wikimedia.org for hosts: ['mwdebug1002.eqiad.wmnet'] ----- OUTPUT of 'icinga-downtime ...n1001 - T274023"' ----- ================ PASS |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00, 2.45hosts/s] FAIL | | 0% (0/1) [00:00<?, ?hosts/s] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'icinga-downtime ...n1001 - T274023"'. 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. Downtimed host on Icinga Found Ganeti VM Shutting down VM mwdebug1002.eqiad.wmnet in cluster ganeti01.svc.eqiad.wmnet ----- OUTPUT of 'gnt-instance shu...1002.eqiad.wmnet' ----- Waiting for job 1134523 for mwdebug1002.eqiad.wmnet ... ================ PASS |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:10<00:00, 10.66s/hosts] FAIL | | 0% (0/1) [00:10<?, ?hosts/s] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'gnt-instance shu...1002.eqiad.wmnet'. 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. VM shutdown ----- OUTPUT of 'systemctl start ...iad_sync.service' ----- ================ PASS |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00, 1.87hosts/s] FAIL | | 0% (0/1) [00:00<?, ?hosts/s] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'systemctl start ...iad_sync.service'. 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox Sleeping for 20s to avoid race conditions... Removed host mwdebug1002.eqiad.wmnet from Debmonitor Removed from DebMonitor ----- OUTPUT of 'puppet node clea...1002.eqiad.wmnet' ----- Notice: Revoked certificate with serial 2340 Notice: Revoked certificate with serial 3962 Notice: Revoked certificate with serial 5498 mwdebug1002.eqiad.wmnet ================ PASS |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:01<00:00, 1.96s/hosts] FAIL | | 0% (0/1) [00:01<?, ?hosts/s] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'puppet node clea...1002.eqiad.wmnet'. ----- OUTPUT of 'puppet node deac...1002.eqiad.wmnet' ----- Submitted 'deactivate node' for mwdebug1002.eqiad.wmnet with UUID db4a3b35-c781-4438-ae8e-618a1689d227 ================ PASS |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:01<00:00, 1.83s/hosts] FAIL | | 0% (0/1) [00:01<?, ?hosts/s] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'puppet node deac...1002.eqiad.wmnet'. 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. Removed from Puppet master and PuppetDB Issuing Ganeti remove command, it can take up to 15 minutes... Removing VM mwdebug1002.eqiad.wmnet in cluster ganeti01.svc.eqiad.wmnet. This may take a few minutes. ----- OUTPUT of 'gnt-instance rem...1002.eqiad.wmnet' ----- ================ PASS |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:05<00:00, 5.50s/hosts] FAIL | | 0% (0/1) [00:05<?, ?hosts/s] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'gnt-instance rem...1002.eqiad.wmnet'. 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. VM removed ----- OUTPUT of 'systemctl start ...iad_sync.service' ----- ================ PASS |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00, 1.93hosts/s] FAIL | | 0% (0/1) [00:00<?, ?hosts/s] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'systemctl start ...iad_sync.service'. 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox Generating the DNS records from Netbox data. It will take a couple of minutes. ----- OUTPUT of 'cd /tmp && runus...e asset tag one"' ----- 2021-02-12 22:56:23,745 [INFO] Gathering devices, interfaces, addresses and prefixes from Netbox 2021-02-12 23:02:15,871 [ERROR] Failed to run Traceback (most recent call last): File "/srv/deployment/netbox-extras/dns/generate_dns_snippets.py", line 686, in main batch_status, ret_code = run_commit(args, config, tmpdir) File "/srv/deployment/netbox-extras/dns/generate_dns_snippets.py", line 590, in run_commit netbox.collect() File "/srv/deployment/netbox-extras/dns/generate_dns_snippets.py", line 156, in collect self._collect_device(device, True) File "/srv/deployment/netbox-extras/dns/generate_dns_snippets.py", line 197, in _collect_device if self.addresses[primary.id].dns_name: KeyError: 6398 ================ PASS | | 0% (0/1) [05:53<?, ?hosts/s] FAIL |██████████████████████████████████████████████████████████████████| 100% (1/1) [05:53<00:00, 353.00s/hosts] 100.0% (1/1) of nodes failed to execute command 'cd /tmp && runus...e asset tag one"': netbox1001.wikimedia.org 0.0% (0/1) success ratio (< 100.0% threshold) for command: 'cd /tmp && runus...e asset tag one"'. Aborting. 0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting. Failed to run the sre.dns.netbox cookbook Traceback (most recent call last): File "/srv/deployment/spicerack/cookbooks/sre/hosts/decommission.py", line 365, in run dns_netbox_run(dns_netbox_args, spicerack) File "/srv/deployment/spicerack/cookbooks/sre/dns/netbox.py", line 73, in run results = netbox_host.run_sync(command, is_safe=True) File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 475, in run_sync batch_sleep=batch_sleep, is_safe=is_safe) File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 637, in _execute raise RemoteExecutionError(ret, 'Cumin execution failed') spicerack.remote.RemoteExecutionError: Cumin execution failed (exit_code=2) **Failed to run the sre.dns.netbox cookbook**: Cumin execution failed (exit_code=2) ERROR: some step failed, check the task updates. Updated Phabricator task T274023 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
sre.hosts.decommission: temporary fix for Netbox | operations/cookbooks | master | +2 -0 |
Event Timeline
Correct me if I'm wrong, but is this the vm that is replacing another vm of the same name? There might be some assumptions about the connectivity in netbox of things.
It's just an attempt tp remove an existing VM. (after reimaging the existing VM resulted in it not coming back from reboot)
Mentioned in SAL (#wikimedia-operations) [2021-02-13T00:08:32Z] <mutante> ganeti1011 - manually deleting VM mwdebug1002 - T274689 T274023
Mentioned in SAL (#wikimedia-operations) [2021-02-13T00:26:49Z] <mutante> ganeti - attempting to recreate VM mwdebug1002 with cookbook that wsa previously deleted manually (T274689 T274023)
summary:
- existing VM, actually very old from 2016, works fine
- tried to install new distro version on it, which was no problem on other VMs in codfw, but this one simply did not come back from reboot
- checked console.. nothing... gnt-instance said status is UP, Icinga disagrees says it is down, can't ssh to it
- try to restart it again.. nothing on console, nothing happens
- decide to just delete it, use decom script I am supposed to use for removing VMs
- decom script fails with errors reported here
- manually delete it with gnt-instance to attempt to then recreate it with makevm under the same name
- makevm cookbook suggest to add a public IP even though I said I need private:
sudo cookbook sre.ganeti.makevm --vcpus 4 --memory 4 --disk 50 --network private eqiad_A mwdebug1002.eqiad.wmnet
+mwdebug1002 1H IN A 208.80.154.6 +mwdebug1002 1H IN AAAA 2620:0:861:1:208:80:154:6 +mwdebug1002 1H IN AAAA 2620:0:861:101:10:64:0:46
I ABORT because that is clearly wrong, it's not supposed to have a public IP.
I am stuck now how I should properly resolve this.
the last part about the public IP might have been just about the ordering of parameters I had in makevm.. give me a minute, trying that one more time
Nope, it wasn't. It is trying to assign a public IP again.
Cas is cleaning up netbox and we will try it again.
+mwdebug1002 1H IN A 10.64.0.93 +mwdebug1002 1H IN A 208.80.154.6 +mwdebug1002 1H IN AAAA 2620:0:861:1:208:80:154:6 +mwdebug1002 1H IN AAAA 2620:0:861:101:10:64:0:46 +mwdebug1002 1H IN AAAA 2620:0:861:101:10:64:0:93 mwdebug1003 1H IN A 10.64.32.9 mwdebug1003 1H IN AAAA 2620:0:861:103:10:64:32:9
I've successfully got makevm to work as expected after deleting the IP addresses for mwdebug1002 that were left behind by the DNS generation at the decom step failing. That code failure in the DNS generation may be some vm/physical confusion, but it's not obvious from the code why this would have happened.
To be clear about the timeline:
- Daniel attempted to reimage ganeti host which failed
- Attempted to reboot ganeti host which failed
- Attempted to decom ganeti host which mostly worked (the host was removed, but DNS generation failed)
- Attempted to create a new ganeti host under the name (which added a public IP address for some reason) and was aborted at DNS generation due to weird diff above
- Repeated attempt, aborted for same reason
- I removed IP addresses from Netbox and reattempted makevm which worked.
So I think Daniel is unblocked, but there are open questions:
- Why did decom fail for this box?
- Why would makevm get a public address if the private address already existed, and/or why would it do this at all if --network private is passed? Or did this happen?
Also there's a notable UX issue with makevm, in that when prompted to review the diff from the DNS generation, aborting does not actually clean up the changes that happened prior, so the addresses, like it or not have already been allocated, which is not entirely the expected behavior.
This is also not an expected use case (basically recreating the same box), but I think it's a reasonable use case which currently may always necessitate some manual cleanup of leftover IPs - it might be better to be able to use makevm to create a new VM with existing IPs as an option. There could be safeguards like checking if the IP is currently unconnected to a VM and there is no ganeti host under the name requested, but it seems a reasonable compromise.
Yes the addition of a revert of netbox changes on failure in the makevm cookbook was already in the TODO, I didn't check if it has already a task but is something we should definitely do.
@crusnov Anything I should do for this task?
for the record, yes, I am unblocked and this issue did NOT show up when I reimaged mwdebug1001. Simply because that VM came back from reboot as normally expected. So that did not trigger trying to delete a VM.
The issue that stays though is if this will happen again when the next VM is decom'ed.
Change 668505 had a related patch set uploaded (by Volans; owner: Volans):
[operations/cookbooks@master] sre.hosts.decommission: temporary fix for Netbox
Change 668505 merged by jenkins-bot:
[operations/cookbooks@master] sre.hosts.decommission: temporary fix for Netbox
The additional sleep in the above patch should have workaround the issue. Resolving it for now, feel free to reopen if it happens again.