This tag is a catch-all for Puppet tasks that don't align with Puppet-Core, Puppet-Infrastructure, or Puppet CI. It is not specifically assigned to any team and can be used by any team to conveniently tag Puppet-related tasks.
See also:
This tag is a catch-all for Puppet tasks that don't align with Puppet-Core, Puppet-Infrastructure, or Puppet CI. It is not specifically assigned to any team and can be used by any team to conveniently tag Puppet-related tasks.
See also:
My intent was to remove the umask parameter (T338277) which was completed. While doing so, Elukey wanted to keep the 0440 mode which I have split in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1056981 and filed this task for it. That is going further than just removing umask and I don't have any intent to proceed any further. I am thus closing this task.
Change #1056981 abandoned by Hashar:
[operations/puppet@production] cumin: use defaults to clone homer public repo
Reason:
My intent was to remove the `umask` parameter (T338277) which was completed.
This warning is no longer displayed, and having lots of facts doesn't seem to actually break anything.
Change #1104656 merged by Andrew Bogott:
[operations/puppet@production] profile::puppet::agent: actually pass facts_soft_limit to puppet::agent
Change #1104656 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):
[operations/puppet@production] profile::puppet::agent: actually pass facts_soft_limit to puppet::agent
Change #1099748 merged by Andrew Bogott:
[operations/puppet@production] Puppet agent: allow hiera config of number_of_facts_soft_limit
I've checked all the resolv.confs and they all look fine. I'm passing this task over to @ssingh to review what fnegri and dcaro determined about the low-level puppet resolv functions... I don't see obvious low-level ways to prevent this sort of thing happening in the future but it's worth digging deeper.
Change #1099748 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):
[operations/puppet@production] Puppet agent: allow hiera config of number_of_facts_soft_limit
But with perccli the battery is reported to be fine (command is /opt/MegaRAID/perccli/perccli64), I'll report this upstream.
The underlying failing check is defined in the headers, but not otherwise used in the driver, so it must be something set/managed excusively by the hardware:
It differentiates states already, ms-be2082 has "module missing, pack missing, charge failed" while an-presto1016 has the controller detected and only "charge failed". I'll poke at the megaraid_sas driver tomorrow to see what "charge failed" could actually mean.
In T377853#10366080, @MatthewVernon wrote:Megactl is correct that the battery is missing, but obviously on nodes where we expect that, it shouldn't flag as an error...
Megactl is correct that the battery is missing, but obviously on nodes where we expect that, it shouldn't flag as an error...
Tried megactl (packaged by Moritz) on ms-be2082, this is the result:
kinda weird behavior if you ask me
I tried with:
resolver = Resolv::DNS.new( :nameserver => '127.0.0.1', :raise_timeout_erros => true, )
@dcaro thanks for that analysis! I had a look at the source code for Resolv::DNS and apparently ResolvTimeout is only triggered if you enable the optional :raise_timeout_errors, see also https://bugs.ruby-lang.org/issues/18151
In T379927#10354355, @Andrew wrote:From Gerrit, @dcaro writes:
Did a quick test, there's three functions we use to resolve names, and only one of them actually fails if it can't resolve:
wmflib::hosts2ips -> does not fail dnsquery::lookup (used by hosts2ips) -> does not fail ipresolve -> failsmaybe hosts2ips should use ipresolve? or have a flag to fail if it's empty? It's only used here and in the firewall (where I'm not sure if we want to fail or not).
It's weird though, as it seems from the code that dnsquery::lookup should raise if there's an error:
https://gerrit.wikimedia.org/g/operations/puppet/+/cfc612449876b7ff631492b5f2c32b3c2e762ac3/vendor_modules/dnsquery/lib/puppet/functions/dnsquery/lookup.rb#30
This has not recurred. Nevertheless we should figure out what's happening with the ruby functions that don't raise when they should.
In T377853#10360601, @elukey wrote:I think we could easily try to swap perccli with storcli for the host swith SAS3908 onboard, but I am struggling to download the binary from the website (it doesnt' show up from the research).
One other option is to try https://github.com/namiltd/megactl with this controller. (The underlying chipset is usually the same and on ms-be2081 I can also see /dev/megaraid_sas_ioctl_node).
There are debs available in the Thomas Krenn repo (German server vendor):
https://www.thomas-krenn.com/de/wiki/StorCLI_unter_Ubuntu_installieren
In T377853#10360612, @MoritzMuehlenhoff wrote:There are debs available in the Thomas Krenn repo (German server vendor):
https://www.thomas-krenn.com/de/wiki/StorCLI_unter_Ubuntu_installieren
I think we could easily try to swap perccli with storcli for the host swith SAS3908 onboard, but I am struggling to download the binary from the website (it doesnt' show up from the research).
I tried to dowload and install perccli == 007.2616.0000.0000 on ms-be2081 but no luck, same issue.
Change #1091249 merged by Ssingh:
[operations/puppet@production] resolvconf: don't update resolv.conf with 0 nameservers
Nameserver is missing from the following hosts:
Assigning this task to @Andrew as he's currently working on a patch.
From Gerrit, @dcaro writes:
In T379927#10351489, @fnegri wrote:
In T377853#10351959, @MatthewVernon wrote:It's worth noting here that this is causing icinga to never be happy on the new nodes - it'll always be in state "Unknown" for the RAID controller check - you can look at e.g. thanos-be2005 in icinga to see what I mean.
It's worth noting here that this is causing icinga to never be happy on the new nodes - it'll always be in state "Unknown" for the RAID controller check - you can look at e.g. thanos-be2005 in icinga to see what I mean.
This has just caused a WMCS proxy outage, because the nameserver was removed from both /etc/resolv.conf and the Nginx config files in /etc/nginx/sites-available/.
Just found another VM where this happened: liwa3-2.linkwatcher.eqiad1.wikimedia.cloud
Change #1091733 merged by Majavah:
[operations/puppet@production] keepalived::failover: Support IPv6
Change #1091732 merged by Majavah:
[operations/puppet@production] keepalived: Split failover config template to new class
With BBU:
root@backup1012:~$ ./storcli64 show all J { "Controllers":[ { "Command Status" : { "CLI Version" : "007.3103.0000.0000 Aug 22, 2024", "Operating system" : "Linux 6.1.0-26-amd64", "Status Code" : 0, "Status" : "Success", "Description" : "None" }, "Response Data" : { "Number of Controllers" : 1, "Host Name" : "backup1012", "Operating System " : "Linux 6.1.0-26-amd64", "System Overview" : [ { "Ctl" : 0, "Model" : "SAS3908", "Ports" : 8, "PDs" : 24, "DGs" : 1, "DNOpt" : 0, "VDs" : 1, "VNOpt" : 0, "BBU" : "Opt", "sPR" : "On", "DS" : "1&2", "EHS" : "Y", "ASOs" : 4, "Hlth" : "Opt" } ], "ASO" : [ { "Ctl" : 0, "Cl" : "X", "SAS" : "U", "MD" : "U", "R6" : "U", "WC" : "U", "R5" : "U", "SS" : "U", "FP" : "U", "Re" : "X", "CR" : "X", "RF" : "X", "CO" : "X", "CW" : "X", "HA" : "X", "SSHA" : "X" } ] } } ] }