Page MenuHomePhabricator

PuppetTag
ActivePublic

Details

Description

This tag is a catch-all for Puppet tasks that don't align with Puppet-Core, Puppet-Infrastructure, or Puppet CI. It is not specifically assigned to any team and can be used by any team to conveniently tag Puppet-related tasks.

See also:

Recent Activity

Tue, Dec 17

hashar closed T371980: Puppet git::clone should default mode to 0644 (read-only) instead of 0755 as Declined.

My intent was to remove the umask parameter (T338277) which was completed. While doing so, Elukey wanted to keep the 0440 mode which I have split in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1056981 and filed this task for it. That is going further than just removing umask and I don't have any intent to proceed any further. I am thus closing this task.

Tue, Dec 17, 2:48 PM · Infrastructure-Foundations, Puppet, Release-Engineering-Team
gerritbot added a comment to T338277: Puppet git::clone probably does not need `umask` parameter.

Change #1056981 abandoned by Hashar:

[operations/puppet@production] cumin: use defaults to clone homer public repo

Reason:

My intent was to remove the `umask` parameter (T338277) which was completed.

Tue, Dec 17, 2:48 PM · Patch-For-Review, Puppet, Release-Engineering-Team

Mon, Dec 16

Maintenance_bot removed a project from T381293: Too many puppet facts on toolforge k8s workers: Patch-For-Review.
Mon, Dec 16, 3:31 PM · Toolforge, cloud-services-team, Puppet
Andrew closed T381293: Too many puppet facts on toolforge k8s workers as Resolved.

This warning is no longer displayed, and having lots of facts doesn't seem to actually break anything.

Mon, Dec 16, 3:19 PM · Toolforge, cloud-services-team, Puppet
gerritbot added a comment to T381293: Too many puppet facts on toolforge k8s workers.

Change #1104656 merged by Andrew Bogott:

[operations/puppet@production] profile::puppet::agent: actually pass facts_soft_limit to puppet::agent

https://gerrit.wikimedia.org/r/1104656

Mon, Dec 16, 3:02 PM · Toolforge, cloud-services-team, Puppet
gerritbot added a project to T381293: Too many puppet facts on toolforge k8s workers: Patch-For-Review.
Mon, Dec 16, 2:46 PM · Toolforge, cloud-services-team, Puppet
gerritbot added a comment to T381293: Too many puppet facts on toolforge k8s workers.

Change #1104656 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] profile::puppet::agent: actually pass facts_soft_limit to puppet::agent

https://gerrit.wikimedia.org/r/1104656

Mon, Dec 16, 2:46 PM · Toolforge, cloud-services-team, Puppet
Maintenance_bot removed a project from T381293: Too many puppet facts on toolforge k8s workers: Patch-For-Review.
Mon, Dec 16, 2:30 PM · Toolforge, cloud-services-team, Puppet
gerritbot added a comment to T381293: Too many puppet facts on toolforge k8s workers.

Change #1099748 merged by Andrew Bogott:

[operations/puppet@production] Puppet agent: allow hiera config of number_of_facts_soft_limit

https://gerrit.wikimedia.org/r/1099748

Mon, Dec 16, 2:18 PM · Toolforge, cloud-services-team, Puppet

Dec 4 2024

Andrew created P71552 facts from tools-k8s-worker-nfs-14 -- T381293.
Dec 4 2024, 5:23 PM · Puppet, Cloud-VPS

Dec 3 2024

Andrew reassigned T379927: Puppet removed "nameserver" line from /etc/resolv.conf from Andrew to ssingh.

I've checked all the resolv.confs and they all look fine. I'm passing this task over to @ssingh to review what fnegri and dcaro determined about the low-level puppet resolv functions... I don't see obvious low-level ways to prevent this sort of thing happening in the future but it's worth digging deeper.

Dec 3 2024, 8:56 PM · Puppet, Infrastructure-Foundations, Cloud-VPS, cloud-services-team

Dec 2 2024

gerritbot added a project to T381293: Too many puppet facts on toolforge k8s workers: Patch-For-Review.
Dec 2 2024, 5:32 PM · Toolforge, cloud-services-team, Puppet
gerritbot added a comment to T381293: Too many puppet facts on toolforge k8s workers.

Change #1099748 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Puppet agent: allow hiera config of number_of_facts_soft_limit

https://gerrit.wikimedia.org/r/1099748

Dec 2 2024, 5:32 PM · Toolforge, cloud-services-team, Puppet
taavi edited projects for T381293: Too many puppet facts on toolforge k8s workers, added: Toolforge; removed Tools.
Dec 2 2024, 5:23 PM · Toolforge, cloud-services-team, Puppet
Andrew created T381293: Too many puppet facts on toolforge k8s workers.
Dec 2 2024, 5:19 PM · Toolforge, cloud-services-team, Puppet

Nov 29 2024

MoritzMuehlenhoff added a comment to T377853: RAID monitoring on new hardware spec requires new or updated user space cli tool.

But with perccli the battery is reported to be fine (command is /opt/MegaRAID/perccli/perccli64), I'll report this upstream.

Nov 29 2024, 9:39 AM · SRE-swift-storage, DC-Ops, SRE-tools, observability, Puppet, Infrastructure-Foundations
MoritzMuehlenhoff added a comment to T377853: RAID monitoring on new hardware spec requires new or updated user space cli tool.

The underlying failing check is defined in the headers, but not otherwise used in the driver, so it must be something set/managed excusively by the hardware:

Nov 29 2024, 9:35 AM · SRE-swift-storage, DC-Ops, SRE-tools, observability, Puppet, Infrastructure-Foundations

Nov 28 2024

MoritzMuehlenhoff added a comment to T377853: RAID monitoring on new hardware spec requires new or updated user space cli tool.

It differentiates states already, ms-be2082 has "module missing, pack missing, charge failed" while an-presto1016 has the controller detected and only "charge failed". I'll poke at the megaraid_sas driver tomorrow to see what "charge failed" could actually mean.

Nov 28 2024, 7:05 PM · SRE-swift-storage, DC-Ops, SRE-tools, observability, Puppet, Infrastructure-Foundations
elukey added a comment to T377853: RAID monitoring on new hardware spec requires new or updated user space cli tool.

Megactl is correct that the battery is missing, but obviously on nodes where we expect that, it shouldn't flag as an error...

Nov 28 2024, 4:57 PM · SRE-swift-storage, DC-Ops, SRE-tools, observability, Puppet, Infrastructure-Foundations
MatthewVernon added a comment to T377853: RAID monitoring on new hardware spec requires new or updated user space cli tool.

Megactl is correct that the battery is missing, but obviously on nodes where we expect that, it shouldn't flag as an error...

Nov 28 2024, 4:55 PM · SRE-swift-storage, DC-Ops, SRE-tools, observability, Puppet, Infrastructure-Foundations
elukey added a comment to T377853: RAID monitoring on new hardware spec requires new or updated user space cli tool.

Tried megactl (packaged by Moritz) on ms-be2082, this is the result:

Nov 28 2024, 4:43 PM · SRE-swift-storage, DC-Ops, SRE-tools, observability, Puppet, Infrastructure-Foundations
fnegri added a comment to T379927: Puppet removed "nameserver" line from /etc/resolv.conf.

kinda weird behavior if you ask me

Nov 28 2024, 1:27 PM · Puppet, Infrastructure-Foundations, Cloud-VPS, cloud-services-team
dcaro added a comment to T379927: Puppet removed "nameserver" line from /etc/resolv.conf.

I tried with:

resolver = Resolv::DNS.new(
  :nameserver => '127.0.0.1',
  :raise_timeout_erros => true,
)
Nov 28 2024, 11:41 AM · Puppet, Infrastructure-Foundations, Cloud-VPS, cloud-services-team
fnegri added a subtask for T379927: Puppet removed "nameserver" line from /etc/resolv.conf: T381092: 2024-11-25 ProjectProxyMainProxyDown.
Nov 28 2024, 11:10 AM · Puppet, Infrastructure-Foundations, Cloud-VPS, cloud-services-team
fnegri added a comment to T379927: Puppet removed "nameserver" line from /etc/resolv.conf.

@dcaro thanks for that analysis! I had a look at the source code for Resolv::DNS and apparently ResolvTimeout is only triggered if you enable the optional :raise_timeout_errors, see also https://bugs.ruby-lang.org/issues/18151

Nov 28 2024, 10:44 AM · Puppet, Infrastructure-Foundations, Cloud-VPS, cloud-services-team
dcaro added a comment to T379927: Puppet removed "nameserver" line from /etc/resolv.conf.

From Gerrit, @dcaro writes:

Did a quick test, there's three functions we use to resolve names, and only one of them actually fails if it can't resolve:

wmflib::hosts2ips -> does not fail
dnsquery::lookup (used by hosts2ips) -> does not fail
ipresolve -> fails

maybe hosts2ips should use ipresolve? or have a flag to fail if it's empty? It's only used here and in the firewall (where I'm not sure if we want to fail or not).

It's weird though, as it seems from the code that dnsquery::lookup should raise if there's an error:
https://gerrit.wikimedia.org/g/operations/puppet/+/cfc612449876b7ff631492b5f2c32b3c2e762ac3/vendor_modules/dnsquery/lib/puppet/functions/dnsquery/lookup.rb#30

Nov 28 2024, 9:21 AM · Puppet, Infrastructure-Foundations, Cloud-VPS, cloud-services-team

Nov 27 2024

Andrew added a comment to T379927: Puppet removed "nameserver" line from /etc/resolv.conf.

This has not recurred. Nevertheless we should figure out what's happening with the ruby functions that don't raise when they should.

Nov 27 2024, 7:21 PM · Puppet, Infrastructure-Foundations, Cloud-VPS, cloud-services-team
fnegri added a parent task for T379927: Puppet removed "nameserver" line from /etc/resolv.conf: T380882: openstack network problems (November 2024).
Nov 27 2024, 4:35 PM · Puppet, Infrastructure-Foundations, Cloud-VPS, cloud-services-team
jcrespo added a comment to T377853: RAID monitoring on new hardware spec requires new or updated user space cli tool.

I think we could easily try to swap perccli with storcli for the host swith SAS3908 onboard, but I am struggling to download the binary from the website (it doesnt' show up from the research).

Nov 27 2024, 12:04 PM · SRE-swift-storage, DC-Ops, SRE-tools, observability, Puppet, Infrastructure-Foundations
MoritzMuehlenhoff added a comment to T377853: RAID monitoring on new hardware spec requires new or updated user space cli tool.

One other option is to try https://github.com/namiltd/megactl with this controller. (The underlying chipset is usually the same and on ms-be2081 I can also see /dev/megaraid_sas_ioctl_node).

Nov 27 2024, 9:36 AM · SRE-swift-storage, DC-Ops, SRE-tools, observability, Puppet, Infrastructure-Foundations
MoritzMuehlenhoff added a comment to T377853: RAID monitoring on new hardware spec requires new or updated user space cli tool.

There are debs available in the Thomas Krenn repo (German server vendor):
https://www.thomas-krenn.com/de/wiki/StorCLI_unter_Ubuntu_installieren

Nov 27 2024, 9:30 AM · SRE-swift-storage, DC-Ops, SRE-tools, observability, Puppet, Infrastructure-Foundations
MoritzMuehlenhoff added a comment to T377853: RAID monitoring on new hardware spec requires new or updated user space cli tool.

There are debs available in the Thomas Krenn repo (German server vendor):
https://www.thomas-krenn.com/de/wiki/StorCLI_unter_Ubuntu_installieren

Nov 27 2024, 9:29 AM · SRE-swift-storage, DC-Ops, SRE-tools, observability, Puppet, Infrastructure-Foundations
elukey added a comment to T377853: RAID monitoring on new hardware spec requires new or updated user space cli tool.

I think we could easily try to swap perccli with storcli for the host swith SAS3908 onboard, but I am struggling to download the binary from the website (it doesnt' show up from the research).

Nov 27 2024, 9:08 AM · SRE-swift-storage, DC-Ops, SRE-tools, observability, Puppet, Infrastructure-Foundations
elukey added a comment to T377853: RAID monitoring on new hardware spec requires new or updated user space cli tool.

I tried to dowload and install perccli == 007.2616.0000.0000 on ms-be2081 but no luck, same issue.

Nov 27 2024, 8:59 AM · SRE-swift-storage, DC-Ops, SRE-tools, observability, Puppet, Infrastructure-Foundations

Nov 26 2024

Maintenance_bot removed a project from T379927: Puppet removed "nameserver" line from /etc/resolv.conf: Patch-For-Review.
Nov 26 2024, 4:31 PM · Puppet, Infrastructure-Foundations, Cloud-VPS, cloud-services-team
gerritbot added a comment to T379927: Puppet removed "nameserver" line from /etc/resolv.conf.

Change #1091249 merged by Ssingh:

[operations/puppet@production] resolvconf: don't update resolv.conf with 0 nameservers

https://gerrit.wikimedia.org/r/1091249

Nov 26 2024, 4:22 PM · Puppet, Infrastructure-Foundations, Cloud-VPS, cloud-services-team

Nov 25 2024

Andrew added a comment to T379927: Puppet removed "nameserver" line from /etc/resolv.conf.

Nameserver is missing from the following hosts:

Nov 25 2024, 6:22 PM · Puppet, Infrastructure-Foundations, Cloud-VPS, cloud-services-team
fnegri reassigned T379927: Puppet removed "nameserver" line from /etc/resolv.conf from fnegri to Andrew.

Assigning this task to @Andrew as he's currently working on a patch.

Nov 25 2024, 5:04 PM · Puppet, Infrastructure-Foundations, Cloud-VPS, cloud-services-team
Andrew updated subscribers of T379927: Puppet removed "nameserver" line from /etc/resolv.conf.

From Gerrit, @dcaro writes:

Nov 25 2024, 5:03 PM · Puppet, Infrastructure-Foundations, Cloud-VPS, cloud-services-team
fnegri added a comment to T379927: Puppet removed "nameserver" line from /etc/resolv.conf.

@bd808, yes kinda: https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/Incident_response_process#Writing_an_incident_report

Nov 25 2024, 4:50 PM · Puppet, Infrastructure-Foundations, Cloud-VPS, cloud-services-team
bd808 added a comment to T379927: Puppet removed "nameserver" line from /etc/resolv.conf.
Nov 25 2024, 4:45 PM · Puppet, Infrastructure-Foundations, Cloud-VPS, cloud-services-team
jcrespo added a comment to T377853: RAID monitoring on new hardware spec requires new or updated user space cli tool.

It's worth noting here that this is causing icinga to never be happy on the new nodes - it'll always be in state "Unknown" for the RAID controller check - you can look at e.g. thanos-be2005 in icinga to see what I mean.

Nov 25 2024, 10:39 AM · SRE-swift-storage, DC-Ops, SRE-tools, observability, Puppet, Infrastructure-Foundations
MatthewVernon added a project to T377853: RAID monitoring on new hardware spec requires new or updated user space cli tool: SRE-swift-storage.

It's worth noting here that this is causing icinga to never be happy on the new nodes - it'll always be in state "Unknown" for the RAID controller check - you can look at e.g. thanos-be2005 in icinga to see what I mean.

Nov 25 2024, 9:55 AM · SRE-swift-storage, DC-Ops, SRE-tools, observability, Puppet, Infrastructure-Foundations
fnegri reopened T379927: Puppet removed "nameserver" line from /etc/resolv.conf as "Open".

This has just caused a WMCS proxy outage, because the nameserver was removed from both /etc/resolv.conf and the Nginx config files in /etc/nginx/sites-available/.

Nov 25 2024, 4:00 AM · Puppet, Infrastructure-Foundations, Cloud-VPS, cloud-services-team

Nov 23 2024

Andrew added a comment to T379927: Puppet removed "nameserver" line from /etc/resolv.conf.

Just found another VM where this happened: liwa3-2.linkwatcher.eqiad1.wikimedia.cloud

Nov 23 2024, 1:39 PM · Puppet, Infrastructure-Foundations, Cloud-VPS, cloud-services-team

Nov 22 2024

Maintenance_bot removed a project from T380057: Keepalived Puppet module: Support IPv6: Patch-For-Review.
Nov 22 2024, 4:31 PM · IPv6, Puppet
taavi closed T380057: Keepalived Puppet module: Support IPv6 as Resolved.
Nov 22 2024, 4:26 PM · IPv6, Puppet
gerritbot added a comment to T380057: Keepalived Puppet module: Support IPv6.

Change #1091733 merged by Majavah:

[operations/puppet@production] keepalived::failover: Support IPv6

https://gerrit.wikimedia.org/r/1091733

Nov 22 2024, 4:16 PM · IPv6, Puppet
gerritbot added a comment to T380057: Keepalived Puppet module: Support IPv6.

Change #1091732 merged by Majavah:

[operations/puppet@production] keepalived: Split failover config template to new class

https://gerrit.wikimedia.org/r/1091732

Nov 22 2024, 4:16 PM · IPv6, Puppet

Nov 19 2024

jcrespo added a comment to T377853: RAID monitoring on new hardware spec requires new or updated user space cli tool.

With BBU:

root@backup1012:~$ ./storcli64 show all J
{
"Controllers":[
{
        "Command Status" : {
                "CLI Version" : "007.3103.0000.0000 Aug 22, 2024",
                "Operating system" : "Linux 6.1.0-26-amd64",
                "Status Code" : 0,
                "Status" : "Success",
                "Description" : "None"
        },
        "Response Data" : {
                "Number of Controllers" : 1,
                "Host Name" : "backup1012",
                "Operating System " : "Linux 6.1.0-26-amd64",
                "System Overview" : [
                        {
                                "Ctl" : 0,
                                "Model" : "SAS3908",
                                "Ports" : 8,
                                "PDs" : 24,
                                "DGs" : 1,
                                "DNOpt" : 0,
                                "VDs" : 1,
                                "VNOpt" : 0,
                                "BBU" : "Opt",
                                "sPR" : "On",
                                "DS" : "1&2",
                                "EHS" : "Y",
                                "ASOs" : 4,
                                "Hlth" : "Opt"
                        }
                ],
                "ASO" : [
                        {
                                "Ctl" : 0,
                                "Cl" : "X",
                                "SAS" : "U",
                                "MD" : "U",
                                "R6" : "U",
                                "WC" : "U",
                                "R5" : "U",
                                "SS" : "U",
                                "FP" : "U",
                                "Re" : "X",
                                "CR" : "X",
                                "RF" : "X",
                                "CO" : "X",
                                "CW" : "X",
                                "HA" : "X",
                                "SSHA" : "X"
                        }
                ]
        }
}
]
}
Nov 19 2024, 1:35 PM · SRE-swift-storage, DC-Ops, SRE-tools, observability, Puppet, Infrastructure-Foundations
  NODES
HOME 2
Note 1
os 15
server 23
swift 16
web 2