Page MenuHomePhabricator

Decision Request - Toolforge policy agent
Closed, ResolvedPublic

Description

Problem

We need to decide on the implementation for a policy agent for Toolforge Kubernetes. This policy agent should replace the Pod Security Policy mechanism, a core Toolforge security function, that is being removed from Kubernetes in version 1.25. See also: T279110: [infra] Replace PodSecurityPolicy in Toolforge Kubernetes

Constraints and risks

  • TBD.

Decision record

https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Decision_record_T362233_Toolforge_policy_agent

Options

Option 1

Adopt Kyverno https://kyverno.io/ without a date to migrate out of it.

This is a CNCF incubating project, with about 300 different contributors since its inception in 2019.

This software was originally created by Nirmata, a company that offers a number of additional enterprise services based on it, in particular:

image.png (244×973 px, 57 KB)

Kyverno is simple to work with, it was designed specifically for Kubernetes, and supports writing policies in CEL, which has since been adopted by the main Kubernetes for Validating Admission Policies (starting in k8s 1.26).

This means that if we adopt Kyverno today, and once we get to k8s 1.26, we can consider migrating our policies -unchanged- to the native VAPs, thus removing the need for Kyverno itself.

Either in Kyverno original policy language or CEL, policies are rather simple and straightforward to work with.

If we adopted Kyverno, the events could be:

  1. in k8s 1.24, adopt kyverno 1.10, with policies in the native language
  2. we migrate from k8s 1.24 to 1.25
  3. we upgrade kyverno to 1.11
  4. we translate policies from the native language to CEL (or, fetch them from the policy registry, where they will likely be)
  5. we migrate from k8s 1.25 to 1.26
  6. we evaluate dropping kyverno in favor of VAPs.

Pros:

  • Simplified workflow for writing policies, compared to OPA gatekeeper (no template indirection).
  • Apparently stable native CEL language support.
  • Has a more or less sensible migration path towards VAPs.

Cons:

  • The native CEL language is only available starting with Kyverno 1.11, which has the requirement of k8s 1.25, meaning we cannot adopt CEL directly in k8s 1.24.
  • Kyverno is pushed mainly by a single company that has an enterprise version on top of it.
  • CNCF incubating project (some risk of the project changing direction)

Option 2

Adopt Open Policy Agent Gatekeeper. https://open-policy-agent.github.io/gatekeeper/website/

This is a CNCF graduated project, with about 450 different contributors since its inception in 2015.

This software was originally created by Styra, a company that offers a many additional enterprise services based on it.

OPA Gatekeeper is more complex and apparently a bit "uglier" compared to Kyverno. Policies have a template indirection, which means you need to create a policy template, then a policy instance.

Policies are written in Rego language, which is a domain specific language. Apparently, the rego language is still receiving stabilization changes, as reported by the maintainers in Kubecon EU Paris.

Pros:

  • Apparently more CNCF mature project compared to Kyverno (graduated vs incubating)

Cons:

  • Templates indirection for policies, makes them more cumbersome to work with, compared to Kyverno.
  • CEL-written policies only in pre-alpha support phase, not available for k8s 1.24 anyway.
  • Not a clear migration path to VAPs.

Option 3

Goal k8s native: Option 1 + replacing kyverno with migration to k8s native VAPs on the 1.26 k8s upgrade

Pros:

  • All the ones with option 1
  • No risks involving the project changing direction
  • One less component to maintain in the mid-long term

Cons:

  • Same as option 2, without the risk of long-term maintenance

Event Timeline

/me really interested on the option for dropping the 3rd party component on the 1.26 upgrade

we evaluate dropping kyverno in favor of VAPs

Can we add an option where this is "drop kyverno" instead?

we evaluate dropping kyverno in favor of VAPs

Can we add an option where this is "drop kyverno" instead?

This is very speculative at this point. Why would you like that statement to be included?

we evaluate dropping kyverno in favor of VAPs

Can we add an option where this is "drop kyverno" instead?

This is very speculative at this point. Why would you like that statement to be included?

Because any option that ends without a 3rd party in the mid-run (it's actually just one k8s upgrade away!) is way better than any other option that does. Most of the risk and long-term maintenance involved in having a 3rd party just vanishes.

If that's not part of the solution, then you have to start weighting long-term support, stability and such.

Because any option that ends without a 3rd party in the mid-run (it's actually just one k8s upgrade away!) is way better than any other option that does. Most of the risk and long-term maintenance involved in having a 3rd party just vanishes.

If that's not part of the solution, then you have to start weighting long-term support, stability and such.

I don't think a commitment to drop kyverno at point X in the future makes sense. We haven't done anything like this for any of the other 3rd party components we have in Toolforge.
I'm fine with evaluating our options regarding kyverno vs VAPs when we get to the point where that's an actual possibility. Today this feels a bit like a guessing game.

I don't have any data at this point that indicates that long-term support or stability of kyverno could introduce risks for Toolforge. If you do, please share :-P

Because any option that ends without a 3rd party in the mid-run (it's actually just one k8s upgrade away!) is way better than any other option that does. Most of the risk and long-term maintenance involved in having a 3rd party just vanishes.

If that's not part of the solution, then you have to start weighting long-term support, stability and such.

I don't think a commitment to drop kyverno at point X in the future makes sense. We haven't done anything like this for any of the other 3rd party components we have in Toolforge.
I'm fine with evaluating our options regarding kyverno vs VAPs when we get to the point where that's an actual possibility. Today this feels a bit like a guessing game.

I don't agree with this, the timeline to upgrade to kubernetes 1.26 would be less than a year, that's the clear <1year away point in which we would have to drop kyverno. I don't think it's too far in the future.

Please add an option in which we decide to drop kyverno with the 1.26 upgrade.

I don't have any data at this point that indicates that long-term support or stability of kyverno could introduce risks for Toolforge. If you do, please share :-P

It's a cncf incubating project (that means unstable) pushed by a single company with an enterprise downstream version on top of it (that means that they will change it as they see fit, and with a non-small chance of them either changing the license or the company dropping the project).

Please add an option in which we decide to drop kyverno with the 1.26 upgrade.

Please feel free to add it yourself :-)

I don't have any data at this point that indicates that long-term support or stability of kyverno could introduce risks for Toolforge. If you do, please share :-P

It's a cncf incubating project (that means unstable) pushed by a single company with an enterprise downstream version on top of it (that means that they will change it as they see fit, and with a non-small chance of them either changing the license or the company dropping the project).

I don't think incubating means unstable, per the CNCF definition:

image.png (539×1 px, 62 KB)

The number of contributors to the kyverno project is high, beyond what the single company does.

You mention an enterprise downstream version, but per the Nirmata docs, this doesn't seem to be the case. They just offer value added services, see here:

image.png (55×220 px, 10 KB)

We use plenty of 3rd party components that are way riskier than kyverno for the reasons you pointed out (nginx, haproxy, calico, gitlab, etc) and we don't seem to have any concrete plan to get rid of them, beyond what I'm defending here: we will adapt when the need arises.

I think the tradeoff here is good enough.

Please add an option in which we decide to drop kyverno with the 1.26 upgrade.

Please feel free to add it yourself :-)

I don't have any data at this point that indicates that long-term support or stability of kyverno could introduce risks for Toolforge. If you do, please share :-P

It's a cncf incubating project (that means unstable) pushed by a single company with an enterprise downstream version on top of it (that means that they will change it as they see fit, and with a non-small chance of them either changing the license or the company dropping the project).

I don't think incubating means unstable, per the CNCF definition:

image.png (539×1 px, 62 KB)

Unstable as in unstable project, not unstable software (it's way more likely that it's direction might change).

The number of contributors to the kyverno project is high, beyond what the single company does.

You mention an enterprise downstream version, but per the Nirmata docs, this doesn't seem to be the case. They just offer value added services, see here:

image.png (55×220 px, 10 KB)

They mention that they provide earlier CVE/bugfixes than the OSS one, and that they prioritize features over the OSS ones, that to me means that they use a fork of the OSS one with their enterprise support on top of that one.

In that same panphlet they have a side-to-side between kyveno OSS and nimrata enterpise.

They refer to it as 'Nirmata Enterprise for Kyverno is the enterprise-grade distribution'.

"Kyverno OSS is a full-featured and production-ready solution, but organizations with business-critical applications require the
additional value of Nirmata Enterprise for Kyverno."

That's not only support and training on top of it. I think there's enough in that same link you pass to think they have their own fork of the OSS version with their enterprise bits on top.

We use plenty of 3rd party components that are way riskier than kyverno for the reasons you pointed out (nginx, haproxy, calico, gitlab, etc) and we don't seem to have any concrete plan to get rid of them, beyond what I'm defending here: we will adapt when the need arises.

That does not mean that we should add even more.

I think the tradeoff here is good enough.

All this is part of the discussion that needs to happen, please add the option.

Please add an option in which we decide to drop kyverno with the 1.26 upgrade.

Please feel free to add it yourself :-)

Missed this, I'll do

I'm fine with both option 1 and option 3. I don't see a huge practical difference between "we evaluate dropping kyverno" (option 1) or "replacing kyverno [...] on the 1.26 k8s upgrade" (option 3). In both cases we are going to try dropping Kyverno after the 1.26 upgrade, unless we find some big blockers in doing so.

In both cases, we should keep track of the evolution of Kyverno, and avoid using any Kyverno-only features (if they exist) that could make it harder to migrate to VAP. From a quick search, I found a mention of a policy that cannot be translated.

In both cases we are going to try dropping Kyverno after the 1.26 upgrade, unless we find some big blockers in doing so.

I thin this is not true with option 1, as I understand it, "we can consider migrating our policies" does not mean that we will even consider, and does not force us to commit to do so either.

With option 3 the commitment is clear, so we can plan accordingly and dedicate the resources needed (as opposed to not planning for it, and thus never happens).

Just acknowledging that I've seen this discussion, but that I don't know enough about this topic to have a preference.

scheduled discussion meeting for 2024-04-30.

aborrero changed the task status from Open to In Progress.Apr 25 2024, 8:32 AM
aborrero triaged this task as Medium priority.
aborrero moved this task from Inbox to Needs discussion on the cloud-services-team board.

The decision about commiting to drop the extra component on the upgrade to k8s 1.26 might become way more relevant with T363683: Decision request - kubernetes upgrade workgroup, if we decide to try to upgrade monthly.
As the timeline would be:

  • adopt policy agent
  • right after upgrade to 1.25
  • one month after upgrade to 1.26 + drop policy engine)

The output of the decision meeting was that we go with option 3, with an additional caveat:

we want to drop Kyverno in favor of VallidationAdmissionPolicies after we upgrade K8s to 1.26 and before we upgrade it to 1.29. If we get to the point where we upgrade to 1.29 and we're still using Kyverno, we will hold a new decision request to agree on a new plan.

aborrero claimed this task.
aborrero updated the task description. (Show Details)
  NODES
admin 2
Note 1
Project 18