Add more suggestions for avoiding deadlocks to webhook docs #46798

fasaxc · 2024-06-12T10:41:45Z

Webhooks can cause deadlocks in several ways, expand the list to cover more subtle cases.

Was recently debugging an issue where this came up and I thought, perhaps I can improve the docs to call out more cases. Not sure we've got the root cause of the issue correct yet but I can't see why it wouldn't be possible to cause deadlocks in these ways (and I think I was seeing these problems in a customer cluster).

k8s-ci-robot · 2024-06-12T10:41:55Z

Welcome @fasaxc!

It looks like this is your first PR to kubernetes/website 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/website has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

netlify · 2024-06-12T10:44:24Z

✅ Pull request preview available for checking

Built without sensitive environment variables

Name	Link
🔨 Latest commit	`0c40ece`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-io-main-staging/deploys/67b3381a09ead600088c678c
😎 Deploy Preview	https://deploy-preview-46798--kubernetes-io-main-staging.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

sftim

I recommend clearly calling out ValidatingAdmissionPolicies as a way to limit the impact.
If you run an admission webhook in a namespace scrutineer, you can exempt the scrutineer namespace from the webhook admission checks, and you can still restrict changes to scrutineer using a ValidatingAdmissionPolicy.
You should consider Pod security admission for that scrutineer namespace, too (at least baseline, Restricted if you can make it work).

For example, have a ValidatingAdmissionPolicy that restricts who can remove the scrutineer namespace and also limits writes to the ConfigMap that configures the webhook. Have another ValidatingAdmissionPolicy that prevents making the scrutineer namespace privileged, and a further one that prevents creating new ClusterRoleBindings inside the scrutineer namespace.

That's just an example but I hope you get the idea.

sftim · 2024-06-12T12:14:03Z

content/en/docs/reference/access-authn-authz/extensible-admission-controllers.md

+  It is recommended to exclude cluster infratructure namespaces from webhooks, including kube-system,
+  any namespaces used by CNI plugins, etc.


typo (“infratructure”)

let's mention ValidatingAdmissionPolicies as a way to provide (at least) some protection without having the admission webhook protect itself

content/en/docs/reference/access-authn-authz/extensible-admission-controllers.md

k8s-triage-robot · 2024-09-26T16:35:22Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

fasaxc · 2024-10-02T13:32:55Z

/remove-lifecycle stale

sftim

Thanks. Here's some feedback.

content/en/docs/reference/access-authn-authz/extensible-admission-controllers.md

sftim · 2024-10-31T10:27:14Z

content/en/docs/reference/access-authn-authz/extensible-admission-controllers.md

+  example, due to a cluster-wide outage or a node failure where both pods run on the same node)
+  deadlock occurs because neither webhook pod can be recreated without the other already running.
+
+  It is recommended to establish a desired order for webhooks to start, then to exclude "earlier"


I don't think this is a recommendation that the Kubernetes project makes.

OK, I've toned it down from recommendation to an example

content/en/docs/reference/access-authn-authz/extensible-admission-controllers.md

sftim · 2024-10-31T10:29:56Z

content/en/docs/reference/access-authn-authz/extensible-admission-controllers.md

+  It is recommended to exclude cluster infrastructure namespaces from webhooks, including kube-system,
+  any namespaces used by CNI plugins, etc.


This isn't a Kubernetes project recommendation, although in many (not all) cases it is a good idea to do.

toned down.

content/en/docs/reference/access-authn-authz/extensible-admission-controllers.md

fasaxc · 2024-10-31T17:17:21Z

@sftim do you want to take over this PR? Seems like you want to rewrite this section of the docs.

sftim · 2024-10-31T18:45:42Z

@sftim do you want to take over this PR?

I'd much rather not, honestly. If I can help you follow our style guide, great - I'll provide advice. If you've found that you don't have capacity to move this forward, that's OK; please let us know.

If you would like to get this merged, the feedback so far will help you make the right kind of changes.

sftim

/lgtm

A couple of nits, but: we should merge this, it will make the docs better.

k8s-ci-robot · 2024-11-01T09:48:24Z

LGTM label has been added.

Git tree hash: f01bff874888c599957b47bf82ef0db5adc994c2

fasaxc · 2024-11-01T10:13:22Z

Applied your markups and tweaked the language to avoid making "official recommendations" PTAL.

sftim · 2024-11-01T10:27:43Z

content/en/docs/reference/access-authn-authz/extensible-admission-controllers.md

+  One way to prevent this is to establish a desired order for webhooks to start, then to exclude 
+  "earlier" webhooks' resources from being inspected by "later" webhooks.  This ensures that the 
+  "earliest" webhook can start, which in turn allows the next.


I don't understand how this could work and don't think it's an effective control. See https://stackoverflow.com/a/69166466 for informal confirmation that admission webhooks are not ordered.

I think you've taken the wrong idea from that paragraph, which suggests I could explain it more clearly! I'm talking about the order in which the webhook pods start rather than the order that webhooks apply to a particular resource.

The deadlock that can happen is if

webhook A is unavailable

webhook B is unavailable

webhook A is needed to approve webhook B's pod

webhook B is needed to approve webhook A's pod.

So the suggestion is:

Pick the webhook that needs to apply most broadly. Let's say that's A

Make sure that no other webhooks can interfere with starting A.

Therefore A can always start, even if no other webhooks are available

Therefore B can start, because it depends on A but we know that A can start.

If you have3 webhooks, A, B, C, then

Make sure B and C exclude A's pods

Therefore A can start

Make sure C excludes B's pods

Therefore B can start, since we know that A can start

Therefore C can start, since A and B can start.

I'll try to reword it, but suggestions also welcome

This kind of workload sequencing doesn't really hold because nodes are not considered durable and resilient. Also, do try to reword.

I get the point we want to make. Maybe that's worth shifting to its own point (eg a blog article)?

I think as long as you've got an ordering, node failure is fine:

webhook A comes up

webhook A's node dies

webhook A is rescheduled somewhere else

now webhook B can start... and so on.

It should get there eventually, whereas without some sort of ordering like this, node failures will cause deadlocks and you'll be forced to intervene by temporarily removing your webhook

k8s-triage-robot · 2025-01-30T11:28:51Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

SergeyKanzhelev · 2025-02-07T17:35:05Z

content/en/docs/reference/access-authn-authz/extensible-admission-controllers.md

 container with name "foo-sidecar" with the expected configuration exists in the to-be-created
 object.

 ### Avoiding deadlocks in self-hosted webhooks


Maybe this scenario can also be incorporated: kubernetes/enhancements#5068 (comment)

This PR feels like it's at danger of not converging, I'd rather not add anything more to it.

content/en/docs/reference/access-authn-authz/extensible-admission-controllers.md

sftim · 2025-02-14T15:52:31Z

/lgtm

We should merge this, it will make the docs better.

k8s-ci-robot · 2025-02-14T15:52:39Z

LGTM label has been added.

Git tree hash: 9647147c1f415d781b8078b9ce710d5f81f366f4

sftim · 2025-02-14T15:58:59Z

@fasaxc would you squash this to 1 commit and force-push? (if you keep the same merge base, Prow won't drop the LGTM label).

Webhooks can cause deadlocks in several ways, expand the list to cover more subtle cases. Co-authored-by: Kat Cosgrove <kat.cosgrove@gmail.com> Co-authored-by: Tim Bannister <tim@scalefactory.com>

Source PR: kubernetes#46798 Co-authored-by: Shaun Crampton <shaun@tigera.io> Co-authored-by: Kat Cosgrove <kat.cosgrove@gmail.com> Co-authored-by: Tim Bannister <tim@scalefactory.com>

nate-double-u · 2025-02-20T22:07:41Z

/lgtm

We should merge this, it will make the docs better.

I agree, I don't see any reason not to merge this in as is. Any unaddressed feedback can be left for a followup PR.

Thanks for this @fasaxc, and thanks to all the reviewers!

/approve

k8s-ci-robot · 2025-02-20T22:07:54Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nate-double-u

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~content/en/docs/OWNERS~~ [nate-double-u]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Source PR: kubernetes#46798 Co-authored-by: Shaun Crampton <shaun@tigera.io> Co-authored-by: Kat Cosgrove <kat.cosgrove@gmail.com> Co-authored-by: Tim Bannister <tim@scalefactory.com>

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 12, 2024

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. language/en Issues or PRs related to English language labels Jun 12, 2024

k8s-ci-robot requested review from jpbetz and liggitt June 12, 2024 10:42

sftim reviewed Jun 12, 2024

View reviewed changes

katcosgrove reviewed Jun 28, 2024

View reviewed changes

content/en/docs/reference/access-authn-authz/extensible-admission-controllers.md Outdated Show resolved Hide resolved

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 26, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 2, 2024

sftim reviewed Oct 31, 2024

View reviewed changes

sftim reviewed Nov 1, 2024

View reviewed changes

k8s-ci-robot assigned sftim Nov 1, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 1, 2024

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 1, 2024

k8s-ci-robot requested a review from sftim November 1, 2024 10:11

sftim reviewed Nov 1, 2024

View reviewed changes

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 30, 2025

sftim mentioned this pull request Feb 7, 2025

Add a good practices page for mutating webhook design #49626

Merged

SergeyKanzhelev reviewed Feb 7, 2025

View reviewed changes

sftim reviewed Feb 7, 2025

View reviewed changes

content/en/docs/reference/access-authn-authz/extensible-admission-controllers.md Outdated Show resolved Hide resolved

SergeyKanzhelev mentioned this pull request Feb 7, 2025

KEP 5067: Pod Generation KEP creation kubernetes/enhancements#5068

Merged

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 14, 2025

Add more suggestions for avoiding deadlocks

0c40ece

Webhooks can cause deadlocks in several ways, expand the list to cover more subtle cases. Co-authored-by: Kat Cosgrove <kat.cosgrove@gmail.com> Co-authored-by: Tim Bannister <tim@scalefactory.com>

fasaxc force-pushed the patch-1 branch from 50268df to 0c40ece Compare February 17, 2025 13:22

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 20, 2025

k8s-ci-robot merged commit 91919bd into kubernetes:main Feb 20, 2025
6 checks passed

		It is recommended to exclude cluster infratructure namespaces from webhooks, including kube-system,
		any namespaces used by CNI plugins, etc.

Add more suggestions for avoiding deadlocks to webhook docs #46798

Add more suggestions for avoiding deadlocks to webhook docs #46798

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

✅ Pull request preview available for checking

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!