10BC0 Add more suggestions for avoiding deadlocks to webhook docs by fasaxc · Pull Request #46798 · kubernetes/website · GitHub
[go: up one dir, main page]

Skip to content

Conversation

fasaxc
Copy link
Contributor
@fasaxc fasaxc commented Jun 12, 2024

Webhooks can cause deadlocks in several ways, expand the list to cover more subtle cases.

Was recently debugging an issue where this came up and I thought, perhaps I can improve the docs to call out more cases. Not sure we've got the root cause of the issue correct yet but I can't see why it wouldn't be possible to cause deadlocks in these ways (and I think I was seeing these problems in a customer cluster).

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 12, 2024
@k8s-ci-robot
Copy link
Contributor

Welcome @fasaxc!

It looks like this is your first PR to kubernetes/website 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/website has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. language/en Issues or PRs related to English language labels Jun 12, 2024
@k8s-ci-robot k8s-ci-robot requested review from jpbetz and liggitt June 12, 2024 10:42
Copy link
netlify bot commented Jun 12, 2024

Pull request preview available for checking

Built without sensitive environment variables

Name Link
🔨 Latest commit 0c40ece
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-io-main-staging/deploys/67b3381a09ead600088c678c
😎 Deploy Preview https://deploy-preview-46798--kubernetes-io-main-staging.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Copy link
Contributor
@sftim sftim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend clearly calling out ValidatingAdmissionPolicies as a way to limit the impact.
If you run an admission webhook in a namespace scrutineer, you can exempt the scrutineer namespace from the webhook admission checks, and you can still restrict changes to scrutineer using a ValidatingAdmissionPolicy.
You should consider Pod security admission for that scrutineer namespace, too (at least baseline, Restricted if you can make it work).

For example, have a ValidatingAdmissionPolicy that restricts who can remove the scrutineer namespace and also limits writes to the ConfigMap that configures the webhook. Have another ValidatingAdmissionPolicy that prevents making the scrutineer namespace privileged, and a further one that prevents creating new ClusterRoleBindings inside the scrutineer namespace.

That's just an example but I hope you get the idea.

Comment on lines 1268 to 1269
It is recommended to exclude cluster infratructure namespaces from webhooks, including kube-system,
any namespaces used by CNI plugins, etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • typo (“infratructure”)
  • let's mention ValidatingAdmissionPolicies as a way to provide (at least) some protection without having the admission webhook protect itself

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 26, 2024
@fasaxc
Copy link
Contributor Author
fasaxc commented Oct 2, 2024

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 2, 2024
Copy link
Contributor
@sftim sftim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Here's some feedback.

example, due to a cluster-wide outage or a node failure where both pods run on the same node)
deadlock occurs because neither webhook pod can be recreated without the other already running.
It is recommended to establish a desired order for webhooks to start, then to exclude "earlier"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a recommendation that the Kubernetes project makes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I've toned it down from recommendation to an example

Comment on lines 1271 to 1272
It is recommended to exclude cluster infrastructure namespaces from webhooks, including kube-system,
any namespaces used by CNI plugins, etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a Kubernetes project recommendation, although in many (not all) cases it is a good idea to do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

toned down.

@fasaxc
Copy link
Contributor Author
fasaxc commented Oct 31, 2024

@sftim do you want to take over this PR? Seems like you want to rewrite this section of the docs.

@sftim
Copy link
Contributor
sftim commented Oct 31, 2024

@sftim do you want to take over this PR?

I'd much rather not, honestly. If I can help you follow our style guide, great - I'll provide advice. If you've found that you don't have capacity to move this forward, that's OK; please let us know.

If you would like to get this merged, the feedback so far will help you make the right kind of changes.

Copy link
Contributor
@sftim sftim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

A couple of nits, but: we should merge this, it will make the docs better.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 1, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: f01bff874888c599957b47bf82ef0db5adc994c2

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 1, 2024
@k8s-ci-robot k8s-ci-robot requested a review from sftim November 1, 2024 10:11
@fasaxc
Copy link
Contributor Author
fasaxc commented Nov 1, 2024

Applied your markups and tweaked the language to avoid making "official recommendations" PTAL.

Comment on lines 1262 to 1264
One way to prevent this is to establish a desired order for webhooks to start, then to exclude
"earlier" webhooks' resources from being inspected by "later" webhooks. This ensures that the
"earliest" webhook can start, which in turn allows the next.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how this could work and don't think it's an effective control. See https://stackoverflow.com/a/69166466 for informal confirmation that admission webhooks are not ordered.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you've taken the wrong idea from that paragraph, which suggests I could explain it more clearly! I'm talking about the order in which the webhook pods start rather than the order that webhooks apply to a particular resource.

The deadlock that can happen is if

  • webhook A is unavailable
  • webhook B is unavailable
  • webhook A is needed to approve webhook B's pod
  • webhook B is needed to approve webhook A's pod.

So the suggestion is:

  • Pick the webhook that needs to apply most broadly. Let's say that's A
  • Make sure that no other webhooks can interfere with starting A.
  • Therefore A can always start, even if no other webhooks are available
  • Therefore B can start, because it depends on A but we know that A can start.

If you have3 webhooks, A, B, C, then

  • Make sure B and C exclude A's pods
  • Therefore A can start
  • Make sure C excludes B's pods
  • Therefore B can start, since we know that A can start
  • Therefore C can start, since A and B can start.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to reword it, but suggestions also welcome

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of workload sequencing doesn't really hold because nodes are not considered durable and resilient. Also, do try to reword.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the point we want to make. Maybe that's worth shifting to its own point (eg a blog article)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think as long as you've got an ordering, node failure is fine:

  • webhook A comes up
  • webhook A's node dies
  • webhook A is rescheduled somewhere else
  • now webhook B can start... and so on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should get there eventually, whereas without some sort of ordering like this, node failures will cause deadlocks and you'll be forced to intervene by temporarily removing your webhook

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 30, 2025
container with name "foo-sidecar" with the expected configuration exists in the to-be-created
object.

### Avoiding deadlocks in self-hosted webhooks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this scenario can also be incorporated: kubernetes/enhancements#5068 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR feels like it's at danger of not converging, I'd rather not add anything more to it.

@sftim
Copy link
Contributor
sftim commented Feb 14, 2025

/lgtm

We should merge this, it will make the docs better.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 14, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 9647147c1f415d781b8078b9ce710d5f81f366f4

@sftim
Copy link
Contributor
sftim commented Feb 14, 2025

@fasaxc would you squash this to 1 commit and force-push? (if you keep the same merge base, Prow won't drop the LGTM label).

Webhooks can cause deadlocks in several ways, expand the list to cover more subtle cases.

Co-authored-by: Kat Cosgrove <kat.cosgrove@gmail.com>
Co-authored-by: Tim Bannister <tim@scalefactory.com>
shannonxtreme added a commit to shannonxtreme/website that referenced this pull request Feb 19, 2025
Source PR: kubernetes#46798

Co-authored-by: Shaun Crampton <shaun@tigera.io>
Co-authored-by: Kat Cosgrove <kat.cosgrove@gmail.com>
Co-authored-by: Tim Bannister <tim@scalefactory.com>
shannonxtreme added a commit to shannonxtreme/website that referenced this pull request Feb 19, 2025
Source PR: kubernetes#46798

Co-authored-by: Shaun Crampton <shaun@tigera.io>
Co-authored-by: Kat Cosgrove <kat.cosgrove@gmail.com>
Co-authored-by: Tim Bannister <tim@scalefactory.com>
@nate-double-u
Copy link
Contributor

/lgtm

We should merge this, it will make the docs better.

I agree, I don't see any reason not to merge this in as is. Any unaddressed feedback can be left for a followup PR.

Thanks for this @fasaxc, and thanks to all the reviewers!

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nate-double-u

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 20, 2025
@k8s-ci-robot k8s-ci-robot merged commit 91919bd into kubernetes:main Feb 20, 2025
6 checks passed
shannonxtreme added a commit to shannonxtreme/website that referenced this pull request Mar 10, 2025
Source PR: kubernetes#46798

Co-authored-by: Shaun Crampton <shaun@tigera.io>
Co-authored-by: Kat Cosgrove <kat.cosgrove@gmail.com>
Co-authored-by: Tim Bannister <tim@scalefactory.com>
shannonxtreme added a commit to shannonxtreme/website that referenced this pull request Mar 21, 2025
Source PR: kubernetes#46798

Co-authored-by: Shaun Crampton <shaun@tigera.io>
Co-authored-by: Kat Cosgrove <kat.cosgrove@gmail.com>
Co-authored-by: Tim Bannister <tim@scalefactory.com>
shannonxtreme added a commit to shannonxtreme/website that referenced this pull request Mar 21, 2025
Source PR: kubernetes#46798

Co-authored-by: Shaun Crampton <shaun@tigera.io>
Co-authored-by: Kat Cosgrove <kat.cosgrove@gmail.com>
Co-authored-by: Tim Bannister <tim@scalefactory.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. language/en Issues or PRs related to English language lgtm "Looks good to me", indicates that a PR is ready to be merged. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants
0