-
Notifications
You must be signed in to change notification settings - Fork 41.5k
Make container removal fail if platform-specific containers fail #80320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@wk8 Why are the previous comments being deleted? Are you still scripting retests against kubernetes/kubernetes? |
/test pull-kubernetes-bazel-test |
@cblecker : deactivated the automated process. Sorry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your pr :)
Before I start a review, can you share more about this line in your pr:
that could potentially (in pathological cases) cause memory leaks - for containers that use GMSA cred specs, get created successfully, but then never get started nor removed.
Can you share more about these pathological cases and the size/scope of the memory leaks?
@mattjmcnaughton : the only way cred specs can be leak is if a container is created with a cred spec, then somehow never gets either started nor removed - and even then a single That's not very likely to occur, but it was deemed a sufficient concern to require some kind of GC before GMSA support is considered mature enough to be graduated to beta (see #74737 (comment)). Please also note that this whole thing should become moot and superfluous when we can deprecate support for versions of docker older than the current latest release, which includes moby/moby#38777, a patch that makes it possible to pass cred specs directly to docker instead of having to write somewhere, and then clean up. |
Gotcha, thanks @wk8 - just wanted to confirm that the failure was bad/common enough to warrant a fix and it sounds like we've already had that discussion. cc @lavalamp how much context do you have on this part of the code base? I have fairly little. Happy to try and wade in if necessary, but just wanted to check with you first :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - it might be good to prefix this with unsafe so its consistent with the other functions that assume the caller holds the mutex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, done, thanks :)
@wk8 thanks for the PR. For the reasons you stated above, I think a simpler cleanup routine would suffice. We don't need to worry about containers that were never started or removed -- kubelet should try to remove them, and would trigger the cleanup anyway. The reviewer comments from the previous PR was mainly about the case where the cleanup failed (for whatever reason), dockershim will never attempt to clean up again. To handle the specific corner, dockershim can periodically look at the cleanup that failed previously and attempt again. This should be a much smaller code change. |
@yujuhong: GitHub didn't allow me to request PR reviews from the following users: liyanhui1228. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@yujuhong : that's indeed what this change is trying to accomplish, in a way that also stays reasonably efficient if for whatever reason we end up having a large number of such failures. Could you please be a little more specific as to what specific parts should be simplified? Thank you! |
870531c
to
60671ff
Compare
/test pull-kubernetes-integration |
1 similar comment
/test pull-kubernetes-integration |
@yujuhong : simplified the implementation at the cost of a slightly decreased performance, please let me know what you think, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cleanup associated -> associated cleanup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed :)
/assign @Random-Liu |
After discussing it with @Random-Liu on Slack, it seems just running clean ups at container removals, and failing those if the clean up fails, is the cleaner way to go here :) Will make that change as soon as @yujuhong okays it. |
SGTM. Thanks! |
containerCleanupInfos
kubernetes#74737 introduced a new in-memory map for the dockershim, that could potentially (in pathological cases) cause memory leaks - for containers that use GMSA cred specs, get created successfully, but then never get started nor removed. This patch addresses this issue by making container removal fail altogether when platform-specific clean ups fail: this allows clean ups to be retried later, when the kubelet attempts to remove the container again. Resolves issue kubernetes#74843. Signed-off-by: Jean Rouge <rougej+github@gmail.com>
@yujuhong & @Random-Liu : amended as requested to fail container removal when the clean up fails. PTAL, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with one question
if cleanupInfo, present := ds.containerCleanupInfos[containerID]; present { | ||
ds.performPlatformSpecificContainerCleanupAndLogErrors(containerID, cleanupInfo) | ||
delete(ds.containerCleanupInfos, containerID) | ||
errors = ds.performPlatformSpecificContainerCleanupAndLogErrors(containerID, cleanupInfo) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't you call performPlatformSpecificContainerCleanup instead?
Or do you intentionally want to log the error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's to log any errors that might occur; it seems better to do it here as performPlatformSpecificContainerCleanup
lives in platform-specific files, so logging here centralizes it for all platforms.
/priority important-soon |
/test pull-kubernetes-e2e-gce |
/lgtm |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: PatrickLang, Random-Liu, wk8 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test pull-kubernetes-integration |
/test pull-kubernetes-e2e-gce-100-performance |
Concerns from alpha-phase reviews have been addressed in kubernetes#80320 and kubernetes#82109 and early adopters have given positive feedback; so it seems there are no blockers to graduate GMSA support to beta. This patch also enables GMSA support by default, mainly for the sake of making it easier for cluster admins to use, as well as for e2e tests in nightly builds. Signed-off-by: Jean Rouge <rougej+github@gmail.com>
What type of PR is this?
/kind feature
What this PR does / why we need it:
#74737 introduced a new in-memory
map for the dockershim, that could potentially (in pathological cases) cause
memory leaks - for containers that use GMSA cred specs, get created
successfully, but then never get started nor removed.
This patch addresses this issue by making container removal fail altogether
when platform-specific clean ups fail: this allows clean ups to be retried
later, when the kubelet attempts to remove the container again.
Which issue(s) this PR fixes:
Fixes #74843
Does this PR introduce a user-facing change?: