-
Notifications
You must be signed in to change notification settings - Fork 40.9k
Add retries to proxy timeout for metrics resource endpoint in e2e tests #132641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add retries to proxy timeout for metrics resource endpoint in e2e tests #132641
Conversation
Welcome @ngopalak-redhat! |
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Hi @ngopalak-redhat. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ngopalak-redhat The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
fbc789d
to
f04522d
Compare
f04522d
to
b70990e
Compare
gomega.Expect(response).NotTo(gomega.BeEmpty()) | ||
// Each attempt to grab metrics from the Kubelet has an internal proxyTimeout (2 minutes). | ||
// The Gomega.Eventually ensures we retry the entire grab operation (including potential internal proxy timeouts) | ||
// for up to 7 minutes, polling every 10 seconds. This allows for multiple retries if the Kubelet is temporarily |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
7 minutes is too much to get a metrics, 5 retries in one minute sounds more than enough,
/hold
that is the symptom, can you please explain the root cause, have you checked the kubelet, runtime and apiserver logs to see where the connection times out? we should not hide errors implementing retries in test until we evaluate the problem is really something that requires retries to solve |
Thanks @aojea for the review. We agree that retries shouldn't simply hide errors. We investigated this specific flakiness: Due to its low occurance and dependency on parallel execution, we unfortunately couldn't reproduce the specific proxy timeout error with increased logging enabled. This behavior strongly suggests that other concurrent e2e tests are occasionally causing temporary resource exhaustion or minor network delays within the cluster, leading to intermittent proxy timeouts. It's not necessarily a bug in the proxy itself, but rather an occasional symptom of contention. Given the difficulty in finding the root cause under parallel load, and the fact that the test itself takes only 1-2 seconds when successful, we believe adding a targeted retry for proxy timeouts is the most feasible solution here. It specifically addresses only the proxy timeout issue. The retry logic is designed to choose a different node, which should help bypass any network or resource issues on a particular node during that specific moment when the code executed. We are welcome to suggestions if you have ideas on how to deterministically capture logs for such highly intermittent, parallel-execution-dependent failures. |
once you get an oc 8000 currence get the kubelet logs to check if is a problem on the controller runtime, that I assume is CRIO or in kubernetes itself. Also, correlate to see if this fails in kubernetes CI, we need to check in current list of flakes if we see this failures |
What type of PR is this?
/kind bug
What this PR does / why we need it:
This PR addresses the flakiness observed in the "should grab all metrics from kubelet /metrics/resource endpoint" test when run in parallel.
Our analysis shows the failures are due to proxy timeouts. To mitigate this, we're implementing a retry mechanism. If a proxy timeout occurs, the test will now attempt to fetch metrics/resource from a random node in a multi-node cluster.
The test's existing proxy timeout is 2 minutes. With the new retry logic, the retries are configured to last up to 7 minutes, allowing for approximately 3 retry attempts. In ideal conditions, this test still completes within 1-2 seconds.
Here's the copy of the log for reference:
Which issue(s) this PR is related to:
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: