8000 Add retries to proxy timeout for metrics resource endpoint in e2e tests by ngopalak-redhat · Pull Request #132641 · kubernetes/kubernetes · GitHub
[go: up one dir, main page]

Skip to content

Add retries to proxy timeout for metrics resource endpoint in e2e tests #132641

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ngopalak-redhat
Copy link
@ngopalak-redhat ngopalak-redhat commented Jul 1, 2025

What type of PR is this?

/kind bug

What this PR does / why we need it:

This PR addresses the flakiness observed in the "should grab all metrics from kubelet /metrics/resource endpoint" test when run in parallel.

Our analysis shows the failures are due to proxy timeouts. To mitigate this, we're implementing a retry mechanism. If a proxy timeout occurs, the test will now attempt to fetch metrics/resource from a random node in a multi-node cluster.

The test's existing proxy timeout is 2 minutes. With the new retry logic, the retries are configured to last up to 7 minutes, allowing for approximately 3 retry attempts. In ideal conditions, this test still completes within 1-2 seconds.

Here's the copy of the log for reference:

  STEP: Creating a kubernetes client @ 06/15/25 11:48:02.606
  STEP: Building a namespace api object, basename metrics @ 06/15/25 11:48:02.607
I0615 11:48:02.634122 15781 namespace.go:59] About to run a Kube e2e test, ensuring namespace/e2e-metrics-3900 is privileged
  STEP: Waiting for a default service account to be provisioned in namespace @ 06/15/25 11:48:03.033
  STEP: Waiting for kube-root-ca.crt to be provisioned in namespace @ 06/15/25 11:48:03.036
  STEP: Connecting to kubelet's /metrics/resource endpoint @ 06/15/25 11:48:03.043
I0615 11:50:03.050424 15781 metrics.go:67] Unexpected error: 
    <*errors.errorString | 0xc00321ae90>: 
    Timed out when waiting for proxy to gather metrics from ip-10-0-143-38.ec2.internal
    {
        s: "Timed out when waiting for proxy to gather metrics from ip-10-0-143-38.ec2.internal",
    }
  [FAILED] in [It] - k8s.io/kubernetes/test/e2e/instrumentation/metrics.go:67 @ 06/15/25 11:50:03.05
  STEP: dump namespace information after failure @ 06/15/25 11:50:03.05
  STEP: Collecting events from namespace "e2e-metrics-3900". @ 06/15/25 11:50:03.05
  STEP: Found 0 events. @ 06/15/25 11:50:03.052

Which issue(s) this PR is related to:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 1, 2025
@k8s-ci-robot
Copy link
Contributor

Welcome @ngopalak-redhat!

It looks like this is your first PR to kubernetes/kubernetes 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kubernetes has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Contributor

Hi @ngopalak-redhat. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jul 1, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ngopalak-redhat
Once this PR has been reviewed and has the lgtm label, please assign dashpole for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework area/test sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 1, 2025
@ngopalak-redhat ngopalak-redhat force-pushed the ngopalak/e2etest_metrics branch from fbc789d to f04522d Compare July 1, 2025 04:15
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jul 1, 2025
@ngopalak-redhat ngopalak-redhat changed the title Add retries to proxy timeout for metrics resource endpoint Add retries to proxy timeout for metrics resource endpoint in e2e tests Jul 1, 2025
@ngopalak-redhat ngopalak-redhat force-pushed the ngopalak/e2etest_metrics branch from f04522d to b70990e Compare July 1, 2025 09:47
@ngopalak-redhat ngopalak-redhat marked this pull request as ready for review July 1, 2025 09:50
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 1, 2025
gomega.Expect(response).NotTo(gomega.BeEmpty())
// Each attempt to grab metrics from the Kubelet has an internal proxyTimeout (2 minutes).
// The Gomega.Eventually ensures we retry the entire grab operation (including potential internal proxy timeouts)
// for up to 7 minutes, polling every 10 seconds. This allows for multiple retries if the Kubelet is temporarily
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 minutes is too much to get a metrics, 5 retries in one minute sounds more than enough,

@aojea
Copy link
Member
aojea commented Jul 1, 2025

/hold

Our analysis shows the failures are due to proxy timeouts. To mitigate this, we're implementing a retry mechanism. If a proxy timeout occurs, the test will now attempt to fetch metrics/resource from a random node in a multi-node cluster.

that is the symptom, can you please explain the root cause, have you checked the kubelet, runtime and apiserver logs to see where the connection times out? we should not hide errors implementing retries in test until we evaluate the problem is really something that requires retries to solve

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 1, 2025
@ngopalak-redhat
Copy link
Author

/hold

Our analysis shows the failures are due to proxy timeouts. To mitigate this, we're implementing a retry mechanism. If a proxy timeout occurs, the test will now attempt to fetch metrics/resource from a random node in a multi-node cluster.

that is the symptom, can you please explain the root cause, have you checked the kubelet, runtime and apiserver logs to see where the connection times out? we should not hide errors implementing retries in test until we evaluate the problem is really something that requires retries to solve

Thanks @aojea for the review. We agree that retries shouldn't simply hide errors. We investigated this specific flakiness:
The test's success rate drops to 91% when running e2e tests in parallel, but it's 100% reliable when run serially. This strong correlation with parallel execution points away from a consistent bug in a single component (like kubelet or apiserver) and towards resource issues or network issues due to parallel load.

Due to its low occurance and dependency on parallel execution, we unfortunately couldn't reproduce the specific proxy timeout error with increased logging enabled.

This behavior strongly suggests that other concurrent e2e tests are occasionally causing temporary resource exhaustion or minor network delays within the cluster, leading to intermittent proxy timeouts. It's not necessarily a bug in the proxy itself, but rather an occasional symptom of contention.

Given the difficulty in finding the root cause under parallel load, and the fact that the test itself takes only 1-2 seconds when successful, we believe adding a targeted retry for proxy timeouts is the most feasible solution here. It specifically addresses only the proxy timeout issue.

The retry logic is designed to choose a different node, which should help bypass any network or resource issues on a particular node during that specific moment when the code executed.

We are welcome to suggestions if you have ideas on how to deterministically capture logs for such highly intermittent, parallel-execution-dependent failures.

@aojea
Copy link
Member
aojea commented Jul 1, 2025

once you get an oc 8000 currence get the kubelet logs to check if is a problem on the controller runtime, that I assume is CRIO or in kubernetes itself.

Also, correlate to see if this fails in kubernetes CI, we need to check in current list of flakes if we see this failures

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0