Add retries to proxy timeout for metrics resource endpoint in e2e tests #132641

ngopalak-redhat · 2025-07-01T04:09:13Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

This PR addresses the flakiness observed in the "should grab all metrics from kubelet /metrics/resource endpoint" test when run in parallel.

Our analysis shows the failures are due to proxy timeouts. To mitigate this, we're implementing a retry mechanism. If a proxy timeout occurs, the test will now attempt to fetch metrics/resource from a random node in a multi-node cluster.

The test's existing proxy timeout is 2 minutes. With the new retry logic, the retries are configured to last up to 7 minutes, allowing for approximately 3 retry attempts. In ideal conditions, this test still completes within 1-2 seconds.

Here's the copy of the log for reference:

  STEP: Creating a kubernetes client @ 06/15/25 11:48:02.606
  STEP: Building a namespace api object, basename metrics @ 06/15/25 11:48:02.607
I0615 11:48:02.634122 15781 namespace.go:59] About to run a Kube e2e test, ensuring namespace/e2e-metrics-3900 is privileged
  STEP: Waiting for a default service account to be provisioned in namespace @ 06/15/25 11:48:03.033
  STEP: Waiting for kube-root-ca.crt to be provisioned in namespace @ 06/15/25 11:48:03.036
  STEP: Connecting to kubelet's /metrics/resource endpoint @ 06/15/25 11:48:03.043
I0615 11:50:03.050424 15781 metrics.go:67] Unexpected error: 
    <*errors.errorString | 0xc00321ae90>: 
    Timed out when waiting for proxy to gather metrics from ip-10-0-143-38.ec2.internal
    {
        s: "Timed out when waiting for proxy to gather metrics from ip-10-0-143-38.ec2.internal",
    }
  [FAILED] in [It] - k8s.io/kubernetes/test/e2e/instrumentation/metrics.go:67 @ 06/15/25 11:50:03.05
  STEP: dump namespace information after failure @ 06/15/25 11:50:03.05
  STEP: Collecting events from namespace "e2e-metrics-3900". @ 06/15/25 11:50:03.05
  STEP: Found 0 events. @ 06/15/25 11:50:03.052

Which issue(s) this PR is related to:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2025-07-01T04:09:23Z

Welcome @ngopalak-redhat!

It looks like this is your first PR to kubernetes/kubernetes 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kubernetes has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2025-07-01T04:09:23Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2025-07-01T04:09:23Z

Hi @ngopalak-redhat. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2025-07-01T04:09:37Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ngopalak-redhat
Once this PR has been reviewed and has the lgtm label, please assign dashpole for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

aojea · 2025-07-01T10:03:06Z

test/e2e/instrumentation/metrics.go

-		gomega.Expect(response).NotTo(gomega.BeEmpty())
+		// Each attempt to grab metrics from the Kubelet has an internal proxyTimeout (2 minutes).
+		// The Gomega.Eventually ensures we retry the entire grab operation (including potential internal proxy timeouts)
+		// for up to 7 minutes, polling every 10 seconds. This allows for multiple retries if the Kubelet is temporarily


7 minutes is too much to get a metrics, 5 retries in one minute sounds more than enough,

aojea · 2025-07-01T10:08:55Z

/hold

Our analysis shows the failures are due to proxy timeouts. To mitigate this, we're implementing a retry mechanism. If a proxy timeout occurs, the test will now attempt to fetch metrics/resource from a random node in a multi-node cluster.

that is the symptom, can you please explain the root cause, have you checked the kubelet, runtime and apiserver logs to see where the connection times out? we should not hide errors implementing retries in test until we evaluate the problem is really something that requires retries to solve

ngopalak-redhat · 2025-07-01T10:31:40Z

/hold

Our analysis shows the failures are due to proxy timeouts. To mitigate this, we're implementing a retry mechanism. If a proxy timeout occurs, the test will now attempt to fetch metrics/resource from a random node in a multi-node cluster.

that is the symptom, can you please explain the root cause, have you checked the kubelet, runtime and apiserver logs to see where the connection times out? we should not hide errors implementing retries in test until we evaluate the problem is really something that requires retries to solve

Thanks @aojea for the review. We agree that retries shouldn't simply hide errors. We investigated this specific flakiness:
The test's success rate drops to 91% when running e2e tests in parallel, but it's 100% reliable when run serially. This strong correlation with parallel execution points away from a consistent bug in a single component (like kubelet or apiserver) and towards resource issues or network issues due to parallel load.

Due to its low occurance and dependency on parallel execution, we unfortunately couldn't reproduce the specific proxy timeout error with increased logging enabled.

This behavior strongly suggests that other concurrent e2e tests are occasionally causing temporary resource exhaustion or minor network delays within the cluster, leading to intermittent proxy timeouts. It's not necessarily a bug in the proxy itself, but rather an occasional symptom of contention.

Given the difficulty in finding the root cause under parallel load, and the fact that the test itself takes only 1-2 seconds when successful, we believe adding a targeted retry for proxy timeouts is the most feasible solution here. It specifically addresses only the proxy timeout issue.

The retry logic is designed to choose a different node, which should help bypass any network or resource issues on a particular node during that specific moment when the code executed.

We are welcome to suggestions if you have ideas on how to deterministically capture logs for such highly intermittent, parallel-execution-dependent failures.

aojea · 2025-07-01T10:44:56Z

once you get an oc 8000 currence get the kubelet logs to check if is a problem on the controller runtime, that I assume is CRIO or in kubernetes itself.

Also, correlate to see if this fails in kubernetes CI, we need to check in current list of flakes if we see this failures

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jul 1, 2025

k8s-ci-robot requested review from dgrisonnet and logicalhan July 1, 2025 04:09

ngopalak-redhat force-pushed the ngopalak/e2etest_metrics branch from fbc789d to f04522d Compare July 1, 2025 04:15

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jul 1, 2025

ngopalak-redhat changed the title ~~Add retries to proxy timeout for metrics resource endpoint~~ Add retries to proxy timeout for metrics resource endpoint in e2e tests Jul 1, 2025

Add retries to proxy timeout for metrics resource endpoint

b70990e

ngopalak-redhat force-pushed the ngopalak/e2etest_metrics branch from f04522d to b70990e Compare July 1, 2025 09:47

ngopalak-redhat marked this pull request as ready for review July 1, 2025 09:50

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 1, 2025

k8s-ci-robot requested review from richabanker and s-urbaniak July 1, 2025 09:50

aojea reviewed Jul 1, 2025

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add retries to proxy timeout for metrics resource endpoint in e2e tests #132641

Add retries to proxy timeout for metrics resource endpoint in e2e tests #132641

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Add retries to proxy timeout for metrics resource endpoint in e2e tests #132641

Are you sure you want to change the base?

Add retries to proxy timeout for metrics resource endpoint in e2e tests #132641

Conversation

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR is related to:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!