8000 client-go/discovery: Migrate disk cache to RFC 9111 compliant implementation by bartventer · Pull Request #132681 · kubernetes/kubernetes · GitHub
[go: up one dir, main page]

Skip to content

client-go/discovery: Migrate disk cache to RFC 9111 compliant implementation #132681

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

bartventer
Copy link
@bartventer bartventer commented Jul 2, 2025

What type of PR is this?

/kind feature
/kind cleanup

What this PR does / why we need it:

This PR migrates the discovery disk cache from gregjones/httpcache to bartventer/httpcache for RFC 9111 HTTP Caching compliance, active maintenance, and improved caching behavior (ito hit rates).

Motivation:
gregjones/httpcache is archived and implements the obsolete RFC 7234. bartventer/httpcache implements RFC 9111 §4.1 with proper normalization of Vary headers and URIs when generating cache keys; this covers recommended normalization steps from RFC 7230 §2.7.3 and RFC 3986 §6. In contrast, gregjones/httpcache uses (*url.URL).String() as the cache key, which cannot support Vary-based caching and may produce unpredictable behavior for requests with varying headers.

Additionally, bartventer/httpcache also:

  • Caches a wider range of HTTP status codes (200, 203, 301, 304, 308, 404, 405, 410, 414, 501)
  • Supports stale-while-revalidate and immutable cache-control extensions
  • Provides more detailed cache status headers (X-Httpcache-Status: HIT, MISS, STALE, REVALIDATED, BYPASS), in addition to X-From-Cache (for compatibility with any existing consumers)

Technical changes:

  • Minor refactoring to adapt the current disk-based cache implementation
    to satisfy the driver.Conn interface
  • Introduces DSN-based cache configuration with sumDiskScheme
  • Improved test coverage for the new cache implementation

Which issue(s) this PR is related to:

References #120276

Special notes for your reviewer:

  • Added a comment and TODO for the UTC/GMT date header workaround, which addresses kube-openapi's RFC 9110 violation where it generates "UTC" instead of "GMT" in Expires headers and can be removed once kube-openapi is fixed upstream (will create tracking issue after this PR merges).
  • bartventer/httpcache has it's own acceptance tests for cache implementations, which are run here; I removed overlapping tests, summarized below.
    Acceptance Test Description
    SetAndGet Replaces need for "NoSuchKey"
    Overwrite Replaces need for "OverwriteExistingKey"
    Delete Replaces need for "DeleteKey"
    GetNonexistent Replaces need for "NoSuchKey"
    DeleteNonexistent Additional case not covered before
  • Cache compatibility: Existing cache from gregjones/httpcache cannot be safely reused due to incompatible key formats and potential collision risks. See discussion below for migration options.

Does this PR introduce a user-facing change?

NONE

Replace `gregjones/httpcache` with `bartventer/httpcache` for RFC 9111 compliance and active maintenance.

Refs kubernetes#120276
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels Jul 2, 2025
Copy link
linux-foundation-easycla bot commented Jul 2, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 2, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jul 2, 2025
@k8s-ci-robot
Copy link
Contributor

Welcome @bartventer!

It looks like this is your first PR to kubernetes/kubernetes 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kubernetes has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 2, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @bartventer. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot requested review from andrewsykim, bart0sh and a team July 2, 2025 12:53
@k8s-ci-robot k8s-ci-robot added area/apiserver area/cloudprovider area/dependency Issues or PRs related to dependency changes area/kubectl area/kubelet sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/cli Categorizes an issue or PR as relevant to SIG CLI. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 2, 2025
@k8s-ci-robot k8s-ci-robot added the sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. label Jul 2, 2025
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG CLI Jul 2, 2025
@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jul 2, 2025
@k8s-ci-robot k8s-ci-robot added the wg/device-management Categorizes an issue or PR as relevant to WG Device Management. label Jul 2, 2025
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jul 2, 2025
@aojea
Copy link
Member
aojea commented Jul 2, 2025

/area code-organization

@k8s-ci-robot k8s-ci-robot added the area/code-organization Issues or PRs related to kubernetes code organization label Jul 2, 2025
@bartventer bartventer changed the title feat(client-go/discovery): Migrate disk cache to RFC 9111 compliant implementation client-go/discovery: Migrate disk cache to RFC 9111 compliant implementation Jul 2, 2025
@lmktfy
Copy link
lmktfy commented Jul 2, 2025

This change does sound to me like it is user visible, albeit only indirectly.

@BenTheElder
Copy link
Member

This sounds like a pretty big change, one missing detail in the PR body: How did you approach when there is a cache from the previous implementation already on disk? Does it get cleaned up? Re-used? Or abandoned to eat disk forever?

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicat 8000 es a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 2, 2025
@bartventer
Copy link
Author

This change does sound to me like it is user visible, albeit only indirectly.

@lmktfy You're right that this could be indirectly user-visible. The main impacts:

Potential benefits:

  • Better cache hit rates (faster response times due to proper Vary header support)
  • More accurate HTTP caching per RFC 9111

Potential disruptions:

  • Initial cache misses until new cache warms up
  • Different caching behavior for responses with Vary headers

Since the discovery client API stays the same and this is internal to the caching layer, I marked it as NONE. Users call the same methods and get the same results, just with better HTTP compliance behind the scenes.

Do you think we should add a brief release note about the cache improvements, or is NONE appropriate given the unchanged API?

@bartventer
Copy link
Author
bartventer commented Jul 3, 2025

This sounds like a pretty big change, one missing detail in the PR body: How did you approach when there is a cache from the previous implementation already on disk? Does it get cleaned up? Re-used? Or abandoned to eat disk forever?

/ok-to-test

Thanks for pointing this out.

The existing cache can't be reused because of incompatible key formats and collision risks between gregjones/httpcache and bartventer/httpcache. The new implementation uses a different storage structure that would conflict with existing entries.

Key Differences in Cache Key Generation:

gregjones/httpcache:

  • Uses (*url.URL).String() directly as the cache key (including fragments) (see httpcache.go)
  • Can only store one representation per URL
  • Overwrites previous entries when Vary headers differ

bartventer/httpcache:

  • Normalizes URLs per RFC 3986 §6 (lowercase scheme/host, remove default ports, normalize paths/encoding) (see urlkeyer.go)
  • Uses two-tier storage:
    • Normalized URL keys (without vary hash) that reference full response entries
    • Full response keys with Vary header hash: <normalized-url>#<vary-hash> (or #0 if no Vary headers) (see normalization.go)
  • Stores multiple representations of the same URL based on different Vary header combinations

Cache Collision Risks:

Manual cleanup isn't safe because of potential key collisions between old and new cache entries:

  1. Normalized URL collisions: When URL normalization doesn't change anything, old raw URL keys could collide with new normalized reference keys.
  2. Fragment collisions: Old entries with URL fragments could collide with new entries using #0 suffix (when no Vary headers exist).
  3. Two-tier lookup conflicts: New reference keys might accidentally resolve to old cache entries, corrupting data.

Possible Migration Approaches:

  1. Separate cache storage: Use a different cache directory/subdirectory to avoid collisions. Old cache can be cleaned up later, but expect slower performance initially. 1
  2. Migration script: Convert old keys and strip synthetic headers (complex due to collision detection). Would require careful handling to avoid collisions, but potentially allows less disruption.
  3. Do nothing: Accept collision risks and let implementations coexist. Not really viable due to the risks outlined above.

@BenTheElder, some questions for you:

  • Which approach makes the most sense for Kubernetes users?
  • Are there existing patterns in the codebase for handling cache format migrations?
  • Should we document this incompatibility in release notes or migration guides?

I'm leaning toward option 1 (separate storage) given the collision risks, but would appreciate your thoughts on the best path forward.

Footnotes

  1. For example, TempDir (round_tripper.go#L80) could be updated to use a new name like .diskv-temp-9111. Cleanup of the old .diskv-temp directory can then be handled in a separate, non-blocking goroutine.

@bartventer
Copy link
Author

/retest

@enj enj moved this to Needs Triage in SIG Auth Jul 7, 2025
@dgrisonnet
Copy link
Member

/remove-sig instrumentation

@k8s-ci-robot k8s-ci-robot removed the sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. label Jul 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/apiserver area/cloudprovider area/code-organization Issues or PRs related to kubernetes code organization area/dependency Issues or PRs related to dependency changes area/kubectl area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/feature Categorizes issue or PR as related to a new feature. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
Archived in project
Status: Needs Triage
Status: Needs Triage
Development

Successfully merging this pull request may close these issues.

6 participants
0