-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
What is the issue?
After the linkerd-proxy sidecar container in the linkerd-destination control plane pod was OOMKilled, all data plane sidecar proxies entered a permanently broken state where every outbound request fails with buffer's worker closed unexpectedly. The proxies did not self-heal even after the destination pod fully recovered. A manual restart of the data plane sidecar proxies (via pod restart) was required to restore connectivity.
The root cause is a panic in control.rs at line 118:
thread 'main' panicked at linkerd/app/core/src/control.rs:118:49: period must be non-zero.
This panic was triggered when DNS resolution for linkerd-dst-headless.linkerd.svc.cluster.local returned zero results during the brief window when the destination pod was recovering from OOM. The zero-result DNS response caused a resolution period of zero to be computed, which triggered the panic.
The panic killed the internal balance queu
6880
e worker task (at linkerd/proxy/balance/queue/src/service.rs:73), after which the proxy's outbound pipeline was permanently broken. Every subsequent outbound connection attempt failed with:
WARN outbound: linkerd_app_core::serve: Server failed to become ready error=buffer's worker closed unexpectedly
This is related to #14333
How can it be reproduced?
Restarting linkerd destination several times may help to reproduce
Logs, error output, etc
Phase 1: Normal startup
INFO linkerd2_proxy: release 2.316.0 (0a932ea) by linkerd on 2025-08-27T03:53:54Z INFO dst:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}: linkerd_pool_p2c: Adding endpoint addr=10.x.x.x:x INFO dst:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}: linkerd_pool_p2c: Adding endpoint addr=10.x.x.x:x INFO dst:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}: linkerd_pool_p2c: Adding endpoint addr=10.x.x.x:x INFO linkerd2_proxy: Destinations resolved via linkerd-dst-headless.linkerd.svc.cluster.local:x
Phase 2: OOM event on destination pod proxy - gRPC streams break
WARN watch{port=8080}: linkerd_app_inbound::policy::api: Unexpected policy controller response; retrying with a backoff grpc.status=Unknown error grpc.message="h2 protocol error: error reading a body from connection" WARN policy:controller:endpoint{addr=10.x.x.x:x}: linkerd_reconnect: Service failed error=endpoint 10.x.x.x:x: channel closed WARN policy:controller:endpoint{addr=10.x.x.x:x}: linkerd_reconnect: Failed to connect error=endpoint 10.x.x.x:x: Connection refused (os error 111)
Phase 3: DNS resolution fails for destination service
WARN dst:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}: linkerd_app_core::control: Failed to resolve control-plane component error=failed SRV and A record lookups: failed to resolve SRV record: proto error: no records found for Query { name: Name("linkerd-dst-headless.linkerd.svc.cluster.local."), query_type: SRV, query_class: IN }; failed to resolve A record: proto error: no records found for Query { name: Name("linkerd-dst-headless.linkerd.svc.cluster.local."), query_type: AAAA, query_class: IN }
Phase 4: Fatal panic - period must be non-zero
thread 'main' panicked at linkerd/app/core/src/control.rs:118:49: period must be non-zero. note: run with RUST_BACKTRACE=1 environment variable to display a backtrace
Phase 5: Permanent broken state
thread 'main' panicked at /__w/linkerd2-proxy/linkerd2-proxy/linkerd/proxy/balance/queue/src/service.rs:73:18: worker must set a failure if it exits prematurely WARN outbound: linkerd_app_core::serve: Server failed to become ready error=buffer's worker closed unexpectedly client.addr=10.x.x.x:x
output of linkerd check -o short
NA
Environment
Linkerd version: stable-2.316.0 (proxy release 2.316.0, built 2025-08-27)
Possible solution
No response
Additional context
No response
Would you like to work on fixing this bug?
None