Description
Summary
Coder users are experiencing intermittent and prolonged failures during SSH connections, especially in VPN environments where DNS servers are partially or intermittently unresponsive. These issues are directly tied to the behavior of Go’s standard resolver, which lacks in-process caching and has limited timeout and failover behavior.
This proposes we configure a more resilient DNS resolution, in-memory caching, enhanced diagnostics via coder netcheck
, and tighter integration with Coder's SSH ProxyCommand to mitigate these failures and improve user experience.
Problem Statement
Current Pain Points
- SSH ProxyCommand fails with errors like:
could not get canonical name
Did not find remote IP address (is SSH ProxyCommand disabled?)
coder netcheck
logs:"no address for node"
or timeouts- VPN environments inject multiple DNS servers, some of which are non-responsive
- Go’s
net.Resolver
has no DNS cache, leading to repetitive failed lookups - DNS queries (e.g.,
gethostbyname
) can hang without enforced timeouts - Lack of CLI tooling for diagnosing DNS-related failures
Root Cause Analysis
-
Unresponsive or Hanging DNS Servers
VPN clients often assign multiple DNS servers, but not all are reachable or fast.
Regular DNS handling of OS can lead to long timeouts.
Poor DNS Servers can cause failure rates or success rates (A/B where A is the number of healthy servers and B the total number of DNS Servers (in the resolv.conf), minimum one, and up to maximum of three IP Addresses)
This failure rate therefore is mathematically 100% for one host and it is bad, 1/2 for one bad host out of two, or 1/3, 2/3 for one or two failing hosts out of three.
Due to round-robin methods, some end-users can ultimately have varying amount of success, but typically describe in the above scenario. -
Go DNS Resolver Limitations
- No built-in in-process DNS cache
- Each call to
net.LookupHost()
performs a fresh network resolution - Honour the GODEBUG environment variable (netdns=go or netdns=cgo) which could provide improved behaviour, to selectively use Go or System DNS behaviour
- net: DNS lookup timeout only when using go resolver golang/go#28419
- cgo may be disabled by default on macOS builds (Coder is likely not enabling cgo as it's not enabled when cross compiling or compiling on macOS)
- net: Support the /etc/resolver DNS resolution configuration hierarchy on OS X when cgo is disabled golang/go#12524
- EDNS0 is not discussed here (tunnel issues or packet size)
- net: go DNS resolver fails to connect to local DNS server golang/go#67925
-
Lack of Observability & Failover
- No health scoring or metrics for DNS servers
- No fallback to alternate resolvers on failure
-
Error Visibility
- Error output from
coder netcheck
does not surface DNS issues clearly - No user-facing tools to diagnose DNS resolution failures in SSH flows
- Error output from
Proposed Solution
1. Enhanced DNS Resolution with Resilience
type DNSConfig struct {
ServerTimeout time.Duration `json:"server_timeout"` // e.g. 2s per DNS server
TotalTimeout time.Duration `json:"total_timeout"` // e.g. 10s total
RetryAttempts int `json:"retry_attempts"` // e.g. 2 retries
ParallelQuery bool `json:"parallel_query"` // Query all resolvers concurrently
}
Features:
- Per-server timeout with total query deadline
- Parallel or sequential failover query modes
- Retry logic for transient errors
- Health scoring of DNS servers for future prioritization
2. In-Process DNS Caching
Inspired by rs/dnscache, which wraps Go’s net.Resolver
:
type CacheConfig struct {
TTL time.Duration `json:"ttl"` // Default: 5m
RefreshInterval time.Duration `json:"refresh_interval"` // Background refresh
MaxEntries int `json:"max_entries"` // Default: 1000
PersistentCache bool `json:"persistent_cache"` // Optional
BackgroundRefresh bool `json:"background_refresh"` // Optional
}
Features:
- TTL-based expiration
- Background refresh of entries to maintain freshness
- Stale result fallback on DNS outage
- Manual cache clearing (
coder dns-cache --clear
) - Optional persistent cache across CLI sessions
3. Enhanced Diagnostics in coder netcheck
# New DNS-focused netcheck options
coder netcheck --dns-detailed
coder netcheck --dns-servers-only
coder netcheck --dns-timeout 3s
Diagnostic Output:
- Server-by-server resolution time and success/failure status
- Identification of dead/stale servers
- Parallel test to detect mixed-responsiveness issues
- DNS configuration tips and fix suggestions
4. ProxyCommand Integration
With Match .coder and coder. in the ProxyCommand, we have the option to apply potential new parameters to encourage failover or retries:
# New parameters:
dns:
timeout: 2s
cache_ttl: 5m
parallel_queries: true
max_retries: 3
ssh:
dns_resilience: true
connection_timeout: 30s
Features:
- Retry logic in the face of DNS resolution failure
- Parallel resolution at SSH connection start
- Graceful fallback to cached DNS data
- DNS-aware keep-alive and connection pooling
Alternatively we could also use 'coder config-ssh' to add HostName for sensitive hosts, to avoid DNS lookups.
Having fixed IP Addresses in the .ssh/config file could help, but they could also expire with LB's often changing IP Addresses, dynamic hosting etc.
Implementation Plan
Phase 1: Core DNS Enhancements
- Add configurable DNS timeout and retry logic
- Implement in-memory DNS caching
- Support DNS server failover and parallel queries
Phase 2: Enhanced Diagnostics
- Expand
netcheck
to report per-resolver behavior - Add CLI flags for DNS tests and timeout controls
- Implement NXDOMAIN Detection (If query for the ACCESS URL or Wildcard URL return NXDOMAIN, tell the end-user)
Phase 3: SSH Integration
- Add DNS retry and cache fallback to ProxyCommand
- Expose new DNS options in CLI and config file
- Support persistent DNS cache storage
- Check if HostName can be used for DNS avoidance (with %h and %p parameters)
Benefits
Immediate
- Reliable SSH in VPN environments with bad DNS
- Faster reconnections due to caching
- Clear error messages and troubleshooting guidance
Long-Term
- Infrastructure for DNS-over-HTTPS/TLS support
- Enhanced resilience in mixed-network setups
- Less user frustration and fewer support cases
Testing Strategy
- Simulate VPN with one working and one broken DNS server
- Measure SSH connection startup time with and without cache
- Verify fallback behavior when all DNS resolvers fail
- Benchmark success rate and latency with new parallel resolver
- Test CLI output clarity and usefulness under failure conditions
Backward Compatibility
- New features are opt-in via configuration flags
- Existing behavior remains default
- Graceful fallback to Go’s default resolver if advanced config fails
Related Issues and Logs
netcheck: [v1] measuring ICMP latency of xyz:us-west-2b (2): no address for node uswest2b
netcheck: netcheck.runProbe: named node "uswest2a" has no address
ssh: could not get canonical name
ssh: Did not find remote IP address (is SSH ProxyCommand disabled?)