8000 Bug: Enhanced DNS Resilience and Diagnostics for Coder SSH ProxyCommand · Issue #18616 · coder/coder · GitHub
[go: up one dir, main page]

Skip to content
Bug: Enhanced DNS Resilience and Diagnostics for Coder SSH ProxyCommand #18616
Open
@bjornrobertsson

Description

@bjornrobertsson

Summary

Coder users are experiencing intermittent and prolonged failures during SSH connections, especially in VPN environments where DNS servers are partially or intermittently unresponsive. These issues are directly tied to the behavior of Go’s standard resolver, which lacks in-process caching and has limited timeout and failover behavior.

This proposes we configure a more resilient DNS resolution, in-memory caching, enhanced diagnostics via coder netcheck, and tighter integration with Coder's SSH ProxyCommand to mitigate these failures and improve user experience.


Problem Statement

Current Pain Points

  • SSH ProxyCommand fails with errors like:
  • could not get canonical name
  • Did not find remote IP address (is SSH ProxyCommand disabled?)
  • coder netcheck logs: "no address for node" or timeouts
  • VPN environments inject multiple DNS servers, some of which are non-responsive
  • Go’s net.Resolver has no DNS cache, leading to repetitive failed lookups
  • DNS queries (e.g., gethostbyname) can hang without enforced timeouts
  • Lack of CLI tooling for diagnosing DNS-related failures

Root Cause Analysis

  1. Unresponsive or Hanging DNS Servers
    VPN clients often assign multiple DNS servers, but not all are reachable or fast.
    Regular DNS handling of OS can lead to long timeouts.
    Poor DNS Servers can cause failure rates or success rates (A/B where A is the number of healthy servers and B the total number of DNS Servers (in the resolv.conf), minimum one, and up to maximum of three IP Addresses)
    This failure rate therefore is mathematically 100% for one host and it is bad, 1/2 for one bad host out of two, or 1/3, 2/3 for one or two failing hosts out of three.
    Due to round-robin methods, some end-users can ultimately have varying amount of success, but typically describe in the above scenario.

  2. Go DNS Resolver Limitations

  3. Lack of Observability & Failover

    • No health scoring or metrics for DNS servers
    • No fallback to alternate resolvers on failure
  4. Error Visibility

    • Error output from coder netcheck does not surface DNS issues clearly
    • No user-facing tools to diagnose DNS resolution failures in SSH flows

Proposed Solution

1. Enhanced DNS Resolution with Resilience

type DNSConfig struct {
    ServerTimeout     time.Duration `json:"server_timeout"`     // e.g. 2s per DNS server
    TotalTimeout      time.Duration `json:"total_timeout"`      // e.g. 10s total
    RetryAttempts     int           `json:"retry_attempts"`     // e.g. 2 retries
    ParallelQuery     bool          `json:"parallel_query"`     // Query all resolvers concurrently
}

Features:

  • Per-server timeout with total query deadline
  • Parallel or sequential failover query modes
  • Retry logic for transient errors
  • Health scoring of DNS servers for future prioritization

2. In-Process DNS Caching

Inspired by rs/dnscache, which wraps Go’s net.Resolver:

type CacheConfig struct {
    TTL                time.Duration `json:"ttl"`                // Default: 5m
    RefreshInterval    time.Duration `json:"refresh_interval"`   // Background refresh
    MaxEntries         int           `json:"max_entries"`        // Default: 1000
    PersistentCache    bool          `json:"persistent_cache"`   // Optional
    BackgroundRefresh  bool          `json:"background_refresh"` // Optional
}

Features:

  • TTL-based expiration
  • Background refresh of entries to maintain freshness
  • Stale result fallback on DNS outage
  • Manual cache clearing (coder dns-cache --clear)
  • Optional persistent cache across CLI sessions

3. Enhanced Diagnostics in coder netcheck

# New DNS-focused netcheck options
coder netcheck --dns-detailed
coder netcheck --dns-servers-only
coder netcheck --dns-timeout 3s

Diagnostic Output:

  • Server-by-server resolution time and success/failure status
  • Identification of dead/stale servers
  • Parallel test to detect mixed-responsiveness issues
  • DNS configuration tips and fix suggestions

4. ProxyCommand Integration

With Match .coder and coder. in the ProxyCommand, we have the option to apply potential new parameters to encourage failover or retries:

# New parameters:
dns:
  timeout: 2s
  cache_ttl: 5m
  parallel_queries: true
  max_retries: 3

ssh:
  dns_resilience: true
  connection_timeout: 30s

Features:

  • Retry logic in the face of DNS resolution failure
  • Parallel resolution at SSH connection start
  • Graceful fallback to cached DNS data
  • DNS-aware keep-alive and connection pooling

Alternatively we could also use 'coder config-ssh' to add HostName for sensitive hosts, to avoid DNS lookups.
Having fixed IP Addresses in the .ssh/config file could help, but they could also expire with LB's often changing IP Addresses, dynamic hosting etc.


Implementation Plan

Phase 1: Core DNS Enhancements

  • Add configurable DNS timeout and retry logic
  • Implement in-memory DNS caching
  • Support DNS server failover and parallel queries

Phase 2: Enhanced Diagnostics

  • Expand netcheck to report per-resolver behavior
  • Add CLI flags for DNS tests and timeout controls
  • Implement NXDOMAIN Detection (If query for the ACCESS URL or Wildcard URL return NXDOMAIN, tell the end-user)

Phase 3: SSH Integration

  • Add DNS retry and cache fallback to ProxyCommand
  • Expose new DNS options in CLI and config file
  • Support persistent DNS cache storage
  • Check if HostName can be used for DNS avoidance (with %h and %p parameters)

Benefits

Immediate

  • Reliable SSH in VPN environments with bad DNS
  • Faster reconnections due to caching
  • Clear error messages and troubleshooting guidance

Long-Term

  • Infrastructure for DNS-over-HTTPS/TLS support
  • Enhanced resilience in mixed-network setups
  • Less user frustration and fewer support cases

Testing Strategy

  • Simulate VPN with one working and one broken DNS server
  • Measure SSH connection startup time with and without cache
  • Verify fallback behavior when all DNS resolvers fail
  • Benchmark success rate and latency with new parallel resolver
  • Test CLI output clarity and usefulness under failure conditions

Backward Compatibility

  • New features are opt-in via configuration flags
  • Existing behavior remains default
  • Graceful fallback to Go’s default resolver if advanced config fails

Related Issues and Logs

netcheck: [v1] measuring ICMP latency of xyz:us-west-2b (2): no address for node uswest2b
netcheck: netcheck.runProbe: named node "uswest2a" has no address
ssh: could not get canonical name
ssh: Did not find remote IP address (is SSH ProxyCommand disabled?)

Metadata

Metadata

Assignees

No one assigned

    Labels

    customer-reportedBugs reported by enterprise customers. Only humans may set this.customer-requestedFeatures requested by enterprise customers. Only humans may set this.s4Internal bugs (e.g. test flakes), extreme edge cases, and bug risks

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0