Setting Up Fallback Tile Routing in Production

Deploying resilient tile delivery pipelines in production geospatial portals requires deterministic routing logic that gracefully degrades when primary cache layers or rendering backends experience partial failure. Tile servers operating at scale frequently encounter transient cache invalidations, mbtiles fragmentation, or GeoWebCache seed interruptions that manifest as intermittent 404 and 502 responses. Without explicit fallback routing, client-side map libraries trigger cascading request storms, degrading user experience and exhausting upstream compute. Establishing a robust fallback topology ensures that coordinate requests are seamlessly redirected to secondary renderers, pre-seeded archive stores, or simplified vector fallbacks while preserving strict security boundaries and audit compliance.

Architectural Positioning and Security Alignment

The routing layer must operate as a stateless, high-throughput reverse proxy positioned between client applications and the tile generation stack. When designing this topology, engineers should align proxy behavior with the broader Core Portal Architecture & Security Boundaries framework to ensure that fallback paths do not inadvertently bypass authentication gates, rate-limiting policies, or network segmentation controls. In practice, this means configuring upstream health checks that evaluate tile response codes rather than generic TCP connectivity, and implementing request normalization to handle both TMS and WMTS coordinate conventions before routing decisions are applied.

Proxy deployments must enforce identical ingress policies across all upstream tiers. If the primary tile cache resides behind an OAuth2/OIDC gateway, the fallback renderer must inherit the same token validation middleware. Network segmentation rules should explicitly permit east-west traffic between the proxy and secondary backends while blocking direct client access to fallback endpoints. This prevents shadow routing and maintains audit trails for compliance-driven geospatial platforms.

Deterministic Routing Logic and Error Stratification

Configuration of the fallback chain relies on explicit error-code differentiation and timeout stratification. Primary tile caches should return 404 Not Found for genuinely missing coordinates and 5xx series codes for transient rendering failures. The proxy must be instructed to intercept 5xx responses and route identical coordinate requests to a secondary upstream, while allowing 4xx responses to pass through to the client to prevent infinite retry loops against non-existent tiles.

Coordinate normalization is critical before routing evaluation. TMS uses a bottom-left origin, while WMTS and standard web map implementations use a top-left origin. Proxies should apply a consistent Y-axis inversion rule (y = 2^z - 1 - y) at the ingress layer to ensure that fallback backends receive standardized requests regardless of client library behavior. Health checks must validate tile integrity by requesting a known baseline coordinate (e.g., z/0/0/0) and verifying both HTTP status and Content-Type: image/png or application/vnd.mapbox-vector-tile.

The decision flow below summarizes the routing logic the proxy configurations implement: normalize coordinates, try the primary upstream, pass 4xx straight through, and divert only 5xx/timeouts to the fallback.

flowchart TB
    Req["Tile request z/x/y"] --> Norm["Normalize TMS / WMTS Y axis"]
    Norm --> P["Primary upstream"]
    P --> C{"Response?"}
    C -->|"2xx tile"| Done["Serve tile"]
    C -->|"4xx"| Pass["Pass 4xx to client, no retry"]
    C -->|"5xx or timeout"| F["Fallback upstream"]
    F --> Done

Implementation Patterns for Reverse Proxies

Production deployments typically standardize on Nginx, HAProxy, or Envoy. Each requires precise directive tuning to avoid compounding latency or masking upstream failures.

Nginx Configuration

Nginx handles fallback routing through upstream groups and error interception. The proxy_next_upstream directive must be scoped to specific failure conditions, and proxy_intercept_errors must be enabled to allow the proxy to evaluate backend status codes before forwarding them.

upstream tile_primary {
    server 10.0.1.10:8080;
    server 10.0.1.11:8080;
}

upstream tile_fallback {
    server 10.0.2.20:8080;
}

server {
    listen 443 ssl;
    server_name tiles.example.gov;

    location /tiles/ {
        proxy_pass http://tile_primary;
        proxy_intercept_errors on;
        proxy_next_upstream error timeout invalid_header http_500 http_502 http_503 http_504;
        proxy_next_upstream_tries 2;
        proxy_connect_timeout 2s;
        proxy_read_timeout 5s;

        error_page 500 502 503 504 = @fallback;
    }

    location @fallback {
        proxy_pass http://tile_fallback;
        proxy_set_header X-Fallback-Active true;
    }
}

HAProxy Configuration

HAProxy leverages the backup keyword and option redispatch to enforce deterministic failover. The http-check directive ensures that routing decisions are based on application-layer tile responses rather than socket availability.

frontend ft_tiles
    bind *:443 ssl crt /etc/haproxy/certs/portal.pem
    default_backend bk_tile_primary

backend bk_tile_primary
    option httpchk GET /healthz
    http-check expect status 200
    option redispatch 1
    server primary-1 10.0.1.10:8080 check inter 3s fall 3 rise 2
    server primary-2 10.0.1.11:8080 check inter 3s fall 3 rise 2
    server fallback-1 10.0.2.20:8080 check inter 5s fall 5 rise 3 backup

Envoy Configuration

Envoy utilizes retry_policy blocks with explicit status code matching. Capping num_retries at one prevents compounding latency during partial outages. For detailed cluster-level orchestration patterns, refer to the Fallback Routing Strategies for Tile Servers documentation.

clusters:
  - name: tile_primary
    connect_timeout: 2s
    type: STATIC
    lb_policy: ROUND_ROBIN
    load_assignment:
      cluster_name: tile_primary
      endpoints:
        - lb_endpoints:
            - endpoint:
                address:
                  socket_address: { address: 10.0.1.10, port_value: 8080 }
    health_checks:
      - timeout: 1s
        interval: 3s
        unhealthy_threshold: 3
        healthy_threshold: 2
        http_health_check: { path: "/healthz" }

  - name: tile_fallback
    connect_timeout: 3s
    type: STATIC
    lb_policy: ROUND_ROBIN
    load_assignment:
      cluster_name: tile_fallback
      endpoints:
        - lb_endpoints:
            - endpoint:
                address:
                  socket_address: { address: 10.0.2.20, port_value: 8080 }

route_config:
  virtual_hosts:
    - name: tile_route
      domains: ["*"]
      routes:
        - match: { prefix: "/tiles/" }
          route:
            cluster: tile_primary
            retry_policy:
              retry_on: "5xx,connect-failure,retriable-status-codes"
              retriable_status_codes: [500, 502, 503]
              num_retries: 1
              per_try_timeout: 4s

Header Propagation and Cache Consistency

Each routing tier must enforce identical Cache-Control and ETag headers to prevent downstream CDN layers from caching error responses or mismatched tile versions. When a fallback upstream serves a tile, it must preserve the original Vary: Accept-Encoding directive and inject a X-Cache-Status: FALLBACK header for observability without altering client caching behavior.

Misaligned headers frequently cause CDN edge nodes to cache 5xx responses as valid tiles, resulting in prolonged rendering gaps even after primary backends recover. Configure proxy response rewriting to strip backend-specific headers and standardize cache lifetimes:

proxy_hide_header X-Powered-By;
proxy_hide_header Server;
proxy_hide_header X-Backend-Id;
add_header Cache-Control "public, max-age=86400, stale-while-revalidate=3600" always;
add_header X-Proxy-Routing "primary|fallback" always;

Production Debugging and Edge Case Mitigation

Debugging fallback activation in production requires isolating proxy decision logic from backend rendering latency. Common edge cases include cache stampedes during simultaneous seed invalidations, where thousands of concurrent requests bypass the primary cache and overwhelm the fallback renderer. Mitigation strategies include implementing request coalescing at the proxy layer, enabling stale-while-revalidate on the CDN, and introducing exponential backoff with jitter on client-side tile fetchers.

To isolate routing failures, enable structured access logging that captures upstream response times, retry counts, and fallback activation flags. In Nginx, use $upstream_response_time and $proxy_add_x_forwarded_for to trace request paths. In HAProxy, configure log-format to output %{+Q}r (request line) and %{+Q}ST (status code). Envoy requires access_log configuration with grpc_status and response_flags fields to distinguish between UF (upstream failure) and RL (rate limit) events.

Monitor fallback activation rates using time-series metrics. A sustained fallback rate above 5% typically indicates primary cache degradation or seed pipeline failures rather than transient network issues. Correlate proxy logs with GeoWebCache seed job statuses and mbtiles file integrity checks (sqlite3 integrity_check) to identify root causes before they cascade into client-side map rendering failures.