Setting Up Fallback Tile Routing in Production

This guide walks through configuring a production reverse proxy so that tile requests degrade deterministically to a secondary backend when the primary cache or renderer fails, without breaking security boundaries.

It is the hands-on implementation companion to Fallback Routing Strategies for Tile Servers, and sits within the broader Core Portal Architecture & Security Boundaries framework that governs how routing, identity, and network segmentation interact across the portal. Where that parent page covers the three-tier strategy at a conceptual level, the steps below produce a working proxy_next_upstream chain on Nginx, a backup-server failover on HAProxy, and an equivalent Envoy retry_policy, each validated against real tile responses rather than raw socket health.

Tile servers at scale routinely encounter transient cache invalidations, MBTiles fragmentation, or GeoWebCache seed interruptions that surface as intermittent 404 and 502 responses. Without explicit fallback routing, client map libraries trigger cascading request storms that exhaust upstream compute. A correctly configured fallback topology redirects coordinate requests to a secondary renderer or a pre-seeded archive store while preserving authentication, rate limiting, and audit trails.

Prerequisites

Confirm the following before changing any routing configuration. Each item maps directly to a step below.

Proxy software: Nginx 1.21+ (for stale-while-revalidate support), HAProxy 2.4+, or Envoy 1.24+. Pick one; the steps cover all three.
Two reachable tile backends: a primary cache/renderer (for example a MapProxy or GeoServer tier) and a secondary fallback renderer on a separate host or pod. The examples use 10.0.1.10/10.0.1.11 (primary) and 10.0.2.20 (fallback).
A health endpoint on each backend that returns 200 only when the renderer can actually serve a baseline tile — not a generic TCP listener check.
Identity propagation in place. If the primary tier sits behind an OAuth2/OIDC gateway, the fallback tier must inherit the same token validation. Align this with Implementing RBAC for Multi-Tenant GIS Portals so failover paths never widen access scope.
Network policy that permits east-west traffic between the proxy and the fallback backend while blocking direct client ingress to the fallback endpoint, consistent with Security Boundary Mapping for OGC Services.
Access: permission to reload the proxy (nginx -s reload, systemctl reload haproxy, or an Envoy xDS/config push) and to read its access and error logs.

The routing decision the configuration implements is: normalize coordinates, try the primary upstream, pass 4xx straight through to the client, and divert only 5xx and timeouts to the fallback.

Step 1 — Normalize coordinates and stratify error codes

Before any routing decision, standardize the request. TMS uses a bottom-left tile origin, while WMTS and standard web maps use a top-left origin. Apply a single Y-axis inversion rule at ingress so both the primary and fallback backends receive identical, canonical requests regardless of which client library issued them:

# Map a TMS-origin Y to a WMTS/web-origin Y at zoom $zoom: y = 2^z - 1 - y
# Applied in a map{} block keyed on the parsed request URI before proxy_pass.
map $tms_y $wmts_y {
    default $tms_y;          # already top-left, pass through
    # rewrite rules populate $wmts_y for /tms/ prefixed requests
}

Equally important is error stratification at the source. The primary cache must return 404 Not Found for genuinely missing coordinates and 5xx codes only for transient rendering failures. This distinction is what lets the proxy retry recoverable errors while letting non-existent tiles fail fast — without it, the fallback tier is hammered by requests for tiles that will never exist at any backend.

Step 2 — Configure the fallback chain on your proxy

Pick the directive set matching your edge. Each scopes retries to specific failure conditions and caps retry depth at one so a partial outage cannot compound latency.

Nginx

Nginx routes fallbacks through upstream groups plus error interception. proxy_intercept_errors on lets the proxy evaluate backend status codes before forwarding them, and proxy_next_upstream restricts retries to real failures:

upstream tile_primary {
    server 10.0.1.10:8080;
    server 10.0.1.11:8080;
}

upstream tile_fallback {
    server 10.0.2.20:8080;
}

server {
    listen 443 ssl;
    server_name tiles.example.gov;

    location /tiles/ {
        proxy_pass http://tile_primary;
        proxy_intercept_errors on;
        proxy_next_upstream error timeout invalid_header http_500 http_502 http_503 http_504;
        proxy_next_upstream_tries 2;
        proxy_connect_timeout 2s;
        proxy_read_timeout 5s;

        error_page 500 502 503 504 = @fallback;
    }

    location @fallback {
        proxy_pass http://tile_fallback;
        proxy_set_header X-Fallback-Active true;
    }
}

HAProxy

HAProxy uses the backup keyword and option redispatch for deterministic failover. The http-check directive bases routing on application-layer tile health, not socket availability:

frontend ft_tiles
    bind *:443 ssl crt /etc/haproxy/certs/portal.pem
    default_backend bk_tile_primary

backend bk_tile_primary
    option httpchk GET /healthz
    http-check expect status 200
    option redispatch 1
    server primary-1 10.0.1.10:8080 check inter 3s fall 3 rise 2
    server primary-2 10.0.1.11:8080 check inter 3s fall 3 rise 2
    server fallback-1 10.0.2.20:8080 check inter 5s fall 5 rise 3 backup

Envoy

Envoy expresses the same logic with a retry_policy block and explicit status-code matching. Capping num_retries at one prevents compounding latency during partial outages:

clusters:
  - name: tile_primary
    connect_timeout: 2s
    type: STATIC
    lb_policy: ROUND_ROBIN
    load_assignment:
      cluster_name: tile_primary
      endpoints:
        - lb_endpoints:
            - endpoint:
                address:
                  socket_address: { address: 10.0.1.10, port_value: 8080 }
    health_checks:
      - timeout: 1s
        interval: 3s
        unhealthy_threshold: 3
        healthy_threshold: 2
        http_health_check: { path: "/healthz" }

  - name: tile_fallback
    connect_timeout: 3s
    type: STATIC
    lb_policy: ROUND_ROBIN
    load_assignment:
      cluster_name: tile_fallback
      endpoints:
        - lb_endpoints:
            - endpoint:
                address:
                  socket_address: { address: 10.0.2.20, port_value: 8080 }

route_config:
  virtual_hosts:
    - name: tile_route
      domains: ["*"]
      routes:
        - match: { prefix: "/tiles/" }
          route:
            cluster: tile_primary
            retry_policy:
              retry_on: "5xx,connect-failure,retriable-status-codes"
              retriable_status_codes: [500, 502, 503]
              num_retries: 1
              per_try_timeout: 4s

Step 3 — Align headers and cache lifetimes across tiers

Every routing tier must emit identical Cache-Control and ETag headers, or downstream CDN nodes will cache error responses and mismatched tile versions. A frequent failure mode is an edge node caching a 5xx as if it were a valid tile, leaving rendering gaps long after the primary recovers. When the fallback serves a tile it must preserve the original Vary: Accept-Encoding directive and add an observability header without altering client caching behaviour:

proxy_hide_header X-Powered-By;
proxy_hide_header Server;
proxy_hide_header X-Backend-Id;
add_header Cache-Control "public, max-age=86400, stale-while-revalidate=3600" always;
add_header X-Proxy-Routing "primary|fallback" always;

Step 4 — Enable structured access logging

Capture upstream timing, retry counts, and the fallback flag so activation is traceable in production. In Nginx, log $upstream_response_time, $upstream_status, and $proxy_add_x_forwarded_for. In HAProxy, set a log-format that emits %{+Q}r (request line) and %ST (status). In Envoy, configure access_log with response_flags so you can distinguish UF (upstream failure) from RL (rate limit) events. This logging is what turns Step 5’s verification from guesswork into evidence.

Verification

Confirm the change took effect before shifting real traffic onto it.

Baseline tile loads from the primary. Request a known coordinate and verify the routing header is absent or marked primary:

curl -s -o /dev/null -D - \
  "https://tiles.example.gov/tiles/0/0/0.png" | grep -i 'x-proxy-routing\|x-fallback-active'

Force a primary failure and confirm divert. Stop the primary renderer (or block its port) and re-request the same tile; the response must still be 200 and now carry X-Fallback-Active: true:
```
curl -s -o /dev/null -w '%{http_code}\n' "https://tiles.example.gov/tiles/0/0/0.png"
```
Confirm 4xx is not retried. Request a non-existent zoom/coordinate and verify it returns 404 quickly with no fallback header — proof that missing tiles fail fast instead of stampeding the fallback.
Check health-endpoint behaviour. Hit /healthz on each backend directly and confirm it returns 200 only when a baseline tile is actually renderable.
Verify cache headers match between a primary-served and fallback-served tile so the CDN treats them as interchangeable:
```
curl -sI "https://tiles.example.gov/tiles/3/4/2.png" | grep -i 'cache-control\|etag\|vary'
```

Troubleshooting

Symptom	Likely cause	Fix
Tiles stuck loading after failover	Fallback returns a different `ETag`/`Content-Type` than the CDN cached from the primary, triggering validation loops	Standardize `ETag` and `Content-Type` across tiers; strip backend-specific headers in the proxy
Sustained fallback rate above 5%	Primary cache degradation or a failed seed pipeline, not transient network noise	Correlate with GeoWebCache seed jobs and run `sqlite3 file.mbtiles "PRAGMA integrity_check"`
Infinite retries / request storms	`4xx` responses being retried, or `proxy_next_upstream_tries` unset	Restrict retries to `5xx`/timeout and cap `num_retries`/`proxy_next_upstream_tries` at the primary count
Fallback overwhelmed during seed invalidation	Cache stampede — many concurrent misses bypass the primary at once	Enable request coalescing at the proxy, `stale-while-revalidate` on the CDN, and client-side backoff with jitter
CDN serving stale errors after recovery	Edge node cached a `5xx` as a valid object	Set `proxy_intercept_errors on` and never add long `Cache-Control` to error responses
Fallback path bypasses auth	Fallback upstream not behind the same OIDC/token middleware	Apply identical ingress policy to both tiers per Securing MapProxy with Nginx and ModSecurity
Connection limits hit under failover load	Backend database/pool saturated when the fallback renders on demand	Tune pool sizing per Optimizing PostgreSQL/PostGIS Connection Limits

Once these checks pass in a passive monitoring window, shift traffic gradually using weighted routing or a canary before declaring the fallback path production-ready.

Setting Up Fallback Tile Routing in Production

Prerequisites #

Step 1 — Normalize coordinates and stratify error codes #

Step 2 — Configure the fallback chain on your proxy #

Nginx #

HAProxy #

Envoy #

Step 3 — Align headers and cache lifetimes across tiers #

Step 4 — Enable structured access logging #

Verification #

Troubleshooting #