Fallback Routing Strategies for Tile Servers

Fallback routing keeps map tiles flowing when a primary renderer, vector tile generator, or upstream raster store fails, by steering each request to the healthiest available backend instead of returning a broken map. Without it, a single degraded origin produces grey checkerboards, infinite loading spinners, and cascading timeouts that take an entire geospatial portal offline — affecting public dashboards, agency field crews, and any downstream OGC client that consumes the WMTS or XYZ endpoint. This guide sits inside the Core Portal Architecture & Security Boundaries framework and treats tile routing as a first-class resilience boundary: a deliberate decision layer that platform engineers, GIS administrators, and government technical teams must design, version, and test exactly like any other production control.

Tile routing is inherently dynamic. The routing layer continuously evaluates request latency, cache hit ratios, and upstream health probes before directing client traffic, and it must do so without ever weakening the authentication and tenant-isolation guarantees defined elsewhere in this framework. Whether the data plane is built on MapProxy, TileServer-GL, or a custom Nginx tile proxy, the routing logic has to be declarative, version-controlled, and reproducible across staging and production. Teams still choosing between an integrated portal and a dedicated caching tier should first read the GeoNode vs MapProxy Architecture Comparison, because the component split determines how much routing complexity each tier inherits.

The three-tier hierarchy below shows how the routing layer degrades gracefully as each upstream tier becomes unhealthy or saturated.

Where Fallback Routing Lives in the Stack

Fallback routing is not a property of any single tile server — it lives in the request path between the client and the rendering backends, typically inside the reverse proxy or an edge CDN that fronts the data plane. This placement matters because the routing layer is the only component with a complete view of every backend’s health, and it is also the layer that terminates TLS and carries the caller’s identity. Co-locating routing with the reverse proxy configuration for WMS/WFS means failover decisions, rate limits, and header propagation are enforced in one auditable place rather than scattered across application code.

A production-grade strategy implements a three-tier hierarchy:

Primary tier serves pre-rendered tiles from a distributed object store or edge cache (CloudFront, Fastly, or a local Varnish cluster). This tier handles the overwhelming majority of traffic at single-digit-millisecond latency.
Secondary tier is a lightweight rendering service that generates tiles on the fly from vector data or raster mosaics. The routing layer promotes traffic here when cache miss rates cross a threshold or the primary origin returns consecutive HTTP 5xx responses.
Tertiary tier is a cold-storage or on-demand renderer that activates only when the secondary tier saturates, often serving simplified symbology to preserve throughput under extreme load.

The key architectural decision is that fallback is directional and stateful: once a tier is marked unhealthy by the circuit breaker, traffic does not return to it until probes confirm recovery, preventing the request storm that occurs when a half-recovered backend is immediately flooded again.

Health Probing and the Circuit-Breaker Model

The routing layer’s correctness depends entirely on how accurately it measures backend health. A naive TCP connect check will happily route traffic to a renderer that accepts connections but returns corrupt PNGs, so probes must validate the application layer: a real tile request against a known coordinate, asserting both the HTTP status and the Content-Type. The circuit breaker then tracks consecutive failures per backend and trips when a fall threshold is reached, redispatching to the next tier.

# /etc/nginx/conf.d/tile-fallback.conf
# Primary edge cache, with on-the-fly renderer as an explicit backup upstream.
upstream tile_primary {
    server cache-edge-1.internal:8080 max_fails=3 fail_timeout=15s;
    server cache-edge-2.internal:8080 max_fails=3 fail_timeout=15s;
    # backup is only used when every non-backup peer is marked down
    server render-secondary.internal:8081 backup;
}

server {
    listen 443 ssl;
    server_name tiles.example.gov;

    location ~ ^/wmts/(?<layer>[^/]+)/ {
        proxy_pass http://tile_primary;

        # treat upstream errors and timeouts as a trigger to try the next peer
        proxy_next_upstream error timeout http_502 http_503 http_504;
        proxy_next_upstream_tries 2;
        proxy_next_upstream_timeout 4s;

        # short connect timeout so a dead primary fails over fast, not after 60s
        proxy_connect_timeout 2s;
        proxy_read_timeout    8s;

        # preserve the caller identity across the fallback hop (see auth section)
        proxy_set_header Authorization $http_authorization;
        proxy_set_header X-Tenant-Id  $tenant_id;
    }
}

For richer behaviour — weighted tiers, active health checks, and adaptive thresholds — many teams place HAProxy in front of the renderers; the patterns in Configuring HAProxy for WMS Load Balancing translate directly to tile traffic, where option redispatch with explicit fall and rise counters governs how quickly a backend is ejected and readmitted. Active probes should always be defined with check expect rules that assert a valid tile response rather than a bare port open.

Security and Tenant Isolation During Failover

The most dangerous failure mode is not an outage — it is a fallback path that silently bypasses access control. When the routing layer reroutes a request, it must carry the same authentication context, security headers, and tenant scoping that the primary path enforced. A fallback route that strips the Authorization header, drops the tenant identifier, or exposes an internal render endpoint to public ingress turns a resilience feature into a data-leak vector, particularly in multi-tenant deployments where one tenant’s cache must never serve another tenant’s tiles.

Three rules keep failover safe:

Identity propagates across every hop. The tenant token and any scoped API key are re-injected on each upstream, as shown by the proxy_set_header directives above, so a tier-2 renderer applies the same dataset visibility rules as tier 1.
Cache keys include the tenant. Pre-rendered tiles are partitioned by tenant in the cache key ($tenant_id$uri), so a fallback miss can never resolve to a sibling tenant’s cached object.
Internal renderers stay private. Secondary and tertiary backends bind to the internal network only and accept requests exclusively from the routing layer, never from public DNS.

The role-based policies that govern which layers a caller may see are defined once and inherited by every tier; see Implementing RBAC for Multi-Tenant GIS Portals for how those roles are modelled, and Security Boundary Mapping for OGC Services for how the surrounding WMTS/WMS surface is hardened. Health probes must follow the application-layer validation patterns documented in the official Nginx HTTP health-check guide.

Declarative Routing Configuration

Routing thresholds, circuit-breaker intervals, and fallback chains must be codified, never hand-tuned on a live box. MapProxy expresses fallback as an ordered list of sources, where the first reachable source wins and later sources act as backups:

# mapproxy.yaml — ordered fallback chain for a single cached layer.
sources:
  primary_cache:
    type: tile
    url: https://edge-cache.internal/wmts/%(tile_matrix_set)s/%(z)d/%(x)d/%(y)d.png
    # if the edge returns 5xx, MapProxy moves on to the next source in the cache list
    on_error:
      502: { response: transparent, cache: false }
      503: { response: transparent, cache: false }
      504: { response: transparent, cache: false }

  live_render:
    type: wms
    req:
      url: http://render-secondary.internal:8081/wms
      layers: basemap
    # cap how long we wait before declaring this tier saturated
    http:
      client_timeout: 8

caches:
  basemap_cache:
    grids: [webmercator]
    # sources are tried in order: object-store cache first, live renderer as fallback
    sources: [primary_cache, live_render]
    cache:
      type: s3
      bucket_name: tiles-primary
      directory: basemap/

grids:
  webmercator:
    srs: 'EPSG:3857'
    origin: nw

Always validate the chain before promotion with mapproxy-util check-config -f mapproxy.yaml, which catches invalid source references and unreachable cache definitions at build time rather than at 3 a.m. The same declarative discipline applies when the renderer itself is the unit of redundancy: running multiple replicas behind the router, as described in Containerizing TileServer-GL for High Availability, turns a single point of failure into a pool the circuit breaker can drain and refill safely.

Authentication and API Boundary Enforcement

Because the routing layer is where TLS terminates, it is also the correct place to validate credentials before any spatial work is dispatched. A scoped, short-lived token is verified once at the edge, the tenant claim is extracted into a variable, and that variable becomes both the cache-partition key and an upstream header. Renderers downstream trust the header only because it arrives over the internal network from the authenticated edge — they never accept it from public ingress.

# Map a verified JWT claim to an internal tenant header used for cache keys.
# (auth_jwt provided by the Nginx Plus / njs JWT module or an auth_request subrequest)
map $jwt_claim_tenant $tenant_id {
    default        "";
    "~^[a-z0-9-]+$" $jwt_claim_tenant;   # allow-list safe tenant slugs only
}

# Reject any request whose token did not resolve to a known tenant.
location @verify_tenant {
    if ($tenant_id = "") { return 403; }
}

# Tenant-partitioned cache key: a fallback miss can never cross tenants.
proxy_cache_key "$tenant_id$scheme$request_method$host$uri$is_args$args";

This boundary holds even during failover: every tier receives X-Tenant-Id and a forwarded Authorization header, so a request that drops from the edge cache to the live renderer is authorized identically. For multi-tenant role modelling and the agency-team workflows that issue these tokens, follow Implementing RBAC for Multi-Tenant GIS Portals.

CI/CD Integration and Drift Detection

Fallback routing is only trustworthy if its configuration in production matches what the repository declares. Treat proxy templates, MapProxy YAML, and circuit-breaker thresholds as immutable artifacts promoted through a pipeline, with a phased rollout that prioritizes observability over immediate traffic shifting:

Validate — run nginx -t and mapproxy-util check-config as required pipeline gates; a syntactically invalid fallback chain must never reach a runner.
Shadow — deploy fallback routes in passive mode first, logging simulated failover events without serving real client requests, so baseline circuit-breaker behaviour is observed before any traffic shifts.
Canary — shift a weighted slice of traffic to the new routing config, watching failover activation flags and 5xx rates before full promotion.
Detect drift — reconcile the live config against the committed manifest on a schedule and alert on divergence.

Keeping staging and production routing identical is itself an architectural control; the parity patterns in Environment Parity in Geospatial CI Pipelines and the Terraform-driven workflow in Syncing GeoNode Environments with Terraform ensure a fallback path verified in staging behaves the same in production. For the full deployment runbook — circuit-breaker tuning, retry-limit calibration, and traffic-weight configuration — work through Setting Up Fallback Tile Routing in Production. Fallback chain and retry syntax should follow the MapProxy configuration reference.

Operational Troubleshooting

When failover misbehaves, structured access logs that capture $upstream_response_time, $upstream_status, retry counts, and a fallback-activation flag are the difference between a five-minute fix and a multi-hour incident. The matrix below maps the symptoms most teams encounter to their usual cause and resolution.

Tiles stuck loading after failover fires — the fallback backend returns a different ETag or Content-Type than the CDN cached from the primary, triggering revalidation loops. Normalize headers across tiers and assert Content-Type: image/png in the health probe.
Intermittent grey tiles under load — the secondary renderer is saturating but the circuit breaker fall threshold is too high, so traffic keeps hitting a dying backend. Lower max_fails/fall and add the tertiary tier as a backup peer.
Failover never recovers to primary — fail_timeout/rise is too long or active probes are disabled, so the recovered primary is never readmitted. Enable active check expect status 200 probes.
One tenant sees another tenant’s basemap — the cache key omits $tenant_id; a fallback miss resolved to a shared object. Repartition the cache key and purge.
All traffic collapses to the cold tier — an expired upstream TLS certificate makes both primary and secondary fail the probe. Check the proxy error_log for SSL_do_handshake failures and rotate the certificate.
Latency spikes only on failover — proxy_connect_timeout is too long, so each request waits on the dead primary before retrying. Shorten the connect timeout to 1–2s so the breaker trips fast.

Final deployment validation should include chaos exercises that simulate upstream degradation, cache corruption, and network partitioning against the staging routing config. By codifying fallback routing as configuration-as-code, enforcing strict security inheritance across every tier, and continuously validating failover thresholds, platform teams convert reactive incident response into proactive resilience engineering — keeping the portal serviceable under load conditions no one predicted.

Fallback Routing Strategies for Tile Servers

Where Fallback Routing Lives in the Stack #

Health Probing and the Circuit-Breaker Model #

Security and Tenant Isolation During Failover #

Declarative Routing Configuration #

Authentication and API Boundary Enforcement #

CI/CD Integration and Drift Detection #

Operational Troubleshooting #