Containerizing TileServer GL for High Availability

Running TileServer GL from a single container is the most common cause of avoidable tile outages: one node restart, one corrupt MBTiles mount, or one traffic spike saturates the V8 renderer and the entire basemap goes dark. For government mapping teams, open-source portal maintainers, and platform engineers, that single point of failure is unacceptable when the tile service backs interactive web GIS frontends, embedded dashboards, and downstream OGC clients. This page sits within Infrastructure Orchestration & Configuration Management and shows how to convert a monolithic TileServer GL instance into a stateless, horizontally scalable fleet with read-only assets, a shared cache tier, and automated lifecycle management — so any pod can die and be replaced without configuration drift or data loss.

The high-availability topology below keeps the rendering tier stateless — assets are mounted read-only, a shared cache absorbs spikes, and the fleet scales horizontally behind the proxy.

Architectural Placement: Where TileServer GL Sits in the Stack

TileServer GL is a stateless rendering concern. It lives behind the ingress and in front of the spatial data tier, consuming pre-built styles, fonts, sprites, and tilesets while never owning durable state itself. In a portal architecture, the same reverse proxy that fronts your OGC endpoints — configured per Reverse Proxy Configuration for WMS/WFS — also fronts the tile fleet, normalizing /styles/, /data/, and /fonts/ routes and absorbing repeated requests at the cache layer before they ever reach a renderer.

Placing the renderer in the stateless tier has three consequences that shape every decision below. First, scaling is purely additive: more pods means more rendering throughput, with no leader election or quorum to coordinate. Second, lifecycle events become routine — a SIGTERM during a rolling update should be a non-event because no pod holds unique state. Third, all configuration must arrive from outside the container at startup, which is what makes immutable images and externalized assets non-negotiable. When dynamic feature rendering or on-the-fly vectorization is required, the renderer reaches across the trust boundary into the spatial database tier, which is deliberately kept stateful and isolated — see Kubernetes StatefulSets for PostGIS Databases for the pod-identity and persistent-volume guarantees that backing tier depends on.

Stateless Container Architecture and Resource Isolation

The foundation of a highly available TileServer GL deployment is treating the application container as strictly stateless. All style definitions, glyph (.pbf) font resources, sprite sheets, raster overlays, and vector .mbtiles datasets must be externalized from the container filesystem. In practice that means mounting read-only volumes backed by distributed object storage or a network filesystem, so any instance can be terminated and replaced without configuration drift.

Resource isolation matters because the V8 engine that powers TileServer GL is memory-hungry and bursty under concurrent rendering. CPU requests should be calibrated to expected concurrent render load, and memory limits set high enough to absorb tile-generation overhead without triggering OOM kills mid-render. Two safeguards keep a single bad request from cascading across the fleet:

A liveness probe on the /health endpoint so the orchestrator restarts a wedged renderer.
A startup probe that waits for font and style assets to finish mounting, preventing readiness from flapping during rolling updates. The official Kubernetes probe configuration guidelines describe the startup/liveness/readiness ordering this relies on.

# tileserver-deployment.yaml — stateless renderer with read-only assets
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tileserver-gl
  namespace: geo-tiles
spec:
  replicas: 3
  selector:
    matchLabels: { app: tileserver-gl }
  template:
    metadata:
      labels: { app: tileserver-gl }
    spec:
      terminationGracePeriodSeconds: 90   # must exceed p99 render latency
      securityContext:
        runAsNonRoot: true
        runAsUser: 10001
        seccompProfile: { type: RuntimeDefault }
      containers:
        - name: tileserver
          image: registry.example.org/tileserver-gl@sha256:<pinned-digest>
          args: ["--config", "/config/config.json", "--public_url", "https://tiles.example.org/"]
          ports:
            - { containerPort: 8080, name: http }
          resources:
            requests: { cpu: "500m", memory: "512Mi" }
            limits:   { cpu: "2",    memory: "1536Mi" }   # headroom for V8 render bursts
          startupProbe:                 # block traffic until styles/fonts mount
            httpGet: { path: /health, port: http }
            failureThreshold: 30
            periodSeconds: 2
          livenessProbe:
            httpGet: { path: /health, port: http }
            periodSeconds: 10
            failureThreshold: 3
          readinessProbe:
            httpGet: { path: /styles/basemap/style.json, port: http }
            periodSeconds: 5
          volumeMounts:
            - { name: assets, mountPath: /data, readOnly: true }   # MBTiles, fonts, sprites
            - { name: config, mountPath: /config, readOnly: true } # style JSON via ConfigMap
      volumes:
        - name: assets
          persistentVolumeClaim: { claimName: tile-assets-ro }
        - name: config
          configMap: { name: tileserver-config }

The readiness probe deliberately targets a real style.json rather than only /health: a pod can be process-alive yet unable to compile its style because a font glyph range failed to mount, and routing tile traffic to it would surface as blank or 500-returning tiles to the client.

Data Isolation and Read-Only Asset Boundaries

High availability depends on a clean separation between the immutable assets a renderer reads and the dynamic data it queries. Treat every artifact the renderer consumes as immutable and version it: a tileset revision is a new object key, never an in-place overwrite. This is what lets you roll a fleet forward and back deterministically, and it pairs naturally with the dataset versioning discipline described in Version Tagging & Sync for Spatial Datasets.

The data isolation model has two concrete mechanisms:

Read-only mounts. Asset volumes are mounted readOnly: true. A compromised or buggy renderer cannot mutate, encrypt, or delete the tile corpus, and an accidental write fails loudly instead of corrupting shared state.
Network segmentation. A NetworkPolicy confines the renderer’s egress to exactly the backends it needs — the spatial database and the asset store — and nothing else. This prevents a renderer that proxies remote basemap providers from becoming an open egress hop.

# networkpolicy-tileserver.yaml — least-privilege egress for the renderer
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: tileserver-egress
  namespace: geo-tiles
spec:
  podSelector:
    matchLabels: { app: tileserver-gl }
  policyTypes: ["Ingress", "Egress"]
  ingress:
    - from:
        - podSelector: { matchLabels: { app: edge-proxy } }  # only the proxy may call renderers
      ports:
        - { protocol: TCP, port: 8080 }
  egress:
    - to:
        - podSelector: { matchLabels: { app: postgis } }       # dynamic feature queries
      ports:
        - { protocol: TCP, port: 6432 }                        # pgBouncer, not direct 5432
    - to:                                                      # DNS only
        - namespaceSelector: {}
          podSelector: { matchLabels: { k8s-app: kube-dns } }
      ports:
        - { protocol: UDP, port: 53 }

When the rendering pipeline issues dynamic feature queries, route them through a connection pooler rather than opening direct sessions against the primary. PgBouncer in transaction-pooling mode caps concurrent backend connections so a render spike cannot exhaust the database’s connection slots, and read replicas absorb heavy analytical reads off the primary render path. Decoupling the spatial backend this way is the single most effective defence against I/O contention bleeding into tile latency during peak query windows.

Declarative Configuration and Drift Management

In multi-service geospatial stacks, TileServer GL rarely operates alone — it interfaces with metadata catalogs, authentication proxies, and web GIS frontends. Managing configuration drift across those services requires disciplined override management and environment-variable injection rather than hand-edited containers.

Mount style JSON and tileserver config.json from a ConfigMap and inject any credentials from a Secret, both as read-only volumes, so the container image itself carries no environment-specific configuration and credential rotation never requires a rebuild:

# tileserver-config.yaml — style + server config as immutable, mounted state
apiVersion: v1
kind: ConfigMap
metadata:
  name: tileserver-config
  namespace: geo-tiles
data:
  config.json: |
    {
      "options": {
        "paths": { "root": "/data", "styles": "styles", "fonts": "fonts", "mbtiles": "mbtiles" },
        "serveStaticMaps": false,
        "formatQuality": { "jpeg": 80, "webp": 80 }
      },
      "styles": {
        "basemap": { "style": "styles/basemap/style.json", "tilejson": { "bounds": [-180,-85,180,85] } }
      },
      "data": {
        "osm": { "mbtiles": "mbtiles/osm.mbtiles" }
      }
    }

For edge or hybrid deployments that use Docker Compose instead of Kubernetes, keep the base docker-compose.yml environment-agnostic and layer a per-environment docker-compose.override.yml on top, so local development settings never leak into a production manifest. Whichever runtime you target, the rule is the same: the desired state lives in version control, and the running fleet is reconciled toward it — never edited live.

Authentication and API Boundary Enforcement

TileServer GL has no native multi-tenant authorization model, so the API boundary must be enforced at the ingress rather than inside the renderer. The renderer should trust only requests that arrive through the proxy and carry a verified identity, which keeps token validation, rate limiting, and scope checks in one auditable place. This boundary is where the tile fleet inherits the portal-wide access model — align the scopes and tenant claims here with Implementing RBAC for Multi-Tenant GIS Portals so a tile request is governed by the same roles as the rest of the portal, and with the request-classification rules in Security Boundary Mapping for OGC Services.

The enforcement pattern at the proxy is:

Validate, then strip. The proxy verifies the bearer token (OAuth2/OIDC), then strips the client Authorization header before forwarding, so the renderer never sees raw credentials.
Inject scoped context. The proxy injects a signed X-Tenant-Id and X-Style-Scope header derived from the token claims; the renderer (or a sidecar) restricts which styles/ and data/ paths that tenant may request.
Scope outbound credentials. API keys for external basemap providers live in a mounted Secret, are scoped to the minimum referer/origin, and are rotated on a schedule independent of deployments.

# edge-proxy snippet — auth_request gate in front of the tile fleet
location /tiles/ {
    auth_request /_authz;                       # OAuth2 introspection sidecar
    auth_request_set $tenant $upstream_http_x_tenant_id;

    proxy_set_header   X-Tenant-Id $tenant;     # signed downstream context
    proxy_set_header   Authorization "";        # never forward client creds to renderer
    proxy_pass         http://tileserver_gl;
    proxy_cache        tiles_zone;
    proxy_cache_key    "$tenant$uri";           # tenant-scoped cache, no cross-tenant bleed
    add_header         Cache-Control "public, max-age=86400, stale-while-revalidate=3600";
}

Note the $tenant-prefixed proxy_cache_key: without it, a cached tile from one tenant’s scoped style could be served to another. Tenant-scoping the cache key is the difference between a shared cache that accelerates everyone and one that silently leaks restricted imagery across the boundary.

CI/CD Integration and Environment Parity

Reproducible deployments depend on strict environment parity between developer workstations, CI runners, and production clusters. Container images must be built from deterministic Dockerfiles that pin base-image digests, compile native dependencies from source, and verify checksums for every external asset. The full discipline for keeping those stages identical is covered in Environment Parity in Geospatial CI Pipelines; the tile-specific gates layered on top are what stop a broken tileset or unrenderable style from ever reaching production.

# .ci/tile-gates.yml — pre-promotion validation for the tile image
stages: [validate, build, deploy]

validate-assets:
  stage: validate
  script:
    - sqlite3 osm.mbtiles "PRAGMA integrity_check;" | grep -qx "ok"   # corrupt MBTiles -> fail
    - sqlite3 osm.mbtiles "SELECT name,value FROM metadata;" | grep -q "format"
    - npx @maplibre/maplibre-gl-style-spec gl-style-validate styles/basemap/style.json
    - tilelive-copy --help >/dev/null                                # toolchain present

build-image:
  stage: build
  script:
    - docker build --pull -t "$IMAGE:$CI_COMMIT_SHA" .
    - docker run --rm "$IMAGE:$CI_COMMIT_SHA" --version                # smoke: binary starts
    - cosign sign --yes "$IMAGE:$CI_COMMIT_SHA"                        # provenance for GitOps

deploy-gitops:
  stage: deploy
  script:
    - yq -i ".spec.template.spec.containers[0].image = \"$IMAGE@$DIGEST\"" k8s/tileserver-deployment.yaml
    - git commit -am "tiles: promote $DIGEST" && git push   # reconciler applies; no kubectl apply

Two practices keep this honest in production. First, the deploy stage commits an image digest to a Git-tracked manifest and lets the GitOps reconciler apply it — there is no imperative kubectl apply from CI, so the live cluster’s state always equals what is in version control and drift is detectable as a diff. Second, the reconciler runs continuously: if someone edits the running Deployment by hand, the next reconcile loop reverts it to the committed manifest, which is what makes “the renderer is stateless and reproducible” an enforced property rather than an aspiration.

Operational Troubleshooting and Observability

TileServer GL exposes request metrics from its built-in HTTP server; scrape them with Prometheus and track tile render latency, active worker threads, cache miss ratio, and style-compilation failures. Correlate those with the proxy’s proxy_cache_status and the database’s session counts so a latency spike can be attributed to the renderer, the cache, or the spatial backend rather than guessed at. Configure Horizontal Pod Autoscalers to scale on active HTTP connections or cache-miss ratio rather than CPU alone, since CPU lags the real saturation signal during a render storm. Implement graceful-shutdown (SIGTERM) handling so in-flight renders complete before a pod exits, with terminationGracePeriodSeconds set above your p99 render latency. When a renderer must hand off because it cannot serve a layer at all, pair the fleet with the degradation logic in Fallback Routing Strategies for Tile Servers.

Use this symptom-to-fix matrix as the first diagnostic pass:

Blank or transparent tiles, pod is Ready. Style compiled but a glyph range or sprite failed to mount. Check the renderer log for ENOENT on a fonts/ or sprite path; confirm the read-only asset PVC is bound and the style.json glyph URL template resolves. Tighten the readiness probe to a real style.json so the pod is not marked ready before assets mount.
HTTP 500 on /styles/<id>/.... Invalid or partially-written style.json. Validate with gl-style-validate in CI; verify the ConfigMap mounted completely (a truncated mount yields a parse error at startup, visible as a style-compilation failure metric).
Intermittent 502/504 at the proxy under load. Renderer V8 saturation or OOM kill. Inspect for OOMKilled in pod status and raised container_memory_working_set_bytes; increase memory limit headroom and scale the HPA on connections, not CPU.
Tiles from the wrong tenant. Cache key not tenant-scoped. Confirm proxy_cache_key includes the $tenant claim and that the proxy strips the client Authorization header before caching.
Stale tiles after a dataset update. Cache not invalidated on tileset version bump. Roll the cache namespace or key prefix when promoting a new .mbtiles revision; never overwrite an existing object key in place.
High DB latency only during render spikes. Direct-connection exhaustion against PostGIS. Route renderer queries through pgBouncer (transaction pooling), cap pool size below the primary’s max_connections, and offload analytical reads to a replica.
Renderers flap during rolling updates. Grace period shorter than in-flight render time. Raise terminationGracePeriodSeconds above p99 render latency and confirm the container handles SIGTERM by draining rather than exiting immediately.

Beyond the live matrix, keep a documented manual-failover runbook for when automated recovery thresholds are breached, audit read-only mounts for stale assets on a schedule, and rotate the external basemap provider credentials independently of image releases. A TileServer GL fleet built this way degrades gracefully under load instead of failing all at once, which is the entire point of moving off a single node.

Infrastructure Orchestration & Configuration Management — the parent overview this page belongs to.
Kubernetes StatefulSets for PostGIS Databases — the stateful spatial backend the renderer queries.
Reverse Proxy Configuration for WMS/WFS — the ingress and cache tier in front of the fleet.
Environment Parity in Geospatial CI Pipelines — keeping build, staging, and production identical.
Fallback Routing Strategies for Tile Servers — graceful degradation when a renderer cannot serve a layer.

Containerizing TileServer GL for High Availability

Architectural Placement: Where TileServer GL Sits in the Stack #

Stateless Container Architecture and Resource Isolation #

Data Isolation and Read-Only Asset Boundaries #

Declarative Configuration and Drift Management #

Authentication and API Boundary Enforcement #

CI/CD Integration and Environment Parity #

Operational Troubleshooting and Observability #

Related #