Kubernetes StatefulSets for PostGIS Databases

Run a spatial database from a stateless Deployment and the failure is not subtle: a rescheduled pod reattaches to the wrong volume — or no volume at all — and your GiST indexes, table partitions, and write-ahead logs vanish or fork into split-brain. For GIS administrators, open-source portal maintainers, and government platform engineers, that is the difference between a routine node drain and a multi-hour outage of every WMS, WFS, and WCS endpoint the portal serves. This page sits within Infrastructure Orchestration & Configuration Management and explains how the StatefulSet controller gives PostGIS the three guarantees it actually needs — stable network identity, ordered lifecycle, and one-to-one persistent storage binding — and how to wire those guarantees into declarative config, access control, and CI/CD so a production geospatial data layer becomes reproducible and auditable rather than hand-nursed.

The topology below shows the StatefulSet’s stable identities: an ordinal primary writer and replicas, each bound to its own PVC, with reads and writes split via the headless service.

Architectural Placement: Where the StatefulSet Sits in the Stack

A StatefulSet is the bottom of the data plane — every higher tier in the portal ultimately resolves to a query against one of its pods. Unlike a Deployment, which treats its pods as interchangeable cattle behind a single load-balanced Service, a StatefulSet issues each pod a sticky ordinal identity (postgis-0, postgis-1, postgis-2) and a matching DNS record under a headless service. That stable identity is what lets you designate postgis-0 as the primary writer and the remaining ordinals as streaming replicas without an external service-discovery layer.

The controller also manages persistent volume claims through its volumeClaimTemplates block. Each replica receives a deterministic PVC name derived from the template name and the pod’s zero-based index — data-postgis-0, data-postgis-1 — and that PVC is never reassigned to a different ordinal. This binding is the architectural core: it guarantees that spatial indexes, partitions, and WAL segments survive pod rescheduling, node failure, or a rolling update without manual volume reattachment. When postgis-1 is evicted and rescheduled to another node, Kubernetes reattaches data-postgis-1 to the new pod, and PostGIS replays its WAL from exactly where it left off.

Storage class selection is part of the architecture, not an afterthought. GiST and SP-GiST index maintenance, bulk geometry loads, and raster ingestion all generate heavy random I/O, so the storageClassName must point at a provisioner that delivers predictable IOPS and low p99 latency. The procedures for binding dynamic provisioners, choosing reclaim policies, and validating volume topology against node zones live in the operational walkthrough Deploying PostGIS on Kubernetes with Persistent Volumes. The rest of this page assumes those volumes exist and focuses on the controller, isolation, and operational behavior layered on top.

Read traffic from the rendering tier never touches postgis-0 directly. Heavy ST_AsMVT and ST_TileEnvelope calls from the vector-tile fleet described in Containerizing TileServer GL for High Availability are routed to replica ordinals, so a spike in basemap rendering cannot starve transactional writes. The StatefulSet is therefore the pivot point where the portal’s read/write separation physically materializes.

Data Isolation and the Replication Trust Model

Stable identity is only useful if the data it protects is correctly isolated. Three isolation boundaries matter for a PostGIS StatefulSet, and each maps to a concrete Kubernetes mechanism.

Storage isolation is enforced by the one-PVC-per-ordinal rule above, reinforced by an accessModes: ["ReadWriteOnce"] claim. RWO ensures a volume is mounted by exactly one node at a time, which prevents two pods from ever opening the same PostgreSQL data directory — the most direct route to catastrophic heap corruption.

Network isolation is enforced with a NetworkPolicy. By default, any pod in the namespace can open a TCP connection to port 5432. For a multi-tenant portal that is unacceptable: a compromised tile renderer should not be able to reach the primary writer. A default-deny ingress policy that admits only labelled clients closes that gap:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: postgis-restrict-ingress
  namespace: geoportal
spec:
  # Apply to every pod managed by the StatefulSet
  podSelector:
    matchLabels:
      app: postgis
  policyTypes:
    - Ingress
  ingress:
    # Only pods explicitly labelled as DB clients may reach 5432
    - from:
        - podSelector:
            matchLabels:
              db-access: "postgis"
      ports:
        - protocol: TCP
          port: 5432

Replication-role isolation is the trust boundary between primary and replicas. WAL streaming uses a dedicated replication-privileged role — never the application superuser — scoped in pg_hba.conf to the pod CIDR of the StatefulSet only. Replicas connect to postgis-0.postgis-headless.geoportal.svc.cluster.local over that role, and the application roles that back the portal’s tenants are governed separately. Where those application roles enforce per-tenant row visibility, the model mirrors the same row-level-security and role design covered in Implementing RBAC for Multi-Tenant GIS Portals; the database tier is where that policy is ultimately adjudicated, so the StatefulSet must expose the roles intact, not collapse them behind a shared connection.

Declarative Configuration: The Annotated StatefulSet

The controller manifest is the single source of truth for the data tier. The block below is a production-shaped definition — abbreviated only where the persistent-volume guide already covers the detail — with the load-bearing fields commented inline.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgis
  namespace: geoportal
spec:
  # Bind pods to the headless service so each ordinal gets stable DNS
  serviceName: postgis-headless
  replicas: 3
  # Boot/terminate one ordinal at a time so the primary settles before replicas join
  podManagementPolicy: OrderedReady
  selector:
    matchLabels:
      app: postgis
  template:
    metadata:
      labels:
        app: postgis
    spec:
      # Spread ordinals across zones; a single AZ loss must not take all pods
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: postgis
      initContainers:
        # Apply schema + extensions deterministically before postgres accepts traffic
        - name: bootstrap-schema
          image: postgis/postgis:16-3.4
          command: ["/bin/bash", "/bootstrap/init.sh"]
          volumeMounts:
            - name: bootstrap
              mountPath: /bootstrap
            - name: data
              mountPath: /var/lib/postgresql/data
      containers:
        - name: postgis
          image: postgis/postgis:16-3.4
          ports:
            - containerPort: 5432
              name: postgres
          # Tune autovacuum for high-churn spatial tables (see troubleshooting)
          args:
            - -c
            - autovacuum_vacuum_scale_factor=0.05
            - -c
            - maintenance_work_mem=512MB
          resources:
            requests:
              cpu: "2"
              memory: 4Gi
            limits:
              memory: 8Gi
          readinessProbe:
            # Gate Service endpoints on a real query, not just an open socket
            exec:
              command: ["pg_isready", "-U", "geoportal", "-d", "geoportal"]
            initialDelaySeconds: 15
            periodSeconds: 10
          volumeMounts:
            - name: data
              mountPath: /var/lib/postgresql/data
      volumes:
        - name: bootstrap
          configMap:
            name: postgis-bootstrap
  # One PVC per ordinal, never reassigned — the durability guarantee
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: ["ReadWriteOnce"]
        storageClassName: fast-ssd-retain
        resources:
          requests:
            storage: 200Gi

Three fields carry most of the operational weight. podManagementPolicy: OrderedReady forces postgis-0 to become Ready before postgis-1 is created, so replicas never attempt to stream WAL from a primary that has not finished its own bootstrap. The readinessProbe runs pg_isready rather than a bare TCP check, which keeps a pod out of the Service endpoint list until it can actually answer spatial queries — preventing the portal from routing a SELECT to a replica still replaying WAL. And the volumeClaimTemplates block, with storageClassName: fast-ssd-retain, pairs the durability guarantee with a Retain reclaim policy so an accidental StatefulSet deletion does not garbage-collect the geometry data.

The init.sh mounted from the postgis-bootstrap ConfigMap is what makes the schema reproducible. It runs CREATE EXTENSION IF NOT EXISTS postgis;, installs custom topology functions, and applies versioned migrations — so development, staging, and production clusters converge on identical structure. Treating the schema as immutable infrastructure this way is the same discipline enforced in Environment Parity in Geospatial CI Pipelines, where a drifted SRID or a missing extension is caught at the pipeline gate rather than in a broken production map.

Connection Boundary: Routing Writes and Reads Safely

The application tier must never guess which pod is the writer. The boundary is enforced by DNS and a connection pooler, not by application logic. PgBouncer sits in front of the StatefulSet with two pools: a write pool whose host is the pinned primary DNS name, and a read pool that fans out across replica ordinals.

[databases]
# Writes are pinned to the ordinal-0 primary's stable headless DNS record
geoportal_rw = host=postgis-0.postgis-headless.geoportal.svc.cluster.local port=5432 dbname=geoportal
# Reads target replicas; rotate ordinals at the service layer
geoportal_ro = host=postgis-1.postgis-headless.geoportal.svc.cluster.local port=5432 dbname=geoportal

[pgbouncer]
listen_port = 6432
auth_type = scram-sha-256
# Transaction pooling maximizes reuse for short OGC SELECTs
pool_mode = transaction
max_client_conn = 2000
default_pool_size = 40

Credentials are scoped, not shared. The write pool authenticates as a role that holds INSERT/UPDATE/DELETE on the spatial schemas; the read pool authenticates as a role with SELECT-only grants and a default_transaction_read_only = on setting, so even a misrouted write fails closed at the database rather than silently corrupting a replica. SCRAM credentials are injected from a Kubernetes Secret mounted into the PgBouncer pod — never baked into the image or the ConfigMap. This is the same principle the portal applies at its HTTP edge, where the proxy that terminates OGC requests injects scoped headers as described in Reverse Proxy Configuration for WMS/WFS: the credential that reaches the database is the narrowest one that can satisfy the request.

For OGC services specifically, the read pool is the default and the write pool is reserved for transactional WFS-T operations and ingestion jobs. Aligning that split with the service-level trust zones documented in Security Boundary Mapping for OGC Services keeps the public-facing read path on replicas and the privileged write path on the primary, behind the NetworkPolicy shown earlier.

Failover and Primary Promotion

The StatefulSet guarantees stable identity, but it does not, on its own, know which ordinal is the PostgreSQL primary — that role is application-level state layered on top of the controller. When postgis-0 fails, promotion is a deliberate, gated operation, never an automatic side effect of pod rescheduling. Treating it otherwise is how a Postgres cluster ends up with two writers and a forked timeline.

A safe promotion flow respects the ordinal contract. The recovering primary must rejoin as a replica, not race the new one for write authority:

# 1. Confirm the old primary is genuinely down, not just network-partitioned,
#    before touching the writer role — a false positive causes split-brain.
kubectl exec postgis-0 -n geoportal -- pg_isready -t 5 || echo "primary unreachable"

# 2. Promote the most-caught-up replica (lowest pg_last_wal_receive_lsn gap).
kubectl exec postgis-1 -n geoportal -- pg_ctl promote -D /var/lib/postgresql/data

# 3. Repoint the write pool. The pooler's geoportal_rw host now resolves to
#    the promoted ordinal; reads continue uninterrupted on the remaining replicas.
kubectl annotate service postgis-rw -n geoportal primary=postgis-1 --overwrite

# 4. Re-introduce the recovered node as a replica with pg_rewind, never as a writer.
kubectl exec postgis-0 -n geoportal -- pg_rewind \
  --source-server="host=postgis-1.postgis-headless.geoportal.svc.cluster.local" \
  --target-pgdata=/var/lib/postgresql/data

In production this sequence is automated by a Kubernetes-aware operator — Patroni, CloudNativePG, or Stolon — that holds a distributed lock so only one promotion can win, watches the replica WAL positions to pick the least-lagged candidate, and rewrites the writer endpoint atomically. The operator does not replace the StatefulSet; it consumes the stable ordinal DNS the controller provides and adds the consensus layer the controller deliberately leaves out. The topologySpreadConstraints in the manifest are what make this worthwhile: with ordinals pinned across availability zones, the loss of a single zone leaves a healthy, caught-up replica in another zone ready to take writes.

CI/CD Integration: Gating the Rollout

A PostGIS StatefulSet should never be kubectl apply-ed by hand. It is reconciled by a GitOps controller — Argo CD or Flux — from a versioned manifest directory, so the live cluster state is always a function of a Git commit. That gives you an auditable history and an automatic drift alarm: if someone edits the live StatefulSet with kubectl edit, the controller flags OutOfSync and reverts it.

The pipeline that promotes a manifest change runs three gates before the sync is allowed to proceed:

# .gitlab-ci.yml — gates that must pass before Argo CD syncs the StatefulSet
validate-manifests:
  stage: test
  script:
    # 1. Schema and reference-system integrity against an ephemeral PostGIS
    - ./ci/spin-ephemeral-postgis.sh
    - psql "$EPHEMERAL_URL" -f schema/migrations.sql
    - psql "$EPHEMERAL_URL" -c "SELECT postgis_full_version();"
    - ./ci/assert-srid-present.sh 4326 3857
    # 2. Dry-run the rendered manifests against the live API server
    - kubectl apply --dry-run=server -k overlays/production
    # 3. Confirm volumeClaimTemplates were not mutated (immutable field guard)
    - ./ci/assert-pvc-template-unchanged.sh
  rules:
    - if: $CI_COMMIT_BRANCH == "main"

The third gate matters because volumeClaimTemplates is immutable on an existing StatefulSet — a change to the storage size or class is silently rejected by the API server and produces a confusing partial rollout. Catching that diff in CI turns a mysterious production stall into a clear merge-request failure. Backup verification belongs in the same pipeline: a nightly pg_basebackup restore into an ephemeral pod, validated with a known-row SELECT, proves the WAL archive is actually recoverable before anyone needs it. These gates extend the parity discipline of Environment Parity in Geospatial CI Pipelines down to the storage layer.

Operational Troubleshooting

Most StatefulSet incidents present as a stuck rollout, a degraded query path, or a replica that will not catch up. Work the matrix below symptom-first.

Pod stuck in Pending, event pod has unbound immediate PersistentVolumeClaims. The provisioner cannot satisfy the claim. Check kubectl get pvc -n geoportal for a Pending PVC, then confirm the storageClassName exists and has zone topology matching the pod’s topologySpreadConstraints. A volume that can only provision in zone-a will never bind a pod the scheduler placed in zone-b.
Rollout halted at postgis-0, replicas never created. With OrderedReady, a primary that never passes its readinessProbe blocks every later ordinal. Inspect kubectl logs postgis-0 and the container log at /var/lib/postgresql/data/log/; a failed bootstrap-schema init container or a corrupt WAL replay is the usual cause.
Replica postgis-1 lagging or LOG: started streaming WAL then disconnecting. Verify the replication role and the pg_hba.conf CIDR admit the replica pod IP, and check pg_stat_replication on the primary. Lag that grows under load usually means replica IOPS are saturated — confirm the replica’s PVC uses the same fast storageClassName as the primary.
Queries slow, pg_stat_user_tables shows n_dead_tup climbing. Autovacuum is not keeping pace with high-churn spatial tables. Lower autovacuum_vacuum_scale_factor and reduce autovacuum_vacuum_cost_delay per table; left unchecked this drives transaction-ID wraparound risk and bloated GiST indexes. Cross-reference the PostgreSQL documentation on routine vacuuming before overriding defaults.
too many connections under render load. The tile fleet is bypassing the pooler. Confirm every client resolves through PgBouncer’s 6432 port and that default_pool_size × ordinal count stays under max_connections; never let renderers open direct 5432 connections to a replica.
Write succeeds but vanishes after failover. A SELECT was misrouted to a replica with default_transaction_read_only unset, or a manual promotion left two writers. Confirm the write pool’s host still resolves to the live primary ordinal and that pg_controldata shows exactly one node in production. Consult the Kubernetes StatefulSet controller reference for the ordinal guarantees that promotion tooling must respect.

Adopting the StatefulSet controller turns spatial database administration from a reactive, manual chore into a predictable workflow: deterministic storage binding, codified initialization, tuned background maintenance, and an enforced read/write boundary together let platform teams meet enterprise and public-sector SLAs without heroics.

Deploying PostGIS on Kubernetes with Persistent Volumes — storage provisioning and volume topology in depth.
Environment Parity in Geospatial CI Pipelines — schema and SRS gates that protect every rollout.
Containerizing TileServer GL for High Availability — the read-heavy rendering tier that sits on top of the replicas.
Reverse Proxy Configuration for WMS/WFS — scoped-credential injection at the OGC HTTP edge.
Implementing RBAC for Multi-Tenant GIS Portals — the role and row-security model the database tier adjudicates.

Up one level: Infrastructure Orchestration & Configuration Management.

Kubernetes StatefulSets for PostGIS Databases

Architectural Placement: Where the StatefulSet Sits in the Stack #

Data Isolation and the Replication Trust Model #

Declarative Configuration: The Annotated StatefulSet #

Connection Boundary: Routing Writes and Reads Safely #

Failover and Primary Promotion #

CI/CD Integration: Gating the Rollout #

Operational Troubleshooting #

Related #