Metadata Catalog Automation & Ingestion Workflows: Operational Guide

Within the wider open-source geospatial portal operations program, metadata catalog automation is the discoverability control plane: the subsystem that decides which datasets are findable, how they are described, and whether downstream agencies can trust what they query. This guide gives GIS administrators, open-source maintainers, platform engineers, and government technology teams a production baseline for running catalog ingestion as a deterministic, configuration-as-code pipeline — one that harvests from heterogeneous upstreams, validates against published standards, indexes for low-latency search, and keeps an immutable audit trail, without ever requiring manual curation to keep the catalog correct.

This discoverability surface sits alongside two companion guides in the same program: runtime topology and trust boundaries are owned by Core Portal Architecture & Security Boundaries, and the provisioning and convergence machinery lives in Infrastructure Orchestration & Config Management. The catalog inherits its network segmentation, RBAC model, and GitOps reconciliation from those two surfaces and adds the metadata-specific concerns — protocol harvesting, schema normalization, and lineage tracking — on top.

The end-to-end pipeline below shows how records move from upstream providers through validation and normalization into the searchable catalog, with invalid records diverted rather than halting the stream.

Architectural Foundations and Service Decomposition

Production-grade catalog ingestion must remain decoupled from the monolithic portal runtime. The reference architecture splits the workload into four single-responsibility services so that each can be scaled, deployed, and reasoned about independently: harvest workers that speak the upstream protocols, a stateless transform-and-validation tier, an indexing broker that fans records into the search store, and a synchronization service that reconciles catalog state against the underlying datasets. Each service is an ephemeral, horizontally scalable container scheduled by Kubernetes — the same orchestration substrate described in Infrastructure Orchestration & Config Management — and none of them holds durable local state beyond an in-flight batch.

The central design trade-off is between coupling and operational blast radius. A naive implementation harvests, validates, and indexes inside a single worker process; this is simple to deploy but means an upstream that returns malformed XML can stall the indexing thread and starve unrelated providers. The decomposed design instead routes every record through a message broker (Kafka, RabbitMQ, or Redis Streams) between stages, so that a slow validator or a saturated index applies backpressure to the harvesters rather than dropping records. Throughput is bounded by the slowest stage, which is almost always XML/JSON parsing during validation — a CPU-bound workload that benefits from per-stage horizontal scaling rather than scaling the pipeline as one unit.

The four services map to concrete responsibilities:

Harvest workers — protocol clients that page through upstream endpoints, honour resumption tokens, and emit raw records onto the ingest topic. They are I/O-bound and tuned for connection reuse and backoff, covered in depth in Automated Metadata Ingestion via OAI-PMH.
Transform and validation tier — stateless workers that map external payloads onto the internal schema and gate them through structural and semantic checks, detailed in CSW Catalog Schema Mapping & Validation.
Indexing broker — bulk-writes validated records into the search store with checkpointing and idempotent document IDs, tuned for query performance in Search Indexing Optimization with Elasticsearch.
Synchronization service — watches dataset version events and propagates lineage and extent changes into catalog entries, the focus of Version Tagging & Sync for Spatial Datasets.

All pipeline definitions — harvest schedules, transformation rules, routing logic, and resource quotas — live in version-controlled repositories so that identical ingestion logic can be promoted across development, staging, and production with strict separation of duties. Nothing reaches a running worker except through a reviewed commit.

Security Boundary Mapping

The catalog spans three trust zones, and the boundary between them must be enforced at the network layer rather than only in application code. Harvest workers sit in an egress-only zone: they initiate outbound connections to untrusted upstream providers but accept no inbound traffic, so a compromised upstream cannot pivot into the Kubernetes cluster. The transform, index, and sync services sit in a restricted internal zone reachable only from the broker and from each other. The public-facing search API — the only surface end users touch — sits behind the same edge proxy and TLS termination point described in Security Boundary Mapping for OGC Services.

Treat every upstream payload as hostile until proven otherwise. Geospatial metadata frequently arrives as XML, which makes the pipeline a target for XML External Entity (XXE) and billion-laughs entity-expansion attacks. Harden every parser at the harvest boundary: disable DTD processing and external entity resolution entirely, cap document size, and bound expansion depth.

# Harden the XML parser at the harvest boundary before any upstream
# payload is touched. Applies to every harvest worker.
from lxml import etree

# resolve_entities=False blocks XXE; no_network=True forbids the parser
# from fetching external DTDs/entities; huge_tree=False bounds expansion.
SAFE_PARSER = etree.XMLParser(
    resolve_entities=False,   # neutralise XXE / entity-expansion
    no_network=True,          # never fetch remote DTDs
    huge_tree=False,          # reject pathologically large trees
    remove_comments=True,
    remove_pis=True,
)

def parse_upstream(raw: bytes) -> etree._Element:
    if len(raw) > 8 * 1024 * 1024:        # 8 MiB hard cap per record
        raise ValueError("record exceeds size budget")
    return etree.fromstring(raw, parser=SAFE_PARSER)

Network segmentation is enforced declaratively with Kubernetes NetworkPolicy so that the boundary is auditable and survives redeployment. The policy below pins the egress zone: harvest workers may reach the broker and arbitrary external HTTPS endpoints, but nothing in the Kubernetes cluster may connect to them.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: harvest-egress-only
  namespace: catalog
spec:
  podSelector:
    matchLabels: { tier: harvest }
  policyTypes: [Ingress, Egress]
  ingress: []                     # deny all inbound to harvest workers
  egress:
    - to:                          # the message broker only
        - podSelector: { matchLabels: { tier: broker } }
      ports:
        - { protocol: TCP, port: 9092 }
    - to:                          # outbound HTTPS to upstream providers
        - ipBlock: { cidr: 0.0.0.0/0 }
      ports:
        - { protocol: TCP, port: 443 }

Secrets — upstream API tokens, broker credentials, index write keys — are injected at runtime from the Kubernetes secret store and never baked into images or committed to the repository. Mutual TLS between internal services ensures that even a process that lands inside the restricted zone cannot read the ingest stream without a valid client certificate.

Identity and Access Control

The catalog has two distinct identity surfaces, and conflating them is a common source of over-privilege. The read surface is the public search API, where most users are anonymous or hold a low-privilege token; the write surface is the ingestion and administration plane, where only the pipeline service accounts and a small set of operators may mutate catalog state. Role assignments are inherited from the portal-wide model described in Implementing RBAC for Multi-Tenant GIS Portals, so an agency editor who can publish a dataset is automatically scoped to the matching catalog records and cannot touch another tenant’s metadata.

Authentication for the write surface flows through an OIDC provider (Keycloak in most reference deployments). Operators and CI runners present a short-lived JWT; the catalog admin API validates the token signature against the provider’s JWKS endpoint, checks the aud claim, and maps the token’s group claims onto catalog scopes — catalog:harvest:write, catalog:index:admin, catalog:audit:read. Service-to-service calls inside the pipeline use the same mechanism with client-credentials grants rather than interactive logins, so every write is attributable to a named principal in the audit log.

# Scope check applied to every write-surface request. The token is
# already signature-verified upstream; here we enforce least privilege.
REQUIRED_SCOPES = {
    "POST /harvest/run":   "catalog:harvest:write",
    "PUT /index/settings": "catalog:index:admin",
    "DELETE /records":     "catalog:index:admin",
}

def authorize(method_path: str, token_scopes: set[str]) -> None:
    needed = REQUIRED_SCOPES.get(method_path)
    if needed and needed not in token_scopes:
        raise PermissionError(f"missing scope: {needed}")

The read surface stays deliberately thin: rather than embed the full RBAC engine in the hot query path, the indexer stamps each document with a visibility field at ingest time (public, tenant:<id>, or internal). The search API then injects a filter clause derived from the caller’s token, so authorization collapses into an index filter that the search engine evaluates natively. This keeps query latency flat as the catalog grows into the millions of records, because access control is a precomputed attribute rather than a per-request policy evaluation.

Resilience and Routing

Upstream metadata providers are the least reliable component in the system — government and academic endpoints routinely time out, return partial pages, or serve stale resumption tokens. The pipeline must degrade gracefully when an upstream misbehaves and recover automatically when it returns, without an operator in the loop.

Three patterns carry that resilience:

Bounded retries with exponential backoff and jitter. Harvest workers retry transient failures (HTTP 429, 503, connection resets) with capped exponential backoff, but treat hard failures (HTTP 400, malformed schema) as terminal and route them to the dead-letter queue immediately rather than burning retries.
Circuit breakers per upstream. Each provider gets its own breaker so that one failing endpoint cannot exhaust the worker pool. After a threshold of consecutive failures the breaker opens, the worker stops dialing that upstream for a cooldown window, and an alert fires. Other providers continue harvesting unaffected.
Idempotent writes. Every record carries a deterministic document ID derived from its source URN plus a content hash, so a re-harvested or replayed record overwrites in place rather than duplicating. This is what makes “retry the whole batch” a safe default.

# Per-upstream circuit breaker guarding a harvest call. Open breakers
# fail fast and shed load instead of queueing doomed requests.
import time

class Breaker:
    def __init__(self, threshold=5, cooldown=60):
        self.threshold, self.cooldown = threshold, cooldown
        self.failures, self.opened_at = 0, None

    def allow(self) -> bool:
        if self.opened_at is None:
            return True
        if time.monotonic() - self.opened_at >= self.cooldown:
            self.opened_at, self.failures = None, 0   # half-open: retry
            return True
        return False                                   # still open: shed

    def record(self, ok: bool) -> None:
        if ok:
            self.failures, self.opened_at = 0, None
        else:
            self.failures += 1
            if self.failures >= self.threshold:
                self.opened_at = time.monotonic()

Validation failures route to a dedicated dead-letter queue carrying a structured error payload — the offending record, the failing rule, and the source endpoint — so that remediation is targeted rather than a full re-harvest. Because the queue preserves the original record, fixing a transform rule and replaying the queue is a routine, non-destructive operation. This is the mechanism that lets the pipeline preserve availability during upstream outages: a provider that goes dark simply stops contributing new records while everything already in the catalog stays queryable.

Health probes close the loop. Each service exposes a /healthz liveness endpoint and a /readyz readiness endpoint; readiness gates whether Kubernetes routes work to a pod, so a worker that has lost its broker connection is pulled from rotation rather than silently dropping batches. Horizontal pod autoscalers are keyed to broker queue depth rather than CPU alone, because a backlog of unvalidated records is the earliest signal that the validation tier needs more replicas.

Configuration-as-Code and Drift Control

The entire ingestion subsystem is declared as code and reconciled continuously, so that the running cluster always matches a reviewed commit. Harvest schedules, transform rule sets, broker topics, and index templates are templated with Helm and provisioned with Terraform, following the same GitOps discipline established in Syncing GeoNode Environments with Terraform. The catalog store and search index are themselves stateful workloads, so their persistence is governed by the patterns in Kubernetes StatefulSets for PostGIS Databases.

A harvest source is a declarative object, not an imperative script. Adding a new upstream provider is a pull request that adds a record to a manifest, reviewed and merged like any other change:

# harvest-sources.yaml — the single source of truth for every upstream.
# A GitOps reconciler renders these into CronJobs; no manual kubectl.
sources:
  - id: usgs-national-map
    protocol: oai-pmh
    endpoint: https://example-upstream.gov/oai
    metadataPrefix: iso19139         # harvested format
    schedule: "0 */6 * * *"          # every 6 hours
    incremental: true                # use resumptionToken + datestamp
    backoff: { maxRetries: 5, baseSeconds: 2, jitter: true }
  - id: regional-csw
    protocol: csw
    endpoint: https://example-region.org/csw
    outputSchema: http://www.isotc211.org/2005/gmd
    schedule: "30 2 * * *"           # nightly off-peak window
    incremental: false

Drift detection runs on a schedule and on every merge: the GitOps controller diffs live cluster state against the rendered manifests and either auto-corrects or raises an alert when a hand-edit has crept in — a kubectl edit to bump an index replica count, for example, is reverted on the next reconcile unless it is also committed. Index templates and analyzer settings are version-controlled the same way, so a mapping change is reviewable and reversible rather than a one-way live mutation. The principle is identical to the rest of the program: that cluster is expected to converge on the commit without manual intervention, and any divergence is a defect to be surfaced, not a state to be tolerated.

Operational Troubleshooting

Diagnosis follows the record’s path through the pipeline: confirm it was harvested, confirm it validated, confirm it indexed, confirm it synced. Work the stages in order and the failing boundary reveals itself quickly. The matrix below keys common symptoms to the log path and HTTP signal that confirms the cause.

Symptom	Likely cause	Where to look	Fix
Records missing from search but no errors	Harvester never reached the upstream	`/var/log/catalog/harvest.log`; upstream returns HTTP 429/503	Check the breaker state; widen backoff window; verify upstream SLA
Harvest stalls mid-run, resumes from start	Resumption token expired or not persisted	`harvest.log` shows repeated `resumptionToken` resets	Persist token between pages; shorten harvest window below token TTL
Records land in the dead-letter queue	Schema validation failure	`/var/log/catalog/validate.log`; DLQ payload `rule` field	Inspect the failing XSD/Schematron rule; patch the transform; replay DLQ
Search API returns HTTP 403 for valid users	`visibility` stamp or token scope mismatch	API access log; token `groups` claim vs index filter	Reconcile RBAC group mapping; re-index affected `visibility` field
Indexing throws HTTP 429 from the search store	Bulk write rate exceeds index capacity	`/var/log/catalog/index.log`; search store `rejected` thread-pool counter	Lower bulk batch size; add index replicas; throttle the broker consumer
Catalog entry points to a relocated dataset	Sync service missed a version event	`/var/log/catalog/sync.log`; compare entry tag to dataset HEAD	Replay the version event; verify the sync hook subscription
Pipeline healthy but queue depth climbing	Validation tier under-provisioned	Broker lag metric; HPA at max replicas	Raise HPA ceiling; profile the CPU-bound parser; shard the topic
Duplicate records after a replay	Non-deterministic document ID	`index.log` shows differing IDs for one source URN	Restore the URN-plus-content-hash ID; idempotent writes dedupe on replay

Two rules keep diagnosis fast. First, every log line carries the record’s source URN and a correlation ID that follows it across all four services, so a single grep reconstructs the full journey of one record. Second, the dead-letter queue is the first place to look for “missing” records — a record that failed validation is not lost, it is parked with the reason attached, and that reason is almost always the real answer.

Operational Maturity Checklist

Treat the following as auditable runbook items; each maps to a concern above and should have a named owner and a verification command.

Harvest — Every upstream is declared in harvest-sources.yaml; incremental harvesting uses persisted resumption tokens; each provider has its own circuit breaker and alert.
Validation — Every record passes XSD structural and Schematron semantic gates before indexing; the XML parser is hardened against XXE and entity expansion; failures carry a structured payload to the dead-letter queue.
Security — Harvest workers are egress-only via NetworkPolicy; internal services use mutual TLS; secrets are injected at runtime and never committed; the search API enforces a visibility filter on every query.
Identity — Write-surface access requires a scoped, signature-verified JWT; service accounts use client-credentials grants; every mutation is attributable to a named principal in the audit log.
Resilience — Writes are idempotent on a deterministic document ID; retries use bounded backoff with jitter; readiness probes gate routing; autoscaling keys on queue depth.
Configuration — Harvest schedules, transform rules, and index templates are version-controlled and rendered by GitOps; drift detection runs on a schedule and reverts uncommitted edits.
Auditing — Every ingestion event, schema transformation, and manual override emits an immutable, correlation-tagged audit record; logs are retained per the records-management schedule and mapped to a governance baseline such as the NIST SP 800-53 Rev. 5 audit and accountability controls.
Standards — External integrations conform to the OGC Catalogue Service for the Web (CSW) specification; custom adapters bridge protocol gaps without forking the internal schema.

By treating metadata ingestion as a stateless, observable, strictly versioned subsystem — harvested behind hardened boundaries, validated before it persists, and reconciled against the datasets it describes — a portal achieves the discoverability and traceability that mission-critical geospatial operations demand.

CSW Catalog Schema Mapping & Validation — normalization and structural/semantic validation gates.
Automated Metadata Ingestion via OAI-PMH — resilient harvest cycles and token-based resumption.
Search Indexing Optimization with Elasticsearch — shard allocation, analyzers, and low-latency spatial queries.
Version Tagging & Sync for Spatial Datasets — lineage tracking and dataset-to-catalog synchronization.
Core Portal Architecture & Security Boundaries and Infrastructure Orchestration & Config Management — the runtime and provisioning surfaces this catalog builds on.

Up one level: this guide is part of the open-source geospatial portal operations program.

Metadata Catalog Automation & Ingestion Workflows: Operational Guide

Architectural Foundations and Service Decomposition #

Security Boundary Mapping #

Identity and Access Control #

Resilience and Routing #

Configuration-as-Code and Drift Control #

Operational Troubleshooting #

Operational Maturity Checklist #

Related #