Automated Metadata Ingestion via OAI-PMH: Production Pipeline Architecture & Scaling Strategies

When metadata ingestion stalls, discovery breaks: federated portals start serving stale bounding boxes, dead download links, and records that no longer match their source datasets. Platform engineers running multi-agency geospatial catalogs feel this first, because a single unattended harvest failure silently desynchronizes thousands of records before anyone notices. This guide treats automated OAI-PMH ingestion as production infrastructure rather than a cron one-liner, and sits inside the broader Metadata Catalog Automation & Ingestion Workflows reference as the inbound edge of every catalog pipeline — the component that decides whether the rest of the stack ever sees a record at all.

The Open Archives Initiative Protocol for Metadata Harvesting is deliberately minimal: six verbs, stateless XML responses, and resumption tokens for paging. That minimalism is exactly why naive implementations fail at scale. Real agency endpoints rate-limit, restart mid-harvest, expire tokens, and emit non-compliant XML, so the burden of correctness shifts entirely onto the harvester. The sections below cover where the harvester sits in the stack, how it isolates and quarantines bad payloads, how harvest configuration becomes version-controlled policy, how transport is authenticated against agency security policy, how the whole pipeline is gated in CI/CD, and how to diagnose the failure modes that actually page on-call.

The production harvester architecture below persists resumption-token state so harvests survive restarts, validates every payload, and quarantines failures without stalling the batch.

Architectural Placement: Where Ingestion Lives in the Stack

OAI-PMH ingestion is the boundary between systems you do not control (remote agency repositories) and systems you do (the catalog, the search index, the discovery API). Architecturally it belongs in its own service tier — never embedded inside the catalog application — because its failure characteristics are completely different from request-serving workloads. A harvester spends most of its life blocked on slow network I/O, handling partial responses, and replaying state; the catalog spends its life answering low-latency spatial queries. Co-locating them couples a long-running, fault-prone batch process to a latency-sensitive service and guarantees that a hung harvest degrades discovery.

The reference decomposition separates four responsibilities. A scheduler decides when each endpoint is harvested and enforces per-agency concurrency caps. A protocol client performs the raw ListIdentifiers, ListRecords, and GetRecord calls and owns retry, backoff, and resumption-token lifecycle. A transformation and validation layer maps Dublin Core or ISO 19139 payloads to the catalog’s canonical schema and gates them. A persistence layer performs idempotent upserts into the catalog and emits bulk-index events. Each tier scales independently: the protocol client scales with the number of slow endpoints, the transformation layer scales with record volume, and the index writers scale with catalog write throughput.

This concern pairs tightly with schema work downstream. Where this guide stops at “deliver a validated record to the persistence layer,” the CSW Catalog Schema Mapping & Validation reference takes over for canonical-schema design, and Search Indexing Optimization with Elasticsearch owns what happens once records hit the index. The implementation-level mechanics of token storage, delta windows, and FIPS-aligned transport for compliance-driven deployments are detailed in Automating OAI-PMH Harvesting for Government Geospatial Portals. Treat that page as the runbook for this section’s architecture.

State is the architectural crux. Because resumption tokens are server-bound, opaque, and short-lived, the harvester must externalize harvest cursors to a durable store so that a pod restart resumes from the last committed timestamp rather than re-harvesting from epoch or, worse, silently dropping a window. Redis is appropriate for in-flight token caching; a relational store (often the same PostgreSQL/PostGIS cluster the catalog already runs — see Kubernetes StatefulSets for PostGIS Databases for the durable-storage pattern) is appropriate for the authoritative per-endpoint cursor.

Data Isolation and the Quarantine Model

The single most important reliability property of an ingestion pipeline is that one bad record cannot poison a batch. Agency endpoints routinely return unescaped ampersands in identifiers, mixed encoding declarations, ISO 19139 records with proprietary CRS definitions absent from the EPSG registry, and geometries that violate ring orientation. If the pipeline treats the batch as atomic, a single malformed record fails the entire harvest window and blocks every valid record behind it.

The isolation mechanism is a quarantine (dead-letter) queue with per-record fault boundaries. Each record is parsed, validated, and transformed in isolation; on failure it is routed to quarantine with a structured error payload — the failing rule, the source identifier, the raw XML, and the harvest cursor — rather than aborting the batch. Validation proceeds in two stages with different jobs: XSD enforces structural correctness (namespaces, cardinality, datatypes), and Schematron enforces business logic that XSD cannot express, such as a mandatory non-empty temporal extent or a bounding box whose westBoundLongitude is genuinely west of its eastBoundLongitude. Records failing either gate are quarantined, never indexed.

# harvester/validate.py — per-record fault isolation, never batch-atomic
import lxml.etree as etree

# Disable network and entity expansion: OAI-PMH payloads are untrusted input,
# and DTD/entity loading is a direct XXE vector for agency-supplied XML.
SECURE_PARSER = etree.XMLParser(
    resolve_entities=False,   # block external entity expansion (XXE)
    no_network=True,          # never fetch remote DTDs/schemas at parse time
    load_dtd=False,
    huge_tree=False,          # cap tree size; reject billion-laughs payloads
    recover=False,            # structural faults must fail loudly, not be patched
)

def validate_record(raw_xml: bytes, xsd: etree.XMLSchema,
                    schematron: etree.Schematron) -> tuple[bool, str | None]:
    """Return (ok, reason). A False result quarantines ONE record; the batch lives."""
    try:
        doc = etree.fromstring(raw_xml, SECURE_PARSER)
    except etree.XMLSyntaxError as exc:
        return False, f"xml_syntax:{exc.msg}"
    if not xsd.validate(doc):           # stage 1: structural (XSD)
        return False, f"xsd:{xsd.error_log.last_error.message}"
    if not schematron.validate(doc):    # stage 2: business rules (Schematron)
        return False, f"schematron:{schematron.error_log.last_error.message}"
    return True, None

This same record-level isolation is what the schema team relies on downstream; the rule matrices and Schematron templates that back stage two are catalogued in Validating ISO 19115 Metadata Before Ingestion. The harvester’s only contract is: a record either reaches the persistence layer fully valid, or it lands in quarantine with enough context to be replayed after a fix — nothing in between.

Idempotency closes the isolation loop. Every upsert keys on the OAI-PMH identifier plus a content hash, so a record reharvested after a token replay or a network partition updates in place rather than duplicating. Without this, retry logic — which is mandatory given how often agency endpoints drop connections — would multiply records on every recovered failure.

Policy-as-Code: Harvest Configuration as Version-Controlled Manifests

Harvest behavior must never live in operator memory or hand-edited config on a box. Endpoints, polling cadence, concurrency caps, metadata prefixes, and retry budgets are policy, and policy belongs in a version-controlled repository with branch protection and mandatory review. A declarative manifest per endpoint makes harvest intent auditable and diffable, and lets a reviewer reason about blast radius before a change merges.

# harvest/endpoints/state-gis.yaml — one declarative manifest per agency endpoint.
# Reviewed via pull request; rendered into the scheduler at deploy time.
apiVersion: ingest.geoportal.io/v1
kind: HarvestSource
metadata:
  name: state-gis-clearinghouse
  labels:
    jurisdiction: state
    compliance: fips-moderate
spec:
  baseUrl: https://metadata.state.example.gov/oai
  metadataPrefix: iso19139           # request ISO 19139; reject anything else
  schedule: "0 */6 * * *"            # incremental delta every 6 hours
  delta:
    mode: from-cursor                # ListIdentifiers using the persisted last_ts
    overlapSeconds: 300              # re-scan a 5-min window to absorb clock skew
  concurrency:
    maxRequests: 50                  # hard cap to stay under the agency WAF limit
    respectRetryAfter: true          # honour 429 Retry-After before backing off
  backoff:
    initialMs: 1000
    maxMs: 60000
    jitter: full                     # full jitter prevents thundering-herd retries
  token:
    cacheTtlSeconds: 900             # local resumption-token TTL before cursor reset
  validation:
    xsd: schemas/iso19139.xsd
    schematron: rules/geo-business-rules.sch
    onFailure: quarantine            # never batch-abort on a single bad record
  transport:
    mtls: true                       # client-cert auth where the agency mandates it
    minTlsVersion: "1.3"

Because the manifest is data, a CI job can statically check it before it ever runs: that maxRequests is within the agency’s published fair-use ceiling, that metadataPrefix is one the transformation layer actually maps, and that referenced XSD and Schematron files exist. The provisioning of the compute, secrets, and networking these manifests run on is itself codified — the same configuration-as-code discipline described in Syncing GeoNode Environments with Terraform — so staging and production harvesters are byte-for-byte reproducible and drift is detectable rather than discovered during an incident.

Authentication and the Transport Boundary

Many government and research endpoints sit behind mutual TLS, IP allow-lists, or authenticating proxies, so the harvester is not just an HTTP client — it is a credential-bearing service crossing a trust boundary. The authentication model has two distinct planes: how the harvester proves identity to the agency endpoint outbound, and how it carries scoped credentials inbound to the catalog and state store.

Outbound, where an agency mandates client certificates, the harvester presents a per-source client cert and pins minTlsVersion: "1.3". Credentials are never baked into images or manifests; they are mounted from a secret store at runtime and rotated out of band. The protocol client must also strip and never forward agency-supplied response headers into internal systems, the same hygiene that header-injection-aware gateways enforce for OGC services in Security Boundary Mapping for OGC Services.

# harvester/transport.py — scoped, rotated credentials; no secrets in the image.
import os, ssl, httpx

def build_client(source: dict) -> httpx.Client:
    ctx = ssl.create_default_context()
    ctx.minimum_version = ssl.TLSVersion.TLSv1_3
    if source["transport"].get("mtls"):
        # Cert + key are mounted from the secret store, NOT shipped in the image.
        ctx.load_cert_chain(
            certfile=os.environ["OAI_CLIENT_CERT"],
            keyfile=os.environ["OAI_CLIENT_KEY"],
        )
    return httpx.Client(
        verify=ctx,
        timeout=httpx.Timeout(connect=10.0, read=60.0, write=10.0, pool=5.0),
        headers={"User-Agent": "geoportal-harvester/2.x (+ops@geoportal.example)"},
        limits=httpx.Limits(max_connections=source["concurrency"]["maxRequests"]),
    )

Inbound, the harvester authenticates to the catalog with a narrowly scoped service credential — write access to the ingestion staging tables and the quarantine queue, and nothing else. It has no business holding catalog-admin rights. This least-privilege posture is the ingestion-tier application of the role scoping covered in Implementing RBAC for Multi-Tenant GIS Portals: a compromised harvester should be able to fill a quarantine queue, not rewrite the catalog. Connections to the shared PostGIS state store run through a bounded pool so a burst of parallel harvest jobs cannot exhaust backend connections, following the sizing model in Optimizing PostgreSQL/PostGIS Connection Limits.

CI/CD Integration: Gating Harvest Changes Before They Run

Because harvest manifests and transformation rules are code, every change flows through the same pipeline gates as application code. The defining gate is a dry-run harvest against a sandbox endpoint: the pipeline executes the changed configuration end to end against a fixture repository, runs the full XSD and Schematron suite, and emits a configuration diff report so a reviewer sees exactly which endpoints, prefixes, or rules changed before approving. A change that would raise maxRequests above an agency’s ceiling, or reference a metadata prefix with no transformer, fails the build.

# .github/workflows/harvest-gate.yml — no harvest config merges un-tested.
name: harvest-config-gate
on:
  pull_request:
    paths: ["harvest/**", "schemas/**", "rules/**"]
jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Lint harvest manifests
        run: harvestctl lint harvest/endpoints/   # schema + fair-use ceiling checks
      - name: Dry-run against sandbox repository
        env:
          OAI_SANDBOX_URL: $
        run: harvestctl dry-run --source harvest/endpoints/ --limit 200 --no-index
      - name: Validate fixtures through full XSD + Schematron suite
        run: harvestctl validate-fixtures tests/fixtures/ --strict
      - name: Emit config diff report
        run: harvestctl diff --base origin/main --out diff-report.md

On merge, a GitOps controller reconciles the desired manifest set into the running scheduler; an operator never kubectl edits a harvest source by hand. Any divergence between Git and the live scheduler is treated as drift and either auto-reverted or alerted, the same self-healing reconciliation pattern applied across the platform’s infrastructure tier. This is what makes a configuration change to a production harvester a reviewable, revertible event rather than a midnight SSH session.

Multi-Region Catalog Synchronization

Portals serving multiple jurisdictions need geographic redundancy, and replicated catalogs introduce write-conflict risk that ingestion must account for. An active-passive topology with timestamp-based last-write-wins and a single authoritative writer per record is the pragmatic baseline: the harvester writes only to the primary region, and replication carries records to read replicas. Active-active configurations — where more than one region accepts catalog writes — require conflict-free replicated data types or an external coordination layer to reconcile concurrent updates to the same record, and should not be adopted before automated failover drills prove that catalog consistency holds under partition. Bake those drills into the operational runbook; redundancy that has never been exercised under a partition is an assumption, not a guarantee.

Observability for Harvest Pipelines

A harvest pipeline that cannot be observed cannot be operated. Structured logging (JSON, one event per line) must capture, at minimum, the harvest initiation timestamp, source identifier, record counts, the HTTP status distribution, and every resumption-token lifecycle event, all correlated by a per-run request ID. Metrics should expose queue depth, transformation latency, validation-failure rate per source, quarantine accumulation, and index refresh duration; distributed tracing should let an operator follow a single record from the GetRecord call through validation to the index write. For agency deployments, audit logs are cryptographically signed, retained per records-management schedule, and exported to a central SIEM. The protocol and interoperability baselines for all of this remain the Open Archives Initiative Protocol for Metadata Harvesting specification and the OGC Catalogue Services Specification.

Operational Troubleshooting

Most ingestion incidents reduce to a handful of recurring failure modes. The matrix below keys each symptom to its usual cause, the log path or config flag to inspect, and the corrective action.

Symptom	Likely cause	Where to look	Fix
Harvest resumes from epoch after a restart	Cursor held only in memory; durable `last_ts` never committed	State store row for the source; `harvester.log` cursor-commit events	Commit the cursor to the durable store after each page, not at batch end
`noRecordsMatch` on every incremental run	`from` date format or granularity mismatch with the repository	Request log; compare `from` to the endpoint’s `Identify` granularity	Match `granularity` from `Identify`; send UTC `YYYY-MM-DD` or full timestamp accordingly
Harvest hangs mid-batch, never completes	Resumption token expired past its TTL while paging a large set	`harvester.log` token age vs `token.cacheTtlSeconds`	Reissue `ListIdentifiers` from the persisted cursor instead of resuming a stale token
Repeated `503` / `429` from one endpoint	Concurrency above the agency WAF/fair-use ceiling	Response status distribution metric; `concurrency.maxRequests`	Lower `maxRequests`; honour `Retry-After`; confirm full-jitter backoff is active
One bad record blocks a whole window	Batch treated as atomic instead of per-record isolation	`validation.onFailure` flag; quarantine queue depth	Set `onFailure: quarantine`; validate and upsert per record
Duplicate records after a network partition	Upsert key omits the OAI-PMH `identifier` + content hash	Catalog rows with duplicate source identifiers	Key idempotent upserts on `identifier` + hash so replays update in place
`XMLSyntaxError` spikes from one source	Unescaped ampersands or mixed encoding in agency XML	Quarantine payloads; `validation:xml_syntax` reasons	Quarantine and alert the provider; do not enable `recover=True` in production
CRS-related Schematron failures	ISO 19139 records carry proprietary CRS codes absent from EPSG	Quarantine `schematron:` reasons; transformer CRS map	Map legacy codes to EPSG in the transform; quarantine truly unmappable records
State-store connection errors under load	Parallel harvest jobs exhausting the PostGIS connection pool	PostgreSQL `log_connections`; pool saturation metric	Bound the pool; size per the connection-limits guidance
Live scheduler disagrees with Git	GitOps drift; a harvest source edited out of band	Argo CD / Flux sync status; config diff report	Enforce auto-sync with self-heal; treat manual edits as drift

Worked diagnostic walkthroughs for the token-lifecycle and XML-recovery cases above — including the parser hardening and cursor-reset logic — are expanded in Automating OAI-PMH Harvesting for Government Geospatial Portals.

Automated OAI-PMH ingestion turns a deliberately minimal harvesting protocol into resilient catalog infrastructure. Externalized harvest state, per-record quarantine isolation, version-controlled harvest policy, scoped and rotated transport credentials, and pipeline-gated configuration are what let a platform team keep discovery accurate across distributed agency endpoints — and, just as importantly, prove it was accurate when an auditor asks.

Metadata Catalog Automation & Ingestion Workflows — the parent reference this ingestion edge feeds into.
Automating OAI-PMH Harvesting for Government Geospatial Portals — the implementation-level runbook for token storage, delta windows, and parser hardening.
CSW Catalog Schema Mapping & Validation — canonical-schema design and the validation gates this harvester hands records to.
Search Indexing Optimization with Elasticsearch — what happens to validated records once they reach the index.
Version Tagging & Sync for Spatial Datasets — reconciling harvested metadata versions against the underlying dataset lineage.

Automated Metadata Ingestion via OAI-PMH: Production Pipeline Architecture & Scaling Strategies

Architectural Placement: Where Ingestion Lives in the Stack #

Data Isolation and the Quarantine Model #

Policy-as-Code: Harvest Configuration as Version-Controlled Manifests #

Authentication and the Transport Boundary #

CI/CD Integration: Gating Harvest Changes Before They Run #

Multi-Region Catalog Synchronization #

Observability for Harvest Pipelines #

Operational Troubleshooting #

Related #