Automating OAI-PMH Harvesting for Government Geospatial Portals

A step-by-step procedure for building a fault-tolerant, incremental OAI-PMH harvester that synchronizes metadata from distributed government agency endpoints into an open-source geospatial catalog without losing state across restarts.

This guide is the hands-on companion to Automated Metadata Ingestion via OAI-PMH and sits inside the wider Metadata Catalog Automation & Ingestion Workflows practice; read the parent page first if you need the pipeline-level rationale for why harvest state must be persisted and why validation runs as a separate gate. Here the focus is narrow and operational: how to drive the verb sequence safely, how to survive resumption-token expiry and rate limiting, and how to verify that a delta harvest actually advanced the cursor before any records reach the catalog.

Prerequisites

Before running a harvest against a production agency endpoint, confirm the following. Each is a common root cause of a stalled or duplicated harvest.

Python 3.11+ with requests (or httpx) and lxml >= 5.0 for defensive XML parsing.
A reachable repository base URL and a successful Identify response — confirm the granularity (YYYY-MM-DD vs YYYY-MM-DDThh:mm:ssZ) and earliestDatestamp before choosing date windows.
The agency’s supported metadataPrefix values from a ListMetadataFormats call — government portals typically expose oai_dc (Dublin Core) and iso19139.
Durable state storage for the harvest cursor: a small SQLite file, Redis, or a PostGIS-backed table as described in Kubernetes StatefulSets for PostGIS Databases. Never hold cursor state in memory only.
Outbound network policy that permits TLS 1.3 to the agency host, plus any proxy authentication; respect a published fair-use ceiling (default to a maximum of 50 concurrent requests per endpoint).
Write access to a quarantine path (/var/lib/harvester/quarantine/) and a structured log sink (/var/log/harvester/harvest.jsonl).

The sequence below traces one incremental harvest cycle — reading saved state, listing changed identifiers, fetching records, and persisting the new cursor. On token expiry the harvester reissues ListIdentifiers from the cursor rather than resuming a stale token.

Step-by-step implementation

1. Build a resilient HTTP session

Decouple network I/O from transformation by isolating all transport concerns in one session object. Persistent connections with keep-alive prevent socket exhaustion during high-frequency polling, and a urllib3 retry policy honours Retry-After on 429/503 responses before a circuit breaker trips.

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

def build_session(user_agent: str) -> requests.Session:
    """HTTP session with bounded retries, exponential backoff, and pooling."""
    retry = Retry(
        total=5,
        connect=3,
        backoff_factor=2.0,            # 2s, 4s, 8s, 16s, 32s
        status_forcelist=(429, 500, 502, 503, 504),
        respect_retry_after_header=True,
        allowed_methods=frozenset({"GET"}),
    )
    adapter = HTTPAdapter(max_retries=retry, pool_connections=10, pool_maxsize=50)
    session = requests.Session()
    session.mount("https://", adapter)
    session.headers.update({
        "User-Agent": user_agent,       # identify the harvester to agency ops
        "Connection": "keep-alive",
    })
    return session

2. Persist incremental harvest state

Production harvesting is strictly incremental. Track the last successful datestamp, the in-flight resumption token, and a per-endpoint error counter so a restart resumes from the cursor instead of re-harvesting the full catalog.

import json, sqlite3, datetime as dt

class HarvestState:
    """Durable cursor: last datestamp, live token, consecutive error count."""
    def __init__(self, db_path: str, endpoint: str):
        self.db = sqlite3.connect(db_path)
        self.endpoint = endpoint
        self.db.execute(
            "CREATE TABLE IF NOT EXISTS state ("
            "endpoint TEXT PRIMARY KEY, last_ts TEXT, token TEXT, errors INTEGER DEFAULT 0)"
        )
        self.db.commit()

    def read(self) -> dict:
        row = self.db.execute(
            "SELECT last_ts, token, errors FROM state WHERE endpoint = ?",
            (self.endpoint,),
        ).fetchone()
        return {"last_ts": row[0], "token": row[1], "errors": row[2]} if row else {}

    def commit(self, last_ts: str, token: str | None, errors: int = 0) -> None:
        self.db.execute(
            "INSERT INTO state(endpoint, last_ts, token, errors) VALUES(?,?,?,?) "
            "ON CONFLICT(endpoint) DO UPDATE SET last_ts=?, token=?, errors=?",
            (self.endpoint, last_ts, token, errors, last_ts, token, errors),
        )
        self.db.commit()

3. Drive the verb sequence with resumption-token paging

Use ListIdentifiers for delta detection (it is far cheaper than ListRecords), then issue targeted GetRecord calls. The token returned by the server is stateful and server-bound: page through it until it is absent, but treat any token error as a signal to restart from the saved from timestamp.

from lxml import etree

OAI_NS = {"oai": "http://www.openarchives.org/OAI/2.0/"}

def list_identifiers(session, base_url, parser, prefix, from_ts, token=None):
    """Yield (identifier, datestamp, next_token) for one page of a delta harvest."""
    params = {"verb": "ListIdentifiers"}
    if token:
        params["resumptionToken"] = token        # token alone — no other args
    else:
        params.update({"metadataPrefix": prefix, "from": from_ts})

    resp = session.get(base_url, params=params, timeout=30)
    resp.raise_for_status()
    root = etree.fromstring(resp.content, parser)

    error = root.find("oai:error", OAI_NS)
    if error is not None:
        code = error.get("code")
        if code == "noRecordsMatch":
            return                                  # nothing new — clean exit
        raise OaiError(code, error.text)

    for hdr in root.iterfind(".//oai:header", OAI_NS):
        if hdr.get("status") == "deleted":
            continue                                # handle tombstones separately
        yield (
            hdr.findtext("oai:identifier", namespaces=OAI_NS),
            hdr.findtext("oai:datestamp", namespaces=OAI_NS),
        )

    next_token = root.findtext(".//oai:resumptionToken", namespaces=OAI_NS)
    yield ("__TOKEN__", next_token or None)

When a token expires mid-cycle, catch the badResumptionToken error and reissue ListIdentifiers with the original from cursor. This keeps delta application idempotent and prevents the catalog desynchronization that follows a naive token resume across a network partition.

4. Parse defensively and map error codes

Government endpoints routinely emit malformed XML, unescaped ampersands in identifier fields, or mixed encoding declarations. Configure one hardened parser and reuse it: recover=True isolates recoverable faults, while no_network and resolve_entities=False close the XML External Entity (XXE) vector. ISO 19139 records carrying proprietary CRS definitions must have unrecognized <gco:CharacterString> blocks stripped and legacy codes mapped to EPSG identifiers before persistence — the same normalization documented in Validating ISO 19115 Metadata Before Ingestion.

import lxml.etree as etree

# One reusable, XXE-safe parser for every response in the harvest loop.
parser = etree.XMLParser(
    recover=True,          # isolate recoverable faults from fatal violations
    no_network=True,       # disable DTD and external entity loading
    resolve_entities=False,
    huge_tree=False,       # bound memory on hostile or oversized payloads
)

Watch for badVerb, badArgument, and noRecordsMatch codes; the first two almost always mean a malformed date window or an unsupported metadataPrefix, not a transient fault, so they must not be retried blindly.

5. Schedule, shard, and quarantine

Run the harvester as a scheduled worker rather than a cron one-shot so backoff and circuit-breaker state survive between cycles. For high-volume portals, shard harvest jobs by metadataPrefix or date range and fan them out across workers; the orchestration baseline for that lives in Automated Metadata Ingestion via OAI-PMH. Quarantine any endpoint whose consecutive error counter crosses a threshold, and route failed records to a dead-letter path instead of halting the batch.

# harvest-schedule.yaml — declarative job definition consumed by the scheduler
harvest:
  endpoint: "https://catalog.agency.gov/oai"
  metadata_prefix: "iso19139"
  interval_minutes: 60
  max_concurrency: 50            # fair-use ceiling per agency endpoint
  token_ttl_minutes: 15          # refresh window before treating a token as stale
  quarantine_after_errors: 5     # consecutive failures before endpoint is paused
  state_backend: "sqlite:///var/lib/harvester/state.db"
  quarantine_path: "/var/lib/harvester/quarantine/"
  log_sink: "/var/log/harvester/harvest.jsonl"

Records that pass parsing flow into schema mapping and indexing; that downstream contract is defined in CSW Catalog Schema Mapping & Validation and the search layer in Search Indexing Optimization with Elasticsearch.

Verification

Confirm each layer before trusting an automated cycle. Run these against a staging endpoint first.

# 1. The endpoint answers Identify and exposes the expected granularity
curl -s "https://catalog.agency.gov/oai?verb=Identify" \
  | xmllint --xpath '//*[local-name()="granularity"]/text()' -

# 2. The metadata prefix you configured is actually supported
curl -s "https://catalog.agency.gov/oai?verb=ListMetadataFormats" \
  | xmllint --xpath '//*[local-name()="metadataPrefix"]/text()' -

# 3. A bounded delta probe returns identifiers (or a clean noRecordsMatch)
curl -s "https://catalog.agency.gov/oai?verb=ListIdentifiers&metadataPrefix=iso19139&from=2026-06-01" \
  | grep -c "<identifier>"

# 4. The cursor advanced after a harvest run — last_ts must move forward
sqlite3 /var/lib/harvester/state.db \
  "SELECT endpoint, last_ts, token, errors FROM state;"

# 5. No records landed in quarantine, and the log shows zero fatal parses
ls -1 /var/lib/harvester/quarantine/ | wc -l
grep -c '"level":"error"' /var/log/harvester/harvest.jsonl

A monotonically advancing last_ts, an empty quarantine directory, and a zero fatal-error count confirm the harvest is idempotent and the cursor logic is sound. Wire checks 3 and 4 into a pipeline gate so every promotion is honest.

Troubleshooting matrix

Symptom	Likely cause	Fix
Harvest re-fetches the full catalog every run	Cursor never persisted, or `last_ts` written before records committed	Commit `last_ts` only after successful ingestion; verify the row in the state store
`badResumptionToken` error mid-cycle	Token TTL exceeded or the harvester restarted between pages	Catch the code, discard the token, reissue `ListIdentifiers` from the saved `from` cursor
Repeated `503` / `429` then a stalled job	Exceeding the agency fair-use ceiling or ignoring `Retry-After`	Lower `max_concurrency`; confirm `respect_retry_after_header=True`; quarantine after N failures
`badArgument` on every request	`from`/`until` format does not match the endpoint’s `granularity`	Read granularity from `Identify`; emit `YYYY-MM-DD` or full UTC `YYYY-MM-DDThh:mm:ssZ` accordingly
Parser raises `XMLSyntaxError` and the batch halts	Unescaped `&` or mixed encoding in agency XML, with a strict parser	Use the shared `recover=True` parser; route the offending record to quarantine, do not stop
Records ingest but geometries misalign	Legacy or proprietary CRS codes lacking EPSG mappings	Normalize CRS before persistence per the ISO 19115 validation guide; reject unmapped codes
`noRecordsMatch` when changes are expected	`from` window is ahead of the server clock or past `earliestDatestamp`	Normalize to UTC; clamp the window to `earliestDatestamp`; allow a small clock-skew margin
Duplicate catalog entries after a partition	Non-idempotent insert instead of upsert on the identifier	Upsert keyed on the OAI `identifier`; treat `status="deleted"` headers as tombstones

For strict compliance with verb sequencing and error-response structure, the official OAI-PMH 2.0 Specification is authoritative, and the lxml parsing documentation covers production-safe parser tuning and memory limits.

Automated Metadata Ingestion via OAI-PMH — the pipeline architecture, token storage backends, and scaling model this procedure plugs into.
Validating ISO 19115 Metadata Before Ingestion — the validation gate harvested records pass through before indexing.
CSW Catalog Schema Mapping & Validation — schema-aware transformation of harvested Dublin Core and ISO 19139 payloads.
Search Indexing Optimization with Elasticsearch — where validated harvest output becomes discoverable.

Up one level: Automated Metadata Ingestion via OAI-PMH.

Automating OAI-PMH Harvesting for Government Geospatial Portals

Prerequisites #

Step-by-step implementation #

1. Build a resilient HTTP session #

2. Persist incremental harvest state #

3. Drive the verb sequence with resumption-token paging #

4. Parse defensively and map error codes #

5. Schedule, shard, and quarantine #

Verification #

Troubleshooting matrix #

Related #