CSW Catalog Schema Mapping & Validation: An Operational Guide

When a Catalog Service for the Web (CSW) endpoint ingests metadata from many agencies without a deterministic mapping-and-validation gate, the failure is silent: malformed gmd:MD_Metadata records index without error, coordinate reference systems drift, and GetRecords responses begin returning records that no client can parse. The people who feel this are the GIS administrators fielding “why can’t I find this layer” tickets and the platform engineers forced into full reindexes when a single bad batch corrupts a field mapping. Schema mapping and validation is the control that stops heterogeneous metadata from reaching the index in the first place. This topic sits inside the broader Metadata Catalog Automation & Ingestion Workflows reference, alongside the harvest and sync mechanisms that feed it; everything here assumes records arrive continuously and must be normalized, gated, and routed without manual triage.

The mapping-and-validation pipeline below normalizes heterogeneous inputs to a single internal schema, then gates them through structural and semantic checks before indexing.

Architectural Placement: Where Mapping and Validation Live in the Stack

Mapping and validation belong on the ingestion edge, strictly upstream of any write to the primary catalog database or search index. The harvest layer — whether it pulls via OAI-PMH, CSW Harvest, or a file drop — should hand raw payloads to a transformation service and never to the index directly. That ordering is what makes the catalog auditable: every record that reaches the index has provably passed both a structural and a semantic gate, and every record that did not is recoverable from quarantine with a structured reason attached.

CSW implementations rarely consume metadata in a single canonical format. Government datasets, scientific repositories, and municipal GIS layers typically arrive as ISO 19115/19139, Dublin Core, FGDC, or custom JSON-LD profiles. A production-grade mapping layer must decouple raw ingestion from normalization so that adding a new source profile is a configuration change, not a code rewrite. The recommended pattern employs a stateless transformation service that applies versioned XSLT or schema-aware Python transformers to map incoming payloads to one unified internal representation. Statelessness matters operationally: it lets transformation workers scale horizontally against queue depth and be killed and rescheduled without losing in-flight records. Where the records originate from external harvesters, automated metadata ingestion via OAI-PMH supplies the standardized pull and delta mechanism that feeds this transform stage, allowing incremental harvesting and deterministic reconciliation against the last known resumption token.

Mapping rules themselves should be codified in the same repositories that hold the catalog deployment manifests. By treating transformation logic as configuration rather than as embedded application code, platform teams enforce schema evolution through pull requests, run automated diff tests against a fixture corpus, and roll back an incompatible mapping without redeploying the service.

Data Isolation: The Quarantine and Namespace Model

The security and integrity model for an ingestion gate is built on two isolation mechanisms: namespace isolation during parsing, and quarantine isolation for anything that fails.

Namespace isolation prevents the single most common silent corruption in CSW pipelines — XPath expressions matching the wrong element because a record declares gmd, gco, or csw prefixes differently from the validator’s assumptions. Every transform and every Schematron rule must bind namespaces explicitly and resolve elements by namespace URI, never by raw prefix string. A record that uses iso where the canonical profile uses gmd is still valid XML; only explicit URI binding catches the mismatch before it produces an empty mapped field.

Quarantine isolation is what keeps a bad batch from contaminating the index. A failed record is written to a dead-letter queue, never to the catalog, and never silently dropped. The quarantine payload carries the original record, the failing gate (XSD vs. Schematron vs. business rule), the precise validation error, and the source identity — enough for automated alerting and a self-healing retry once the upstream profile or the mapping is corrected. This is the same isolation discipline applied to the search tier; the field-type and analyzer guarantees described in search indexing optimization with Elasticsearch only hold because nothing reaches the index that has not already been normalized and validated here.

Schema mapping without strict validation introduces corruption that compounds at scale, so the gate enforces a deliberate ordering: structural validation against XSD first (cheap, rejects malformed XML early), semantic validation via Schematron second, and agency-specific business rules last (mandatory CRS declarations, licensing fields, spatial-extent bounds). For the full rule matrix and reusable Schematron templates, see validating ISO 19115 metadata before ingestion, which expands each of these stages into concrete, copy-ready checks.

Mapping and Validation as Declarative Config

Treat both the field mapping and the validation gates as version-controlled configuration. The mapping registry below maps each external profile to internal canonical fields and pins the validation artifacts it must pass, so onboarding a new provider is a reviewed change rather than a code edit.

# ingestion/mappings.yaml — single source of truth for profile -> canonical mapping.
# Versioned alongside the catalog deployment; every change is a reviewed pull request.
apiVersion: catalog.ingest/v1
profiles:
  iso-19115:
    matcher:
      namespace: "http://www.isotc211.org/2005/gmd"   # bind by URI, never by prefix
      root: "MD_Metadata"
    transform: transforms/iso19115_to_canonical.xsl    # pinned XSLT, semver-tagged
    validate:
      xsd: schemas/gmd/gmd.xsd                          # structural gate (stage 1)
      schematron: rules/iso19115_business.sch           # semantic gate (stage 2)
    business_rules:
      require_crs: true                                 # reject records with no CRS
      require_extent: true                              # reject records with no bbox
      allowed_licenses: [CC-BY-4.0, OGL-3.0, CC0-1.0]
  dublin-core:
    matcher:
      namespace: "http://www.openarchives.org/OAI/2.0/oai_dc/"
      root: "dc"
    transform: transforms/dc_to_canonical.xsl
    validate:
      xsd: schemas/oai_dc/oai_dc.xsd
      schematron: rules/dc_minimal.sch
    business_rules:
      require_crs: false                                # DC has no CRS; default applied
      default_crs: "EPSG:4326"

routing:
  on_pass: kafka://catalog.ingest.normalized            # validated records only
  on_fail: kafka://catalog.ingest.dead-letter           # quarantine with error payload

The structural stage itself is intentionally boring and deterministic. Each submission or batch triggers a linting step that runs xmllint --schema <xsd> --noout <record.xml> (or the in-process lxml equivalent) against the pinned schema before any semantic rule executes:

# ingest/validate.py — structural gate, called per record before Schematron.
# Returns a structured verdict so the broker can route pass/fail deterministically.
from lxml import etree

def structural_gate(xml_bytes: bytes, xsd_path: str) -> dict:
    try:
        doc = etree.fromstring(xml_bytes)                 # raises on malformed XML
    except etree.XMLSyntaxError as exc:
        return {"status": "FAIL", "gate": "parse", "error": str(exc)}

    schema = etree.XMLSchema(etree.parse(xsd_path))       # XMLSchema wants a parsed tree
    if not schema.validate(doc):
        # error_log carries line-accurate reasons -> goes into the quarantine payload
        return {"status": "FAIL", "gate": "xsd",
                "error": str(schema.error_log)}
    return {"status": "PASS", "gate": "xsd"}

Because the mapping registry and the gates are declarative, a reviewer can see exactly which licenses an agency is permitted to publish, which CRS is defaulted for prefix-less Dublin Core, and which schema version a profile is pinned to — all without reading transform code.

API Boundary: Authenticating Harvest and GetRecords

The CSW endpoint and the harvest path are distinct trust boundaries and must be enforced separately. Anonymous discovery traffic (GetCapabilities, GetRecords, GetRecordById) is read-only against the validated index; ingestion (Harvest, Transaction, or the internal submit API) is privileged and must carry scoped credentials. Collapsing the two — for example exposing Transaction on the same unauthenticated path as GetRecords — is how untrusted records bypass the gate entirely.

Harvest workers should authenticate with short-lived, narrowly scoped credentials rather than a shared service account. A worker that pulls from source A should hold a token scoped to write only to that source’s normalized topic and quarantine, so a compromised or buggy worker cannot overwrite another agency’s records. The same token discipline that governs portal users in implementing RBAC for multi-tenant GIS portals applies to ingestion principals: every harvest job is an identity with an explicit, least-privilege grant, and the source identity it carries is the same value stamped onto the quarantine payload for audit.

On the response side, the GetCapabilities document and GetRecords filters should reflect only the records a requesting principal is entitled to see. Where the catalog is multi-tenant, the query layer rewrites the capabilities document down to the authorized record set rather than broadcasting the full catalog — the discovery boundary mirrors the OGC service boundaries documented for the WMS/WFS tier, and validated metadata is what makes that rewrite trustworthy.

CI/CD Integration and Drift Detection

Validation artifacts must live in the same delivery pipeline as the rest of the platform. Each change to a mapping, an XSD pin, or a Schematron rule runs against a fixture corpus of known-good and known-bad records, and the pipeline fails closed if any previously passing fixture now fails or any known-bad fixture now passes.

# .github/workflows/ingest-gate.yml — gate mapping/validation changes before merge.
name: ingest-gate
on:
  pull_request:
    paths: ["ingestion/**", "transforms/**", "rules/**", "schemas/**"]
jobs:
  validate-fixtures:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run the structural + semantic gate over the fixture corpus
        run: python -m ingest.gate --corpus tests/fixtures --fail-on-regression
      - name: Diff canonical output against golden snapshots
        # catches a mapping change that silently drops or renames a field
        run: python -m ingest.diff --golden tests/golden --report diff.json

Distributed catalogs run more than one regional node, and the artifacts those nodes load must stay in lockstep. Schema drift — node A pinned to an older XSD or a stale Schematron rule than node B — produces index fragmentation and inconsistent GetRecords responses for the same query. A centralized configuration registry should be the single source from which every node loads its mapping and validation artifacts, propagated atomically through the same GitOps sync that delivers the rest of the platform. The environment-parity discipline in environment parity in geospatial CI pipelines is the mechanism that keeps staging and every production region pinned to identical transformation libraries and rule sets, tracked with semantic versioning so a breaking change to a mapping is visible as a major bump.

Drift detection then runs as a scheduled job rather than a hope: periodically compare per-region index checksums and the loaded artifact versions, and trigger a reconciliation workflow when divergence crosses a threshold. For cross-region replication, resolve conflicts with idempotent upsert and a monotonic version counter so that out-of-order or replayed records converge to the same state, and schedule latency-aware harvest jobs that account for replication lag before treating a record as authoritative.

Operational Troubleshooting

Diagnose ingestion-gate failures by symptom, the queue or log that reveals the cause, and the config surface that fixes it:

Symptom	Likely cause	Where to look	Fix
Records vanish, no error logged	Empty mapped fields from prefix mismatch	Transform output for canonical `title`/`identifier` being null	Bind namespaces by URI in the XSLT/Schematron; never match raw prefixes
Dead-letter queue accumulating	Schematron or business rule rejecting a new source profile	Quarantine payload `gate` + `error` fields	Add/adjust the profile in `mappings.yaml`; backfill from quarantine after re-test
`GetRecords` differs between regions	Artifact version drift across nodes	Loaded `schematron`/`xsd` version per node vs. registry	Force GitOps resync; pin all nodes to the registry-published version
Reindex required after a batch	Field type conflict from auto-mapping	Index mapping conflict in the search-tier logs	Pre-declare analyzers and geo-shape fields; never rely on dynamic mapping in production
XSD passes but record is unusable	Structurally valid yet semantically wrong (missing CRS/extent)	Stage-2 Schematron results, not stage-1 XSD	Promote the missing constraint to a `business_rules` check
Gate latency spiking under load	Stateful workers not scaling with queue depth	Worker autoscaler metrics vs. broker lag	Keep transform workers stateless; autoscale on queue depth, expose lag to Prometheus

Operationally, deploy transformation and validation workers as ephemeral containers with horizontal autoscaling tied to queue-depth metrics, and expose pipeline health via Prometheus endpoints tracking validation success rate, dead-letter accumulation, and mapping latency. Audit the rule sets against evolving ISO 19115 amendments and agency profiles on a schedule, so the gate tightens as the standard moves rather than silently aging out of compliance.

Metadata Catalog Automation & Ingestion Workflows — the parent reference this mapping-and-validation gate fits inside.
Validating ISO 19115 Metadata Before Ingestion — the full rule matrix and Schematron templates behind the semantic gate.
Automated Metadata Ingestion via OAI-PMH — the harvest mechanism that feeds raw records into the transform stage.
Search Indexing Optimization with Elasticsearch — the field-type and analyzer guarantees on the index tier downstream of this gate.
Version Tagging & Sync for Spatial Datasets — how validated records stay consistent across regional nodes during synchronization.

CSW Catalog Schema Mapping & Validation: An Operational Guide

Architectural Placement: Where Mapping and Validation Live in the Stack #

Data Isolation: The Quarantine and Namespace Model #

Mapping and Validation as Declarative Config #

API Boundary: Authenticating Harvest and GetRecords #

CI/CD Integration and Drift Detection #

Operational Troubleshooting #

Related #