Validating ISO 19115 Metadata Before Ingestion

A step-by-step procedure for gating ISO 19115 metadata through strict XSD, Schematron, and referential checks so that only conformant records reach a production geospatial catalog.

This guide is the hands-on companion to CSW Catalog Schema Mapping & Validation and sits inside the wider Metadata Catalog Automation & Ingestion Workflows practice; read the parent page first if you need the pipeline-level rationale for why validation runs as a gate that is decoupled from transformation. Here the focus is narrow and operational: how to parse defensively, how to enforce structural and business rules in distinct stages, and how to verify that a failing record is quarantined rather than silently corrupting the index. Unvalidated XML payloads introduce silent corruption, degrade CSW query performance, and trigger downstream indexing failures that are expensive to remediate after the fact.

Prerequisites

Before wiring validation into an ingestion stream, confirm the following. Each is a common root cause of a validation stage that passes bad records or stalls under load.

Python 3.11+ with lxml >= 5.0 (it ships the C libxml2 bindings used for both XSD and ISO Schematron).
The ISO 19115-3:2018 (or legacy ISO 19115-1:2014 / 19139) XSD bundle on disk, with all imported schemas resolvable locally — never let the parser fetch schema includes over the network.
A Schematron rule set (.sch) encoding your agency or INSPIRE profile constraints, compiled or loaded through lxml.isoschematron.
An authoritative CRS registry (a local EPSG snapshot or a pinned pyproj database) so coordinate-reference checks never depend on a live lookup.
A writable quarantine path (/var/lib/ingest/quarantine/) and a structured log sink (/var/log/ingest/validate.jsonl) for per-record XPath-keyed failures.
When running under Kubernetes, the XSD and .sch rule sets mounted as immutable ConfigMaps so every replica validates against an identical ruleset across scaling events — the StatefulSet patterns in Kubernetes StatefulSets for PostGIS Databases cover the volume-mount mechanics.

A single schema check is never sufficient. XSD guarantees well-formedness and structural compliance, but it cannot enforce controlled vocabularies, cardinality, or domain logic; Schematron supplies those contextual assertions; and CRS and temporal checks catch referential faults that pass both. The harness implements these as a multi-stage gate where any failure short-circuits to quarantine, and only records passing every stage are ingested.

Step-by-step implementation

1. Configure an XXE-safe, non-recovering parser

The first stage is the cheapest filter and the most security-sensitive. Configure one parser with recover=False so malformed payloads fail loudly instead of being silently repaired, and disable entity resolution and network access to close the XML External Entity (XXE) vector. Reuse this single parser for every record in the loop.

from lxml import etree

# One reusable, XXE-safe parser. recover=False means malformed XML raises
# rather than being silently "fixed" into a record that passes later stages.
STRICT_PARSER = etree.XMLParser(
    recover=False,          # no lenient recovery — fail malformed payloads
    no_network=True,        # never fetch DTDs, schema includes, or entities
    resolve_entities=False, # close the XXE vector
    huge_tree=False,        # bound memory on hostile or oversized payloads
)

2. Enforce strict namespace resolution

The most persistent failure vector is namespace collision, especially when legacy FGDC, INSPIRE, or custom agency profiles are intermixed with standard ISO 19115 records. A prefix mismatch between gmd and gmx causes silent XSD bypasses if the parser falls back to lenient recovery. Declare the namespaces you require explicitly and assert that the document root carries an xsi:schemaLocation, so an unqualified payload is rejected before schema validation rather than passing it by accident.

NS = {
    "gmd": "http://www.isotc211.org/2005/gmd",
    "gmx": "http://www.isotc211.org/2005/gmx",
    "gco": "http://www.isotc211.org/2005/gco",
    "gml": "http://www.opengis.net/gml/3.2",
    "xsi": "http://www.w3.org/2001/XMLSchema-instance",
}

def require_schema_location(doc):
    """Reject records that omit xsi:schemaLocation — these slip past lenient parsers."""
    root = doc.getroot()
    if root.get(f"{{{NS['xsi']}}}schemaLocation") is None:
        return {"status": "FAIL", "error": "missing xsi:schemaLocation"}
    return {"status": "PASS"}

3. Validate structure against the ISO 19115 XSD, then business rules with Schematron

Run XSD first to confirm structural compliance, then Schematron for contextual assertions such as requiring at least one gmd:MD_DataIdentification block inside gmd:identificationInfo, or constraining temporal extents to gco:Date / gco:DateTime. Note the lxml specifics: etree.XMLSchema takes a parsed tree, not raw bytes; ISO Schematron lives in lxml.isoschematron.Schematron (there is no etree.Schematron); and store_report=True exposes the structured report.

from lxml import etree, isoschematron

def validate_iso19115(xml_path, xsd_path, schematron_path=None):
    """Gate one ISO 19115 record through parse -> XSD -> Schematron stages."""
    # Stage 1: strict parse
    try:
        doc = etree.parse(xml_path, STRICT_PARSER)
    except etree.XMLSyntaxError as e:
        return {"status": "FAIL", "stage": "parse", "error": f"malformed XML: {e}"}

    loc = require_schema_location(doc)
    if loc["status"] == "FAIL":
        return {"stage": "namespace", **loc}

    # Stage 2: XSD structural validation
    schema = etree.XMLSchema(etree.parse(xsd_path))  # takes a tree, not bytes
    if not schema.validate(doc):
        return {"status": "FAIL", "stage": "xsd", "details": str(schema.error_log)}

    # Stage 3: Schematron business rules (isoschematron, not etree.Schematron)
    if schematron_path:
        sch = isoschematron.Schematron(etree.parse(schematron_path), store_report=True)
        if not sch.validate(doc):
            return {
                "status": "FAIL",
                "stage": "schematron",
                "details": str(sch.validation_report),
            }

    return {"status": "PASS", "doc": doc}

4. Reject malformed CRS and temporal extents referentially

XSD and Schematron pass syntactically valid but referentially broken records. The two that most often break downstream indexing are invalid coordinate reference systems and ambiguous temporal extents. Check gmd:referenceSystemInfo against the authoritative EPSG snapshot, and enforce ISO 8601 with UTC normalization on gmd:EX_TemporalExtent, rejecting partial dates and any range where the start does not precede the end.

import datetime as dt
from pyproj import CRS
from pyproj.exceptions import CRSError

def check_references(doc):
    """Referential checks that schema validation cannot express."""
    # CRS: every referenceSystem identifier must resolve in the EPSG registry.
    for ident in doc.iterfind(".//gmd:referenceSystemInfo//gmd:code/gco:CharacterString", NS):
        code = (ident.text or "").strip()
        try:
            CRS.from_user_input(code)          # e.g. "EPSG:4326" or an OGC URN
        except CRSError:
            return {"status": "FAIL", "stage": "crs", "error": f"unresolvable CRS: {code}"}

    # Temporal: ISO 8601, full dates only, UTC-normalized, start < end.
    for ext in doc.iterfind(".//gmd:EX_TemporalExtent//gml:TimePeriod", NS):
        begin = ext.findtext("gml:beginPosition", namespaces=NS)
        end = ext.findtext("gml:endPosition", namespaces=NS)
        try:
            b = dt.datetime.fromisoformat(begin.replace("Z", "+00:00"))
            e = dt.datetime.fromisoformat(end.replace("Z", "+00:00"))
        except (AttributeError, ValueError):
            return {"status": "FAIL", "stage": "temporal", "error": "non-ISO-8601 extent"}
        if b >= e:
            return {"status": "FAIL", "stage": "temporal", "error": "start not before end"}

    return {"status": "PASS"}

5. Route through an asynchronous gate with structured logging

Synchronous validation becomes a throughput bottleneck at thousands of records per minute, so stage payloads on a queue and route each to an approved ingestion lane or a quarantine directory. Log every failure with its stage and XPath so triage never requires re-running the batch — the same decoupling principle the parent CSW Catalog Schema Mapping & Validation guide applies to transformation. Records that pass flow on to schema mapping and the search layer described in Search Indexing Optimization with Elasticsearch.

# validate-gate.yaml — declarative config consumed by the validation worker
validation:
  xsd_bundle: "/etc/ingest/schema/iso19115-3.xsd"   # mounted as an immutable ConfigMap
  schematron: "/etc/ingest/rules/agency-profile.sch"
  encoding: "utf-8"                # reject UTF-8 with BOM and other encodings
  max_concurrency: 32             # bound the async validation pool
  quarantine_path: "/var/lib/ingest/quarantine/"
  log_sink: "/var/log/ingest/validate.jsonl"
  fail_closed: true               # any stage error routes to quarantine, never ingest

Verification

Confirm each stage in isolation before trusting the gate end to end. Run these against a staging ruleset first.

# 1. The XSD bundle is self-contained — no include resolves over the network
xmllint --noout --schema /etc/ingest/schema/iso19115-3.xsd sample-record.xml

# 2. A known-good record passes every stage
python -c "from validate import validate_iso19115 as v; \
  print(v('good.xml', '/etc/ingest/schema/iso19115-3.xsd', '/etc/ingest/rules/agency-profile.sch'))"

# 3. A record with a bad EPSG code is rejected at the crs stage, not ingested
python -c "from validate import validate_iso19115 as v, check_references as c; \
  r = v('bad-crs.xml', '/etc/ingest/schema/iso19115-3.xsd'); \
  print(c(r['doc']) if r['status']=='PASS' else r)"

# 4. The malformed sample landed in quarantine and the log recorded its stage
ls -1 /var/lib/ingest/quarantine/ | grep bad-crs
grep '"stage":"crs"' /var/log/ingest/validate.jsonl | tail -1

A clean xmllint schema check, a PASS on the good record, a crs-stage FAIL on the bad one, and a matching quarantine entry confirm the gate is fail-closed and every rejection is auditable. Wire checks 2 and 3 into a pipeline gate so each promotion is honest.

Troubleshooting matrix

Symptom	Likely cause	Fix
Malformed XML passes validation	Parser left in recovery mode	Set `recover=False`; reuse the single `STRICT_PARSER`; do not let callers override it
`AttributeError: 'NoneType' object has no attribute 'Schematron'`	Imported `etree.Schematron`, which does not exist	Import and call `lxml.isoschematron.Schematron` instead
`XMLSchemaParseError` on load	XSD includes resolved over the network or missing locally	Mount the full bundle; keep `no_network=True`; pin all `xsd:import` locations to disk
Records with bad EPSG codes reach the index	CRS checked only structurally, never resolved	Resolve every `referenceSystemInfo` code against the EPSG snapshot; reject on `CRSError`
Harvester breaks on temporal extents	Partial dates, mixed timezones, or start ≥ end accepted	Enforce full ISO 8601, normalize to UTC, assert start < end; reject otherwise
`gmd`/`gmx` prefix mismatch slips through	Lenient parser ignored the missing `xsi:schemaLocation`	Assert `xsi:schemaLocation` on the root before XSD; fail unqualified payloads
External entity fetched during parse (XXE)	Entity resolution or network left enabled	Set `resolve_entities=False` and `no_network=True` on every parser
Validation stalls the whole pipeline at peak	Synchronous, in-line validation	Move validation onto the async queue with bounded `max_concurrency`; quarantine, don’t block
Replicas disagree on what is valid	Rule sets drift between pods	Mount XSD and `.sch` as immutable ConfigMaps so every replica loads identical rules

For authoritative rule semantics, the OGC Catalogue Service (CSW) standard defines the interoperability contract harvested records must hold, and the lxml validation documentation covers production-safe parser tuning and memory limits.

CSW Catalog Schema Mapping & Validation — the parent practice: schema-aware transformation and the gate this procedure plugs into.
Automating OAI-PMH Harvesting for Government Geospatial Portals — where harvested records originate before they hit this validation gate.
Search Indexing Optimization with Elasticsearch — where validated records become discoverable.
Version Tagging & Sync for Spatial Datasets — keeping validated catalog records consistent across dataset revisions.

Up one level: CSW Catalog Schema Mapping & Validation.

Validating ISO 19115 Metadata Before Ingestion

Prerequisites #

Step-by-step implementation #

1. Configure an XXE-safe, non-recovering parser #

2. Enforce strict namespace resolution #

3. Validate structure against the ISO 19115 XSD, then business rules with Schematron #

4. Reject malformed CRS and temporal extents referentially #

5. Route through an asynchronous gate with structured logging #

Verification #

Troubleshooting matrix #

Related #