Skip to content

Validate Overture data on Spark against the Pydantic schema #517

@sethfitz

Description

Problem

The Pydantic models in overture-schema-* packages are the canonical definition of every constraint in the schema -- value ranges, string patterns, require_any_of/forbid_if/require_if model rules, discriminator gating, unique-items, geometry types, country/region codes, linear-reference ordering, and so on. They run against single Python objects, one at a time.

Most consumers of Overture data work at the scale of a release: tens to hundreds of millions of rows per theme, partitioned across hundreds of Parquet files in S3. There is currently no way to ask "does this release conform to the schema" without round-tripping every row through Python. In practice that means schema drift is caught by downstream consumers, not by us.

We need a validator that:

  • Runs the same constraints that Pydantic does, on the same data, at Parquet/Spark scale.
  • Stays in lockstep with the Pydantic models -- no second source of truth to maintain.
  • Reports violations at row granularity (feature ID, field path, check name, offending value) so problems are actionable.
  • Surfaces schema drift (extra columns, missing columns, type mismatches) separately from value-level violations.

Proposal

Generate PySpark Column expressions from the Pydantic models, and ship them as a runtime package with a CLI.

Two new packages working together

overture-schema-pyspark -- runtime. Hand-written shared utilities (constraint_expressions, column_patterns, schema-comparison, validation pipeline, CLI) plus a generated per-feature expression module tree under expressions/generated/. Public API: validate_feature(df, type) -> ValidationResult, plus explain_errors() to UNPIVOT row-level results into one row per violation.

overture-schema-codegen (extended) -- a pyspark/ subpackage that walks the existing FeatureSpec trees and emits per-feature expression modules and per-feature conformance test modules. The pipeline reuses the extraction layer already in place for the Markdown renderer; PySpark is just a new output target.

CLI

overture-validate <feature-type> <parquet-or-directory>

One row per violation: feature ID, theme/type, failing field, check name, offending value. Single-pass evaluation keeps memory bounded. Supports --count-only, --head, --suppress, --skip-schema-check, --ignore-columns, and --skip-extra-columns for working with real releases that have known drift.

Conformance gate

The codegen also emits a conformance test module per feature (tests/generated/.../test_<feature>.py). Each scenario mutates a valid base row to violate exactly one constraint and asserts the expected check fires. The expectations are derived from the same Pydantic models the expressions are derived from, frozen at codegen time -- so a codegen change that produces different expressions breaks the regenerated tests until the expected scenarios are also regenerated. This is what keeps the two surfaces from drifting.

Conformance tests are run once per feature on a shared DataFrame (O(checks), not O(checks * scenarios)) so the suite stays affordable even for segment, which has the deepest nesting and produces three arm-specific test files.

Non-goals

  • Not a replacement for Pydantic validation in producer pipelines -- Pydantic still owns single-row validation, and is the source of truth.
  • Not a JSON Schema generator. JSON Schema baselines already exist via a separate path.
  • No new constraint types in this work. Anything Pydantic can express today, the generator targets; anything it can't is out of scope here.

How to experience it

After the PR lands and make install runs:

  1. overture-validate --help lists every supported feature type.
  2. Point it at a small local Parquet file: overture-validate segment samples/segment.parquet --count-only. The output is a count of violations per (field, check) pair.
  3. Drop --count-only and the first 20 violation rows print with feature ID, field path, check name, message, and offending value.
  4. Run it against a real release prefix: overture-validate place s3a://overturemaps-us-west-2/release/<release>/ -- expect schema mismatches (e.g. Float vs Double on bbox) that --skip-schema-check lets you push past.
  5. Browse packages/overture-schema-pyspark/src/overture/schema/pyspark/expressions/generated/overture/schema/transportation/segment.py to see the generated expressions for the hardest feature in the schema -- nested vehicle scoping, discriminator gating, three subtype arms -- and the matching test_segment_*.py files that pin every check to a scenario.

Acceptance

  • overture-schema-pyspark installs and exposes the documented public API.
  • overture-validate runs against a local Parquet file and a real release prefix.
  • Codegen pipeline reuses FeatureSpec extraction and emits generated trees inside a single generated/ boundary that make generate-pyspark wipes and recreates.
  • Per-feature conformance test modules exist for every feature and pass.
  • Documented in packages/overture-schema-pyspark/README.md and packages/overture-schema-codegen/docs/design.md.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions