Problem
The Pydantic models in overture-schema-* packages are the canonical definition of every constraint in the schema -- value ranges, string patterns, require_any_of/forbid_if/require_if model rules, discriminator gating, unique-items, geometry types, country/region codes, linear-reference ordering, and so on. They run against single Python objects, one at a time.
Most consumers of Overture data work at the scale of a release: tens to hundreds of millions of rows per theme, partitioned across hundreds of Parquet files in S3. There is currently no way to ask "does this release conform to the schema" without round-tripping every row through Python. In practice that means schema drift is caught by downstream consumers, not by us.
We need a validator that:
- Runs the same constraints that Pydantic does, on the same data, at Parquet/Spark scale.
- Stays in lockstep with the Pydantic models -- no second source of truth to maintain.
- Reports violations at row granularity (feature ID, field path, check name, offending value) so problems are actionable.
- Surfaces schema drift (extra columns, missing columns, type mismatches) separately from value-level violations.
Proposal
Generate PySpark Column expressions from the Pydantic models, and ship them as a runtime package with a CLI.
Two new packages working together
overture-schema-pyspark -- runtime. Hand-written shared utilities (constraint_expressions, column_patterns, schema-comparison, validation pipeline, CLI) plus a generated per-feature expression module tree under expressions/generated/. Public API: validate_feature(df, type) -> ValidationResult, plus explain_errors() to UNPIVOT row-level results into one row per violation.
overture-schema-codegen (extended) -- a pyspark/ subpackage that walks the existing FeatureSpec trees and emits per-feature expression modules and per-feature conformance test modules. The pipeline reuses the extraction layer already in place for the Markdown renderer; PySpark is just a new output target.
CLI
overture-validate <feature-type> <parquet-or-directory>
One row per violation: feature ID, theme/type, failing field, check name, offending value. Single-pass evaluation keeps memory bounded. Supports --count-only, --head, --suppress, --skip-schema-check, --ignore-columns, and --skip-extra-columns for working with real releases that have known drift.
Conformance gate
The codegen also emits a conformance test module per feature (tests/generated/.../test_<feature>.py). Each scenario mutates a valid base row to violate exactly one constraint and asserts the expected check fires. The expectations are derived from the same Pydantic models the expressions are derived from, frozen at codegen time -- so a codegen change that produces different expressions breaks the regenerated tests until the expected scenarios are also regenerated. This is what keeps the two surfaces from drifting.
Conformance tests are run once per feature on a shared DataFrame (O(checks), not O(checks * scenarios)) so the suite stays affordable even for segment, which has the deepest nesting and produces three arm-specific test files.
Non-goals
- Not a replacement for Pydantic validation in producer pipelines -- Pydantic still owns single-row validation, and is the source of truth.
- Not a JSON Schema generator. JSON Schema baselines already exist via a separate path.
- No new constraint types in this work. Anything Pydantic can express today, the generator targets; anything it can't is out of scope here.
How to experience it
After the PR lands and make install runs:
overture-validate --help lists every supported feature type.
- Point it at a small local Parquet file:
overture-validate segment samples/segment.parquet --count-only. The output is a count of violations per (field, check) pair.
- Drop
--count-only and the first 20 violation rows print with feature ID, field path, check name, message, and offending value.
- Run it against a real release prefix:
overture-validate place s3a://overturemaps-us-west-2/release/<release>/ -- expect schema mismatches (e.g. Float vs Double on bbox) that --skip-schema-check lets you push past.
- Browse
packages/overture-schema-pyspark/src/overture/schema/pyspark/expressions/generated/overture/schema/transportation/segment.py to see the generated expressions for the hardest feature in the schema -- nested vehicle scoping, discriminator gating, three subtype arms -- and the matching test_segment_*.py files that pin every check to a scenario.
Acceptance
Problem
The Pydantic models in
overture-schema-*packages are the canonical definition of every constraint in the schema -- value ranges, string patterns,require_any_of/forbid_if/require_ifmodel rules, discriminator gating, unique-items, geometry types, country/region codes, linear-reference ordering, and so on. They run against single Python objects, one at a time.Most consumers of Overture data work at the scale of a release: tens to hundreds of millions of rows per theme, partitioned across hundreds of Parquet files in S3. There is currently no way to ask "does this release conform to the schema" without round-tripping every row through Python. In practice that means schema drift is caught by downstream consumers, not by us.
We need a validator that:
Proposal
Generate PySpark Column expressions from the Pydantic models, and ship them as a runtime package with a CLI.
Two new packages working together
overture-schema-pyspark-- runtime. Hand-written shared utilities (constraint_expressions,column_patterns, schema-comparison, validation pipeline, CLI) plus a generated per-feature expression module tree underexpressions/generated/. Public API:validate_feature(df, type)->ValidationResult, plusexplain_errors()to UNPIVOT row-level results into one row per violation.overture-schema-codegen(extended) -- apyspark/subpackage that walks the existingFeatureSpectrees and emits per-feature expression modules and per-feature conformance test modules. The pipeline reuses the extraction layer already in place for the Markdown renderer; PySpark is just a new output target.CLI
One row per violation: feature ID, theme/type, failing field, check name, offending value. Single-pass evaluation keeps memory bounded. Supports
--count-only,--head,--suppress,--skip-schema-check,--ignore-columns, and--skip-extra-columnsfor working with real releases that have known drift.Conformance gate
The codegen also emits a conformance test module per feature (
tests/generated/.../test_<feature>.py). Each scenario mutates a valid base row to violate exactly one constraint and asserts the expected check fires. The expectations are derived from the same Pydantic models the expressions are derived from, frozen at codegen time -- so a codegen change that produces different expressions breaks the regenerated tests until the expected scenarios are also regenerated. This is what keeps the two surfaces from drifting.Conformance tests are run once per feature on a shared DataFrame (O(checks), not O(checks * scenarios)) so the suite stays affordable even for
segment, which has the deepest nesting and produces three arm-specific test files.Non-goals
How to experience it
After the PR lands and
make installruns:overture-validate --helplists every supported feature type.overture-validate segment samples/segment.parquet --count-only. The output is a count of violations per(field, check)pair.--count-onlyand the first 20 violation rows print with feature ID, field path, check name, message, and offending value.overture-validate place s3a://overturemaps-us-west-2/release/<release>/-- expect schema mismatches (e.g. Float vs Double on bbox) that--skip-schema-checklets you push past.packages/overture-schema-pyspark/src/overture/schema/pyspark/expressions/generated/overture/schema/transportation/segment.pyto see the generated expressions for the hardest feature in the schema -- nested vehicle scoping, discriminator gating, three subtype arms -- and the matchingtest_segment_*.pyfiles that pin every check to a scenario.Acceptance
overture-schema-pysparkinstalls and exposes the documented public API.overture-validateruns against a local Parquet file and a real release prefix.FeatureSpecextraction and emits generated trees inside a singlegenerated/boundary thatmake generate-pysparkwipes and recreates.packages/overture-schema-pyspark/README.mdandpackages/overture-schema-codegen/docs/design.md.