Validate Overture data on Spark against the Pydantic schema

## Problem

The Pydantic models in `overture-schema-*` packages are the canonical definition of every constraint in the schema -- value ranges, string patterns, `require_any_of`/`forbid_if`/`require_if` model rules, discriminator gating, unique-items, geometry types, country/region codes, linear-reference ordering, and so on. They run against single Python objects, one at a time.

Most consumers of Overture data work at the scale of a release: tens to hundreds of millions of rows per theme, partitioned across hundreds of Parquet files in S3. There is currently no way to ask "does this release conform to the schema" without round-tripping every row through Python. In practice that means schema drift is caught by downstream consumers, not by us.

We need a validator that:

- Runs the same constraints that Pydantic does, on the same data, at Parquet/Spark scale.
- Stays in lockstep with the Pydantic models -- no second source of truth to maintain.
- Reports violations at row granularity (feature ID, field path, check name, offending value) so problems are actionable.
- Surfaces schema drift (extra columns, missing columns, type mismatches) separately from value-level violations.

## Proposal

Generate PySpark Column expressions from the Pydantic models, and ship them as a runtime package with a CLI.

### Two new packages working together

`overture-schema-pyspark` -- runtime. Hand-written shared utilities (`constraint_expressions`, `column_patterns`, schema-comparison, validation pipeline, CLI) plus a generated per-feature expression module tree under `expressions/generated/`. Public API: `validate_feature(df, type)` -> `ValidationResult`, plus `explain_errors()` to UNPIVOT row-level results into one row per violation.

`overture-schema-codegen` (extended) -- a `pyspark/` subpackage that walks the existing `FeatureSpec` trees and emits per-feature expression modules and per-feature conformance test modules. The pipeline reuses the extraction layer already in place for the Markdown renderer; PySpark is just a new output target.

### CLI

```bash
overture-validate <feature-type> <parquet-or-directory>
```

One row per violation: feature ID, theme/type, failing field, check name, offending value. Single-pass evaluation keeps memory bounded. Supports `--count-only`, `--head`, `--suppress`, `--skip-schema-check`, `--ignore-columns`, and `--skip-extra-columns` for working with real releases that have known drift.

### Conformance gate

The codegen also emits a conformance test module per feature (`tests/generated/.../test_<feature>.py`). Each scenario mutates a valid base row to violate exactly one constraint and asserts the expected check fires. The expectations are derived from the same Pydantic models the expressions are derived from, frozen at codegen time -- so a codegen change that produces different expressions breaks the regenerated tests until the expected scenarios are also regenerated. This is what keeps the two surfaces from drifting.

Conformance tests are run once per feature on a shared DataFrame (O(checks), not O(checks * scenarios)) so the suite stays affordable even for `segment`, which has the deepest nesting and produces three arm-specific test files.

## Non-goals

- Not a replacement for Pydantic validation in producer pipelines -- Pydantic still owns single-row validation, and is the source of truth.
- Not a JSON Schema generator. JSON Schema baselines already exist via a separate path.
- No new constraint types in this work. Anything Pydantic can express today, the generator targets; anything it can't is out of scope here.

## How to experience it

After the PR lands and `make install` runs:

1. `overture-validate --help` lists every supported feature type.
2. Point it at a small local Parquet file: `overture-validate segment samples/segment.parquet --count-only`. The output is a count of violations per `(field, check)` pair.
3. Drop `--count-only` and the first 20 violation rows print with feature ID, field path, check name, message, and offending value.
4. Run it against a real release prefix: `overture-validate place s3a://overturemaps-us-west-2/release/<release>/` -- expect schema mismatches (e.g. Float vs Double on bbox) that `--skip-schema-check` lets you push past.
5. Browse `packages/overture-schema-pyspark/src/overture/schema/pyspark/expressions/generated/overture/schema/transportation/segment.py` to see the generated expressions for the hardest feature in the schema -- nested vehicle scoping, discriminator gating, three subtype arms -- and the matching `test_segment_*.py` files that pin every check to a scenario.

## Acceptance

- [ ] `overture-schema-pyspark` installs and exposes the documented public API.
- [ ] `overture-validate` runs against a local Parquet file and a real release prefix.
- [ ] Codegen pipeline reuses `FeatureSpec` extraction and emits generated trees inside a single `generated/` boundary that `make generate-pyspark` wipes and recreates.
- [ ] Per-feature conformance test modules exist for every feature and pass.
- [ ] Documented in `packages/overture-schema-pyspark/README.md` and `packages/overture-schema-codegen/docs/design.md`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate Overture data on Spark against the Pydantic schema #517

Problem

Proposal

Two new packages working together

CLI

Conformance gate

Non-goals

How to experience it

Acceptance

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Validate Overture data on Spark against the Pydantic schema #517

Description

Problem

Proposal

Two new packages working together

CLI

Conformance gate

Non-goals

How to experience it

Acceptance

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions