Skip to content

Promote operator expansion to an ExpandOperators AST pass and reduce the generator to a stock serializer #137

Description

@conradbzura

Description

Make the SQL target dialect a first-class concept that drives both engine-compatible emission and engine-specific optimization, by promoting GIQL operator expansion out of the string-emitting generator into a registry of per-dialect, AST-producing operator expanders. The end state: BaseGIQLGenerator becomes a stock sqlglot serializer, every operator emits standard sqlglot AST chosen per target, and a public registration hook lets users add or override targets.

parse → ResolveOperatorRefs → CanonicalizeCoordinates → ExpandOperators → <per-target stock serializer>
         (pass 1, #114)         (pass 2, #114)            (this epic)        (this epic)

Architecture (four seams):

  1. Target model + capabilities. Each dialect is a class — GenericTarget, DuckDBTarget, DataFusionTarget — carrying a capability set (supports_lateral, supports_star_replace, supports_qualify, range_join_strategy, …) and the sqlglot output dialect for serialization. Portable choices become capability lookups, not scattered if dialect == ... branches: the star-REPLACE-vs-explicit-projection decision and the LATERAL-vs-window-function decision are driven by capabilities.

  2. Operator-expander protocol + registry. An OperatorExpander is expand(node, ctx) -> exp.Expression — it takes a GIQL operator node plus an ExpansionContext (pass-1 resolved metadata, the active target/capabilities, alias minting, tables) and returns standard sqlglot AST. The registry is keyed by (target, operator_type) with a fallback chain (target, op)(generic, op) → legacy *_sql emitter. Write a generic expander once; override per-dialect only where the engine genuinely differs (a limitation or an optimization). The registry is the public extension hook — users register their own target or override (their_target, op) via a decorator, e.g. @register(DuckDBTarget, GIQLDisjoin).

  3. ExpandOperators pass. Runs after CanonicalizeCoordinates; walks the AST and replaces each GIQL operator node with the expansion the registry returns for the active target.

  4. transpile() wiring + feature flag. The dialect param resolves to a Target (backward-compatible: None→Generic, "duckdb"→DuckDB; add "datafusion"). Incremental migration uses a per-operator GIQL_EXPAND class attribute — mirroring the proven GIQL_CANONICALIZE pattern from epic Introduce pre-generation AST normalization pipeline for operator resolution and coordinate canonicalization #114 — so an operator takes the new path only when it is flagged AND an expander is registered, otherwise the legacy emitter. Each migration PR flips one operator, behavior-preserving until the last.

Motivation

  • Dialect portability. sqlglot serializes standard AST per target for free (identifier quoting, function spelling, supported-construct syntax). The two dialect walls already hit in Introduce pre-generation AST normalization pipeline for operator resolution and coordinate canonicalization #114SUPPORTS_LATERAL/SQLite for NEAREST, and SELECT * REPLACE portability for non-canonical canonicalization — move from f-string special-cases into capability-driven expanders. Caveat: AST expansion makes the syntactic layer cheap; semantic fallbacks (e.g. LATERAL→window-function) still need real work, but that work lives in one centralized place per target rather than tangled into templates.
  • Engine-specific optimization is a first-class slot. The existing hardcoded DuckDB IEJoin path is the precedent: IntersectsDuckDBIEJoinTransformer becomes the (DuckDBTarget, Intersects) expander and IntersectsBinnedJoinTransformer the (GenericTarget, Intersects) one — the dialect="duckdb" early-return in transpile() dissolves into the registry.
  • Pure generator + extensibility. The generator stops being a string-template macro expander; output de-canonicalization (the Introduce pre-generation AST normalization pipeline for operator resolution and coordinate canonicalization #114 leftover, decanonical_* on synthesized columns) dissolves into AST. The registration pattern is exposed as a supported extension hook for users targeting their own engines.

Staged migration plan

Each step is a child sub-issue; each lands independently and behaviour-preserving (per-operator GIQL_EXPAND flag, result-snapshot oracle) until step 10 removes the flag and the legacy path.

  1. Target + capability model; dialect param → Target. (Make dialect a first-class target selector driving engine-specific optimization and compatible SQL emission #132) Define GenericTarget/DuckDBTarget/DataFusionTarget, the capability descriptors, and resolve transpile()'s dialect param to a Target. Backward compatible; no expansion yet.
  2. Expander protocol + registry + ExpandOperators pass scaffolding + GIQL_EXPAND flag. Land the registration decorator, the registry with the (target → generic → legacy) fallback chain, the ExpansionContext, the pass, and the per-operator flag. With nothing flagged it is a strict no-op. Includes registry unit tests and an extension-hook test (register a fake custom target, assert dispatch).
  3. Result-oracle test harness + DataFusion integration lane. A cross-target result-identity helper and a DataFusion integration suite (deps already present; only test_binned_join.py exercises it today) so every later migration proves DuckDB≡DataFusion≡expected.
  4. DISTANCE (proof-of-concept — single CASE). Generic expander; flip GIQL_EXPAND; verify cross-target identity; delete its *_sql method.
  5. INTERSECTS / CONTAINS / WITHIN + set predicates. Generic expanders; fold IntersectsDuckDBIEJoinTransformer(DuckDB, Intersects) and IntersectsBinnedJoinTransformer(Generic, Intersects); remove the duckdb early-return in transpile().
  6. NEAREST. Generic LATERAL expander + capability-driven window-function fallback for no-lateral targets; row passthrough as AST.
  7. DISJOIN. WITH-CTE expansion as AST; full-row passthrough and output de-canonicalization as AST; capability-driven canonicalization output (explicit portable projection for no-REPLACE targets like DataFusion, * REPLACE for DuckDB) — this resolves the star-REPLACE portability limitation documented in Port DISJOIN from in-emitter canonicalization to CanonicalizeCoordinates output #122.
  8. CLUSTER / MERGE. Relocate ClusterTransformer/MergeTransformer into the registry as generic expanders for consistency.
  9. DataFusion target completion + dialect-aware canonicalization finalization. Verify every operator has a DataFusion path; move the canonicalizer's REPLACE-vs-explicit decision fully behind capabilities; complete DataFusion integration coverage.
  10. Generator reduction + remove feature flag + extension-hook docs. Delete the now-dead *_sql methods; reduce/replace BaseGIQLGenerator with per-target stock serializers; remove the migration flag and legacy path; document the registration extension hook (registering a custom target, overriding an operator). Closes the epic.

Non-goals

  • Re-implementing CLUSTER/MERGE/binned-join logic — those already produce AST and are relocated, not rewritten.
  • Adding SQLite/Postgres targets in this epic (the registry is designed so they are pure additions later: register a target + capability set, override only divergent operators).
  • Operator semantic changes — expansions are behaviour-preserving against the result oracle.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    refactorCode restructuring without behavior change

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions