Skip to content

Migrate DISJOIN to a registered expander and solve the star-REPLACE portability limitation — Closes #143#159

Draft
conradbzura wants to merge 4 commits into
mainfrom
143-migrate-disjoin-expander
Draft

Migrate DISJOIN to a registered expander and solve the star-REPLACE portability limitation — Closes #143#159
conradbzura wants to merge 4 commits into
mainfrom
143-migrate-disjoin-expander

Conversation

@conradbzura

Copy link
Copy Markdown
Collaborator

Summary

Migrate DISJOIN from the emit-time BaseGIQLGenerator.giqldisjoin_sql string special-case (left by epic #114) to a generic registered AST expander, and make the full-row passthrough capability-driven so non-canonical encodings produce portable SQL on engines that lack SELECT * REPLACE.

Register one expand_disjoin against GenericTarget, so every target resolves to it through the registry's generic chain. The expander assembles the same __giql_dj_* WITH-CTE subquery, parses it back into a sqlglot expression, and returns that node for the active target's serializer to render — dissolving the emit-time string special-case. A single capability branch on ctx.capabilities.supports_star_replace selects the passthrough projection form: emit t.* REPLACE (...) on a target that supports it (DuckDB), and the portable t.* EXCEPT (start, end), <recomputed start>, <recomputed end> form otherwise, which every * EXCEPT-capable engine plans. The portable branch is what adds DataFusion support for non-canonical DISJOIN passthrough. Input canonicalization stays owned by CanonicalizeCoordinates (pass 2, #122) — the expander consumes already-canonical 0-based half-open columns and only round-trips the output back into the target's declared encoding.

Fix #153 by aliasing all four projected columns (kc / ks / ke / pos) in every __giql_dj_cuts UNION branch. Previously the bare de-canonicalized end column and the end-cut expression collided under one output name in the default 0-based half-open identity case; DuckDB tolerated the duplicate, but DataFusion rejected it as a non-unique projection name. With every branch aliased, the projection is internally unique on strict engines and behaviour-preserving on DuckDB. As a result, the cross-target oracle's previously-pinned _pending_153 expected-failure is promoted to a real three-target identity test.

Part of epic #137 wave 3; carries the shared ExpanderRegistry.snapshot()/restore() seam that sibling wave-3 PRs also have (dedupe on merge).

Closes #143
Fixes #153

Proposed changes

Registry save/restore seam (src/giql/expander.py)

Add the public ExpanderRegistry.snapshot() / ExpanderRegistry.restore() methods, first introduced for these fixtures. snapshot() returns a fresh shallow copy of the (target, operator) → expander registrations; restore() drops all current entries and re-installs exactly the snapshot contents. This lets an isolating test fixture (or a plugin) capture the import-time baseline, mutate the process-wide REGISTRY around a body, and hand the baseline back afterward so the built-in expanders survive a fixture that would otherwise clear() them permanently.

giql.expanders package + DISJOIN expander (src/giql/expanders/__init__.py, src/giql/expanders/disjoin.py)

Add the giql.expanders package whose __init__ auto-imports every submodule via pkgutil.iter_modules, so dropping a <operator>.py into the package registers its expander as an import side effect without editing the package file. Add disjoin.py with the @register(GenericTarget, GIQLDisjoin) expander and its helpers (_build_disjoin_sql, _disjoin_passthrough, _disjoin_output_encoding, _disjoin_resolution), carrying over the original resolution-unpacking and historical diagnostics verbatim. The passthrough is the capability-driven form described in the summary; the identity 0-based half-open case stays a plain t.* fast path.

#153 alias fix (isolated in disjoin.py's __giql_dj_cuts assembly)

Alias kc / ks / ke / pos in all three __giql_dj_cuts UNION branches. This is an isolated, cherry-pickable change: it only adds aliases to existing projections and does not depend on the migration or the capability branch.

GIQLDisjoin.GIQL_EXPAND flip + legacy deletion (src/giql/expressions.py, src/giql/generators/base.py, src/giql/transpile.py)

Flip GIQLDisjoin.GIQL_EXPAND from the disabled sentinel to True, so the ExpandOperators pass replaces the node with the expander's AST. Delete BaseGIQLGenerator.giqldisjoin_sql and the DISJOIN-only generator helpers (_disjoin_resolution, _disjoin_passthrough, _disjoin_output_encoding) plus the now-unused GIQLDisjoin import. Wire import giql.expanders in transpile.py so the registry is populated before the first transpile.

Test updates

Update test_disjoin_transpilation.py, test_canonicalizer.py, and test_expander.py for the registry-driven path, and add the capability-passthrough and snapshot()/restore() coverage. Two execute-on-engine harnesses now transpile with the engine dialect: test_usage_patterns.py (_execute) and coordinate_space/conftest.py (giql_query) pass dialect=engine/dialect="duckdb", because a non-canonical DISJOIN passthrough emits * EXCEPT for the generic target and * EXCEPT is not DuckDB-runnable — the SQL must be shaped for the engine it executes on. Promote the cross-target oracle's test_disjoin_on_datafusion_unsupported_pending_153 expected-failure to test_disjoin_agrees_across_all_targets, a real three-target identity test.

Test cases

# Test Suite Given When Then Coverage Target
1 TestDisjoinCanonicalization A self-mode DISJOIN over a 1-based closed target, transpiled for the generic target Transpiling to SQL The passthrough de-canonicalizes via a portable * EXCEPT projection that drops and re-projects the interval columns Portable passthrough on engines without REPLACE
2 TestDisjoinCanonicalization A self-mode DISJOIN over a 1-based closed target, transpiled for the DuckDB target Transpiling to SQL The passthrough de-canonicalizes via a star REPLACE on the final projection REPLACE passthrough on DuckDB
3 TestDisjoinTranspilation A DISJOIN over a registered target Transpiling to SQL Emits a parenthesized WITH-CTE subquery with the disjoin_chrom / disjoin_start / disjoin_end columns Registry-expanded CTE shape
4 TestDisjoinTranspilation A DISJOIN with the reference omitted Transpiling to SQL Defaults the reference to the target set and skips the coverage EXISTS clause Self-reference coverage skip
5 TestDisjoinTranspilation A DISJOIN whose reference is a distinct table or shadowing CTE Transpiling to SQL Emits the coverage EXISTS clause against the reference Coverage filter emission
6 TestDisjoinTranspilation A DISJOIN target or reference using the reserved __giql_dj_ prefix, or an unknown reference name Transpiling to SQL Re-raises DISJOIN's historical diagnostics verbatim Diagnostic parity
7 TestExpanderRegistry A registry with one entry captured by snapshot() A second entry is registered afterward The snapshot holds only the first entry, being a copy not a live view snapshot() independence
8 TestExpanderRegistry A snapshot taken, then the registry cleared and a different entry registered restore() is called with the snapshot The original entry resolves again and the post-snapshot entry is gone restore() semantics
9 TestExpandOperatorsPass A flagged operator with a registered expander The pass transforms the AST Dispatches to the registered expander and replaces the node GIQL_EXPAND dispatch
10 TestNoOpWhenFlagsOff A DISJOIN query with pass 2 bypassed but pass 3 kept Comparing canonicalizer output to the expanded baseline Pass 2 contributes nothing, the byte-identical comparison isolating it Canonicalizer no-op isolation
11 TestCrossTargetOracleDisjoin Two overlapping intervals on chr1 The oracle runs generic, datafusion, and duckdb targets Every target returns identical sub-segments, proving DISJOIN runs on DataFusion and agrees with DuckDB Three-target identity (#143 / #153)
12 TestCrossTargetOracleDisjoin Two overlapping intervals DISJOIN splits them on DuckDB Returns the expected split sub-segments DuckDB split correctness

ExpanderRegistry only exposed clear() for resetting state, which forces an
isolating test fixture to choose between leaving the registry empty
afterward or knowing its prior contents. Once built-in expanders register
themselves at import, clearing permanently drops them for the rest of the
process.

Add snapshot() to capture the current registrations as a fresh mapping and
restore() to reinstall a captured baseline, giving fixtures and plugins a
public save/restore seam that survives a clear() in the middle.
DISJOIN emitted its WITH-CTE subquery from the giqldisjoin_sql string
emitter on BaseGIQLGenerator, an emit-time special case left by the
operator-pass epic. Move it onto the ExpandOperators pass so a registered
expander returns the subquery as a sqlglot AST and the active target's
serializer renders it.

Add the giql.expanders package, which auto-imports every submodule so each
expander self-registers at import; wire that import into transpile so the
registry is populated before the first transpile. Register expand_disjoin
for the generic target as the portable fallback every target resolves to
through the registry chain.

Make the full-row passthrough capability-driven: a target that supports
SELECT * REPLACE (DuckDB) keeps the REPLACE form, while a target without it
(DataFusion, the generic baseline) emits the portable * EXCEPT projection
plus the two recomputed interval columns. Flip GIQLDisjoin.GIQL_EXPAND on
so the pass takes over, and delete the now-dead giqldisjoin_sql emitter
along with its DISJOIN-only resolution and passthrough helpers.

Update the tests to transpile for the executing engine's dialect, drive the
expansion pass in the canonicalizer no-op checks, exercise both passthrough
forms, and treat the import-time built-in registrations as the registry
baseline rather than an empty registry.
The __giql_dj_cuts CTE built its cut positions from three UNION branches,
but only the first branch aliased its four projected columns. In the
default 0-based half-open identity case the end column de-canonicalizes to
the bare physical column, so the unaliased branches projected t."end"
alongside the end-cut expression under the same output name. DuckDB
tolerated the duplicate; DataFusion rejected the projection for non-unique
expression names, so DISJOIN could not run there.

Alias all four columns in every branch. The output names still come from
the first branch, so this is behaviour-preserving on DuckDB while making
each branch internally unique for strict engines. Promote the oracle's
previously expected-failure DISJOIN case to a real cross-target identity
test now that the query runs and agrees on every target.
@conradbzura conradbzura self-assigned this Jun 28, 2026
Add a non-canonical cross-target oracle case so the portable star-EXCEPT passthrough executes on DataFusion, plus an engine-free regression pinning the per-branch cuts-CTE aliases. Document the REPLACE-vs-EXCEPT column-order divergence, centralize the DISJOIN prefix in a constants module, parse with parse_one, type the expander node, and restore the dropped rationale comments. Apply the shared registry-docstring, restore-in-place, and auto-discovery fixes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant