Description
After the gating fix in PR #93 (commit f7dc527) the DuckDB IEJoin dialect falls back to the binned plan whenever the top-level query carries a WITH clause or uses a non-base-table operand on either side of the INTERSECTS join. Extend the dialect to handle these natively. This is the IEJoin-dialect analogue of #104 / #105 / #106 (subquery / CTE / nested-operator support, there for DISJOIN's target).
The mechanism splits along operand kind:
WITH clause — preserve the top-level WITH and emit it on the outer SELECT ... FROM query(getvariable(...)). A CTE backing an INTERSECTS-join input then requires no further work once WITH is preserved and the CTE name is recognized by operand resolution.
- Subquery as a join input (e.g.
JOIN (SELECT * FROM genes WHERE score > 100) b ON ...) — materialize the subquery as an additional internal CTE (__giql_iejoin_sub_<token>), then reference that CTE name in the per-chromosome subqueries' filtered SELECTs. Columns and coordinate-system metadata come from the subquery's projected columns; non-canonical encodings on subquery outputs are unsupported unless the user supplies a wrapping Table declaration.
- CTE as a join input — same handling as subquery, but reference the user's existing CTE name directly under the preserved
WITH.
- GIQL table-function operand (e.g.
JOIN DISJOIN(genes) b ON ...) — recursively transpile the inner operator (DISJOIN / NEAREST / etc.) and inline its SQL as an internal CTE, then proceed as for a subquery operand.
transform_to_sql's with_ and empty-name guards added in f7dc527 are relaxed as each operand kind lands.
Motivation
Composability with the rest of the GIQL operator family (DISJOIN, NEAREST, CLUSTER, MERGE) is the natural next step once the dialect handles base-table operands robustly. Subquery and CTE operands are also common in interval-filtering pipelines (pre-filter genes by score, then overlap against peaks). Until this lands, every such query loses the IEJoin speedup.
Expected outcome
transpile("WITH big AS (...) SELECT a.start FROM peaks a JOIN big b ON a.interval INTERSECTS b.interval", dialect="duckdb") produces dialect SQL with WITH preserved.
transpile("SELECT ... FROM peaks a JOIN (SELECT ... FROM genes) b ON a.interval INTERSECTS b.interval", dialect="duckdb") produces dialect SQL with the subquery materialized as an internal CTE.
transpile("SELECT ... FROM DISJOIN(genes) a, peaks b WHERE a.interval INTERSECTS b.interval", dialect="duckdb") produces dialect SQL with the inner DISJOIN transpiled and inlined.
- New execution tests in
tests/test_duckdb_iejoin.py for each operand kind, including custom-column subquery / CTE outputs.
- Cross-plan equivalence property tests.
docs/transpilation/performance.rst updated to remove these shapes from the fallback list.
Description
After the gating fix in PR #93 (commit f7dc527) the DuckDB IEJoin dialect falls back to the binned plan whenever the top-level query carries a
WITHclause or uses a non-base-table operand on either side of the INTERSECTS join. Extend the dialect to handle these natively. This is the IEJoin-dialect analogue of #104 / #105 / #106 (subquery / CTE / nested-operator support, there for DISJOIN's target).The mechanism splits along operand kind:
WITHclause — preserve the top-levelWITHand emit it on the outerSELECT ... FROM query(getvariable(...)). A CTE backing an INTERSECTS-join input then requires no further work onceWITHis preserved and the CTE name is recognized by operand resolution.JOIN (SELECT * FROM genes WHERE score > 100) b ON ...) — materialize the subquery as an additional internal CTE (__giql_iejoin_sub_<token>), then reference that CTE name in the per-chromosome subqueries' filtered SELECTs. Columns and coordinate-system metadata come from the subquery's projected columns; non-canonical encodings on subquery outputs are unsupported unless the user supplies a wrappingTabledeclaration.WITH.JOIN DISJOIN(genes) b ON ...) — recursively transpile the inner operator (DISJOIN / NEAREST / etc.) and inline its SQL as an internal CTE, then proceed as for a subquery operand.transform_to_sql'swith_and empty-name guards added in f7dc527 are relaxed as each operand kind lands.Motivation
Composability with the rest of the GIQL operator family (DISJOIN, NEAREST, CLUSTER, MERGE) is the natural next step once the dialect handles base-table operands robustly. Subquery and CTE operands are also common in interval-filtering pipelines (pre-filter genes by score, then overlap against peaks). Until this lands, every such query loses the IEJoin speedup.
Expected outcome
transpile("WITH big AS (...) SELECT a.start FROM peaks a JOIN big b ON a.interval INTERSECTS b.interval", dialect="duckdb")produces dialect SQL withWITHpreserved.transpile("SELECT ... FROM peaks a JOIN (SELECT ... FROM genes) b ON a.interval INTERSECTS b.interval", dialect="duckdb")produces dialect SQL with the subquery materialized as an internal CTE.transpile("SELECT ... FROM DISJOIN(genes) a, peaks b WHERE a.interval INTERSECTS b.interval", dialect="duckdb")produces dialect SQL with the inner DISJOIN transpiled and inlined.tests/test_duckdb_iejoin.pyfor each operand kind, including custom-column subquery / CTE outputs.docs/transpilation/performance.rstupdated to remove these shapes from the fallback list.