Description
After the gating fix in PR #93 (commit f7dc527) the DuckDB IEJoin dialect falls back to the binned plan whenever the top-level query has more than the two tables connected by the INTERSECTS predicate — for example:
FROM peaks a, genes b, annotations c WHERE a.interval INTERSECTS b.interval
FROM peaks a JOIN genes b ON a.interval INTERSECTS b.interval JOIN samples c ON a.sample_id = c.id
These are common shapes: take the overlap result and join it to an auxiliary table on an equality / scalar key. Extend the dialect to keep the IEJoin speedup for the INTERSECTS portion while threading the auxiliary table(s) through the outer SELECT.
The natural mechanism: keep the inner query(getvariable(...)) block as it is today (handling only the two INTERSECTS-connected tables), then express the auxiliary join(s) as ordinary SQL JOINs on the outer SELECT. The dynamic CTE carries the columns each auxiliary join needs — those columns are already determined by _resolve_projections from the user's SELECT list.
This is distinct from #96 (multiple INTERSECTS predicates), which is about chaining multiple range-overlap edges. Here there is one INTERSECTS edge plus auxiliary equality / scalar joins.
transform_to_sql's len(joins) != 1 guard added in f7dc527 is relaxed once the outer-join wiring is in place.
Motivation
Joining the overlap result to a sample-metadata table, a sequencing-track table, or an arbitrary annotation table is the natural follow-on to "find overlaps." Forcing those queries to the binned plan defeats the dialect's perf advantage when the auxiliary join is the cheaper part of the query.
Expected outcome
transpile("SELECT a.start, c.label FROM peaks a JOIN genes b ON a.interval INTERSECTS b.interval JOIN samples c ON a.sample_id = c.id", dialect="duckdb") produces dialect SQL where the INTERSECTS is the inner IEJoin and samples is joined on the outer SELECT.
- The comma-style form (
FROM peaks a, genes b, extra c WHERE a.interval INTERSECTS b.interval) is also supported.
- Three or more auxiliary tables work.
- New execution tests in
tests/test_duckdb_iejoin.py covering each form, plus a cross-plan equivalence property test.
docs/transpilation/performance.rst updated to remove "more than two tables" from the fallback list.
Description
After the gating fix in PR #93 (commit f7dc527) the DuckDB IEJoin dialect falls back to the binned plan whenever the top-level query has more than the two tables connected by the INTERSECTS predicate — for example:
FROM peaks a, genes b, annotations c WHERE a.interval INTERSECTS b.intervalFROM peaks a JOIN genes b ON a.interval INTERSECTS b.interval JOIN samples c ON a.sample_id = c.idThese are common shapes: take the overlap result and join it to an auxiliary table on an equality / scalar key. Extend the dialect to keep the IEJoin speedup for the INTERSECTS portion while threading the auxiliary table(s) through the outer SELECT.
The natural mechanism: keep the inner
query(getvariable(...))block as it is today (handling only the two INTERSECTS-connected tables), then express the auxiliary join(s) as ordinary SQL JOINs on the outer SELECT. The dynamic CTE carries the columns each auxiliary join needs — those columns are already determined by_resolve_projectionsfrom the user's SELECT list.This is distinct from #96 (multiple INTERSECTS predicates), which is about chaining multiple range-overlap edges. Here there is one INTERSECTS edge plus auxiliary equality / scalar joins.
transform_to_sql'slen(joins) != 1guard added in f7dc527 is relaxed once the outer-join wiring is in place.Motivation
Joining the overlap result to a sample-metadata table, a sequencing-track table, or an arbitrary annotation table is the natural follow-on to "find overlaps." Forcing those queries to the binned plan defeats the dialect's perf advantage when the auxiliary join is the cheaper part of the query.
Expected outcome
transpile("SELECT a.start, c.label FROM peaks a JOIN genes b ON a.interval INTERSECTS b.interval JOIN samples c ON a.sample_id = c.id", dialect="duckdb")produces dialect SQL where the INTERSECTS is the inner IEJoin andsamplesis joined on the outer SELECT.FROM peaks a, genes b, extra c WHERE a.interval INTERSECTS b.interval) is also supported.tests/test_duckdb_iejoin.pycovering each form, plus a cross-plan equivalence property test.docs/transpilation/performance.rstupdated to remove "more than two tables" from the fallback list.