Skip to content

Migrate NEAREST to a registered expander with capability-driven LATERAL fallback — Closes #142#158

Draft
conradbzura wants to merge 4 commits into
mainfrom
142-migrate-nearest-expander
Draft

Migrate NEAREST to a registered expander with capability-driven LATERAL fallback — Closes #142#158
conradbzura wants to merge 4 commits into
mainfrom
142-migrate-nearest-expander

Conversation

@conradbzura

Copy link
Copy Markdown
Collaborator

Summary

Migrate NEAREST off the string emitter and onto the ExpandOperators pass as a capability-driven expander. Emit the portable correlated LATERAL subquery where capabilities.supports_lateral holds (DuckDB, the generic target), and a decorrelated ROW_NUMBER() window-function form where it does not. This ADDS DataFusion support for correlated NEAREST, which previously failed because DataFusion has no correlated-LATERAL physical plan. Prove the fallback returns row-for-row identical results to the LATERAL form across the k=1/2, max_distance, stranded, signed, and duplicate-reference-row cases, and promote the cross-target oracle's previously-pinned _unsupported_pending_142 expected-failure to a real three-target identity test. Flip GIQL_EXPAND on GIQLNearest and delete the legacy giqlnearest_sql emitter, keeping its shared _generate_distance_case / _nearest_* helpers.

This is epic #137 wave 3; it carries the shared ExpanderRegistry.snapshot() / restore() seam that sibling wave-3 PRs also have (dedupe on merge).

Closes #142

Proposed changes

Registry save/restore seam

Add ExpanderRegistry.snapshot() and ExpanderRegistry.restore() — a public save/restore pair. snapshot() returns a fresh shallow copy of the (target, operator) -> expander registrations; restore() drops the current entries and re-installs exactly a captured snapshot. A test fixture (or plugin) that mutates the process-wide REGISTRY around a body captures the baseline first and hands it back afterward, so the built-in expanders registered at import survive an isolating fixture that would otherwise clear() them permanently.

giql.expanders package and capability-driven nearest.py

Add the giql.expanders package. Its __init__ walks its own submodules with pkgutil.iter_modules and imports each, so dropping a <operator>.py into the package registers its @register(...) expander as an import side effect with no edit to the package file. giql.transpile imports the package once so REGISTRY is populated before the first transpile.

Add nearest.py. expand_nearest branches on ctx.capabilities.supports_lateral and on whether the node is correlated (parent is a LATERAL): lateral-capable targets and every standalone literal-reference placement get the portable LATERAL/standalone subquery, byte-identical to the legacy emitter; a correlated NEAREST on a target without LATERAL support gets the decorrelated fallback. Three load-bearing design points in the fallback:

  • Pre-projected reference relation — the outer relation's reference columns are projected under fresh __giql_x_rk_* names into a renamed derived relation that the target is cross-joined against. DataFusion's planner cannot resolve a window ordering over a join whose two sides share column names (both expose start / end), so the renamed columns keep every reference column distinct from the target's.
  • Separate query levels for join and window — the cross-join, distance, and reference-key projection are computed in an inner subquery, and ROW_NUMBER() is added in the enclosing one. Fused into one level, DataFusion's optimizer mis-derives the window's sort order from the chromosome-equality prefilter and trips SanityCheckPlan.
  • DISTINCT-on-key with top-k fan-out — the reference relation is de-duplicated on the reference key (position, plus strand in stranded mode) with DISTINCT, candidates are ranked once per distinct reference value, and the rewritten join re-associates the top-k back to every outer row sharing that key. Ranking depends only on the reference value, so ranking once and re-joining is identical to the per-row LATERAL form even when the outer table holds duplicate reference rows.

The fallback rewrites <outer> AS a CROSS JOIN LATERAL (nearest) AS b into <outer> AS a JOIN (<ranked subquery>) AS b ON <ref-key match> AND b.<rn> <= k in place.

GIQLNearest.GIQL_EXPAND flip and emitter deletion

Flip GIQLNearest.GIQL_EXPAND to True so NEAREST expands through its registered expander, and delete BaseGIQLGenerator.giqlnearest_sql (including the old SQLite "LATERAL not supported" ValueError branch). The self-free _generate_distance_case (shared with DISTANCE, #140) and the _nearest_* resolution / passthrough / output-encoding helpers stay on BaseGIQLGenerator and are reused by the expander, so distance, passthrough, and encoding round-tripping remain byte-for-byte identical.

Test updates and the promoted oracle test

Update the emitter-level NEAREST tests to run pass 3 (ExpandOperators) before generating, matching transpile. Add snapshot / restore registry tests and route the registry-isolation fixtures through the new seam. Add a parametrized check that every migrated operator ships GIQL_EXPAND=True. Promote the cross-target oracle's test_nearest_on_datafusion_unsupported_pending_142 pytest.raises(match="OuterReferenceColumn") pin to test_correlated_nearest_k1_agrees_across_all_targets, a real three-target identity test (generic and duckdb on the LATERAL form via DuckDB, datafusion on the decorrelated fallback).

Test cases

# Test Suite Given When Then Coverage Target
1 TestNearestTranspilation A query with NEAREST(genes, reference := peaks.interval, k := 3) Transpiling through passes 1-3 A LATERAL join with a distance column, ORDER BY, and LIMIT 3 is generated Correlated LATERAL form
2 TestNearestTranspilation A query with NEAREST(..., k := 5, max_distance := 100000) Transpiling A LATERAL subquery carrying the 100000 distance filter and LIMIT 5 is generated max_distance filter
3 TestNearestTranspilation A query with a literal reference := 'chr1:1000-2000' Transpiling A standalone subquery with no LATERAL and the literal coordinates is generated Standalone literal reference
4 TestNearestTranspilation A query with stranded := true Transpiling A LATERAL subquery with strand filtering is generated Stranded mode
5 TestNearestTranspilation A query with signed := true Transpiling A LATERAL subquery with the signed-distance calculation is generated Signed distance
6 TestExpanderRegistryFallbackGaps A registry with one entry captured by snapshot() A second entry is registered after the snapshot The snapshot holds only the first entry, being a copy not a live view snapshot() independence
7 TestExpanderRegistryFallbackGaps A snapshot taken before the registry is cleared and a different entry registered Calling restore() The original entry resolves again and the post-snapshot entry is gone restore() semantics
8 TestOperatorOptOut Each migrated GIQL operator class Reading its GIQL_EXPAND attribute It is True, so the operator expands through its registered expander Migrated opt-in flag
9 TestOperatorOptOut Each unmigrated GIQL operator class Reading its GIQL_EXPAND attribute It is False, so the operator still uses the legacy emitter Unmigrated opt-out flag
10 TestCrossTargetOracleNearest A single-row peaks table and three candidate genes at varying distances on chr1 A correlated CROSS JOIN LATERAL NEAREST(..., k := 1) runs on generic, duckdb, and datafusion Every target returns the single nearest gene and agrees, with datafusion on the decorrelated window-function fallback Cross-target identity (promoted from _unsupported_pending_142)

Add public snapshot and restore methods to ExpanderRegistry, a
save/restore seam over the process-wide registry. A test fixture or a
plugin that mutates the registry around a body can capture the baseline
and reinstate it afterward, so the built-in expanders registered at
import survive an isolating fixture that would otherwise clear them
permanently.

This is the infrastructure the migrated test fixtures depend on to treat
the import-time built-in registrations as their baseline rather than an
empty registry.
Move NEAREST expansion off the legacy giqlnearest_sql emitter and onto
the ExpandOperators pass as a capability-driven expander. Lateral-capable
targets and every standalone literal-reference placement get the portable
correlated LATERAL subquery, byte-identical to the legacy emitter. A
correlated NEAREST on a target without LATERAL support now gets a
decorrelated ROW_NUMBER() window-function fallback that returns identical
rows: it ranks candidates once per distinct reference key and re-joins the
top k back to every outer row sharing that key.

This adds DataFusion support for correlated NEAREST, which previously had
no physical plan for the LATERAL form and failed outright.

Add the giql.expanders package, whose import registers every built-in
expander as a side effect and auto-discovers new operator modules, and
wire that import into transpile so the registry is populated before the
first transpile. Flip GIQLNearest.GIQL_EXPAND on so the pass owns NEAREST,
and delete the giqlnearest_sql emitter. The shared _generate_distance_case
and _nearest_* resolution helpers are retained and reused by the expander,
keeping the distance, passthrough, and encoding logic unchanged.
Rework the NEAREST and registry tests for the capability-driven expander.

Drive the emitter-level NEAREST tests through the ExpandOperators pass
instead of calling the deleted giqlnearest_sql directly, and update the
pinned SQL to the expander's reserialized output (semantically unchanged
from the legacy emitter). Drop the obsolete SUPPORTS_LATERAL=False
hard-error test, since lateral support is now a target capability with a
window-function fallback rather than a generator-level error.

Promote the cross-target oracle's _unsupported_pending_142 expected-
failure into a real three-target identity test: DataFusion now plans
correlated NEAREST through the decorrelated fallback, so the LATERAL and
window forms are verified to return identical rows on every target.

Update the registry leak guards and clean_registry fixture to treat the
import-time built-in registrations as the baseline through the new
snapshot/restore seam, add coverage for snapshot/restore, and account for
operators that now ship GIQL_EXPAND=True via an _opted_out helper.
@conradbzura conradbzura self-assigned this Jun 28, 2026
Fix the literal-reference NEAREST crash on DataFusion by gating the decorrelated fallback on genuine correlation and materializing the distance in a two-level subquery. Add executing cross-target oracle cases (k>1, duplicate references, multi-key, max_distance, stranded, signed) and a deterministic tiebreaker so the LATERAL and window forms are set-equivalent. Delete dead helpers and SUPPORTS_LATERAL, make borrowed helpers static, mint fallback aliases via ctx.alias, add invariant asserts, and document DataFusion support. Apply the shared registry-docstring, restore-in-place, and auto-discovery fixes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Migrate NEAREST to a registered expander with capability-driven LATERAL fallback

1 participant