Migrate NEAREST to a registered expander with capability-driven LATERAL fallback — Closes #142#158
Draft
conradbzura wants to merge 4 commits into
Draft
Migrate NEAREST to a registered expander with capability-driven LATERAL fallback — Closes #142#158conradbzura wants to merge 4 commits into
conradbzura wants to merge 4 commits into
Conversation
Add public snapshot and restore methods to ExpanderRegistry, a save/restore seam over the process-wide registry. A test fixture or a plugin that mutates the registry around a body can capture the baseline and reinstate it afterward, so the built-in expanders registered at import survive an isolating fixture that would otherwise clear them permanently. This is the infrastructure the migrated test fixtures depend on to treat the import-time built-in registrations as their baseline rather than an empty registry.
Move NEAREST expansion off the legacy giqlnearest_sql emitter and onto the ExpandOperators pass as a capability-driven expander. Lateral-capable targets and every standalone literal-reference placement get the portable correlated LATERAL subquery, byte-identical to the legacy emitter. A correlated NEAREST on a target without LATERAL support now gets a decorrelated ROW_NUMBER() window-function fallback that returns identical rows: it ranks candidates once per distinct reference key and re-joins the top k back to every outer row sharing that key. This adds DataFusion support for correlated NEAREST, which previously had no physical plan for the LATERAL form and failed outright. Add the giql.expanders package, whose import registers every built-in expander as a side effect and auto-discovers new operator modules, and wire that import into transpile so the registry is populated before the first transpile. Flip GIQLNearest.GIQL_EXPAND on so the pass owns NEAREST, and delete the giqlnearest_sql emitter. The shared _generate_distance_case and _nearest_* resolution helpers are retained and reused by the expander, keeping the distance, passthrough, and encoding logic unchanged.
Rework the NEAREST and registry tests for the capability-driven expander. Drive the emitter-level NEAREST tests through the ExpandOperators pass instead of calling the deleted giqlnearest_sql directly, and update the pinned SQL to the expander's reserialized output (semantically unchanged from the legacy emitter). Drop the obsolete SUPPORTS_LATERAL=False hard-error test, since lateral support is now a target capability with a window-function fallback rather than a generator-level error. Promote the cross-target oracle's _unsupported_pending_142 expected- failure into a real three-target identity test: DataFusion now plans correlated NEAREST through the decorrelated fallback, so the LATERAL and window forms are verified to return identical rows on every target. Update the registry leak guards and clean_registry fixture to treat the import-time built-in registrations as the baseline through the new snapshot/restore seam, add coverage for snapshot/restore, and account for operators that now ship GIQL_EXPAND=True via an _opted_out helper.
Fix the literal-reference NEAREST crash on DataFusion by gating the decorrelated fallback on genuine correlation and materializing the distance in a two-level subquery. Add executing cross-target oracle cases (k>1, duplicate references, multi-key, max_distance, stranded, signed) and a deterministic tiebreaker so the LATERAL and window forms are set-equivalent. Delete dead helpers and SUPPORTS_LATERAL, make borrowed helpers static, mint fallback aliases via ctx.alias, add invariant asserts, and document DataFusion support. Apply the shared registry-docstring, restore-in-place, and auto-discovery fixes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Migrate NEAREST off the string emitter and onto the
ExpandOperatorspass as a capability-driven expander. Emit the portable correlatedLATERALsubquery wherecapabilities.supports_lateralholds (DuckDB, the generic target), and a decorrelatedROW_NUMBER()window-function form where it does not. This ADDS DataFusion support for correlated NEAREST, which previously failed because DataFusion has no correlated-LATERAL physical plan. Prove the fallback returns row-for-row identical results to the LATERAL form across the k=1/2,max_distance,stranded,signed, and duplicate-reference-row cases, and promote the cross-target oracle's previously-pinned_unsupported_pending_142expected-failure to a real three-target identity test. FlipGIQL_EXPANDonGIQLNearestand delete the legacygiqlnearest_sqlemitter, keeping its shared_generate_distance_case/_nearest_*helpers.This is epic #137 wave 3; it carries the shared
ExpanderRegistry.snapshot()/restore()seam that sibling wave-3 PRs also have (dedupe on merge).Closes #142
Proposed changes
Registry save/restore seam
Add
ExpanderRegistry.snapshot()andExpanderRegistry.restore()— a public save/restore pair.snapshot()returns a fresh shallow copy of the(target, operator) -> expanderregistrations;restore()drops the current entries and re-installs exactly a captured snapshot. A test fixture (or plugin) that mutates the process-wideREGISTRYaround a body captures the baseline first and hands it back afterward, so the built-in expanders registered at import survive an isolating fixture that would otherwiseclear()them permanently.giql.expanderspackage and capability-drivennearest.pyAdd the
giql.expanderspackage. Its__init__walks its own submodules withpkgutil.iter_modulesand imports each, so dropping a<operator>.pyinto the package registers its@register(...)expander as an import side effect with no edit to the package file.giql.transpileimports the package once soREGISTRYis populated before the first transpile.Add
nearest.py.expand_nearestbranches onctx.capabilities.supports_lateraland on whether the node is correlated (parent is aLATERAL): lateral-capable targets and every standalone literal-reference placement get the portable LATERAL/standalone subquery, byte-identical to the legacy emitter; a correlated NEAREST on a target without LATERAL support gets the decorrelated fallback. Three load-bearing design points in the fallback:__giql_x_rk_*names into a renamed derived relation that the target is cross-joined against. DataFusion's planner cannot resolve a window ordering over a join whose two sides share column names (both exposestart/end), so the renamed columns keep every reference column distinct from the target's.ROW_NUMBER()is added in the enclosing one. Fused into one level, DataFusion's optimizer mis-derives the window's sort order from the chromosome-equality prefilter and tripsSanityCheckPlan.DISTINCT, candidates are ranked once per distinct reference value, and the rewritten join re-associates the top-kback to every outer row sharing that key. Ranking depends only on the reference value, so ranking once and re-joining is identical to the per-row LATERAL form even when the outer table holds duplicate reference rows.The fallback rewrites
<outer> AS a CROSS JOIN LATERAL (nearest) AS binto<outer> AS a JOIN (<ranked subquery>) AS b ON <ref-key match> AND b.<rn> <= kin place.GIQLNearest.GIQL_EXPANDflip and emitter deletionFlip
GIQLNearest.GIQL_EXPANDtoTrueso NEAREST expands through its registered expander, and deleteBaseGIQLGenerator.giqlnearest_sql(including the old SQLite "LATERAL not supported"ValueErrorbranch). The self-free_generate_distance_case(shared with DISTANCE, #140) and the_nearest_*resolution / passthrough / output-encoding helpers stay onBaseGIQLGeneratorand are reused by the expander, so distance, passthrough, and encoding round-tripping remain byte-for-byte identical.Test updates and the promoted oracle test
Update the emitter-level NEAREST tests to run pass 3 (
ExpandOperators) before generating, matchingtranspile. Addsnapshot/restoreregistry tests and route the registry-isolation fixtures through the new seam. Add a parametrized check that every migrated operator shipsGIQL_EXPAND=True. Promote the cross-target oracle'stest_nearest_on_datafusion_unsupported_pending_142pytest.raises(match="OuterReferenceColumn")pin totest_correlated_nearest_k1_agrees_across_all_targets, a real three-target identity test (generic and duckdb on the LATERAL form via DuckDB, datafusion on the decorrelated fallback).Test cases
TestNearestTranspilationNEAREST(genes, reference := peaks.interval, k := 3)ORDER BY, andLIMIT 3is generatedTestNearestTranspilationNEAREST(..., k := 5, max_distance := 100000)100000distance filter andLIMIT 5is generatedmax_distancefilterTestNearestTranspilationreference := 'chr1:1000-2000'TestNearestTranspilationstranded := trueTestNearestTranspilationsigned := trueTestExpanderRegistryFallbackGapssnapshot()snapshot()independenceTestExpanderRegistryFallbackGapsrestore()restore()semanticsTestOperatorOptOutGIQL_EXPANDattributeTrue, so the operator expands through its registered expanderTestOperatorOptOutGIQL_EXPANDattributeFalse, so the operator still uses the legacy emitterTestCrossTargetOracleNearestCROSS JOIN LATERAL NEAREST(..., k := 1)runs on generic, duckdb, and datafusion_unsupported_pending_142)