Migrate DISTANCE to a registered operator expander (proof-of-concept) — Closes #140#156
Draft
conradbzura wants to merge 4 commits into
Draft
Migrate DISTANCE to a registered operator expander (proof-of-concept) — Closes #140#156conradbzura wants to merge 4 commits into
conradbzura wants to merge 4 commits into
Conversation
Provide a public save/restore pair on the expander registry so a caller can capture the current registrations and later re-install exactly that baseline. This lets an isolating test fixture (or a plugin) clear and mutate the process-wide registry around a body without permanently losing the built-in expanders registered at import. snapshot returns a fresh mapping, so mutating it does not affect the registry; restore drops every current entry and re-installs the captured contents.
Move DISTANCE generation off the legacy giqldistance_sql string emitter and onto the registry's AST-expansion pass. A new auto-discovering giql.expanders package holds a generic expander that builds the same CASE expression as an AST subtree; transpile imports the package so the process-wide registry is populated before the first transpile. DISTANCE is the proof-of-concept for the expander protocol: it is a single CASE with no joins or per-target divergence, so one generic expander registered for GenericTarget serves every target. The four shapes (unsigned/signed by non-stranded/stranded) and the bedtools closest -d parity offset are preserved verbatim from the deleted emitter. The change is behavior-preserving. Because the CASE is now reserialized by the active target's serializer rather than spliced in as a raw string, the emitted text changes cosmetically only — most visibly the chrom-mismatch guard renders the SQL-standard <> instead of !=. Flip GIQLDistance.GIQL_EXPAND to True and delete the now-dead giqldistance_sql and _distance_operand methods; the shared _generate_distance_case helper stays for NEAREST.
Run the DISTANCE emitter-level tests through the expansion pass so they exercise the new path, and update their pinned SQL to the reserialized form (<> for the chrom-mismatch guard). The literal-range error tests now assert the diagnostic is raised by the expander rather than the deleted emitter. Rework the registry leak guards in test_expander to treat the import-time built-in registrations as the baseline rather than an empty registry, using the new snapshot and restore seam so isolating fixtures do not wipe the built-ins. Add an _opted_out helper and migrated-vs- unmigrated operator parametrization so a shipped GIQL_EXPAND=True operator can be held as a control, and cover snapshot/restore directly.
Mark the DuckDB-executing distance UDF tests as integration, add a drift-guard parity test between the AST expander and the retained _generate_distance_case, a cross-target byte-identity test, and a Hypothesis property test for the distance invariants. Hoist the shared downstream branch in the stranded CASE, align helper docstrings, and refresh stale giqldistance_sql references. Make the registry docstrings mechanistic, restore the registry in place, harden expander auto-discovery, and key the opt-out control on a dynamically derived migrated operator.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Migrate the DISTANCE operator off its bespoke string emitter and onto the registry-driven AST expansion pass. Add a generic, auto-discovered expander that builds the distance
CASEas asqlglotAST, register it for every target, flipGIQLDistance.GIQL_EXPANDtoTrue, and delete the legacygiqldistance_sqlemitter and its_distance_operandhelper. DISTANCE is the simplest operator — a singleCASEwith no joins or per-target divergence — so it serves as the proof-of-concept that validates the expander protocol, registry dispatch, and the cross-target result oracle before the harder operators migrate.The expander reuses the existing pass-1 resolved-column metadata and pass-2 coordinate canonicalization, and preserves the bedtools
closest -dparity semantics verbatim across all four shapes (unsigned/signed x non-stranded/stranded). Because theCASEis now reserialized by the active target's serializer rather than spliced in as a raw string, the emitted SQL changes cosmetically in one behavior-preserving way:!=renders as the SQL-standard<>.This is wave 3 of epic #137 and introduces the shared
ExpanderRegistry.snapshot()/restore()test seam that the sibling wave-3 PRs also carry; the duplicated seam is to be de-duplicated on merge.Closes #140
Proposed changes
Add a public
snapshot()/restore()seam toExpanderRegistryAdd two methods to
ExpanderRegistry(src/giql/expander.py):snapshot()returns a fresh shallow copy of the current(target, operator) -> expanderregistrations, andrestore()replaces all registrations with a previously captured snapshot. Together they form a public save/restore seam so an isolating test fixture (or a plugin) can mutate the process-wideREGISTRYaround a body and return it to a captured baseline afterward — letting the built-in expanders registered at import survive a fixture that would otherwiseclear()them permanently. This seam is shared across the wave-3 PRs and will be de-duplicated when they merge.Add the auto-discovering
giql.expanderspackage and the generic DISTANCE expanderAdd a new
giql.expanderspackage (src/giql/expanders/__init__.py) that imports every submodule viapkgutil.iter_modulesat import time, so each module's@register(...)decorator runs as a side effect and new operator modules are picked up by dropping a file in — no edit to__init__.pyrequired. Wire the package import once ingiql.transpileso the process-wideREGISTRYis populated before the first transpile.Add
src/giql/expanders/distance.pywithexpand_distance, registered forGenericTarget(one portable expander serves every target, since DISTANCE emits identical SQL on DuckDB, DataFusion, and the generic baseline). The expander reads thestranded/signedarguments and the pass-1 resolved interval operands, then builds the matching one of the fourCASEshapes from AST primitives. It mirrors the legacy emitter's contracts exactly: the bedtools-parity+ 1gap magnitude, the chrom-mismatch / overlap / downstream / upstreamWHENordering, the strand-validity guards (NULL or./?strands yield NULL), and the historical "Literal range as {first,second} argument not yet supported" diagnostic for deferred operands. The deferred-operand fall-through to the unstranded path is preserved.Flip
GIQLDistance.GIQL_EXPANDand delete the legacy emitterSet
GIQLDistance.GIQL_EXPAND = True(src/giql/expressions.py) so the operator routes through the AST-expansion pass. Deletegiqldistance_sqland_distance_operandfromBaseGIQLGenerator(src/giql/generators/base.py) and drop the now-unusedGIQLDistanceimport. The shared_generate_distance_caseand_extract_bool_paramhelpers are kept — NEAREST still depends on_generate_distance_case, and the new expander mirrors_extract_bool_param's coercion.Update the tests for the new pass
Update the DISTANCE emitter-level test helpers in
tests/test_distance_transpilation.py,tests/test_distance_udf.py, andtests/generators/test_base.pyto run theExpandOperatorspass (pass 3) before generation, matching the realtranspilepipeline, and update the pinned expected SQL from!=to<>. Move the two literal-range error assertions to run against the expander pass instead of the generator. Intests/test_expander.py, route theclean_registryfixture and the registry/flag leak guards throughsnapshot()/restore()against the import-time baseline (so built-in registrations survive isolation and leaks are still caught), split the operator opt-out test into migrated/unmigrated parametrizations, add direct coverage ofsnapshot()/restore(), and add an_opted_outcontext manager so control tests can hold a migrated operator off.Test cases
TestDistanceTranspilationCASErendered with<>TestDistanceTranspilationCASEover a commaFROMlistTestDistanceTranspilationTestDistanceTranspilation+ 1gap magnitude is preservedTestDistanceUDFTestBaseGIQLGeneratorCASEwith<>TestBaseGIQLGeneratorCASEwith NULL guardsTestBaseGIQLGeneratorTestBaseGIQLGeneratorTestExpanderRegistryFallbackGapssnapshot()TestExpanderRegistryFallbackGapsrestore()TestExpandOperatorsPassTestNoOpWhenInertTestOperatorOptOutGIQL_EXPANDFalseTestOperatorOptOutGIQL_EXPANDTrue