Skip to content

Migrate DISTANCE to a registered operator expander (proof-of-concept) — Closes #140#156

Draft
conradbzura wants to merge 4 commits into
mainfrom
140-migrate-distance-expander
Draft

Migrate DISTANCE to a registered operator expander (proof-of-concept) — Closes #140#156
conradbzura wants to merge 4 commits into
mainfrom
140-migrate-distance-expander

Conversation

@conradbzura

Copy link
Copy Markdown
Collaborator

Summary

Migrate the DISTANCE operator off its bespoke string emitter and onto the registry-driven AST expansion pass. Add a generic, auto-discovered expander that builds the distance CASE as a sqlglot AST, register it for every target, flip GIQLDistance.GIQL_EXPAND to True, and delete the legacy giqldistance_sql emitter and its _distance_operand helper. DISTANCE is the simplest operator — a single CASE with no joins or per-target divergence — so it serves as the proof-of-concept that validates the expander protocol, registry dispatch, and the cross-target result oracle before the harder operators migrate.

The expander reuses the existing pass-1 resolved-column metadata and pass-2 coordinate canonicalization, and preserves the bedtools closest -d parity semantics verbatim across all four shapes (unsigned/signed x non-stranded/stranded). Because the CASE is now reserialized by the active target's serializer rather than spliced in as a raw string, the emitted SQL changes cosmetically in one behavior-preserving way: != renders as the SQL-standard <>.

This is wave 3 of epic #137 and introduces the shared ExpanderRegistry.snapshot()/restore() test seam that the sibling wave-3 PRs also carry; the duplicated seam is to be de-duplicated on merge.

Closes #140

Proposed changes

Add a public snapshot()/restore() seam to ExpanderRegistry

Add two methods to ExpanderRegistry (src/giql/expander.py): snapshot() returns a fresh shallow copy of the current (target, operator) -> expander registrations, and restore() replaces all registrations with a previously captured snapshot. Together they form a public save/restore seam so an isolating test fixture (or a plugin) can mutate the process-wide REGISTRY around a body and return it to a captured baseline afterward — letting the built-in expanders registered at import survive a fixture that would otherwise clear() them permanently. This seam is shared across the wave-3 PRs and will be de-duplicated when they merge.

Add the auto-discovering giql.expanders package and the generic DISTANCE expander

Add a new giql.expanders package (src/giql/expanders/__init__.py) that imports every submodule via pkgutil.iter_modules at import time, so each module's @register(...) decorator runs as a side effect and new operator modules are picked up by dropping a file in — no edit to __init__.py required. Wire the package import once in giql.transpile so the process-wide REGISTRY is populated before the first transpile.

Add src/giql/expanders/distance.py with expand_distance, registered for GenericTarget (one portable expander serves every target, since DISTANCE emits identical SQL on DuckDB, DataFusion, and the generic baseline). The expander reads the stranded/signed arguments and the pass-1 resolved interval operands, then builds the matching one of the four CASE shapes from AST primitives. It mirrors the legacy emitter's contracts exactly: the bedtools-parity + 1 gap magnitude, the chrom-mismatch / overlap / downstream / upstream WHEN ordering, the strand-validity guards (NULL or ./? strands yield NULL), and the historical "Literal range as {first,second} argument not yet supported" diagnostic for deferred operands. The deferred-operand fall-through to the unstranded path is preserved.

Flip GIQLDistance.GIQL_EXPAND and delete the legacy emitter

Set GIQLDistance.GIQL_EXPAND = True (src/giql/expressions.py) so the operator routes through the AST-expansion pass. Delete giqldistance_sql and _distance_operand from BaseGIQLGenerator (src/giql/generators/base.py) and drop the now-unused GIQLDistance import. The shared _generate_distance_case and _extract_bool_param helpers are kept — NEAREST still depends on _generate_distance_case, and the new expander mirrors _extract_bool_param's coercion.

Update the tests for the new pass

Update the DISTANCE emitter-level test helpers in tests/test_distance_transpilation.py, tests/test_distance_udf.py, and tests/generators/test_base.py to run the ExpandOperators pass (pass 3) before generation, matching the real transpile pipeline, and update the pinned expected SQL from != to <>. Move the two literal-range error assertions to run against the expander pass instead of the generator. In tests/test_expander.py, route the clean_registry fixture and the registry/flag leak guards through snapshot()/restore() against the import-time baseline (so built-in registrations survive isolation and leaks are still caught), split the operator opt-out test into migrated/unmigrated parametrizations, add direct coverage of snapshot()/restore(), and add an _opted_out context manager so control tests can hold a migrated operator off.

Test cases

# Test Suite Given When Then Coverage Target
1 TestDistanceTranspilation A column-to-column DISTANCE query Transpiling through passes 1-3 Emits the unsigned distance CASE rendered with <> Default DISTANCE expansion
2 TestDistanceTranspilation A DISTANCE query using comma-join syntax Transpiling through passes 1-3 Emits the same CASE over a comma FROM list Join-syntax invariance
3 TestDistanceTranspilation A signed DISTANCE query Transpiling through passes 1-3 Upstream gap is negated and downstream stays positive Signed branch
4 TestDistanceTranspilation A DISTANCE query Comparing against the legacy semantics Bedtools-parity + 1 gap magnitude is preserved Distance-math parity
5 TestDistanceUDF A transpiled DISTANCE query Executing the generated SQL on DuckDB Returns the expected genomic distances End-to-end behavior
6 TestBaseGIQLGenerator A DISTANCE query run through passes 1-3 Generating SQL Emits the canonicalized distance CASE with <> Emitter-level expanded output
7 TestBaseGIQLGenerator A stranded DISTANCE query Generating SQL Emits the strand-guarded CASE with NULL guards Stranded branch
8 TestBaseGIQLGenerator A DISTANCE with a literal first operand Running the expander pass Raises "Literal range as first argument" First-operand error contract
9 TestBaseGIQLGenerator A DISTANCE with a literal second operand Running the expander pass Raises "Literal range as second argument" Second-operand error contract
10 TestExpanderRegistryFallbackGaps A registry entry captured by snapshot() Registering a second entry afterward The snapshot still holds only the first entry Snapshot independence
11 TestExpanderRegistryFallbackGaps A snapshot taken, then the registry cleared and re-registered Calling restore() The original entry resolves and the transient one is gone Restore semantics
12 TestExpandOperatorsPass A registered expander with the operator's flag held off Running the pass The operator node is left unexpanded Per-type opt-in gate
13 TestNoOpWhenInert An unmigrated DISJOIN query with the default registry Transpiling with the pass versus a bypassed reference SQL matches exactly with no expander alias prefix Pass inertness
14 TestOperatorOptOut An unmigrated operator class Reading GIQL_EXPAND It is False Unmigrated opt-out
15 TestOperatorOptOut A migrated operator class Reading GIQL_EXPAND It is True Migrated opt-in

Provide a public save/restore pair on the expander registry so a caller
can capture the current registrations and later re-install exactly that
baseline. This lets an isolating test fixture (or a plugin) clear and
mutate the process-wide registry around a body without permanently
losing the built-in expanders registered at import.

snapshot returns a fresh mapping, so mutating it does not affect the
registry; restore drops every current entry and re-installs the captured
contents.
Move DISTANCE generation off the legacy giqldistance_sql string emitter
and onto the registry's AST-expansion pass. A new auto-discovering
giql.expanders package holds a generic expander that builds the same
CASE expression as an AST subtree; transpile imports the package so the
process-wide registry is populated before the first transpile.

DISTANCE is the proof-of-concept for the expander protocol: it is a
single CASE with no joins or per-target divergence, so one generic
expander registered for GenericTarget serves every target. The four
shapes (unsigned/signed by non-stranded/stranded) and the bedtools
closest -d parity offset are preserved verbatim from the deleted
emitter.

The change is behavior-preserving. Because the CASE is now reserialized
by the active target's serializer rather than spliced in as a raw
string, the emitted text changes cosmetically only — most visibly the
chrom-mismatch guard renders the SQL-standard <> instead of !=.

Flip GIQLDistance.GIQL_EXPAND to True and delete the now-dead
giqldistance_sql and _distance_operand methods; the shared
_generate_distance_case helper stays for NEAREST.
Run the DISTANCE emitter-level tests through the expansion pass so they
exercise the new path, and update their pinned SQL to the reserialized
form (<> for the chrom-mismatch guard). The literal-range error tests
now assert the diagnostic is raised by the expander rather than the
deleted emitter.

Rework the registry leak guards in test_expander to treat the
import-time built-in registrations as the baseline rather than an empty
registry, using the new snapshot and restore seam so isolating fixtures
do not wipe the built-ins. Add an _opted_out helper and migrated-vs-
unmigrated operator parametrization so a shipped GIQL_EXPAND=True
operator can be held as a control, and cover snapshot/restore directly.
@conradbzura conradbzura self-assigned this Jun 28, 2026
Mark the DuckDB-executing distance UDF tests as integration, add a drift-guard parity test between the AST expander and the retained _generate_distance_case, a cross-target byte-identity test, and a Hypothesis property test for the distance invariants. Hoist the shared downstream branch in the stranded CASE, align helper docstrings, and refresh stale giqldistance_sql references. Make the registry docstrings mechanistic, restore the registry in place, harden expander auto-discovery, and key the opt-out control on a dynamically derived migrated operator.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Migrate DISTANCE to a registered operator expander (proof-of-concept)

1 participant