Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions docs/transpilation/schema-mapping.rst
Original file line number Diff line number Diff line change
Expand Up @@ -173,6 +173,23 @@ If your data uses 1-based coordinates (like VCF or GFF), configure the
],
)

.. note::

**Non-canonical encodings currently require a DuckDB-compatible engine.**
When a table declares an encoding other than the default 0-based half-open
(for example ``coordinate_system="1based"`` or ``interval_type="closed"``),
GIQL canonicalizes its coordinates by wrapping the relation in a hidden CTE
that uses ``SELECT * REPLACE (...)`` syntax. That syntax is supported by
DuckDB, BigQuery, Snowflake, and ClickHouse, but **not** by PostgreSQL,
SQLite, or DataFusion. Tables in the default 0-based half-open encoding are
unaffected -- they take an identity fast path that emits portable SQL.

To target a non-``REPLACE`` engine today, store your data in 0-based
half-open form, or convert it explicitly in a CTE and reference that CTE
(which GIQL treats as already canonical). Making canonicalization emit
portable SQL on every engine is tracked in
`#132 <https://github.com/abdenlab/giql/issues/132>`_.

Working with Point Features
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down
109 changes: 80 additions & 29 deletions src/giql/canonicalizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,20 @@
tradeoff the epic calls out (only synthesize a wrapper when canonicalization
actually changes columns).

Engine portability (known limitation)
-------------------------------------
The wrapper projection uses ``SELECT * REPLACE (...)`` to canonicalize the
interval columns in place while passing every other source column through
untouched (the registry declares only the genomic columns, so an explicit
full-column projection is not available). ``* REPLACE`` is supported by DuckDB,
BigQuery, Snowflake, and ClickHouse, but **not** by PostgreSQL, SQLite, or
DataFusion — so a non-canonical encoding currently transpiles to
engine-incompatible SQL on those targets. Identity-encoded (default 0-based
half-open) relations are unaffected: they skip wrapping entirely and emit
portable SQL. Making the emit strategy dialect-aware (an explicit portable
projection when the target lacks ``REPLACE`` or the full schema is declared) is
tracked in https://github.com/abdenlab/giql/issues/132.

Gating (epic #114, step 6)
--------------------------
The pass is gated per operator by a ``GIQL_CANONICALIZE`` class attribute on the
Expand All @@ -48,11 +62,17 @@

De-canonicalization hook
-------------------------
The outermost ``SELECT`` projection receives a de-canonicalization rewrite for
any output column that a migrated operator emitted in canonical form but that
must land in the user's preferred encoding. With no operator migrated in this
issue that rewrite has nothing to act on; :func:`_decanonicalize_outputs` is the
designed-but-inert hook the port issues will fill in.
A migrated operator's *output* columns must land back in the target relation's
declared encoding. Epic #114 step 6 envisioned a rewrite of the outermost
``SELECT`` projection, but that placement is wrong for a table function: DISJOIN
synthesizes its ``disjoin_*`` output and its passed-through interval at
*generation* time, so those columns do not exist as AST in this pass, and a
``SELECT *`` consumer hides them from any outer-projection rewrite. So
:func:`_decanonicalize_outputs` instead records each wrapped slot's *original*
:class:`~giql.table.Table` on the operator's
:class:`~giql.resolver.OperatorResolution`, and the operator's emitter reads it
to de-canonicalize those synthesized columns where it generates them (DISJOIN,
issue #122).
"""

from __future__ import annotations
Expand Down Expand Up @@ -246,22 +266,39 @@ def _fresh_name(next_name, taken: set[str]) -> str:
def _canonical_projection(ref: ResolvedRef) -> exp.Select:
"""Build the ``SELECT`` body that projects *ref*'s table to canonical form.

The projection exposes the canonical ``chrom`` / ``start`` / ``end`` columns
under their original physical names, with ``start`` / ``end`` rewritten by
the :mod:`giql.canonical` arithmetic for the table's declared encoding. This
is the interval contract every CTE / subquery reference is assumed to satisfy
(canonical 0-based half-open ``chrom`` / ``start`` / ``end``); operator port
issues #122 / #123 may extend it with pass-through columns as their emitters
require.
The projection is a **full-row passthrough**: ``SELECT *`` keeps every
physical column of the source relation, and a star ``REPLACE`` rewrites only
the two interval columns — ``start`` / ``end``, under their original physical
names — with the :mod:`giql.canonical` arithmetic for the table's declared
encoding. ``chrom`` and every non-interval column flow through the star
untouched.

The full row (rather than a bare ``chrom`` / ``start`` / ``end`` triple) is
required by table-function operators whose final projection passes the whole
source row through — DISJOIN's ``SELECT t.*`` (#122) — and by their join-back
semantics, which key on the source's physical columns. A CTE / subquery
reference that only needs the canonical interval triple still reads those
three columns from the same wrapper.
"""
chrom, start, end = ref.cols
_chrom, start, end = ref.cols
table = ref.table
relation = ref.name
return exp.select(
exp.alias_(exp.column(chrom), chrom),
exp.alias_(_canonical_start_expr(start, table), start),
exp.alias_(_canonical_end_expr(end, table), end),
).from_(exp.to_table(relation))
# Quote the interval identifiers: the canonical column names are physical and
# routinely reserved words (the default genomic layout's ``start`` / ``end``),
# so the executed wrapper must quote them.
star = exp.Star(
replace=[
exp.alias_(
_canonical_start_expr(start, table),
exp.to_identifier(start, quoted=True),
),
exp.alias_(
_canonical_end_expr(end, table),
exp.to_identifier(end, quoted=True),
),
]
)
return exp.Select(expressions=[star]).from_(exp.to_table(relation))


def _canonical_start_expr(start: str, table: Table | None) -> exp.Expression:
Expand All @@ -272,7 +309,7 @@ def _canonical_start_expr(start: str, table: Table | None) -> exp.Expression:
- ``0based``: ``start`` (identity)
- ``1based``: ``start - 1``
"""
col = exp.column(start)
col = exp.column(exp.to_identifier(start, quoted=True))
if table is None or table.coordinate_system == "0based":
return col
return exp.paren(exp.Sub(this=col, expression=exp.Literal.number(1)))
Expand All @@ -288,7 +325,7 @@ def _canonical_end_expr(end: str, table: Table | None) -> exp.Expression:
- ``1based`` / ``half_open``: ``end - 1``
- ``1based`` / ``closed``: ``end`` (identity)
"""
col = exp.column(end)
col = exp.column(exp.to_identifier(end, quoted=True))
if table is None:
return col
key = (table.coordinate_system, table.interval_type)
Expand Down Expand Up @@ -343,13 +380,27 @@ def _decanonicalize_outputs(
expression: exp.Expression,
targets: list[tuple[exp.Expression, str, ResolvedRef]],
) -> None:
"""De-canonicalize migrated operator outputs in the outermost projection.

Inert hook (epic #114, step 6). The outermost ``SELECT`` projection list
should rewrite any output column a migrated operator emitted in canonical
form back into the user's preferred encoding. No operator is migrated in
issue #121, so there is nothing to rewrite; the operator port issues (#122,
#123) fill this in alongside flipping their ``GIQL_CANONICALIZE`` flags.
"""Preserve each wrapped slot's original encoding for the emitter's output.

A wrapped slot's :class:`~giql.resolver.ResolvedRef` is rewritten to a
``Table``-free canonical-CTE ref, which would otherwise lose the
(non-canonical) encoding the operator's *output* must round-trip back into.

The de-canonicalization itself cannot be applied on the AST in this pass for
a table-function operator: DISJOIN synthesizes its ``disjoin_*`` columns and
its passed-through interval at *generation* time, so those columns do not
exist as AST here, and a ``SELECT *`` consumer hides them from any
outer-projection rewrite. The originally-envisioned outermost-projection
rewrite (epic #114, step 6) is therefore wrong for projected
table-function columns; instead this hook records the per-slot original
:class:`~giql.table.Table` on the :class:`~giql.resolver.OperatorResolution`,
and the operator's emitter reads it to de-canonicalize those synthesized
columns where it generates them (see :issue:`122`).

*targets* carries the original (pre-rewrite) refs, so ``ref.table`` is the
source relation's declared encoding.
"""
# Intentionally empty until an operator opts in (see module docstring).
return None
for node, arg, ref in targets:
resolution = node.meta.get(META_KEY)
if isinstance(resolution, OperatorResolution):
resolution.output_tables[arg] = ref.table
9 changes: 9 additions & 0 deletions src/giql/expressions.py
Original file line number Diff line number Diff line change
Expand Up @@ -336,6 +336,15 @@ class GIQLDisjoin(exp.Func):
"reference": False, # Optional: reference table/CTE name or subquery
}

#: Opt DISJOIN into the CanonicalizeCoordinates pass (epic #114 step 7,
#: issue #122). With this flag set, pass 2 wraps every non-canonical
#: interval-bearing operand in a canonical ``__giql_canon_*`` CTE and
#: rewrites the slot to point at it, so the emitter consumes already-canonical
#: 0-based half-open columns instead of canonicalizing inline. Identity
#: (0-based half-open) operands are left unwrapped and the emitted SQL stays
#: byte-identical.
GIQL_CANONICALIZE = True

GIQL_SLOTS = (
SlotSpec("this", frozenset({"registered_table"}), required=True),
SlotSpec(
Expand Down
120 changes: 97 additions & 23 deletions src/giql/generators/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -280,41 +280,56 @@ def giqldisjoin_sql(self, expression: GIQLDisjoin) -> str:
filter drops sub-intervals overlapping no reference interval. When no
``reference`` is given it defaults to the target set.

Coordinate-system round-tripping is handled by
:func:`giql.canonical.decanonical_start` /
:func:`giql.canonical.decanonical_end`.
Input canonicalization is owned by ``CanonicalizeCoordinates`` (pass 2,
issue #122): every non-canonical interval-bearing operand is rewritten to
a canonical ``__giql_canon_*`` CTE before generation, so this emitter
consumes already-canonical 0-based half-open columns and applies no
in-emitter canonicalization arithmetic. The output round-trip back to the
target's declared encoding stays here — the ``disjoin_*`` columns and the
passed-through interval are synthesized at generation time and cannot be
reached by a pass-2 outermost-projection rewrite — driven by the original
encoding the pass preserves on the resolution.

:param expression:
GIQLDisjoin expression node
:return:
SQL string (a parenthesized WITH-CTE subquery) for the DISJOIN table
"""
# Unpack the resolution metadata attached by ResolveOperatorRefs (pass 1).
# Unpack the resolution metadata attached by ResolveOperatorRefs (pass 1)
# and rewritten by CanonicalizeCoordinates (pass 2).
target_ref, ref, ref_from = self._disjoin_resolution(expression)
target_name = target_ref.name
target_chrom, target_start, target_end = target_ref.cols
target_table = target_ref.table
ref_chrom, ref_start, ref_end = ref.cols
ref_table = ref.table
is_self_reference = ref.coverage_skippable

# Canonical target endpoints, qualified by the __giql_dj_tgt alias.
# The target's *declared* encoding, which disjoin_* output and the
# passed-through interval must round-trip back into. Pass 2 preserves it
# on the resolution when it wraps a non-canonical target (the slot's own
# Table is then None); a canonical target is left unwrapped and its slot
# Table carries the (identity) encoding.
output_table = self._disjoin_output_encoding(expression, target_ref)

# Post-pass every operand is canonical 0-based half-open (a registered
# table is either identity-encoded or rewritten to a canonical CTE), so
# the physical columns are consumed verbatim with no canonicalization.
t_chrom = f't."{target_chrom}"'
t_start = canonical_start(f't."{target_start}"', target_table)
t_end = canonical_end(f't."{target_end}"', target_table)

# Canonical reference endpoints: unqualified for the breakpoint CTE,
# qualified by 'r' for the coverage EXISTS filter.
bp_start = canonical_start(f'"{ref_start}"', ref_table)
bp_end = canonical_end(f'"{ref_end}"', ref_table)
r_start = canonical_start(f'r."{ref_start}"', ref_table)
r_end = canonical_end(f'r."{ref_end}"', ref_table)

# disjoin_start / disjoin_end are emitted in the target table's
# coordinate system so an output row carries one convention; the cut
# math above stays canonical internally.
out_start = decanonical_start("s.seg_start", target_table)
out_end = decanonical_end("s.seg_end", target_table)
t_start = f't."{target_start}"'
t_end = f't."{target_end}"'

# Reference endpoints: unqualified for the breakpoint CTE, qualified by
# 'r' for the coverage EXISTS filter.
bp_start = f'"{ref_start}"'
bp_end = f'"{ref_end}"'
r_start = f'r."{ref_start}"'
r_end = f'r."{ref_end}"'

# disjoin_start / disjoin_end are emitted in the target's declared
# coordinate system so an output row carries one convention; the cut math
# stays canonical internally.
out_start = decanonical_start("s.seg_start", output_table)
out_end = decanonical_end("s.seg_end", output_table)
passthrough = self._disjoin_passthrough(target_start, target_end, output_table)

# Build the WITH clause one named fragment per __giql_dj_* CTE so each
# block reads on its own. The `seg_end > seg_start` guard in the final
Expand Down Expand Up @@ -361,7 +376,8 @@ def giqldisjoin_sql(self, expression: GIQLDisjoin) -> str:
)
where_sql = " AND ".join(where_clauses)
final_select = (
f"SELECT t.*, s.kc AS disjoin_chrom, {out_start} AS disjoin_start, "
f"SELECT {passthrough}, s.kc AS disjoin_chrom, "
f"{out_start} AS disjoin_start, "
f"{out_end} AS disjoin_end FROM __giql_dj_tgt AS t "
f'JOIN __giql_dj_segs AS s ON t."{target_chrom}" = s.kc '
f'AND t."{target_start}" = s.ks AND t."{target_end}" = s.ke '
Expand All @@ -372,6 +388,64 @@ def giqldisjoin_sql(self, expression: GIQLDisjoin) -> str:
f"{cuts_cte}, {segs_cte} {final_select})"
)

def _disjoin_output_encoding(
self, expression: GIQLDisjoin, target_ref: ResolvedRef
) -> Table | None:
"""Return the target's declared encoding for DISJOIN's output round-trip.

``CanonicalizeCoordinates`` (pass 2) records the original
:class:`~giql.table.Table` on the resolution when it wraps a non-canonical
target (blanking the slot's own ``table``). For an unwrapped target — a
canonical registered table, or any target when the pass did not run — the
slot's own ``table`` carries the (identity) encoding.

:param expression:
GIQLDisjoin expression node
:param target_ref:
The resolved target reference (post pass 2)
:return:
The target's declared :class:`~giql.table.Table`, or ``None``
"""
resolution = expression.meta.get(META_KEY)
if isinstance(resolution, OperatorResolution):
preserved = resolution.output_tables.get("this")
if preserved is not None:
return preserved
return target_ref.table

def _disjoin_passthrough(
self, target_start: str, target_end: str, output_table: Table | None
) -> str:
"""Project the target's full row, de-canonicalizing the interval columns.

When the target's declared encoding is canonical 0-based half-open the
row passes through as a plain ``t.*`` — the byte-identical identity fast
path. When it is non-canonical the interval columns, canonical inside
``__giql_dj_tgt``, are de-canonicalized back into that encoding via a star
``REPLACE`` so the passed-through interval matches the target's own
convention. (Only non-canonical targets are wrapped, so the ``REPLACE``
appears only where a canonical CTE already shapes the SQL.)

:param target_start:
Physical start column name
:param target_end:
Physical end column name
:param output_table:
The target's declared :class:`~giql.table.Table`, or ``None``
:return:
The passthrough projection fragment (``t.*`` or a star ``REPLACE``)
"""
if output_table is None or (
output_table.coordinate_system == "0based"
and output_table.interval_type == "half_open"
):
return "t.*"
pt_start = decanonical_start(f't."{target_start}"', output_table)
pt_end = decanonical_end(f't."{target_end}"', output_table)
return (
f't.* REPLACE ({pt_start} AS "{target_start}", {pt_end} AS "{target_end}")'
)

def giqldistance_sql(self, expression: GIQLDistance) -> str:
"""Generate SQL CASE expression for DISTANCE function.

Expand Down
11 changes: 11 additions & 0 deletions src/giql/resolver.py
Original file line number Diff line number Diff line change
Expand Up @@ -346,12 +346,23 @@ class OperatorResolution:
resolve (a literal range, or an unqualified column outside a
current-table context) is left out and the generator raises its existing
error.
output_tables : dict[str, Table]
Mapping from a slot's arg key to the *original* :class:`~giql.table.Table`
whose declared encoding that slot carried before
:func:`giql.canonicalizer.canonicalize_coordinates` (pass 2) wrapped it
in a canonical CTE and blanked its ``ResolvedRef.table``. The pass
populates this for every slot it wraps so the operator's emitter can
round-trip *synthesized* output columns (DISJOIN's ``disjoin_*`` and its
passed-through interval) back into that encoding — columns that do not
exist as AST in pass 2 and that a ``SELECT *`` consumer hides from any
outer-projection rewrite. Empty until pass 2 wraps a slot.
"""

operator: str
slots: dict[str, ResolvedRef | ResolvedInterval]
deferrals: dict[str, SlotDeferral] = field(default_factory=dict)
columns: dict[str, ResolvedColumn] = field(default_factory=dict)
output_tables: dict[str, Table] = field(default_factory=dict)

def slot(self, arg: str) -> ResolvedRef | ResolvedInterval | None:
"""Return the resolved metadata for slot *arg*, or ``None``."""
Expand Down
Loading
Loading