Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
aba3779
docs: design spec for symbolic-allele-aware ILEN
d-laub Jun 4, 2026
9a05124
docs: implementation plan for symbolic-allele-aware ILEN
d-laub Jun 4, 2026
010f4b8
feat(exprs): symbolic_ilen helper + is_imprecise expression
d-laub Jun 4, 2026
047c980
test(exprs): multi-ALT symbolic_ilen regression test + docs
d-laub Jun 4, 2026
c038de4
test(fixtures): add symbolic SV builder for ILEN tests
d-laub Jun 4, 2026
d459525
test(fixtures): clarify CNV docstring + assert un-sizable records
d-laub Jun 5, 2026
e1919de
test(oracle): expected_ilen helper from GroundTruth
d-laub Jun 5, 2026
e3c2aa5
test(oracle): mirror symbolic_ilen IMPRECISE + END-fallback semantics
d-laub Jun 5, 2026
1bd9a21
test(oracle): normalize compound SV types; drop unreachable END arm
d-laub Jun 5, 2026
f80ac0d
feat(vcf): size symbolic SVs via SVLEN/END; persist corrected ILEN
d-laub Jun 5, 2026
63ecb18
fix(vcf): don't leak SV INFO placeholder cols; guard ILEN concat alig…
d-laub Jun 5, 2026
013790b
fix(vcf): restrict INFO header detection to INFO; strengthen ILEN ali…
d-laub Jun 5, 2026
72febe0
fix(ilen): coerce null ILEN to 0 at numpy materialization boundaries
d-laub Jun 5, 2026
a1543f0
fix(svar): coerce null ILEN to 0 in with-length read + overlap; test …
d-laub Jun 5, 2026
c2a8ad9
test(ilen): drop tautological assertion; document load-bearing fill_null
d-laub Jun 5, 2026
4167527
test: symbolic INV fixture row + is_symbolic/is_imprecise filter parity
d-laub Jun 5, 2026
228c6bc
test: clarify inert filter callable in parity test
d-laub Jun 5, 2026
8483277
feat(pgen): size symbolic SVs from PVAR INFO (SVLEN/END)
d-laub Jun 5, 2026
8054634
docs(pgen): document IMPRECISE flag regex intent
d-laub Jun 5, 2026
65c7731
test(svar): inherits corrected symbolic ILEN from source
d-laub Jun 5, 2026
50fa0e5
test: property test symbolic ILEN against vcfixture oracle
d-laub Jun 5, 2026
d8eaf43
docs(skill): document symbolic ILEN, is_imprecise, filter guidance
d-laub Jun 5, 2026
aae8ef9
fix(exprs): is_snp/is_indel exclude null-ILEN symbolic SVs; test END …
d-laub Jun 5, 2026
dda1182
test(pgen): filter parity for ~is_symbolic / ~is_imprecise on symboli…
d-laub Jun 5, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
881 changes: 881 additions & 0 deletions docs/superpowers/plans/2026-06-04-symbolic-allele-ilen.md

Large diffs are not rendered by default.

205 changes: 205 additions & 0 deletions docs/superpowers/specs/2026-06-04-symbolic-allele-ilen-design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,205 @@
# Symbolic-allele-aware ILEN

**Status:** design approved, ready for implementation plan
**Date:** 2026-06-04 (revised after PR #51 merge)
**Scope:** VCF/BCF **and** PGEN paths (`VCF`, `PGEN`, `SparseVar.from_vcf` / `from_pgen`)
**Builds on:** PR #51 (`genoray.exprs.is_symbolic`, source-filter inheritance in
`SparseVar`, ALT-normalize-before-filter in `_load_index`). PR #51 named this exact
work as its future-work item: *"symbolic-allele expansion (`<DEL>` → precise via
INFO/END / SVLEN)."*

## 1. Problem

`genoray.exprs.ILEN` is defined as the byte-length difference between ALT and REF:

```python
ILEN = pl.col("ALT").list.eval(pl.element().str.len_bytes().cast(pl.Int32)) \
- pl.col("REF").str.len_bytes().cast(pl.Int32)
```

For symbolic ALTs this is meaningless. `<DEL>` against a single padding REF base
yields `len("<DEL>") - len("G") = 5 - 1 = +4` — a *positive* ILEN for a deletion.
The wrong value then propagates, silently and untested, into:

- variant span computation — `_var_ranges.py:65`, `end = POS - ILEN.clip(upper_bound=0)`
- `"variant"` overlap mode — `_svar.py:273`
- `var_counts` / range queries
- haplotype-length deltas — `_utils.py` `hap_ilens`, and downstream GVL `get_diffs_sparse`

PR #51 added `genoray.exprs.is_symbolic` so users *can* filter symbolic alleles out,
but a symbolic allele that is left in still carries a garbage ILEN. There is no
correct-ILEN computation and zero test coverage for it.

## 2. Why this matters (and what it does *not* fix)

genoray's ILEN feeds two kinds of consumer:

1. **Range / span / overlap / count** — `var_ranges`, `"variant"` mode, `var_counts`.
A correct symbolic ILEN makes precise SVs queryable and correctly-spanned.
2. **Haplotype reconstruction** — GenVarLoader (the primary consumer) sums ILEN
for length deltas *and* copies the literal ALT bytes into the haplotype buffer
(`GenVarLoader/python/genvarloader/_dataset/_genotypes.py:311,380`).

For a symbolic allele there is **no literal ALT sequence**, so haplotype
reconstruction cannot emit it regardless of how correct ILEN is. GVL already drops
symbolic alleles upstream (`bcftools norm`) and has no concept of `IMPRECISE` /
`CIPOS` / `CIEND`. Therefore:

- This work makes precise symbolic SVs **queryable and correctly-spanned**.
- It does **not** enable haplotype reconstruction of symbolic alleles — that is
inherent. A haplotype consumer such as GVL filters `~genoray.exprs.is_symbolic`
to drop *all* symbolic alleles (precise or not), since none have literal bytes.

## 3. Correct ILEN for precise SVs

For a **precise** symbolic SV, ILEN is the change in haplotype length, derived from
`SVLEN` (or `END - POS` for `<DEL>`/`<DUP>`), with the sign applied by SV type. This
normalizes the VCF 4.3/4.4 `SVLEN` sign-convention flip to a magnitude plus a
type-determined sign.

| ALT | ILEN | Reference span (`end`) |
|---------|-------------|------------------------------------------|
| `<DEL>` | `-|SVLEN|` | `POS + |SVLEN|` (extends, like a literal deletion) |
| `<INS>` | `+|SVLEN|` | `POS` (point footprint, like a literal insertion) |
| `<DUP>` | `+|SVLEN|` | `POS` (net insertion — matches genoray's insertion model) |

A precise symbolic SV then behaves identically to the equivalent literal indel for
range/span/overlap, which is the only genoray surface that can use it (§2).

`SVLEN` magnitude is taken as `|SVLEN|` when present; otherwise `|END - POS|` for
`<DEL>`/`<DUP>`. `<INS>` has no meaningful END, so it requires `SVLEN`.

## 4. Policy for un-sizable symbolic variants — NO new kwarg

A variant is **un-sizable** when it is symbolic and we cannot derive a precise
length: the `IMPRECISE` flag is set, `SVLEN`/`END` are missing or unparseable, or the
type is unsupported (`<BND>`, `<CNV>`, `<INV>`, `<*>` / `<NON_REF>`).

**genoray stays permissive — it does not drop or error, and adds no mode flag.** This
honors "one and only one way to do things" and PR #51's explicit non-goal of changing
default filtering. Concretely:

- Every variant is kept in the index (as today).
- Precise `<DEL>`/`<INS>`/`<DUP>` get the correct ILEN from §3.
- Un-sizable symbolic variants get **`ILEN = null`** in the polars index (per ALT
element). `null` is honest: it is neither a literal length nor a fake `0` that
would collide with SNPs.
- A new **derived** expression `genoray.exprs.is_imprecise` identifies them:

```python
is_imprecise = pl.col("ILEN").list.eval(pl.element().is_null()).list.any()
```

No persisted column, no constructor kwarg. Filtering is the user's job via the
existing `filter` / `pl_filter` API, exactly as PR #51 established:
`pgen = PGEN("f.pgen", filter=~genoray.exprs.is_imprecise)` keeps precise SVs and
drops only the un-sizable ones; `~is_symbolic` drops all symbolic.

### null → 0 only at the numpy boundary

A `null` ILEN cannot flow into the int32 numba kernels (`var_ranges` end
computation, `SparseVar` start/end/ilen arrays, GVL). Every site that materializes
ILEN to numpy must coerce `null → 0` with `.fill_null(0)` *at that boundary* (a
null/un-sizable variant becomes a zero-length point variant — safe: no false
deletion-span extension, no length delta). Known sites to audit and fix:

- `genoray/_var_ranges.py:65,126,180` — `pl.col("ILEN").list.first().clip(upper_bound=0)`
- `genoray/_svar.py` — the `ILEN.list.first()` → numpy materializations
(e.g. `:720`, `:2245-2249`, and the SEI cache in `_pgen.py`)

These are the "0 otherwise" fallbacks; the polars index keeps `null`.

## 5. Architecture

### 5.1 VCF / BCF path (`genoray/_vcf.py`)

- **Compute-at-construction, persist corrected ILEN.** `_write_gvi_index`
(lines ~1037–1077) currently persists only `{CHROM, POS, REF, ALT}` and ILEN is
computed lazily at load. Change it to compute the corrected ILEN at write time and
persist it, so `_load_index` (the `if "ILEN" not in schema` block, ~line 1108)
finds it on disk and does not recompute.
- **Header-gated SVLEN/END extraction.** To size symbolic SVs we need `SVLEN`/`END`,
which live in INFO. Requesting an INFO field that the VCF header does not declare
can error in oxbow, and most plain VCFs declare neither. So: introspect the cyvcf2
header for which of `SVLEN`/`END` are declared (helper `_declared_info_fields`),
request only those via `get_record_info(info=[...])`, compute ILEN, then **do not
persist SVLEN/END** unless the user explicitly asked for them via `info=`. A
symbolic variant in a VCF that declares neither field is therefore un-sizable →
`null` ILEN → `is_imprecise` (correct).
- INFO fields come back uppercased; `Number=A` fields like `SVLEN` are `list`-typed
(`.list.first()` to read).

### 5.2 PGEN path (`genoray/_pgen.py`)

- plink2 writes symbolic ALTs verbatim into the PVAR (PR #51 empirical finding), and
the PGEN `.gvi` already persists the raw `INFO` string column (`_scan_pvar`,
~lines 1212–1256; sunk by `_write_index`). So at PGEN `_load_index`
(the ILEN block ~line 1161), extract `SVLEN`/`END` from the persisted INFO string
with polars regex (`str.extract(r"(?:^|;)SVLEN=(-?\d+)")`, etc.), detect the
`IMPRECISE` flag, and compute the corrected ILEN with the same shared helper as the
VCF path. No header dependency — a missing `SVLEN=` simply yields null.

### 5.3 Shared ILEN logic (DRY)

A single private helper produces the corrected per-ALT ILEN `List[Int32]` expression
(with `null` for un-sizable), given `ALT`, `REF`, and the `SVLEN`/`END`/`IMPRECISE`
values already extracted into columns. Both paths extract those columns their own way
(oxbow INFO columns for VCF; regex over the INFO string for PGEN) and then call it.
Likely home: `genoray/exprs.py` (private) or `genoray/_utils.py`. The public
`genoray.exprs.ILEN` expression stays **literal** (`len(ALT) - len(REF)`) — GVL and
others import and apply it to already-normalized literal-sequence indexes
(`GenVarLoader/.../_haps.py:26`); requiring an SVLEN column there would break them.

### 5.4 SparseVar (`genoray/_svar.py`)

After PR #51, `_write_filtered_index` passes a persisted-on-disk ILEN straight
through (`ilen_added = False`), so once VCF/PGEN persist the corrected ILEN it flows
into the SVAR unchanged. Only the numpy-boundary `fill_null(0)` fixes (§4) are needed
on the SVAR query/scan side.

## 6. Public API surface & SKILL.md

Public names touched, so `skills/genoray-api/SKILL.md` **must** be updated in the same
PR (per CLAUDE.md):

- corrected `ILEN` semantics for precise symbolic SVs, and `null` ILEN for un-sizable
ones (behavior change)
- new derived expression `genoray.exprs.is_imprecise`
- guidance: filter `~is_symbolic` (haplotype consumers, drops all SVs) vs
`~is_imprecise` (range consumers, keeps precise SVs)

No new constructor kwarg is added.

## 7. Testing (the "harden the tests" goal)

Using vcfixture 0.6.0's symbolic support (`Sym.deletion/insertion/duplication`,
`Bnd`, `SVLEN`/`SVCLAIM`/`END`/`IMPRECISE` in `info=`):

- **Named fixtures** (`tests/data/fixtures.py`): a `symbolic` builder with precise
`<DEL>` / `<INS>` / `<DUP>` (with `SVLEN`, plus `SVCLAIM` where the version
requires it), plus un-sizable cases — an `IMPRECISE` `<DEL>`, a `<BND>`, and a
`<CNV>`. Add a symbolic VCF→PGEN step to `gen_from_vcf.sh` so the PGEN path is
covered (plink2 carries the symbolic ALTs through).
- **Oracle helper** (`tests/_oracle.py`): `expected_ilen(truth, idx)` deriving the
expected per-record ILEN from `GroundTruth.alts_truth[rec][alt]`
(`.sv_type` → DEL `-`, INS/DUP `+`; `.svlen`/`.sv_end`), and `None` for un-sizable.
- **VCF + PGEN ILEN tests**: persisted ILEN equals the oracle expectation for precise
SVs; `is_imprecise` is `True` exactly for the un-sizable rows; the un-sizable rows'
ILEN is `null`.
- **Span / overlap**: a precise `<DEL>` of length N is returned by a query overlapping
`[POS, POS+N)` and excluded just outside it — locks the `_var_ranges` fix and the
`fill_null(0)` boundary.
- **Filter parity**: `~is_symbolic` drops all symbolic; `~is_imprecise` keeps precise
SVs and drops un-sizable; both work on VCF (paired cyvcf2 + pl_filter) and PGEN, and
inherit into `SparseVar`.
- **Property test**: `@given(vs.symbolic_documents())` round-tripped through `VCF`,
asserting ILEN matches the oracle for the sizable subset and the policy
(`null`/`is_imprecise`) holds for the rest.

## 8. Out of scope

- Haplotype reconstruction of symbolic alleles (inherent — no literal sequence).
- Breakend mate / confidence-interval (`CIPOS`/`CIEND`) semantics.
- Multiallelic records mixing precise and symbolic ALTs are handled best-effort
(per-ALT `null` for the un-sizable element); fixtures focus on biallelic symbolic.
- Changing the public literal `genoray.exprs.ILEN` expression.
40 changes: 36 additions & 4 deletions genoray/_pgen.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
from ._types import POS_MAX, POS_TYPE
from ._utils import ContigNormalizer, format_memory, hap_ilens, parse_memory
from ._var_ranges import var_counts, var_indices
from .exprs import ILEN, is_biallelic
from .exprs import ILEN, is_biallelic, symbolic_ilen

V_IDX_TYPE = np.uint32
"""Dtype for PGEN variant indices (uint32). This determines the maximum number of unique variants in a file."""
Expand Down Expand Up @@ -1159,7 +1159,34 @@ def _load_index(
index = index.with_columns(pl.col("ALT").str.split(","))

if "ILEN" not in schema:
index = index.with_columns(ILEN=ILEN)
if "INFO" in schema.names():
# ILEN is intentionally recomputed from the persisted INFO string on
# each _load_index call. PGEN does NOT persist the computed ILEN in
# the .gvi file (unlike the VCF path, which writes ILEN at index-build
# time). Do NOT "optimise" this into persistence without also updating
# _write_index to store ILEN — otherwise old .gvi files would silently
# miss symbolic-SV ILEN corrections.
#
# Regex-extract SVLEN/END/IMPRECISE from the PVAR INFO string, then
# use the shared symbolic_ilen() helper so symbolic SVs get correct
# sign-adjusted lengths (DEL→-|len|, INS/DUP→+|len|, others→null).
info_col = pl.col("INFO").fill_null("")
index = index.with_columns(
info_col.str.extract(r"(?:^|;)SVLEN=(-?\d+)", 1)
.cast(pl.Int64)
.alias("SVLEN"),
info_col.str.extract(r"(?:^|;)END=(\d+)", 1)
.cast(pl.Int64)
.alias("END"),
# IMPRECISE is a VCF Flag (Number=0): match the bare token only.
# Do NOT broaden to IMPRECISE=… — that would wrongly treat
# IMPRECISE=0 as set.
info_col.str.contains(r"(?:^|;)IMPRECISE(?:;|$)").alias("IMPRECISE"),
)
index = index.with_columns(ILEN=symbolic_ilen())
index = index.drop("SVLEN", "END", "IMPRECISE")
else:
index = index.with_columns(ILEN=ILEN)

if filter is None:
has_multiallelics = index.select((~is_biallelic).any()).collect().item()
Expand All @@ -1175,11 +1202,16 @@ def _load_index(
# anyway, so they won't be accessed
# if the filter is changed, the index is invalidated and re-read (see filter setter)
if filter is None:
data = index.select("start", "end", pl.col("ILEN", "ALT").list.first())
data = index.select(
"start",
"end",
pl.col("ILEN").list.first().fill_null(0).alias("ILEN"),
pl.col("ALT").list.first(),
)
else:
data = index.with_columns(
ILEN=pl.when(filter)
.then(pl.col("ILEN").list.first())
.then(pl.col("ILEN").list.first().fill_null(0))
.otherwise(pl.lit(0))
)
data = data.select("start", "end", "ILEN", pl.col("ALT").list.first())
Expand Down
5 changes: 3 additions & 2 deletions genoray/_svar.py
Original file line number Diff line number Diff line change
Expand Up @@ -717,7 +717,7 @@ def _find_starts_ends_with_length(
ends,
var_ranges,
v_starts,
self.index["ILEN"].list.first().to_numpy(),
self.index["ILEN"].list.first().fill_null(0).to_numpy(),
s_idxs,
self.ploidy,
self._c_max_idxs[c],
Expand Down Expand Up @@ -2276,7 +2276,8 @@ def _get_strand_and_codon_pos(
var_id="index",
chrom="CHROM",
start=pl.col("POS"),
end=pl.col("POS") - pl.col("ILEN").list.first().clip(upper_bound=0),
end=pl.col("POS")
- pl.col("ILEN").list.first().clip(upper_bound=0).fill_null(0),
)
var_intervals.config_meta.set(coordinate_system_zero_based=False) # type: ignore

Expand Down
12 changes: 9 additions & 3 deletions genoray/_var_ranges.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,8 +61,12 @@ def var_ranges(
# 0-based
v_starts = var_table["POS"].to_numpy() - 1
# 0-based, exclusive end
# null ILEN = un-sizable symbolic SV (IMPRECISE / unsupported type); treat as a
# point variant. A single null upcasts the Int32 column to Float64/NaN and breaks
# the numba coordinate kernels.
v_ends = (
var_table["POS"] - var_table["ILEN"].list.first().clip(upper_bound=0)
var_table["POS"]
- var_table["ILEN"].list.first().clip(upper_bound=0).fill_null(0)
).to_numpy()
max_v_len = (v_ends - v_starts).max()

Expand Down Expand Up @@ -123,7 +127,8 @@ def var_indices(
pl.col("index").cast(np_to_pl_dtype(idx_dtype)),
chrom=pl.col("CHROM").cast(pl.Utf8),
start=pl.col("POS") - 1,
end=pl.col("POS") - pl.col("ILEN").list.first().clip(upper_bound=0),
end=pl.col("POS")
- pl.col("ILEN").list.first().clip(upper_bound=0).fill_null(0),
)
)
var_table.config_meta.set(coordinate_system_zero_based=True) # type: ignore
Expand Down Expand Up @@ -177,7 +182,8 @@ def var_counts(
.select(
chrom=pl.col("CHROM").cast(pl.Utf8),
start=pl.col("POS") - 1,
end=pl.col("POS") - pl.col("ILEN").list.first().clip(upper_bound=0),
end=pl.col("POS")
- pl.col("ILEN").list.first().clip(upper_bound=0).fill_null(0),
)
)
var_table.config_meta.set(coordinate_system_zero_based=True) # type: ignore
Expand Down
Loading
Loading