Skip to content

feat: symbolic-allele-aware ILEN#52

Merged
d-laub merged 24 commits into
mainfrom
worktree-test-vcfix
Jun 5, 2026
Merged

feat: symbolic-allele-aware ILEN#52
d-laub merged 24 commits into
mainfrom
worktree-test-vcfix

Conversation

@d-laub
Copy link
Copy Markdown
Owner

@d-laub d-laub commented Jun 5, 2026

Summary

Computes the correct indel length (ILEN) for precise symbolic structural variants (<DEL>/<INS>/<DUP>) on both the VCF and PGEN paths, and marks un-sizable symbolic variants with ILEN = null + a derived is_imprecise expression. Previously symbolic ALTs got a garbage literal byte-length ILEN.

  • genoray.exprs.symbolic_ilen() — per-ALT List[Int32] ILEN: -|SVLEN| for <DEL>, +|SVLEN| for <INS>/<DUP> (with |END - POS| fallback); null for un-sizable symbolic alleles (IMPRECISE, missing SVLEN/END, or unsupported types <BND>/<CNV>/<INV>/<*>); literal len(ALT)-len(REF) for non-symbolic.
  • genoray.exprs.is_imprecise — new expression, True when any ALT's ILEN is null.
  • VCF path_write_gvi_index pulls SVLEN/END/IMPRECISE from header-declared INFO fields (via oxbow) and persists the corrected ILEN. Includes an INFO-vs-FORMAT header guard and a POS-aligned concat check.
  • PGEN path_load_index regex-extracts SVLEN/END/IMPRECISE from the persisted PVAR INFO string and recomputes ILEN via the shared helper.
  • Null coercionfill_null(0) at every numpy/numba ILEN materialization boundary (_var_ranges ×3, PGEN SEI ×2, _svar with-length read + overlap), preventing a silent Int32→Float64/NaN upcast that corrupted coordinate math and silently dropped variants.
  • is_snp/is_indel — made null-aware so un-sizable symbolic SVs are classified as neither (previously matched both).
  • Filter guidance — no new constructor kwarg; ~is_symbolic drops all symbolic (haplotype consumers), ~is_imprecise keeps precise SVs and drops only un-sizable ones (range/overlap queries).
  • Docsskills/genoray-api/SKILL.md updated for the new public surface.

Test Plan

  • pixi run test (full suite + data regen): 338 passed, 16 xfailed
  • pixi run ruff check genoray tests + ruff format --check: clean
  • New tests/test_symbolic_ilen.py (16 tests): exprs unit tests, vcfixture expected_ilen oracle, VCF + PGEN persisted-ILEN vs oracle, null-ILEN numpy-boundary coverage (incl. a regression proving the lazy path no longer silently drops variants), is_symbolic/is_imprecise filter parity, SparseVar inheritance, is_snp/is_indel null exclusion, END-fallback unit test, and a Hypothesis property test (vs.symbolic_documents()) cross-checked against the oracle.

🤖 Generated with Claude Code

d-laub and others added 24 commits June 5, 2026 00:02
Compute correct ILEN for precise <DEL>/<INS>/<DUP> from SVLEN on both
VCF and PGEN paths; un-sizable symbolic variants get null ILEN and a
derived is_imprecise expr (no new kwarg — filter via the existing
filter/pl_filter API, per PR #51). Reconciles with PR #51 and is its
named future-work item. Test plan uses vcfixture 0.6.0 symbolic fixtures.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add symbolic_ilen() to compute per-ALT corrected ILEN for <DEL>/<INS>/<DUP>
structural variants using SVLEN/END magnitudes, and is_imprecise expression
to flag variants with un-sizable symbolic ALTs (null ILEN entries).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Normalize at.sv_type to its first ':'-delimited token before matching in
expected_ilen, mirroring symbolic_ilen's regex+split behaviour so that
subtyped SVs (e.g. DUP:TANDEM, DEL:ME) are correctly sized rather than
falling through to None. Collapse the duplicate DEL / INS+DUP branches
into one. Drop the END-fallback arm (unreachable via vcfixture) and update
the docstring to document this as a known limitation for Task 9.

Add test_oracle_normalizes_compound_sv_type to lock the fix against
regression, building a minimal VcfBuilder fixture with <DUP:TANDEM> and
<DEL:ME> and asserting correct signed ILENs from the oracle.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add _declared_info_fields() and _fetch_info_cols() helpers to VCF.
_write_gvi_index now requests SVLEN/END/IMPRECISE from oxbow when the
header declares them, coerces List-typed SVLEN to scalar via list.first(),
computes ILEN=symbolic_ilen() and persists it in the .gvi index so
symbolic DEL/INS/DUP variants get correct signed lengths instead of
literal byte-difference garbage. Older/PGEN indexes without ILEN fall
back to the existing load-time computation block unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nment

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…gnment guard

- Fix _declared_info_fields to use header_iter() so FORMAT-only fields
  (e.g. SVLEN declared only in FORMAT, common in DRAGEN/CNV VCFs) are
  not falsely reported as INFO fields and passed to oxbow, which would
  error or mis-fetch on real files.
- Extract _oxbow_reader() helper to eliminate copy-pasted reader-dispatch
  in get_record_info and _fetch_info_cols.
- Strengthen _fetch_info_cols alignment guard: retain POS and cross-check
  it element-wise against the base frame before horizontal concat; replace
  the AssertionError with ValueError.
- Move module-level from .exprs import symbolic_ilen to top of file
  (no circular import: exprs.py does not import _vcf).
- Add no-leak assertions and oracle equality for null rows in
  test_vcf_persisted_ilen_matches_oracle.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add .fill_null(0) after .clip(upper_bound=0) at all three ILEN sites in
_var_ranges.py and both branches of the SEI materialization block in
_pgen.py, so null ILEN (IMPRECISE / unsupported SV type) stays Int32
instead of upcasting to Float64/NaN and breaking numba ufuncs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…lazy/svar paths

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add a 6th record (<INV>, POS=6000) to the symbolic() fixture. Sym has no
<BND> constructor (only Bnd produces breakend notation G[chr:pos[, which
does not start with '<' and is not caught by is_symbolic), so <INV> is
used instead per the spec fallback. Adds test_filter_parity_symbolic_vs_imprecise
asserting ~is_symbolic drops all 6 rows and ~is_imprecise keeps exactly the 3
precise SVs. Updates all 5-record count/range assertions to 6-record reality.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Regex-extract SVLEN/END/IMPRECISE from the persisted PVAR INFO string in
_load_index and delegate to the shared symbolic_ilen() helper so that
<DEL>/<INS>/<DUP> get correct sign-adjusted ILEN instead of literal
len(ALT)-len(REF) byte garbage.  Add gen_from_vcf.sh symbolic PGEN
generation (bcftools pre-filter to precise SV types, temp file for
plink2 seekability) and a PGEN oracle test.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds test_svar_inherits_symbolic_ilen to confirm that a SparseVar built
from a symbolic VCF (filtered to 3 precise SVs via ~is_imprecise) carries
the corrected ILEN values [[-100],[50],[30]] through the pass-through
_write_filtered_index path unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…fallback

null ILEN entries (un-sizable symbolic SVs) were silently classified as both
is_snp=True and is_indel=True because list.all() ignores nulls.  Fix makes
both predicates null-aware so any null element causes the row to fail both
tests.  Also adds a focused unit test for the END-only (no SVLEN) size
fallback in symbolic_ilen(), adds an explanatory comment in _pgen._load_index
noting ILEN is intentionally recomputed from INFO on each load, and updates
SKILL.md to document the null-ILEN semantics of is_snp/is_indel.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…c PGEN

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@d-laub d-laub force-pushed the worktree-test-vcfix branch from af5a093 to dda1182 Compare June 5, 2026 07:02
@d-laub d-laub merged commit 7366962 into main Jun 5, 2026
7 checks passed
@d-laub d-laub deleted the worktree-test-vcfix branch June 5, 2026 07:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant