feat: symbolic-allele-aware ILEN#52
Merged
Merged
Conversation
Compute correct ILEN for precise <DEL>/<INS>/<DUP> from SVLEN on both VCF and PGEN paths; un-sizable symbolic variants get null ILEN and a derived is_imprecise expr (no new kwarg — filter via the existing filter/pl_filter API, per PR #51). Reconciles with PR #51 and is its named future-work item. Test plan uses vcfixture 0.6.0 symbolic fixtures. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add symbolic_ilen() to compute per-ALT corrected ILEN for <DEL>/<INS>/<DUP> structural variants using SVLEN/END magnitudes, and is_imprecise expression to flag variants with un-sizable symbolic ALTs (null ILEN entries). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Normalize at.sv_type to its first ':'-delimited token before matching in expected_ilen, mirroring symbolic_ilen's regex+split behaviour so that subtyped SVs (e.g. DUP:TANDEM, DEL:ME) are correctly sized rather than falling through to None. Collapse the duplicate DEL / INS+DUP branches into one. Drop the END-fallback arm (unreachable via vcfixture) and update the docstring to document this as a known limitation for Task 9. Add test_oracle_normalizes_compound_sv_type to lock the fix against regression, building a minimal VcfBuilder fixture with <DUP:TANDEM> and <DEL:ME> and asserting correct signed ILENs from the oracle. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add _declared_info_fields() and _fetch_info_cols() helpers to VCF. _write_gvi_index now requests SVLEN/END/IMPRECISE from oxbow when the header declares them, coerces List-typed SVLEN to scalar via list.first(), computes ILEN=symbolic_ilen() and persists it in the .gvi index so symbolic DEL/INS/DUP variants get correct signed lengths instead of literal byte-difference garbage. Older/PGEN indexes without ILEN fall back to the existing load-time computation block unchanged. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nment Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…gnment guard - Fix _declared_info_fields to use header_iter() so FORMAT-only fields (e.g. SVLEN declared only in FORMAT, common in DRAGEN/CNV VCFs) are not falsely reported as INFO fields and passed to oxbow, which would error or mis-fetch on real files. - Extract _oxbow_reader() helper to eliminate copy-pasted reader-dispatch in get_record_info and _fetch_info_cols. - Strengthen _fetch_info_cols alignment guard: retain POS and cross-check it element-wise against the base frame before horizontal concat; replace the AssertionError with ValueError. - Move module-level from .exprs import symbolic_ilen to top of file (no circular import: exprs.py does not import _vcf). - Add no-leak assertions and oracle equality for null rows in test_vcf_persisted_ilen_matches_oracle. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add .fill_null(0) after .clip(upper_bound=0) at all three ILEN sites in _var_ranges.py and both branches of the SEI materialization block in _pgen.py, so null ILEN (IMPRECISE / unsupported SV type) stays Int32 instead of upcasting to Float64/NaN and breaking numba ufuncs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…lazy/svar paths Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add a 6th record (<INV>, POS=6000) to the symbolic() fixture. Sym has no <BND> constructor (only Bnd produces breakend notation G[chr:pos[, which does not start with '<' and is not caught by is_symbolic), so <INV> is used instead per the spec fallback. Adds test_filter_parity_symbolic_vs_imprecise asserting ~is_symbolic drops all 6 rows and ~is_imprecise keeps exactly the 3 precise SVs. Updates all 5-record count/range assertions to 6-record reality. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Regex-extract SVLEN/END/IMPRECISE from the persisted PVAR INFO string in _load_index and delegate to the shared symbolic_ilen() helper so that <DEL>/<INS>/<DUP> get correct sign-adjusted ILEN instead of literal len(ALT)-len(REF) byte garbage. Add gen_from_vcf.sh symbolic PGEN generation (bcftools pre-filter to precise SV types, temp file for plink2 seekability) and a PGEN oracle test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds test_svar_inherits_symbolic_ilen to confirm that a SparseVar built from a symbolic VCF (filtered to 3 precise SVs via ~is_imprecise) carries the corrected ILEN values [[-100],[50],[30]] through the pass-through _write_filtered_index path unchanged. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…fallback null ILEN entries (un-sizable symbolic SVs) were silently classified as both is_snp=True and is_indel=True because list.all() ignores nulls. Fix makes both predicates null-aware so any null element causes the row to fail both tests. Also adds a focused unit test for the END-only (no SVLEN) size fallback in symbolic_ilen(), adds an explanatory comment in _pgen._load_index noting ILEN is intentionally recomputed from INFO on each load, and updates SKILL.md to document the null-ILEN semantics of is_snp/is_indel. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…c PGEN Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
af5a093 to
dda1182
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Computes the correct indel length (
ILEN) for precise symbolic structural variants (<DEL>/<INS>/<DUP>) on both the VCF and PGEN paths, and marks un-sizable symbolic variants withILEN = null+ a derivedis_impreciseexpression. Previously symbolic ALTs got a garbage literal byte-length ILEN.genoray.exprs.symbolic_ilen()— per-ALTList[Int32]ILEN:-|SVLEN|for<DEL>,+|SVLEN|for<INS>/<DUP>(with|END - POS|fallback);nullfor un-sizable symbolic alleles (IMPRECISE, missingSVLEN/END, or unsupported types<BND>/<CNV>/<INV>/<*>); literallen(ALT)-len(REF)for non-symbolic.genoray.exprs.is_imprecise— new expression,Truewhen any ALT's ILEN is null._write_gvi_indexpullsSVLEN/END/IMPRECISEfrom header-declared INFO fields (via oxbow) and persists the corrected ILEN. Includes an INFO-vs-FORMAT header guard and a POS-aligned concat check._load_indexregex-extractsSVLEN/END/IMPRECISEfrom the persisted PVAR INFO string and recomputes ILEN via the shared helper.fill_null(0)at every numpy/numba ILEN materialization boundary (_var_ranges×3, PGEN SEI ×2,_svarwith-length read + overlap), preventing a silent Int32→Float64/NaN upcast that corrupted coordinate math and silently dropped variants.is_snp/is_indel— made null-aware so un-sizable symbolic SVs are classified as neither (previously matched both).~is_symbolicdrops all symbolic (haplotype consumers),~is_imprecisekeeps precise SVs and drops only un-sizable ones (range/overlap queries).skills/genoray-api/SKILL.mdupdated for the new public surface.Test Plan
pixi run test(full suite + data regen): 338 passed, 16 xfailedpixi run ruff check genoray tests+ruff format --check: cleantests/test_symbolic_ilen.py(16 tests): exprs unit tests, vcfixtureexpected_ilenoracle, VCF + PGEN persisted-ILEN vs oracle, null-ILEN numpy-boundary coverage (incl. a regression proving the lazy path no longer silently drops variants),is_symbolic/is_imprecisefilter parity, SparseVar inheritance,is_snp/is_indelnull exclusion, END-fallback unit test, and a Hypothesis property test (vs.symbolic_documents()) cross-checked against the oracle.🤖 Generated with Claude Code