Skip to content

gvl.write does not validate variant atomization #199

@d-laub

Description

@d-laub

gvl.write documents that variant input must be atomized, but it performs no check. Non-atomic input (e.g. a multi-nucleotide REF/ALT such as a 2-bp MNP, or non-atomized indels) silently corrupts haplotype-length arithmetic via the hardcoded +1 REF/ALT-overlap assumption at:

  • python/genvarloader/_dataset/_genotypes.py:69
  • python/genvarloader/_dataset/_genotypes.py:313
  • python/genvarloader/_dataset/_tracks.py:297

It should raise a clear ValueError (mirroring the existing multi-allelic guard at _dataset/_write.py:389) instructing the user to atomize (bcftools norm -a).

Found via property-based testing (Phase 2 test overhaul): inputs are currently canonicalized with bcftools norm -a --atom-overlaps before gvl.write to sidestep this. A clean-rejection test (tests/integration/dataset/test_haps_property.py) is marked xfail pending this validation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions