Skip to content

with_seqs("variants") (RaggedVariants) does not clip variants to the region window #202

@d-laub

Description

@d-laub

When reading variants via Dataset.with_seqs("variants").with_len("ragged"), the returned RaggedVariants for a [region, sample] cell is NOT clipped to the queried region window.

In python/genvarloader/_dataset/_haps.py, the ragged __call__ path calls:

# _haps.py:471
ragv = self._get_variants(idx=idx, regions=None, shifts=None)

with regions=None (and there is a # TODO: maybe filter variants for region, shifts? at _haps.py:642). By contrast, the haplotype/annotated path (get_haps_and_shifts, _haps.py:522) correctly passes regions=req.regions, keep=req.keep, so haplotype sequence output is properly windowed (this is why haplotype reconstruction is correct).

Consequences:

  • RaggedVariants output can include variants outside the region window (e.g. boundary-overlapping indels, or — for the PGEN backend, which stores a coarser per-cell variant set — variants from elsewhere on the contig).
  • Any consumer counting/inspecting per-region variants via with_seqs("variants") gets an unclipped set.

Found via property-based testing (Phase 2). Track 1b was reframed to AF validation because a per-region variant-count oracle is not meaningful against the current unclipped output.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions