When reading variants via Dataset.with_seqs("variants").with_len("ragged"), the returned RaggedVariants for a [region, sample] cell is NOT clipped to the queried region window.
In python/genvarloader/_dataset/_haps.py, the ragged __call__ path calls:
# _haps.py:471
ragv = self._get_variants(idx=idx, regions=None, shifts=None)
with regions=None (and there is a # TODO: maybe filter variants for region, shifts? at _haps.py:642). By contrast, the haplotype/annotated path (get_haps_and_shifts, _haps.py:522) correctly passes regions=req.regions, keep=req.keep, so haplotype sequence output is properly windowed (this is why haplotype reconstruction is correct).
Consequences:
RaggedVariants output can include variants outside the region window (e.g. boundary-overlapping indels, or — for the PGEN backend, which stores a coarser per-cell variant set — variants from elsewhere on the contig).
- Any consumer counting/inspecting per-region variants via
with_seqs("variants") gets an unclipped set.
Found via property-based testing (Phase 2). Track 1b was reframed to AF validation because a per-region variant-count oracle is not meaningful against the current unclipped output.
When reading variants via
Dataset.with_seqs("variants").with_len("ragged"), the returnedRaggedVariantsfor a[region, sample]cell is NOT clipped to the queried region window.In
python/genvarloader/_dataset/_haps.py, the ragged__call__path calls:with
regions=None(and there is a# TODO: maybe filter variants for region, shifts?at_haps.py:642). By contrast, the haplotype/annotated path (get_haps_and_shifts,_haps.py:522) correctly passesregions=req.regions, keep=req.keep, so haplotype sequence output is properly windowed (this is why haplotype reconstruction is correct).Consequences:
RaggedVariantsoutput can include variants outside the region window (e.g. boundary-overlapping indels, or — for the PGEN backend, which stores a coarser per-cell variant set — variants from elsewhere on the contig).with_seqs("variants")gets an unclipped set.Found via property-based testing (Phase 2). Track 1b was reframed to AF validation because a per-region variant-count oracle is not meaningful against the current unclipped output.