Skip to content
Merged
499 changes: 499 additions & 0 deletions docs/superpowers/plans/2026-06-05-svar-cli-filter-flags.md

Large diffs are not rendered by default.

137 changes: 137 additions & 0 deletions docs/superpowers/specs/2026-06-05-svar-cli-filter-flags-design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
# SVAR write CLI: symbolic + breakend filter flags

**Date:** 2026-06-05
**Branch:** `feat/breakend-alts` (current)

## Problem

The `genoray write` CLI (VCF/PGEN → SVAR) exposes one filtering flag,
`--skip-symbolic-alts`, which drops records carrying VCF 4.x symbolic ALT
alleles (`<DEL>`, `<INS>`, …). Breakend (BND) ALTs — a *distinct* ALT class
(`G[chr2:321[`, `]chr2:321]G`, `.TGCA`, `TGCA.`) that `is_symbolic` does **not**
flag — are equally un-injectable into DNA buffers by downstream haplotype
consumers (e.g. genvarloader), but the CLI offers no way to drop them.

Users who want a gvl-ready SVAR currently cannot express "drop symbolic **and**
breakend" through the CLI.

## Goals

- Drop the symbolic ALT records via a clearer flag name, `--no-symbolic`
(rename of `--skip-symbolic-alts`; treated as non-breaking — the flag is new
enough to rename freely).
- Add `--no-breakend` to drop breakend ALT records.
- Make the two flags independent and composable (`AND` when both are passed).
- Preserve current behavior when neither flag is set (no filtering).

## Non-goals

- **No `--compat-gvl` convenience flag.** Bundling these into a single
"gvl-compatible" switch couples genoray to gvl's *versioned* definition of
compatibility (which may change). The gvl-compat recipe — "pass
`--no-symbolic --no-breakend`" — belongs in gvl's own docs, not in a genoray
flag.
- **No `--no-imprecise` flag.** For haplotype consumers, `~is_symbolic &
~is_breakend` already drops everything un-injectable, including precise
`<DEL>`/`<INS>`/`<DUP>` (gvl cannot expand those either). `is_imprecise`
serves a *different* goal (keep precise SVs, drop only un-sizable ones); users
who want that can use the library `filter`/`pl_filter` API with
`genoray.exprs.is_imprecise`. The CLI does not expose it.
- Version bump, CHANGELOG, gvl-side documentation.

## Background: how the CLI builds filters

`genoray/_cli/__main__.py::write` constructs the source reader, then hands it to
`SparseVar.from_vcf` / `from_pgen`, which inherit the source's filter.

- **VCF** requires *both* a `cyvcf2` record-level callable `filter` **and** a
matching polars `pl_filter` — `VCF.__init__` raises if only one is supplied.
Both must express the same predicate so the genotype scan and the index stay
aligned (a divergence shifts variant indices and corrupts the sparse data).
- **PGEN** takes a single polars `filter` expression.

`genoray.exprs` already ships the polars predicates `is_symbolic` and
`is_breakend` (the latter built on the private regex `_BND_PATTERN`).

## Design

### 1. CLI surface (`genoray/_cli/__main__.py`, `write`)

Replace `skip_symbolic_alts: bool = False` with two independent flags:

```python
no_symbolic: Annotated[bool, Parameter(name="--no-symbolic", negative="")] = False,
no_breakend: Annotated[bool, Parameter(name="--no-breakend", negative="")] = False,
```

- `negative=""` suppresses cyclopts' auto-generated inverse
(`--no-no-symbolic`), so the only spellings are `--no-symbolic` /
`--no-breakend`. Flag present ⇒ drop those records.
- Both default `False` ⇒ no filtering, preserving prior behavior.
- Exact cyclopts negation mechanics to be confirmed against the `cyclopts`
skill during implementation; `negative=""` is the documented way to remove the
auto-inverse.

Update the `write` docstring: remove the `skip_symbolic_alts` entry, document
`no_symbolic` and `no_breakend`.

### 2. Filter composition

Build both filter representations from whichever flags are set:

- **polars side** (VCF & PGEN): collect `~exprs.is_symbolic` and/or
`~exprs.is_breakend` into a list; combine with `reduce(operator.and_, ...)`;
empty ⇒ `None`.
- **cyvcf2 callable side** (VCF only): collect the negated record-level
predicates; the composed callable returns
`all(pred(rec.ALT) for pred in preds)`; empty ⇒ `None`.
- **PGEN**: `filter=` the composed polars expr (or `None`).

VCF is always constructed with both `filter` and `pl_filter` set (or both
`None`), satisfying `VCF.__init__`'s paired-filter requirement.

### 3. Parity helpers (`genoray/exprs.py`)

The central correctness risk is the cyvcf2 record callable drifting from the
polars expr it must mirror. Anchor parity by co-locating two **private**
record-level predicates with the exprs they mirror:

```python
import re # new top-level import

def _record_is_symbolic(alts: Iterable[str]) -> bool: # mirrors is_symbolic
return any(a.startswith("<") for a in alts)

def _record_is_breakend(alts: Iterable[str]) -> bool: # mirrors is_breakend
return any(re.search(_BND_PATTERN, a) is not None for a in alts)
```

The CLI imports these and negates them to form the keep-callable.

*Rejected alternative:* inline both lambdas in the CLI (as the current symbolic
lambda is). Rejected because `_BND_PATTERN` lives in `exprs.py`; inlining the
breakend regex invites drift between the callable and the expr.

### 4. Tests (`tests/test_svar_filtering.py`)

- Update `test_cli_write_skip_symbolic_vcf` / `_pgen` to pass `no_symbolic=True`.
- Add a `_breakend_vcf` fixture (a BND record via `vcfixture.Bnd` plus a plain
SNV) and a `_breakend_pgen` derived from it via plink2.
- New tests: `--no-breakend` drops the BND on both the VCF and PGEN paths.
- One combined test (`no_symbolic=True, no_breakend=True`) asserting both ALT
classes are dropped together.

### 5. Docs / SKILL.md

- Update the `write` docstring (see §1).
- **No `skills/genoray-api/SKILL.md` change required.** SKILL.md documents the
Python import API, not CLI flags; the new helpers are underscore-private; and
`is_symbolic` / `is_breakend` semantics are unchanged.

## Files

| File | Change |
|------|--------|
| `genoray/_cli/__main__.py` | Replace `skip_symbolic_alts` with `no_symbolic` + `no_breakend`; compose filters | Modify |
| `genoray/exprs.py` | Add `_record_is_symbolic`, `_record_is_breakend`, `import re` | Modify |
| `tests/test_svar_filtering.py` | Update renamed-flag tests; add breakend + combined tests | Modify |
69 changes: 48 additions & 21 deletions genoray/_cli/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,8 @@ def write(
overwrite: bool = False,
dosages: str | None = None,
threads: int | None = None,
skip_symbolic_alts: bool = False,
no_symbolic: Annotated[bool, Parameter(name="--no-symbolic", negative="")] = False,
no_breakend: Annotated[bool, Parameter(name="--no-breakend", negative="")] = False,
) -> None:
"""
Convert a VCF or PGEN file to a SVAR file.
Expand All @@ -72,13 +73,17 @@ def write(
If not provided, dosages will not be written.
threads
Number of threads to use for conversion. Defaults to the number of available CPU cores.
skip_symbolic_alts
If True, skip records whose ALT contains a symbolic allele
(``<DEL>``, ``<INS>``, etc.) per VCF 4.x. Applies to both VCF and
PGEN sources. Recommended for SV-bearing cohorts (e.g. 1kGP
SNV_INDEL_SV panels) to avoid emitting literal ``<DEL>`` ASCII bytes
into downstream haplotype buffers. Default False preserves prior
behavior.
no_symbolic
If set, drop records whose ALT contains a symbolic allele
(``<DEL>``, ``<INS>``, etc.) per VCF 4.x. Applies to both VCF and PGEN
sources. Recommended for SV-bearing cohorts (e.g. 1kGP SNV_INDEL_SV
panels) to avoid emitting literal ``<DEL>`` ASCII bytes into downstream
haplotype buffers.
no_breakend
If set, drop records whose ALT contains a breakend (BND) in mate-pair or
single-breakend notation (``G[chr2:321[``, ``]chr2:321]G``, ``.TGCA``,
``TGCA.``). A distinct ALT class from symbolic alleles; combine with
``--no-symbolic`` to drop everything haplotype consumers cannot expand.
"""
from genoray import PGEN, VCF, SparseVar, exprs
from genoray._utils import variant_file_type
Expand All @@ -93,32 +98,54 @@ def write(
if threads is None:
threads = -1

# Compose the requested ALT-class filters. Each flag contributes a polars
# expr (used by both VCF index and PGEN) and a record-level predicate (used
# by the VCF cyvcf2 genotype scan). The two representations of each flag are
# kept in parity via genoray.exprs.
pl_terms = []
record_preds = []
if no_symbolic:
pl_terms.append(~exprs.is_symbolic)
record_preds.append(exprs._record_is_symbolic)
if no_breakend:
pl_terms.append(~exprs.is_breakend)
record_preds.append(exprs._record_is_breakend)

if pl_terms:
from functools import reduce
from operator import and_

pl_filter = reduce(and_, pl_terms)

def record_filter(rec, _preds=tuple(record_preds)) -> bool:
return not any(pred(rec.ALT) for pred in _preds)
else:
pl_filter = None
record_filter = None

if file_type == "vcf":
if dosages is not None and Path(dosages).exists():
raise ValueError(
"The `dosages` argument appears to be a path to an existing file, but VCF requires a FORMAT field name."
)

if skip_symbolic_alts:
# VCF needs both filters: a cyvcf2 callable applied during the
# genotype scan and a polars expr applied to the index; both must
# express the same predicate.
vcf = VCF(
source,
dosage_field=dosages,
filter=lambda rec: not any(a.startswith("<") for a in rec.ALT),
pl_filter=~exprs.is_symbolic,
)
else:
vcf = VCF(source, dosage_field=dosages)
# VCF requires both filters together (or neither): a cyvcf2 callable
# applied during the genotype scan and a matching polars expr applied to
# the index; both must express the same predicate.
vcf = VCF(
source,
dosage_field=dosages,
filter=record_filter,
pl_filter=pl_filter,
)
SparseVar.from_vcf(
out, vcf, max_mem, overwrite, with_dosages=with_dosages, n_jobs=threads
)
elif file_type == "pgen":
pgen = PGEN(
source,
dosage_path=dosages,
filter=(~exprs.is_symbolic) if skip_symbolic_alts else None,
filter=pl_filter,
)
SparseVar.from_pgen(
out, pgen, max_mem, overwrite, with_dosages=with_dosages, n_jobs=threads
Expand Down
45 changes: 37 additions & 8 deletions genoray/_vcf.py
Original file line number Diff line number Diff line change
Expand Up @@ -272,12 +272,7 @@ def __init__(
progress: bool = False,
with_gvi_index: bool = True,
):
if (filter is not None and pl_filter is None) or (
filter is None and pl_filter is not None
):
raise ValueError(
"If a filter function is provided, a polars expression must also be provided, and vice versa."
)
self._check_filter_pair(filter, pl_filter)

self.path = Path(path)
if not self.path.exists():
Expand Down Expand Up @@ -307,6 +302,19 @@ def __init__(
def _open(self) -> cyvcf2.VCF:
return cyvcf2.VCF(self.path, samples=self._samples, lazy=True)

@staticmethod
def _check_filter_pair(
filter: Callable[[cyvcf2.Variant], bool] | None,
pl_filter: pl.Expr | None,
) -> None:
"""Enforce the both-or-neither invariant on a (filter, pl_filter) pair."""
if (filter is not None and pl_filter is None) or (
filter is None and pl_filter is not None
):
raise ValueError(
"If a filter function is provided, a polars expression must also be provided, and vice versa."
)

@property
def filter(self) -> Callable[[cyvcf2.Variant], bool] | None:
"""Function to filter variants. Should return True for variants to keep."""
Expand All @@ -321,10 +329,31 @@ def _index_path(self) -> Path:
return base.with_suffix(".gvi.zst")

@filter.setter
def filter(self, filter: Callable[[cyvcf2.Variant], bool] | None):
"""Changing the filter invalidates the in-memory index."""
def filter(
self,
value: tuple[Callable[[cyvcf2.Variant], bool] | None, pl.Expr | None] | None,
):
"""Set the record filter and its matching polars expression together.

Assign a ``(filter, pl_filter)`` pair, or ``None`` to clear both. The VCF
path requires both a cyvcf2 record callable (for the genotype scan) and a
matching polars expression (for the ``.gvi`` index); they must be set
together, mirroring the constructor's both-or-neither invariant. Changing
the filter invalidates the in-memory index.
"""
if value is None:
filter = pl_filter = None
elif isinstance(value, tuple) and len(value) == 2:
filter, pl_filter = value
else:
raise TypeError(
"VCF.filter must be assigned a (filter, pl_filter) tuple or None; "
f"got {type(value).__name__}."
)
self._check_filter_pair(filter, pl_filter)
self._index = None
self._filter = filter
self._pl_filter = pl_filter

@property
def nbytes(self) -> int:
Expand Down
Loading
Loading