05 Jun 09:27

d-laub

9a180cf

v0.27.0 Latest

Latest

Highlights: robust on-disk artifacts — gvl.write now creates datasets atomically under an advisory file lock, and Dataset.open validates dataset format version + structural integrity before use. FASTA caches move to a self-describing .gvlfa format that builds atomically, auto-migrates legacy .gvl caches, and auto-rebuilds when stale. Plus correct drop_last handling across all dataloader modes.

✨ Features

Atomic, concurrency-safe dataset creation
gvl.write builds into a private sibling temp directory and publishes via an atomic os.replace, so a destination directory is never observed half-written. A best-effort filelock lets parallel jobs sharing one destination avoid redundant rebuilds — correctness rests on the atomic rename; the lock is advisory only (new filelock dependency).

Dataset format versioning + integrity validation
Metadata now records a format_version, and Dataset.open validates both the version and structural integrity (file presence and sizes) before returning. An incompatible or corrupt dataset raises a clear ValueError instructing you to regenerate with gvl.write. Datasets do not auto-rebuild.

New .gvlfa FASTA cache
gvl.Reference.from_path now builds and reuses a self-describing, fingerprint-validated .gvlfa cache directory (and accepts a .gvlfa path directly). The cache is published atomically under a best-effort lock — concurrent builders sharing one reference are safe — and, unlike datasets, auto-rebuilds from its source when stale or missing. Legacy .fa.gvl caches are migrated in place by reusing their bytes.

🐛 Fixes

to_dataloader(drop_last=...) now honored across all modes: drop_last is no longer forwarded to the underlying DataLoader in default mode (it was double-applied); buffered/double_buffered modes honor drop_last=False; and ChunkPlanner keeps the trailing partial batch.
BatchSampler conflict warning: to_dataloader now warns when a BatchSampler overrides an explicit batch_size.
.gvlfa migration guards against stale/truncated legacy bytes, and a format-too-new sibling cache now raises instead of silently downgrading.

🔧 Internals

New atomic_dir directory-publish primitive (temp build + os.replace) underpins both dataset and FASTA-cache creation.
_fasta_cache module: FastaCache models + fingerprinting, three-way source resolution, build/load/validity guards, legacy migration, and an ensure_cache orchestrator.
Concurrency regression tests for atomic cache + dataset creation (closes #21); coverage for too-old format versions and the genotypes-without-ploidy branch.

Out of scope: genoray .gvi and pysam .fai/.gzi index files are created by those upstream libraries and are not covered by gvl's atomic/locked creation.

Feat

_open: validate dataset format version + integrity on open
_write: atomic dataset creation + format_version in Metadata
_fasta_cache: publish cache atomically via atomic_dir + locked double-check
_atomic: add atomic_dir directory-publish primitive
fasta: use .gvlfa cache module and accept .gvlfa input
_fasta_cache: add ensure_cache orchestrator and dispatch
_fasta_cache: migrate legacy .gvl caches by reusing bytes
_fasta_cache: add build, load, and validity guards
_fasta_cache: add source hints and three-way resolution
_fasta_cache: add FastaCache models and fingerprint

Fix

torch: warn when BatchSampler overrides explicit batch_size
torch: buffered modes honor drop_last=False
torch: do not forward drop_last to DataLoader in default mode
chunked: keep trailing partial batch in ChunkPlanner
test_fasta: move mid-file imports to top (E402, CI lint)
_fasta_cache: guard legacy migration against stale/truncated bytes
_fasta_cache: raise on format-too-new sibling cache instead of silent downgrade

Refactor

_write: use plain with-atomic_dir; restore warnings filter; add atomicity + format_version on-disk tests
reference: build cache via ensure_cache, accept .gvlfa
_fasta_cache: fix progress bar advance and tighten load status type

Assets 2

01 Jun 21:29

d-laub

v0.26.0

ce2efe9

v0.26.0

Highlights: prefetching dataloaders for higher GPU utilization, zero-copy spliced-haplotype output driven by a GTF/BED, selective loading of variant INFO/dosage fields, and a ground-up flat-buffer rewrite of sequence reconstruction — ~1.5–3× faster indexing with markedly lower peak memory and byte-identical output.

✨ Features

Prefetching dataloaders
Dataset.to_dataloader(...) gains a mode argument with two prefetching strategies that overlap data production with training:

mode="buffered" — a background producer fills an in-process buffer.
mode="double_buffered" — a producer subprocess serializes batches into shared memory, decoupling reconstruction from the training loop entirely.

New related knobs: buffer_bytes (default 2 GiB), copy, and heartbeat_seconds. Both modes require num_workers=0.

Zero-copy spliced haplotypes
Request spliced output by passing splice_info to Dataset.open(...) or with_settings(...). Exon segments are assembled into a single contiguous haplotype via a sample-agnostic SpliceMap, and the whole splice path is zero-copy — reference, haplotype, and track output are laid out directly in flat buffers (SplicePlan) with no intermediate gather/concat. The new public helper gvl.get_splice_bed builds the splice BED from your annotation.

Selective variant fields
Dataset.open(...) and with_settings(...) accept var_fields=[...] to lazily load only the INFO/dosage fields you need; inspect what's available via available_var_fields. Dosage is now loaded only when explicitly requested (#191).

⚡ Performance

Flat-buffer reconstruction: sequence, haplotype, track, and variant reconstruction now return numpy-backed flat ragged buffers instead of routing through awkward kernels. Combined with seqpro to_packed/masked-reverse-complement kernels (seqpro 0.12.1 → 0.14), the indexing hot path is roughly 1.5–3× faster with byte-identical output.
Lower peak memory: the flat-buffer path eliminates the per-batch awkward empty/concat/_carry churn that dominated allocation in profiling — cumulative allocation drops ~80%, and per-operation peak memory falls sharply (e.g. reverse-complement goes from ~11× batch-size churn to ~1× on copy, 0 in-place). This directly relieves the peak-RSS pressure that pushed long-seqlen track dataloading toward OOM.
to_padded and reverse-complement now delegate to seqpro flat-buffer kernels.

🐛 Fixes

Reproducible shuffling: the generator passed to to_dataloader is now forwarded to RandomSampler, and with_settings(rng=...) maps correctly to the underlying RNG field.
gvl.Table is temporarily disabled until the upstream non-deterministic polars-bio segfault is resolved (#198 / upstream #395).
Sites-only output uses the correct genoray get_record_info kwargs.
to_padded numeric branch handles clip=True output correctly.
Empty-region handling in the no-extend VCF write path; read-only-input copies and 2-D offset indexing fixes in filter_af / choose_exonic_variants.

🔧 Internals

New testing infrastructure built on vcfixture: synthetic-reference test fixtures plus property-based tests that validate gvl output against independent oracles (e.g. bcftools consensus), replacing the old hg38-coupled toy fixtures.
Migrated the dataset, reconstruction, reference, indexing, splice, ragged, variants, and types modules from attrs to stdlib dataclasses.
Split reconstruction into _query.py (QueryView); introduced OpenRequest / ReconstructionRequest and a _build_reconstructor factory.
New benchmarking harness: thread-pinned orchestration, per-cell measurement protocol, bandwidth/small-multiples plots, and CPU/microarch logging.

Feat

query: flat-aware getitem boundary (legacy Ragged still supported)
flat: masked reverse/RC on flat buffers
flat: _Flat numpy ragged transport (no awkward kernels)
write: add _window_to_sparse dense->sparse dispatch helper
bench: log CPU/system/microarch info for all benchmarks
bench: add MiB/s bandwidth plot; trim 1KG regression test data
bench: add 3x4 small-multiples results plot
bench: add bench.py thread-pinned orchestration
bench: add CSV header/append helpers
bench: add per-cell measurement protocol
bench: add exact output-bytes table helper
bench: add BED resize + per-region-length dataset prep
bench: enumerate deduped dataloader bench cells
bench: scaffold dataloader bench axis constants
mode='double_buffered' dataloader happy path
producer: subprocess entrypoint for double_buffered mode
shm: Ragged and RaggedVariants serialization
shm: hand-rolled slot header + dense round-trip
mode='buffered' dataloader
chunked: ChunkPlanner and slice_chunk
dataset: _output_bytes_per_instance tracks branch
dataset: _output_bytes_per_instance variants branch with var_fields
dataset: _output_bytes_per_instance annotated branch
dataset: _output_bytes_per_instance reference + haplotypes
haps: add _allele_bytes_sum for exact variant footprint
dataset: with_settings lazily loads new var_fields
open: Dataset.open accepts var_fields, forwards to Haps.from_path
haps: from_path honors var_fields for lazy info+dosage loading
haps: _Variants.load_info lazily extends info dict
haps: _Variants.from_table accepts info_fields filter
haps: add _Variants.available_info_fields schema peek
reconstruct: promote view-state to explicit _seqs_kind field
reconstruct: add _build_reconstructor factory
splice: Tracks._call_float32 accepts SplicePlan
splice: Haps._get_haplotypes accepts SplicePlan
splice: Ref.call accepts optional SplicePlan for zero-copy spliced layout
splice: add SplicePlan + build_splice_plan helper
refds: support with_settings(splice_info) and spliced subset_to
refds: implement spliced getitem via SpliceMap + _cat_length
refds: add splice_info field and is_spliced/spliced_regions
splice: add sample-agnostic SpliceMap

Fix

write: handle empty regions in no-extend VCF path; assert chunk/index alignment
table: temporarily disable gvl.Table to avoid polars-bio segfault
double_buffered: release producer+shm per loader, not at process exit
double_buffered: serialize RaggedAnnotatedHaps (annotated output)
double_buffered: size shm slots for serialized ragged footprint
bench: open datasets with hg38 reference; revert out-of-scope _open.py change
double_buffered: replay all Dataset settings in producer subprocess
sitesonly: use correct genoray VCF.get_record_info kwargs (fields= not attrs=)
ragged: to_padded numeric branch handles clip=True RegularArray output
torch: forward generator from DataLoader through to RandomSampler for reproducible shuffle
dataset: with_settings(rng=...) maps to _rng dataclass field
inline offset computation in filter_af 2-D path; reorder test_torch imports
rag_variants: accept positional axis arg in RaggedVariants.squeeze
haps: gate dosage output by var_fields (#191)
utils: copy read-only inputs in idx_like_to_array
types: annotate Fasta.pad as bytes | None
Dataset.open: promote to haplotypes before with_settings(splice_info)
choose_exonic_variants: use (2, n_slices) indexing for 2-D offsets
with_settings: propagate var_filter to _recon, preserving kind

Refactor

variants: remove orphaned _rc_helper/_rc_numba_helper
haps: extract _build_allele_layout/_alt_layout_parts helpers
query: overloads for reverse_complement_ragged/pad match flat runtime contract
flat: document + guard reverse_masked comp as DNA-RC mode selector
write: assemble full PGEN windows, dispatch via _window_to_sparse, fix max_ends
write: assemble full VCF windows, dispatch via _window_to_sparse, fix max_ends
trim dead code and over-commenting per CLAUDE.md
types: remove stale type:ignore and annotate the rest
naming: rename SplicePlan.perm -> SplicePlan.permutation
naming: standardize geno_offset_idxs -> geno_offset_idx
naming: rename rsp_idx -> region_sample_ploid_idx
dataset: split _reconstruct.py + extract _query.py via QueryView (PR6)
reconstruct: delete dead body of write_transformed_track + add roadmap
reconstruct: ReconstructionRequest + restructure _get_haplotypes
open: extract OpenRequest + decompose Dataset.open
write: extract shared phased-chunked writer for VCF/PGEN
impl: migrate _impl.py from attrs to stdlib dataclass
reconstruct: migrate _reconstruct.py from attrs to stdlib dataclass
reference: migrate _reference.py from attrs to stdlib dataclass
indexing: migrate _indexing.py from attrs to stdlib dataclass
splice: migrate _splice.py from attrs to stdlib dataclass
insertion-fill: migrate _insertion_fill.py from attrs to stdlib dataclass
ragged: migrate _ragged.py from attrs to stdlib dataclass
variants: migrate _records.py from attrs to stdlib dataclass
types: migrate _types.py from attrs to stdlib dataclass
impl: route remaining _recon construction/checks through view-state
impl: sequence_type returns the _seqs_kind field directly
impl: collapse with_settings _recon propagation via factory
impl: simplify with_tracks via factory + active_tracks check
impl: simplify with_seqs via _seqs_kind + factory
impl: route Dataset.open construction through factory
splice: Dataset spliced path uses SplicePlan
splice: RefDataset spliced path uses SplicePlan
splice: compose SpliceIndexer from SpliceMap + DatasetIndexer
splice: annotate SpliceMap method signatures
splice: clean up _splice imports per review
splice: move _cat_length helper...

Assets 2

22 May 02:13

d-laub

v0.25.0

bfcd46d

v0.25.0

Feat

add migrate_svar_link for upgrading legacy datasets
dataset: wire svar_link resolver into Haps.from_path; Dataset.open(svar=)
dataset: add _resolve_svar and _verify_fingerprint
write: record SvarLink in metadata, drop link.svar symlink
dataset: add SvarLink / SvarFingerprint pydantic models
write: subtract genoray nbytes from max_mem; warn when index dominates

Fix

ndim guard on geno_offsets in choose_exonic_variants second loop
write: clarify max_mem docstring and skip index-accounting log for SparseVar
write: eager-load variant index for accurate max_mem accounting

Refactor

dataset: switch Haps.from_path version compare to SemanticVersion
dataset: use SemanticVersion in Metadata, add svar_link field

Assets 2

13 May 19:50

github-actions

v0.24.1

dff4723

v0.24.1

Fix

bump genoray, VCF bug

Assets 2

12 May 21:21

github-actions

v0.24.0

ad52efa

v0.24.0

Changelog

v0.24.0 (2026-05-12)

Feat

Dataset.with_insertion_fill + public API exports
route per-track insertion fill into HapsTracks kernel call
per-track insertion-fill on Tracks reconstructor
kernel-level insertion-fill strategy dispatch
add InsertionFill strategy classes and lowering helper

Fix

bump genoray
insertion-fill: strengthen tests, Self return type, clearer error
insertion-fill: lazy fallback to Repeat5p for unpopulated insertion_fill
insertion-fill: derive base_seed from full idx array, use full uint64 range
insertion-fill: require params, document fallback, broaden flank-sample tests
insertion-fill: non-instantiable base, tighter MAX_PARAMS, add test coverage

[main ad52efa] bump: version 0.23.1 → 0.24.0
2 files changed, 23 insertions(+), 1 deletion(-)

Assets 2

11 May 18:59

github-actions

v0.23.1

1d31682

v0.23.1

Changelog

v0.23.1 (2026-05-11)

Fix

perf: benchmarks
perf: bench gvl.Table query algs
perf: use single polars-bio overlap (no xprod) in gvl.Table
types

Refactor

perf: vectorize scatter, use replace_strict and lexsort in gvl.Table

[main 1d31682] bump: version 0.23.0 → 0.23.1
2 files changed, 17 insertions(+), 1 deletion(-)

Assets 2

09 May 00:48

github-actions

v0.23.0

577234e

v0.23.0

Changelog

0.23.0 (2026-05-09)

Feat

generalize gvl.write() to accept Table tracks
rename write() param bigwigs= -> tracks=, support mixed sequences
Table._intervals_from_offsets via polars_bio.overlap
Table.count_intervals via polars_bio.count_overlaps
add Table.from_path for csv/tsv/parquet/arrow files
add Table skeleton with long-form DataFrame init
add IntervalTrack Protocol for unified track sources
export get_splice_bed from package root
add get_splice_bed for GTF→splicing-BED conversion

Fix

normalize contig names in Table; correct unavail set warning
cast chrom and strand to Utf8 in get_splice_bed

Refactor

rename _write_bigwigs -> _write_track
tighten IntervalTrack Protocol annotation and docstring

[main 577234e] bump: version 0.22.3 → 0.23.0
2 files changed, 28 insertions(+), 1 deletion(-)

Assets 2

08 May 06:15

github-actions

v0.22.3

2f7b60e

v0.22.3

Changelog

0.22.3 (2026-05-08)

Fix

ci: exclude _impl.py from debug-statements hook (match syntax)
deps: bump seqpro to 0.11.0 and genoray to 2.3.0
_utils: handle Categorical strand in bed_to_regions

[main 2f7b60e] bump: version 0.22.2 → 0.22.3
2 files changed, 12 insertions(+), 1 deletion(-)

Assets 2

28 Apr 23:12

github-actions

v0.22.2

0b45cb6

v0.22.2

Changelog

0.22.2 (2026-04-28)

Fix

make tbb and pyomp optional dependencies

[main 0b45cb6] bump: version 0.22.1 → 0.22.2
2 files changed, 10 insertions(+), 1 deletion(-)

Assets 2

22 Apr 05:35

github-actions

v0.22.1

49895a2

v0.22.1

Changelog

0.22.1 (2026-04-22)

Fix

skip overlapping variants in get_diffs_sparse to match reconstruction logic

[main 49895a2] bump: version 0.22.0 → 0.22.1
2 files changed, 10 insertions(+), 1 deletion(-)

Assets 2

Releases: mcvickerlab/GenVarLoader

v0.27.0

✨ Features

🐛 Fixes

🔧 Internals

Feat

Fix

Refactor

Uh oh!

v0.26.0

✨ Features

⚡ Performance

🐛 Fixes

🔧 Internals

Feat

Fix

Refactor

Uh oh!

v0.25.0

Feat

Fix

Refactor

Uh oh!

v0.24.1

Fix

Uh oh!

v0.24.0

Changelog

v0.24.0 (2026-05-12)

Feat

Fix

Uh oh!

v0.23.1

Changelog

v0.23.1 (2026-05-11)

Fix

Refactor

Uh oh!

v0.23.0

Changelog

0.23.0 (2026-05-09)

Feat

Fix

Refactor

Uh oh!

v0.22.3

Changelog

0.22.3 (2026-05-08)

Fix

Uh oh!

v0.22.2

Changelog

0.22.2 (2026-04-28)

Fix

Uh oh!

v0.22.1

Changelog

0.22.1 (2026-04-22)

Fix

Uh oh!