Releases: mcvickerlab/GenVarLoader
v0.27.0
Highlights: robust on-disk artifacts —
gvl.writenow creates datasets atomically under an advisory file lock, andDataset.openvalidates dataset format version + structural integrity before use. FASTA caches move to a self-describing.gvlfaformat that builds atomically, auto-migrates legacy.gvlcaches, and auto-rebuilds when stale. Plus correctdrop_lasthandling across all dataloader modes.
✨ Features
Atomic, concurrency-safe dataset creation
gvl.write builds into a private sibling temp directory and publishes via an atomic os.replace, so a destination directory is never observed half-written. A best-effort filelock lets parallel jobs sharing one destination avoid redundant rebuilds — correctness rests on the atomic rename; the lock is advisory only (new filelock dependency).
Dataset format versioning + integrity validation
Metadata now records a format_version, and Dataset.open validates both the version and structural integrity (file presence and sizes) before returning. An incompatible or corrupt dataset raises a clear ValueError instructing you to regenerate with gvl.write. Datasets do not auto-rebuild.
New .gvlfa FASTA cache
gvl.Reference.from_path now builds and reuses a self-describing, fingerprint-validated .gvlfa cache directory (and accepts a .gvlfa path directly). The cache is published atomically under a best-effort lock — concurrent builders sharing one reference are safe — and, unlike datasets, auto-rebuilds from its source when stale or missing. Legacy .fa.gvl caches are migrated in place by reusing their bytes.
🐛 Fixes
to_dataloader(drop_last=...)now honored across all modes:drop_lastis no longer forwarded to the underlyingDataLoaderin default mode (it was double-applied);buffered/double_bufferedmodes honordrop_last=False; andChunkPlannerkeeps the trailing partial batch.- BatchSampler conflict warning:
to_dataloadernow warns when aBatchSampleroverrides an explicitbatch_size. .gvlfamigration guards against stale/truncated legacy bytes, and a format-too-new sibling cache now raises instead of silently downgrading.
🔧 Internals
- New
atomic_dirdirectory-publish primitive (temp build +os.replace) underpins both dataset and FASTA-cache creation. _fasta_cachemodule:FastaCachemodels + fingerprinting, three-way source resolution, build/load/validity guards, legacy migration, and anensure_cacheorchestrator.- Concurrency regression tests for atomic cache + dataset creation (closes #21); coverage for too-old format versions and the genotypes-without-ploidy branch.
Out of scope:
genoray.gviandpysam.fai/.gziindex files are created by those upstream libraries and are not covered by gvl's atomic/locked creation.
Feat
- _open: validate dataset format version + integrity on open
- _write: atomic dataset creation + format_version in Metadata
- _fasta_cache: publish cache atomically via atomic_dir + locked double-check
- _atomic: add atomic_dir directory-publish primitive
- fasta: use .gvlfa cache module and accept .gvlfa input
- _fasta_cache: add ensure_cache orchestrator and dispatch
- _fasta_cache: migrate legacy .gvl caches by reusing bytes
- _fasta_cache: add build, load, and validity guards
- _fasta_cache: add source hints and three-way resolution
- _fasta_cache: add FastaCache models and fingerprint
Fix
- torch: warn when BatchSampler overrides explicit batch_size
- torch: buffered modes honor drop_last=False
- torch: do not forward drop_last to DataLoader in default mode
- chunked: keep trailing partial batch in ChunkPlanner
- test_fasta: move mid-file imports to top (E402, CI lint)
- _fasta_cache: guard legacy migration against stale/truncated bytes
- _fasta_cache: raise on format-too-new sibling cache instead of silent downgrade
Refactor
- _write: use plain with-atomic_dir; restore warnings filter; add atomicity + format_version on-disk tests
- reference: build cache via ensure_cache, accept .gvlfa
- _fasta_cache: fix progress bar advance and tighten load status type
v0.26.0
Highlights: prefetching dataloaders for higher GPU utilization, zero-copy spliced-haplotype output driven by a GTF/BED, selective loading of variant INFO/dosage fields, and a ground-up flat-buffer rewrite of sequence reconstruction — ~1.5–3× faster indexing with markedly lower peak memory and byte-identical output.
✨ Features
Prefetching dataloaders
Dataset.to_dataloader(...) gains a mode argument with two prefetching strategies that overlap data production with training:
mode="buffered"— a background producer fills an in-process buffer.mode="double_buffered"— a producer subprocess serializes batches into shared memory, decoupling reconstruction from the training loop entirely.
New related knobs: buffer_bytes (default 2 GiB), copy, and heartbeat_seconds. Both modes require num_workers=0.
Zero-copy spliced haplotypes
Request spliced output by passing splice_info to Dataset.open(...) or with_settings(...). Exon segments are assembled into a single contiguous haplotype via a sample-agnostic SpliceMap, and the whole splice path is zero-copy — reference, haplotype, and track output are laid out directly in flat buffers (SplicePlan) with no intermediate gather/concat. The new public helper gvl.get_splice_bed builds the splice BED from your annotation.
Selective variant fields
Dataset.open(...) and with_settings(...) accept var_fields=[...] to lazily load only the INFO/dosage fields you need; inspect what's available via available_var_fields. Dosage is now loaded only when explicitly requested (#191).
⚡ Performance
- Flat-buffer reconstruction: sequence, haplotype, track, and variant reconstruction now return numpy-backed flat ragged buffers instead of routing through
awkwardkernels. Combined with seqproto_packed/masked-reverse-complement kernels (seqpro 0.12.1 → 0.14), the indexing hot path is roughly 1.5–3× faster with byte-identical output. - Lower peak memory: the flat-buffer path eliminates the per-batch
awkwardempty/concat/_carrychurn that dominated allocation in profiling — cumulative allocation drops ~80%, and per-operation peak memory falls sharply (e.g. reverse-complement goes from ~11× batch-size churn to ~1× on copy, 0 in-place). This directly relieves the peak-RSS pressure that pushed long-seqlen track dataloading toward OOM. to_paddedand reverse-complement now delegate to seqpro flat-buffer kernels.
🐛 Fixes
- Reproducible shuffling: the
generatorpassed toto_dataloaderis now forwarded toRandomSampler, andwith_settings(rng=...)maps correctly to the underlying RNG field. gvl.Tableis temporarily disabled until the upstream non-deterministicpolars-biosegfault is resolved (#198 / upstream #395).- Sites-only output uses the correct genoray
get_record_infokwargs. to_paddednumeric branch handlesclip=Trueoutput correctly.- Empty-region handling in the no-extend VCF write path; read-only-input copies and 2-D offset indexing fixes in
filter_af/choose_exonic_variants.
🔧 Internals
- New testing infrastructure built on vcfixture: synthetic-reference test fixtures plus property-based tests that validate gvl output against independent oracles (e.g.
bcftools consensus), replacing the old hg38-coupled toy fixtures. - Migrated the dataset, reconstruction, reference, indexing, splice, ragged, variants, and types modules from
attrsto stdlib dataclasses. - Split reconstruction into
_query.py(QueryView); introducedOpenRequest/ReconstructionRequestand a_build_reconstructorfactory. - New benchmarking harness: thread-pinned orchestration, per-cell measurement protocol, bandwidth/small-multiples plots, and CPU/microarch logging.
Feat
- query: flat-aware getitem boundary (legacy Ragged still supported)
- flat: masked reverse/RC on flat buffers
- flat: _Flat numpy ragged transport (no awkward kernels)
- write: add _window_to_sparse dense->sparse dispatch helper
- bench: log CPU/system/microarch info for all benchmarks
- bench: add MiB/s bandwidth plot; trim 1KG regression test data
- bench: add 3x4 small-multiples results plot
- bench: add bench.py thread-pinned orchestration
- bench: add CSV header/append helpers
- bench: add per-cell measurement protocol
- bench: add exact output-bytes table helper
- bench: add BED resize + per-region-length dataset prep
- bench: enumerate deduped dataloader bench cells
- bench: scaffold dataloader bench axis constants
- mode='double_buffered' dataloader happy path
- producer: subprocess entrypoint for double_buffered mode
- shm: Ragged and RaggedVariants serialization
- shm: hand-rolled slot header + dense round-trip
- mode='buffered' dataloader
- chunked: ChunkPlanner and slice_chunk
- dataset: _output_bytes_per_instance tracks branch
- dataset: _output_bytes_per_instance variants branch with var_fields
- dataset: _output_bytes_per_instance annotated branch
- dataset: _output_bytes_per_instance reference + haplotypes
- haps: add _allele_bytes_sum for exact variant footprint
- dataset: with_settings lazily loads new var_fields
- open: Dataset.open accepts var_fields, forwards to Haps.from_path
- haps: from_path honors var_fields for lazy info+dosage loading
- haps: _Variants.load_info lazily extends info dict
- haps: _Variants.from_table accepts info_fields filter
- haps: add _Variants.available_info_fields schema peek
- reconstruct: promote view-state to explicit _seqs_kind field
- reconstruct: add _build_reconstructor factory
- splice: Tracks._call_float32 accepts SplicePlan
- splice: Haps._get_haplotypes accepts SplicePlan
- splice: Ref.call accepts optional SplicePlan for zero-copy spliced layout
- splice: add SplicePlan + build_splice_plan helper
- refds: support with_settings(splice_info) and spliced subset_to
- refds: implement spliced getitem via SpliceMap + _cat_length
- refds: add splice_info field and is_spliced/spliced_regions
- splice: add sample-agnostic SpliceMap
Fix
- write: handle empty regions in no-extend VCF path; assert chunk/index alignment
- table: temporarily disable gvl.Table to avoid polars-bio segfault
- double_buffered: release producer+shm per loader, not at process exit
- double_buffered: serialize RaggedAnnotatedHaps (annotated output)
- double_buffered: size shm slots for serialized ragged footprint
- bench: open datasets with hg38 reference; revert out-of-scope _open.py change
- double_buffered: replay all Dataset settings in producer subprocess
- sitesonly: use correct genoray VCF.get_record_info kwargs (fields= not attrs=)
- ragged: to_padded numeric branch handles clip=True RegularArray output
- torch: forward generator from DataLoader through to RandomSampler for reproducible shuffle
- dataset: with_settings(rng=...) maps to _rng dataclass field
- inline offset computation in filter_af 2-D path; reorder test_torch imports
- rag_variants: accept positional axis arg in RaggedVariants.squeeze
- haps: gate dosage output by var_fields (#191)
- utils: copy read-only inputs in idx_like_to_array
- types: annotate Fasta.pad as bytes | None
- Dataset.open: promote to haplotypes before with_settings(splice_info)
- choose_exonic_variants: use (2, n_slices) indexing for 2-D offsets
- with_settings: propagate var_filter to _recon, preserving kind
Refactor
- variants: remove orphaned _rc_helper/_rc_numba_helper
- haps: extract _build_allele_layout/_alt_layout_parts helpers
- query: overloads for reverse_complement_ragged/pad match flat runtime contract
- flat: document + guard reverse_masked comp as DNA-RC mode selector
- write: assemble full PGEN windows, dispatch via _window_to_sparse, fix max_ends
- write: assemble full VCF windows, dispatch via _window_to_sparse, fix max_ends
- trim dead code and over-commenting per CLAUDE.md
- types: remove stale type:ignore and annotate the rest
- naming: rename SplicePlan.perm -> SplicePlan.permutation
- naming: standardize geno_offset_idxs -> geno_offset_idx
- naming: rename rsp_idx -> region_sample_ploid_idx
- dataset: split _reconstruct.py + extract _query.py via QueryView (PR6)
- reconstruct: delete dead body of write_transformed_track + add roadmap
- reconstruct: ReconstructionRequest + restructure _get_haplotypes
- open: extract OpenRequest + decompose Dataset.open
- write: extract shared phased-chunked writer for VCF/PGEN
- impl: migrate _impl.py from attrs to stdlib dataclass
- reconstruct: migrate _reconstruct.py from attrs to stdlib dataclass
- reference: migrate _reference.py from attrs to stdlib dataclass
- indexing: migrate _indexing.py from attrs to stdlib dataclass
- splice: migrate _splice.py from attrs to stdlib dataclass
- insertion-fill: migrate _insertion_fill.py from attrs to stdlib dataclass
- ragged: migrate _ragged.py from attrs to stdlib dataclass
- variants: migrate _records.py from attrs to stdlib dataclass
- types: migrate _types.py from attrs to stdlib dataclass
- impl: route remaining _recon construction/checks through view-state
- impl: sequence_type returns the _seqs_kind field directly
- impl: collapse with_settings _recon propagation via factory
- impl: simplify with_tracks via factory + active_tracks check
- impl: simplify with_seqs via _seqs_kind + factory
- impl: route Dataset.open construction through factory
- splice: Dataset spliced path uses SplicePlan
- splice: RefDataset spliced path uses SplicePlan
- splice: compose SpliceIndexer from SpliceMap + DatasetIndexer
- splice: annotate SpliceMap method signatures
- splice: clean up _splice imports per review
- splice: move _cat_length helper...
v0.25.0
Feat
- add migrate_svar_link for upgrading legacy datasets
- dataset: wire svar_link resolver into Haps.from_path; Dataset.open(svar=)
- dataset: add _resolve_svar and _verify_fingerprint
- write: record SvarLink in metadata, drop link.svar symlink
- dataset: add SvarLink / SvarFingerprint pydantic models
- write: subtract genoray nbytes from max_mem; warn when index dominates
Fix
- ndim guard on geno_offsets in choose_exonic_variants second loop
- write: clarify max_mem docstring and skip index-accounting log for SparseVar
- write: eager-load variant index for accurate max_mem accounting
Refactor
- dataset: switch Haps.from_path version compare to SemanticVersion
- dataset: use SemanticVersion in Metadata, add svar_link field
v0.24.1
Fix
- bump genoray, VCF bug
v0.24.0
Changelog
v0.24.0 (2026-05-12)
Feat
- Dataset.with_insertion_fill + public API exports
- route per-track insertion fill into HapsTracks kernel call
- per-track insertion-fill on Tracks reconstructor
- kernel-level insertion-fill strategy dispatch
- add InsertionFill strategy classes and lowering helper
Fix
- bump genoray
- insertion-fill: strengthen tests, Self return type, clearer error
- insertion-fill: lazy fallback to Repeat5p for unpopulated insertion_fill
- insertion-fill: derive base_seed from full idx array, use full uint64 range
- insertion-fill: require params, document fallback, broaden flank-sample tests
- insertion-fill: non-instantiable base, tighter MAX_PARAMS, add test coverage
[main ad52efa] bump: version 0.23.1 → 0.24.0
2 files changed, 23 insertions(+), 1 deletion(-)
v0.23.1
Changelog
v0.23.1 (2026-05-11)
Fix
- perf: benchmarks
- perf: bench gvl.Table query algs
- perf: use single polars-bio overlap (no xprod) in gvl.Table
- types
Refactor
- perf: vectorize scatter, use replace_strict and lexsort in gvl.Table
[main 1d31682] bump: version 0.23.0 → 0.23.1
2 files changed, 17 insertions(+), 1 deletion(-)
v0.23.0
Changelog
0.23.0 (2026-05-09)
Feat
- generalize gvl.write() to accept Table tracks
- rename write() param bigwigs= -> tracks=, support mixed sequences
- Table._intervals_from_offsets via polars_bio.overlap
- Table.count_intervals via polars_bio.count_overlaps
- add Table.from_path for csv/tsv/parquet/arrow files
- add Table skeleton with long-form DataFrame init
- add IntervalTrack Protocol for unified track sources
- export get_splice_bed from package root
- add get_splice_bed for GTF→splicing-BED conversion
Fix
- normalize contig names in Table; correct unavail set warning
- cast chrom and strand to Utf8 in get_splice_bed
Refactor
- rename _write_bigwigs -> _write_track
- tighten IntervalTrack Protocol annotation and docstring
[main 577234e] bump: version 0.22.3 → 0.23.0
2 files changed, 28 insertions(+), 1 deletion(-)