Skip to content

Releases: mcvickerlab/GenVarLoader

v0.27.0

05 Jun 09:27

Choose a tag to compare

Highlights: robust on-disk artifacts — gvl.write now creates datasets atomically under an advisory file lock, and Dataset.open validates dataset format version + structural integrity before use. FASTA caches move to a self-describing .gvlfa format that builds atomically, auto-migrates legacy .gvl caches, and auto-rebuilds when stale. Plus correct drop_last handling across all dataloader modes.

✨ Features

Atomic, concurrency-safe dataset creation
gvl.write builds into a private sibling temp directory and publishes via an atomic os.replace, so a destination directory is never observed half-written. A best-effort filelock lets parallel jobs sharing one destination avoid redundant rebuilds — correctness rests on the atomic rename; the lock is advisory only (new filelock dependency).

Dataset format versioning + integrity validation
Metadata now records a format_version, and Dataset.open validates both the version and structural integrity (file presence and sizes) before returning. An incompatible or corrupt dataset raises a clear ValueError instructing you to regenerate with gvl.write. Datasets do not auto-rebuild.

New .gvlfa FASTA cache
gvl.Reference.from_path now builds and reuses a self-describing, fingerprint-validated .gvlfa cache directory (and accepts a .gvlfa path directly). The cache is published atomically under a best-effort lock — concurrent builders sharing one reference are safe — and, unlike datasets, auto-rebuilds from its source when stale or missing. Legacy .fa.gvl caches are migrated in place by reusing their bytes.

🐛 Fixes

  • to_dataloader(drop_last=...) now honored across all modes: drop_last is no longer forwarded to the underlying DataLoader in default mode (it was double-applied); buffered/double_buffered modes honor drop_last=False; and ChunkPlanner keeps the trailing partial batch.
  • BatchSampler conflict warning: to_dataloader now warns when a BatchSampler overrides an explicit batch_size.
  • .gvlfa migration guards against stale/truncated legacy bytes, and a format-too-new sibling cache now raises instead of silently downgrading.

🔧 Internals

  • New atomic_dir directory-publish primitive (temp build + os.replace) underpins both dataset and FASTA-cache creation.
  • _fasta_cache module: FastaCache models + fingerprinting, three-way source resolution, build/load/validity guards, legacy migration, and an ensure_cache orchestrator.
  • Concurrency regression tests for atomic cache + dataset creation (closes #21); coverage for too-old format versions and the genotypes-without-ploidy branch.

Out of scope: genoray .gvi and pysam .fai/.gzi index files are created by those upstream libraries and are not covered by gvl's atomic/locked creation.


Feat

  • _open: validate dataset format version + integrity on open
  • _write: atomic dataset creation + format_version in Metadata
  • _fasta_cache: publish cache atomically via atomic_dir + locked double-check
  • _atomic: add atomic_dir directory-publish primitive
  • fasta: use .gvlfa cache module and accept .gvlfa input
  • _fasta_cache: add ensure_cache orchestrator and dispatch
  • _fasta_cache: migrate legacy .gvl caches by reusing bytes
  • _fasta_cache: add build, load, and validity guards
  • _fasta_cache: add source hints and three-way resolution
  • _fasta_cache: add FastaCache models and fingerprint

Fix

  • torch: warn when BatchSampler overrides explicit batch_size
  • torch: buffered modes honor drop_last=False
  • torch: do not forward drop_last to DataLoader in default mode
  • chunked: keep trailing partial batch in ChunkPlanner
  • test_fasta: move mid-file imports to top (E402, CI lint)
  • _fasta_cache: guard legacy migration against stale/truncated bytes
  • _fasta_cache: raise on format-too-new sibling cache instead of silent downgrade

Refactor

  • _write: use plain with-atomic_dir; restore warnings filter; add atomicity + format_version on-disk tests
  • reference: build cache via ensure_cache, accept .gvlfa
  • _fasta_cache: fix progress bar advance and tighten load status type

v0.26.0

01 Jun 21:29

Choose a tag to compare

Highlights: prefetching dataloaders for higher GPU utilization, zero-copy spliced-haplotype output driven by a GTF/BED, selective loading of variant INFO/dosage fields, and a ground-up flat-buffer rewrite of sequence reconstruction — ~1.5–3× faster indexing with markedly lower peak memory and byte-identical output.

✨ Features

Prefetching dataloaders
Dataset.to_dataloader(...) gains a mode argument with two prefetching strategies that overlap data production with training:

  • mode="buffered" — a background producer fills an in-process buffer.
  • mode="double_buffered" — a producer subprocess serializes batches into shared memory, decoupling reconstruction from the training loop entirely.

New related knobs: buffer_bytes (default 2 GiB), copy, and heartbeat_seconds. Both modes require num_workers=0.

Zero-copy spliced haplotypes
Request spliced output by passing splice_info to Dataset.open(...) or with_settings(...). Exon segments are assembled into a single contiguous haplotype via a sample-agnostic SpliceMap, and the whole splice path is zero-copy — reference, haplotype, and track output are laid out directly in flat buffers (SplicePlan) with no intermediate gather/concat. The new public helper gvl.get_splice_bed builds the splice BED from your annotation.

Selective variant fields
Dataset.open(...) and with_settings(...) accept var_fields=[...] to lazily load only the INFO/dosage fields you need; inspect what's available via available_var_fields. Dosage is now loaded only when explicitly requested (#191).

⚡ Performance

  • Flat-buffer reconstruction: sequence, haplotype, track, and variant reconstruction now return numpy-backed flat ragged buffers instead of routing through awkward kernels. Combined with seqpro to_packed/masked-reverse-complement kernels (seqpro 0.12.1 → 0.14), the indexing hot path is roughly 1.5–3× faster with byte-identical output.
  • Lower peak memory: the flat-buffer path eliminates the per-batch awkward empty/concat/_carry churn that dominated allocation in profiling — cumulative allocation drops ~80%, and per-operation peak memory falls sharply (e.g. reverse-complement goes from ~11× batch-size churn to ~1× on copy, 0 in-place). This directly relieves the peak-RSS pressure that pushed long-seqlen track dataloading toward OOM.
  • to_padded and reverse-complement now delegate to seqpro flat-buffer kernels.

🐛 Fixes

  • Reproducible shuffling: the generator passed to to_dataloader is now forwarded to RandomSampler, and with_settings(rng=...) maps correctly to the underlying RNG field.
  • gvl.Table is temporarily disabled until the upstream non-deterministic polars-bio segfault is resolved (#198 / upstream #395).
  • Sites-only output uses the correct genoray get_record_info kwargs.
  • to_padded numeric branch handles clip=True output correctly.
  • Empty-region handling in the no-extend VCF write path; read-only-input copies and 2-D offset indexing fixes in filter_af / choose_exonic_variants.

🔧 Internals

  • New testing infrastructure built on vcfixture: synthetic-reference test fixtures plus property-based tests that validate gvl output against independent oracles (e.g. bcftools consensus), replacing the old hg38-coupled toy fixtures.
  • Migrated the dataset, reconstruction, reference, indexing, splice, ragged, variants, and types modules from attrs to stdlib dataclasses.
  • Split reconstruction into _query.py (QueryView); introduced OpenRequest / ReconstructionRequest and a _build_reconstructor factory.
  • New benchmarking harness: thread-pinned orchestration, per-cell measurement protocol, bandwidth/small-multiples plots, and CPU/microarch logging.

Feat

  • query: flat-aware getitem boundary (legacy Ragged still supported)
  • flat: masked reverse/RC on flat buffers
  • flat: _Flat numpy ragged transport (no awkward kernels)
  • write: add _window_to_sparse dense->sparse dispatch helper
  • bench: log CPU/system/microarch info for all benchmarks
  • bench: add MiB/s bandwidth plot; trim 1KG regression test data
  • bench: add 3x4 small-multiples results plot
  • bench: add bench.py thread-pinned orchestration
  • bench: add CSV header/append helpers
  • bench: add per-cell measurement protocol
  • bench: add exact output-bytes table helper
  • bench: add BED resize + per-region-length dataset prep
  • bench: enumerate deduped dataloader bench cells
  • bench: scaffold dataloader bench axis constants
  • mode='double_buffered' dataloader happy path
  • producer: subprocess entrypoint for double_buffered mode
  • shm: Ragged and RaggedVariants serialization
  • shm: hand-rolled slot header + dense round-trip
  • mode='buffered' dataloader
  • chunked: ChunkPlanner and slice_chunk
  • dataset: _output_bytes_per_instance tracks branch
  • dataset: _output_bytes_per_instance variants branch with var_fields
  • dataset: _output_bytes_per_instance annotated branch
  • dataset: _output_bytes_per_instance reference + haplotypes
  • haps: add _allele_bytes_sum for exact variant footprint
  • dataset: with_settings lazily loads new var_fields
  • open: Dataset.open accepts var_fields, forwards to Haps.from_path
  • haps: from_path honors var_fields for lazy info+dosage loading
  • haps: _Variants.load_info lazily extends info dict
  • haps: _Variants.from_table accepts info_fields filter
  • haps: add _Variants.available_info_fields schema peek
  • reconstruct: promote view-state to explicit _seqs_kind field
  • reconstruct: add _build_reconstructor factory
  • splice: Tracks._call_float32 accepts SplicePlan
  • splice: Haps._get_haplotypes accepts SplicePlan
  • splice: Ref.call accepts optional SplicePlan for zero-copy spliced layout
  • splice: add SplicePlan + build_splice_plan helper
  • refds: support with_settings(splice_info) and spliced subset_to
  • refds: implement spliced getitem via SpliceMap + _cat_length
  • refds: add splice_info field and is_spliced/spliced_regions
  • splice: add sample-agnostic SpliceMap

Fix

  • write: handle empty regions in no-extend VCF path; assert chunk/index alignment
  • table: temporarily disable gvl.Table to avoid polars-bio segfault
  • double_buffered: release producer+shm per loader, not at process exit
  • double_buffered: serialize RaggedAnnotatedHaps (annotated output)
  • double_buffered: size shm slots for serialized ragged footprint
  • bench: open datasets with hg38 reference; revert out-of-scope _open.py change
  • double_buffered: replay all Dataset settings in producer subprocess
  • sitesonly: use correct genoray VCF.get_record_info kwargs (fields= not attrs=)
  • ragged: to_padded numeric branch handles clip=True RegularArray output
  • torch: forward generator from DataLoader through to RandomSampler for reproducible shuffle
  • dataset: with_settings(rng=...) maps to _rng dataclass field
  • inline offset computation in filter_af 2-D path; reorder test_torch imports
  • rag_variants: accept positional axis arg in RaggedVariants.squeeze
  • haps: gate dosage output by var_fields (#191)
  • utils: copy read-only inputs in idx_like_to_array
  • types: annotate Fasta.pad as bytes | None
  • Dataset.open: promote to haplotypes before with_settings(splice_info)
  • choose_exonic_variants: use (2, n_slices) indexing for 2-D offsets
  • with_settings: propagate var_filter to _recon, preserving kind

Refactor

  • variants: remove orphaned _rc_helper/_rc_numba_helper
  • haps: extract _build_allele_layout/_alt_layout_parts helpers
  • query: overloads for reverse_complement_ragged/pad match flat runtime contract
  • flat: document + guard reverse_masked comp as DNA-RC mode selector
  • write: assemble full PGEN windows, dispatch via _window_to_sparse, fix max_ends
  • write: assemble full VCF windows, dispatch via _window_to_sparse, fix max_ends
  • trim dead code and over-commenting per CLAUDE.md
  • types: remove stale type:ignore and annotate the rest
  • naming: rename SplicePlan.perm -> SplicePlan.permutation
  • naming: standardize geno_offset_idxs -> geno_offset_idx
  • naming: rename rsp_idx -> region_sample_ploid_idx
  • dataset: split _reconstruct.py + extract _query.py via QueryView (PR6)
  • reconstruct: delete dead body of write_transformed_track + add roadmap
  • reconstruct: ReconstructionRequest + restructure _get_haplotypes
  • open: extract OpenRequest + decompose Dataset.open
  • write: extract shared phased-chunked writer for VCF/PGEN
  • impl: migrate _impl.py from attrs to stdlib dataclass
  • reconstruct: migrate _reconstruct.py from attrs to stdlib dataclass
  • reference: migrate _reference.py from attrs to stdlib dataclass
  • indexing: migrate _indexing.py from attrs to stdlib dataclass
  • splice: migrate _splice.py from attrs to stdlib dataclass
  • insertion-fill: migrate _insertion_fill.py from attrs to stdlib dataclass
  • ragged: migrate _ragged.py from attrs to stdlib dataclass
  • variants: migrate _records.py from attrs to stdlib dataclass
  • types: migrate _types.py from attrs to stdlib dataclass
  • impl: route remaining _recon construction/checks through view-state
  • impl: sequence_type returns the _seqs_kind field directly
  • impl: collapse with_settings _recon propagation via factory
  • impl: simplify with_tracks via factory + active_tracks check
  • impl: simplify with_seqs via _seqs_kind + factory
  • impl: route Dataset.open construction through factory
  • splice: Dataset spliced path uses SplicePlan
  • splice: RefDataset spliced path uses SplicePlan
  • splice: compose SpliceIndexer from SpliceMap + DatasetIndexer
  • splice: annotate SpliceMap method signatures
  • splice: clean up _splice imports per review
  • splice: move _cat_length helper...
Read more

v0.25.0

22 May 02:13

Choose a tag to compare

Feat

  • add migrate_svar_link for upgrading legacy datasets
  • dataset: wire svar_link resolver into Haps.from_path; Dataset.open(svar=)
  • dataset: add _resolve_svar and _verify_fingerprint
  • write: record SvarLink in metadata, drop link.svar symlink
  • dataset: add SvarLink / SvarFingerprint pydantic models
  • write: subtract genoray nbytes from max_mem; warn when index dominates

Fix

  • ndim guard on geno_offsets in choose_exonic_variants second loop
  • write: clarify max_mem docstring and skip index-accounting log for SparseVar
  • write: eager-load variant index for accurate max_mem accounting

Refactor

  • dataset: switch Haps.from_path version compare to SemanticVersion
  • dataset: use SemanticVersion in Metadata, add svar_link field

v0.24.1

13 May 19:50

Choose a tag to compare

Fix

  • bump genoray, VCF bug

v0.24.0

12 May 21:21

Choose a tag to compare

Changelog

v0.24.0 (2026-05-12)

Feat

  • Dataset.with_insertion_fill + public API exports
  • route per-track insertion fill into HapsTracks kernel call
  • per-track insertion-fill on Tracks reconstructor
  • kernel-level insertion-fill strategy dispatch
  • add InsertionFill strategy classes and lowering helper

Fix

  • bump genoray
  • insertion-fill: strengthen tests, Self return type, clearer error
  • insertion-fill: lazy fallback to Repeat5p for unpopulated insertion_fill
  • insertion-fill: derive base_seed from full idx array, use full uint64 range
  • insertion-fill: require params, document fallback, broaden flank-sample tests
  • insertion-fill: non-instantiable base, tighter MAX_PARAMS, add test coverage

[main ad52efa] bump: version 0.23.1 → 0.24.0
2 files changed, 23 insertions(+), 1 deletion(-)

v0.23.1

11 May 18:59

Choose a tag to compare

Changelog

v0.23.1 (2026-05-11)

Fix

  • perf: benchmarks
  • perf: bench gvl.Table query algs
  • perf: use single polars-bio overlap (no xprod) in gvl.Table
  • types

Refactor

  • perf: vectorize scatter, use replace_strict and lexsort in gvl.Table

[main 1d31682] bump: version 0.23.0 → 0.23.1
2 files changed, 17 insertions(+), 1 deletion(-)

v0.23.0

09 May 00:48

Choose a tag to compare

Changelog

0.23.0 (2026-05-09)

Feat

  • generalize gvl.write() to accept Table tracks
  • rename write() param bigwigs= -> tracks=, support mixed sequences
  • Table._intervals_from_offsets via polars_bio.overlap
  • Table.count_intervals via polars_bio.count_overlaps
  • add Table.from_path for csv/tsv/parquet/arrow files
  • add Table skeleton with long-form DataFrame init
  • add IntervalTrack Protocol for unified track sources
  • export get_splice_bed from package root
  • add get_splice_bed for GTF→splicing-BED conversion

Fix

  • normalize contig names in Table; correct unavail set warning
  • cast chrom and strand to Utf8 in get_splice_bed

Refactor

  • rename _write_bigwigs -> _write_track
  • tighten IntervalTrack Protocol annotation and docstring

[main 577234e] bump: version 0.22.3 → 0.23.0
2 files changed, 28 insertions(+), 1 deletion(-)

v0.22.3

08 May 06:15

Choose a tag to compare

Changelog

0.22.3 (2026-05-08)

Fix

  • ci: exclude _impl.py from debug-statements hook (match syntax)
  • deps: bump seqpro to 0.11.0 and genoray to 2.3.0
  • _utils: handle Categorical strand in bed_to_regions

[main 2f7b60e] bump: version 0.22.2 → 0.22.3
2 files changed, 12 insertions(+), 1 deletion(-)

v0.22.2

28 Apr 23:12

Choose a tag to compare

Changelog

0.22.2 (2026-04-28)

Fix

  • make tbb and pyomp optional dependencies

[main 0b45cb6] bump: version 0.22.1 → 0.22.2
2 files changed, 10 insertions(+), 1 deletion(-)

v0.22.1

22 Apr 05:35

Choose a tag to compare

Changelog

0.22.1 (2026-04-22)

Fix

  • skip overlapping variants in get_diffs_sparse to match reconstruction logic

[main 49895a2] bump: version 0.22.0 → 0.22.1
2 files changed, 10 insertions(+), 1 deletion(-)