Skip to content

Robust on-disk artifacts: atomic .gvlfa cache + dataset creation/validation (closes #21)#206

Merged
d-laub merged 28 commits into
mainfrom
worktree-robust-fasta-cache
Jun 4, 2026
Merged

Robust on-disk artifacts: atomic .gvlfa cache + dataset creation/validation (closes #21)#206
d-laub merged 28 commits into
mainfrom
worktree-robust-fasta-cache

Conversation

@d-laub
Copy link
Copy Markdown
Collaborator

@d-laub d-laub commented Jun 2, 2026

Summary

Makes GenVarLoader's generated on-disk artifacts — the .gvlfa FASTA cache and gvl.write dataset directories — self-describing, safe under concurrent creation, and resilient to format drift. Closes #21.

This PR landed in two stages: the robust .gvlfa FASTA cache, then the atomic-creation + dataset-validation + concurrency follow-ups it set up.

Stage 1 — Robust .gvlfa FASTA cache (self-describing, fingerprint-validated)

Replaces the brittle mtime-validated, sibling-only .fa.gvl flat FASTA cache with a self-describing .gvlfa/ directory cache that fingerprints its source, resolves it three ways, auto-rebuilds when stale/corrupt, and auto-migrates legacy caches — fully backwards compatible.

  • New module python/genvarloader/_fasta_cache.py owns the on-disk format, validation, build, and migration behind a single entry point ensure_cache(path) -> (FastaCache, data_path).
    • .gvlfa/ holds metadata.json (pydantic FastaCache: format version, gvl version, contig lengths, source hints, fingerprint) + sequence.bin (numpy memmap of all contigs).
    • Fingerprint: blake2b of the first 1 MiB + total file size.
    • Three-way source resolution: sibling → relative → absolute.
    • Validity states: fresh / stale / unvalidated; auto-rebuild on stale or size-corrupt; format-too-new raises consistently from both entry points (never silently downgrades).
    • Legacy migration: reuses .fa.gvl bytes via move, but only after verifying the legacy byte count matches the current source — a stale/truncated legacy cache is left untouched and rebuilt fresh.
  • Fasta (_fasta.py) and Reference.from_path (_dataset/_reference.py) rewired to ensure_cache; both now also accept a .gvlfa directory directly as their path.
  • Old _valid_cache / _write_to_cache / _get_sequences machinery removed.

This intentionally mirrors the existing SvarLink robustness idiom (fingerprint + three-way resolution + legacy migration) already used for .svar back-references in dataset metadata.

Stage 2 — Atomic creation, dataset validation, and concurrency safety (closes #21)

  • New single-responsibility primitive python/genvarloader/_atomic.pyatomic_dir(dest, *, overwrite, lock, timeout) builds each artifact into a private sibling temp dir and publishes it with an atomic os.replace. A best-effort filelock avoids N redundant concurrent builds but is never load-bearing for correctness — the atomic rename is the guarantee, so a lock timeout or a network-FS no-op just means "build anyway". SkipPublish aborts publishing to reuse an already-valid dest; overwrite=True uses move-aside-then-rename. Reused by both artifacts.
  • FASTA cache (_fasta_cache.py)build/migrate_legacy now publish through atomic_dir; ensure_cache rebuild paths go through a locked, double-checked helper (_ensure_built) so concurrent builders don't all rebuild and never corrupt. The cache auto-rebuilds (source available).
  • gvl.write (_dataset/_write.py) — the whole dataset is built into a temp dir and published atomically (with atomic_dir(...)), so an interrupted or racing write never leaves a partial dataset. Metadata gains a format_version field (default None for back-compat) and the module records DATASET_FORMAT_VERSION = 1.0.0.
  • Validation on open (_dataset/_validate.py + _open.py)Dataset.open now runs validate_dataset: a format-version major gate (incompatible / too-new / too-old → actionable ValueError; missing → treated as 1.0.0) plus structural/size integrity (required files present; regions.npy shape (n_regions, 4); genotype offsets.npy byte-length matches n_regions·ploidy·n_samples+1). Datasets never auto-rebuild (no retained source); they raise telling you to regenerate with gvl.write.
  • Out-of-scope (documented): genoray .gvi and pysam .fai/.gzi index files are created by those libraries and are not made atomic/locked by gvl.

Adds a filelock>=3.12 dependency. Skill (skills/genvarloader/SKILL.md) updated to document .gvlfa support, atomic/locked creation, the dataset format gate, and the index-file limitation.

Test plan

  • tests/unit/test_fasta_cache.py — fingerprint, source resolution, build/load round-trip, byte-equality vs pysam, stale/unvalidated/format-too-new/corrupt classification, legacy migration (incl. stale-bytes guard), ensure_cache decision matrix, atomic-publish + no-partial-cache + double-check reuse.
  • tests/unit/test_fasta.py.gvlfa creation, direct .gvlfa input, missing-source on-demand read error, legacy migration, in-memory-no-cache path, Reference.from_path .gvlfa round-trip.
  • tests/unit/test_atomic.py (8) — clean publish, sibling temp, exception cleanup, FileExistsError, overwrite replace, SkipPublish, lock-file persistence, concurrent-loser-discarded.
  • tests/unit/dataset/test_validate.py (9) — valid pass, missing/too-new/too-old format version, missing/wrong-shape regions.npy, genotype offsets wrong/correct length, genotypes-without-ploidy.
  • tests/unit/dataset/test_write_atomic.py (7) — format_version field/constant/round-trip, on-disk stamp, atomic no-temp-left, failure leaves no partial artifacts, overwrite-false raises.
  • tests/unit/test_concurrency.py (2, @slow) — the Lock GVL files to avoid multiple file creation/deletion in multi-job settings #21 regression: N processes building the same .gvlfa cache produce a byte-identical result with no orphans; N processes writing the same dataset path (overwrite=True) leave exactly one valid, openable dataset with no orphans.
  • Full fast suite green: 584 passed, 39 skipped, 4 xfailed, 0 failures; slow concurrency tier: 2 passed.
  • ruff check + ruff format clean on all touched files. (Pre-existing pyrefly errors in _bigwig.py/_flat.py/_ragged.py, plus import-resolution false-positives on _fasta_cache/new modules, are unrelated — the Rust ext and seqpro stubs aren't resolvable in the hook env.)

🤖 Generated with Claude Code

d-laub and others added 27 commits June 1, 2026 17:57
Skeleton module with Pydantic models (Fingerprint, SourceHints, FastaCache),
module-level constants (FORMAT_VERSION, FINGERPRINT_WINDOW, suffix/filename
literals), and the fingerprint() function (blake2b, 1 MiB window).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements build() (writes sequence.bin + metadata.json from a source FASTA),
load() (reads metadata and classifies cache as fresh/stale/unvalidated), and
helpers _data_size_ok/_fingerprints_match/_check_format_version. 13 tests pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tus type

Update _write_sequence to use pbar.update(len(c_seq)) so the tqdm bar
advances by nucleotides per contig instead of once per contig. Narrow
load()'s return annotation from str to Literal["fresh","stale","unvalidated"].

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds is_gvlfa, ensure_cache, _ensure_from_fasta, _ensure_from_gvlfa,
_cache_dir_for, _legacy_for; restores ensure_cache to __all__. Adds
import warnings + loguru logger. Fixed potential meta unbound-name
via explicit None init. 21/21 tests pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Rewires Fasta.__init__ to delegate all cache management to the new
_fasta_cache module, adds .gvlfa directory as a valid path input,
migrates legacy .gvl flat caches automatically, and warns+defers when
source FASTA is missing. Removes _valid_cache, _write_to_cache,
_get_sequences, _get_contig_lengths methods.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Rewires Reference.from_path to call ensure_cache() instead of the
removed Fasta._valid_cache/_write_to_cache methods. Accepts both a
.fa source and a pre-built .gvlfa directory as the fasta argument.
Removes now-unused loguru.logger import.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move-aside-then-rename overwrite, internal 60s lock timeout, format_version 1.0.0 with major-on-break bump policy. Confirm sibling temp (not /tmp) for atomic os.replace.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…21)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…double-check

build() and migrate_legacy() now write into a private sibling temp dir and
atomically publish via atomic_dir(overwrite=True), so a mid-build crash or
concurrent builder never leaves a partial .gvlfa dir. _ensure_built() wraps
atomic_dir with an in-lock double-check so concurrent callers reuse a freshly
published cache instead of rebuilding redundantly. All ensure_cache rebuild
paths go through _ensure_built instead of build directly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Add DATASET_FORMAT_VERSION = "1.0.0" module constant to _write.py
- Add format_version: SemanticVersion | None = None field to Metadata
- Stamp format_version into metadata dict at write time
- Route gvl.write through atomic_dir: build into a temp sibling dir,
  publish via os.replace on success, clean up on failure
- Remove now-unused shutil import (atomic_dir owns cleanup)
- Add tests/unit/dataset/test_write_atomic.py (5 tests)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… add atomicity + format_version on-disk tests

- Fix 1: replace hand-rolled __enter__/__exit__ + try/except/else with a
  plain `with atomic_dir(dest, overwrite=overwrite) as path:` block;
  removed `import sys` (no longer used)
- Fix 2: wrap write body in try/finally so warnings.simplefilter("default")
  runs on both success and failure paths; logger.info("Finished writing.")
  stays inside try (success only), just before the finally
- Fix 3a: test_format_version_stamped_on_disk — real gvl.write to tmp_path
  with synthetic_case VCF, asserts metadata.json["format_version"] == "1.0.0"
- Fix 3b: test_failure_leaves_no_partial_artifacts — samples=["NOT_A_REAL_SAMPLE"]
  triggers ValueError("not found in variants or tracks") after atomic_dir
  creates the temp dir; asserts dest and .tmp.* siblings are absent

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds `_validate.py` with `validate_dataset` which enforces a format-version
gate (incompatible MAJOR → actionable ValueError; missing → treated as 1.0.0)
and structural/size integrity checks (required files, regions.npy shape,
genotypes/offsets.npy byte-length for VCF/PGEN datasets).
Called from `OpenRequest._load_metadata` so every `Dataset.open` is guarded.

Key implementation detail: genotypes/offsets.npy is a raw int64 memmap (no
numpy header), so the check uses st_size bytes rather than np.load. SVAR
datasets (which have svar_meta.json) are excluded from the offsets check
because their offsets.npy has a different 4-D shape.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…loidy branches

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
 #21)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…file limitation

- SKILL.md: gvl.write section notes atomic build + advisory lock + no auto-rebuild;
  Dataset.open section notes format_version/integrity validation + no auto-rebuild;
  Reference.from_path one-liner updated with atomic/lock/auto-rebuild note.
- _write.py: Notes section in write() docstring covers atomic dir, advisory lock,
  and out-of-scope genoray/pysam index files.
- _fasta_cache.py: module-level docstring records atomic build + lock + auto-rebuild
  behaviour.

Fast suite: 584 passed, 39 skipped, 4 xfailed, 0 failed.
Slow concurrency tier: 2 passed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@d-laub d-laub marked this pull request as ready for review June 4, 2026 08:09
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@d-laub d-laub changed the title Robust .gvlfa FASTA cache (self-describing, fingerprint-validated) Robust on-disk artifacts: atomic .gvlfa cache + dataset creation/validation (closes #21) Jun 4, 2026
@d-laub d-laub merged commit d733264 into main Jun 4, 2026
7 checks passed
@d-laub d-laub deleted the worktree-robust-fasta-cache branch June 4, 2026 23:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Lock GVL files to avoid multiple file creation/deletion in multi-job settings

1 participant