Robust on-disk artifacts: atomic .gvlfa cache + dataset creation/validation (closes #21)#206
Merged
Conversation
Skeleton module with Pydantic models (Fingerprint, SourceHints, FastaCache), module-level constants (FORMAT_VERSION, FINGERPRINT_WINDOW, suffix/filename literals), and the fingerprint() function (blake2b, 1 MiB window). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements build() (writes sequence.bin + metadata.json from a source FASTA), load() (reads metadata and classifies cache as fresh/stale/unvalidated), and helpers _data_size_ok/_fingerprints_match/_check_format_version. 13 tests pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tus type Update _write_sequence to use pbar.update(len(c_seq)) so the tqdm bar advances by nucleotides per contig instead of once per contig. Narrow load()'s return annotation from str to Literal["fresh","stale","unvalidated"]. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds is_gvlfa, ensure_cache, _ensure_from_fasta, _ensure_from_gvlfa, _cache_dir_for, _legacy_for; restores ensure_cache to __all__. Adds import warnings + loguru logger. Fixed potential meta unbound-name via explicit None init. 21/21 tests pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Rewires Fasta.__init__ to delegate all cache management to the new _fasta_cache module, adds .gvlfa directory as a valid path input, migrates legacy .gvl flat caches automatically, and warns+defers when source FASTA is missing. Removes _valid_cache, _write_to_cache, _get_sequences, _get_contig_lengths methods. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Rewires Reference.from_path to call ensure_cache() instead of the removed Fasta._valid_cache/_write_to_cache methods. Accepts both a .fa source and a pre-built .gvlfa directory as the fasta argument. Removes now-unused loguru.logger import. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
for more information, see https://pre-commit.ci
Move-aside-then-rename overwrite, internal 60s lock timeout, format_version 1.0.0 with major-on-break bump policy. Confirm sibling temp (not /tmp) for atomic os.replace. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…21) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…double-check build() and migrate_legacy() now write into a private sibling temp dir and atomically publish via atomic_dir(overwrite=True), so a mid-build crash or concurrent builder never leaves a partial .gvlfa dir. _ensure_built() wraps atomic_dir with an in-lock double-check so concurrent callers reuse a freshly published cache instead of rebuilding redundantly. All ensure_cache rebuild paths go through _ensure_built instead of build directly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Add DATASET_FORMAT_VERSION = "1.0.0" module constant to _write.py - Add format_version: SemanticVersion | None = None field to Metadata - Stamp format_version into metadata dict at write time - Route gvl.write through atomic_dir: build into a temp sibling dir, publish via os.replace on success, clean up on failure - Remove now-unused shutil import (atomic_dir owns cleanup) - Add tests/unit/dataset/test_write_atomic.py (5 tests) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… add atomicity + format_version on-disk tests
- Fix 1: replace hand-rolled __enter__/__exit__ + try/except/else with a
plain `with atomic_dir(dest, overwrite=overwrite) as path:` block;
removed `import sys` (no longer used)
- Fix 2: wrap write body in try/finally so warnings.simplefilter("default")
runs on both success and failure paths; logger.info("Finished writing.")
stays inside try (success only), just before the finally
- Fix 3a: test_format_version_stamped_on_disk — real gvl.write to tmp_path
with synthetic_case VCF, asserts metadata.json["format_version"] == "1.0.0"
- Fix 3b: test_failure_leaves_no_partial_artifacts — samples=["NOT_A_REAL_SAMPLE"]
triggers ValueError("not found in variants or tracks") after atomic_dir
creates the temp dir; asserts dest and .tmp.* siblings are absent
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds `_validate.py` with `validate_dataset` which enforces a format-version gate (incompatible MAJOR → actionable ValueError; missing → treated as 1.0.0) and structural/size integrity checks (required files, regions.npy shape, genotypes/offsets.npy byte-length for VCF/PGEN datasets). Called from `OpenRequest._load_metadata` so every `Dataset.open` is guarded. Key implementation detail: genotypes/offsets.npy is a raw int64 memmap (no numpy header), so the check uses st_size bytes rather than np.load. SVAR datasets (which have svar_meta.json) are excluded from the offsets check because their offsets.npy has a different 4-D shape. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…loidy branches Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
#21) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…file limitation - SKILL.md: gvl.write section notes atomic build + advisory lock + no auto-rebuild; Dataset.open section notes format_version/integrity validation + no auto-rebuild; Reference.from_path one-liner updated with atomic/lock/auto-rebuild note. - _write.py: Notes section in write() docstring covers atomic dir, advisory lock, and out-of-scope genoray/pysam index files. - _fasta_cache.py: module-level docstring records atomic build + lock + auto-rebuild behaviour. Fast suite: 584 passed, 39 skipped, 4 xfailed, 0 failed. Slow concurrency tier: 2 passed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
.gvlfa FASTA cache (self-describing, fingerprint-validated).gvlfa cache + dataset creation/validation (closes #21)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Makes GenVarLoader's generated on-disk artifacts — the
.gvlfaFASTA cache andgvl.writedataset directories — self-describing, safe under concurrent creation, and resilient to format drift. Closes #21.This PR landed in two stages: the robust
.gvlfaFASTA cache, then the atomic-creation + dataset-validation + concurrency follow-ups it set up.Stage 1 — Robust
.gvlfaFASTA cache (self-describing, fingerprint-validated)Replaces the brittle mtime-validated, sibling-only
.fa.gvlflat FASTA cache with a self-describing.gvlfa/directory cache that fingerprints its source, resolves it three ways, auto-rebuilds when stale/corrupt, and auto-migrates legacy caches — fully backwards compatible.python/genvarloader/_fasta_cache.pyowns the on-disk format, validation, build, and migration behind a single entry pointensure_cache(path) -> (FastaCache, data_path)..gvlfa/holdsmetadata.json(pydanticFastaCache: format version, gvl version, contig lengths, source hints, fingerprint) +sequence.bin(numpy memmap of all contigs).fresh/stale/unvalidated; auto-rebuild on stale or size-corrupt; format-too-new raises consistently from both entry points (never silently downgrades)..fa.gvlbytes via move, but only after verifying the legacy byte count matches the current source — a stale/truncated legacy cache is left untouched and rebuilt fresh.Fasta(_fasta.py) andReference.from_path(_dataset/_reference.py) rewired toensure_cache; both now also accept a.gvlfadirectory directly as their path._valid_cache/_write_to_cache/_get_sequencesmachinery removed.This intentionally mirrors the existing
SvarLinkrobustness idiom (fingerprint + three-way resolution + legacy migration) already used for.svarback-references in dataset metadata.Stage 2 — Atomic creation, dataset validation, and concurrency safety (closes #21)
python/genvarloader/_atomic.py—atomic_dir(dest, *, overwrite, lock, timeout)builds each artifact into a private sibling temp dir and publishes it with an atomicos.replace. A best-effortfilelockavoids N redundant concurrent builds but is never load-bearing for correctness — the atomic rename is the guarantee, so a lock timeout or a network-FS no-op just means "build anyway".SkipPublishaborts publishing to reuse an already-validdest;overwrite=Trueuses move-aside-then-rename. Reused by both artifacts._fasta_cache.py) —build/migrate_legacynow publish throughatomic_dir;ensure_cacherebuild paths go through a locked, double-checked helper (_ensure_built) so concurrent builders don't all rebuild and never corrupt. The cache auto-rebuilds (source available).gvl.write(_dataset/_write.py) — the whole dataset is built into a temp dir and published atomically (with atomic_dir(...)), so an interrupted or racing write never leaves a partial dataset.Metadatagains aformat_versionfield (defaultNonefor back-compat) and the module recordsDATASET_FORMAT_VERSION = 1.0.0._dataset/_validate.py+_open.py) —Dataset.opennow runsvalidate_dataset: a format-version major gate (incompatible / too-new / too-old → actionableValueError; missing → treated as1.0.0) plus structural/size integrity (required files present;regions.npyshape(n_regions, 4); genotypeoffsets.npybyte-length matchesn_regions·ploidy·n_samples+1). Datasets never auto-rebuild (no retained source); they raise telling you to regenerate withgvl.write..gviand pysam.fai/.gziindex files are created by those libraries and are not made atomic/locked by gvl.Adds a
filelock>=3.12dependency. Skill (skills/genvarloader/SKILL.md) updated to document.gvlfasupport, atomic/locked creation, the dataset format gate, and the index-file limitation.Test plan
tests/unit/test_fasta_cache.py— fingerprint, source resolution, build/load round-trip, byte-equality vs pysam, stale/unvalidated/format-too-new/corrupt classification, legacy migration (incl. stale-bytes guard),ensure_cachedecision matrix, atomic-publish + no-partial-cache + double-check reuse.tests/unit/test_fasta.py—.gvlfacreation, direct.gvlfainput, missing-source on-demand read error, legacy migration, in-memory-no-cache path,Reference.from_path.gvlfaround-trip.tests/unit/test_atomic.py(8) — clean publish, sibling temp, exception cleanup,FileExistsError, overwrite replace,SkipPublish, lock-file persistence, concurrent-loser-discarded.tests/unit/dataset/test_validate.py(9) — valid pass, missing/too-new/too-old format version, missing/wrong-shaperegions.npy, genotype offsets wrong/correct length, genotypes-without-ploidy.tests/unit/dataset/test_write_atomic.py(7) —format_versionfield/constant/round-trip, on-disk stamp, atomic no-temp-left, failure leaves no partial artifacts, overwrite-false raises.tests/unit/test_concurrency.py(2,@slow) — the Lock GVL files to avoid multiple file creation/deletion in multi-job settings #21 regression: N processes building the same.gvlfacache produce a byte-identical result with no orphans; N processes writing the same dataset path (overwrite=True) leave exactly one valid, openable dataset with no orphans.ruff check+ruff formatclean on all touched files. (Pre-existing pyrefly errors in_bigwig.py/_flat.py/_ragged.py, plus import-resolution false-positives on_fasta_cache/new modules, are unrelated — the Rust ext and seqpro stubs aren't resolvable in the hook env.)🤖 Generated with Claude Code