Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
ffe0b19
feat(_fasta_cache): add FastaCache models and fingerprint
d-laub Jun 2, 2026
a920b0a
feat(_fasta_cache): add source hints and three-way resolution
d-laub Jun 2, 2026
ce9a501
feat(_fasta_cache): add build, load, and validity guards
d-laub Jun 2, 2026
1b99759
refactor(_fasta_cache): fix progress bar advance and tighten load sta…
d-laub Jun 2, 2026
1e2215a
feat(_fasta_cache): migrate legacy .gvl caches by reusing bytes
d-laub Jun 2, 2026
483c9e7
feat(_fasta_cache): add ensure_cache orchestrator and dispatch
d-laub Jun 2, 2026
7df826d
fix(_fasta_cache): raise on format-too-new sibling cache instead of s…
d-laub Jun 2, 2026
33d05de
feat(fasta): use .gvlfa cache module and accept .gvlfa input
d-laub Jun 2, 2026
3a7881b
test(fasta): cover in-memory no-cache _read_all_from_fasta path
d-laub Jun 2, 2026
91f0207
refactor(reference): build cache via ensure_cache, accept .gvlfa
d-laub Jun 2, 2026
bc975cd
docs(skill): note Reference.from_path .gvlfa support
d-laub Jun 2, 2026
4eac37c
fix(_fasta_cache): guard legacy migration against stale/truncated bytes
d-laub Jun 2, 2026
e629962
chore(pre-commit): auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 2, 2026
d7ef3d0
docs(specs): design robust on-disk artifacts (atomic creation + valid…
d-laub Jun 2, 2026
021a632
docs(specs): resolve open questions for robust on-disk artifacts
d-laub Jun 3, 2026
a8e8981
docs(plans): implementation plan for robust on-disk artifacts (closes…
d-laub Jun 3, 2026
bd15a0f
build: add filelock dependency for atomic artifact creation
d-laub Jun 3, 2026
ff3092e
feat(_atomic): add atomic_dir directory-publish primitive
d-laub Jun 3, 2026
5dab78f
feat(_fasta_cache): publish cache atomically via atomic_dir + locked …
d-laub Jun 3, 2026
f6d2209
feat(_write): atomic dataset creation + format_version in Metadata
d-laub Jun 3, 2026
59ccf4f
refactor(_write): use plain with-atomic_dir; restore warnings filter;…
d-laub Jun 3, 2026
d7762fb
feat(_open): validate dataset format version + integrity on open
d-laub Jun 3, 2026
c5e16c6
test(_validate): cover too-old format version and genotypes-without-p…
d-laub Jun 3, 2026
6f15e89
test: concurrency regression for atomic cache + dataset creation (clo…
d-laub Jun 3, 2026
83790b3
docs(skill): note atomic/locked creation, dataset format gate, index-…
d-laub Jun 3, 2026
0053a3b
docs(_atomic): clarify overwrite=False discard rationale per consumer
d-laub Jun 3, 2026
93cef1a
style: apply ruff format to _write and _fasta_cache
d-laub Jun 4, 2026
edfba67
fix(test_fasta): move mid-file imports to top (E402, CI lint)
d-laub Jun 4, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,060 changes: 1,060 additions & 0 deletions docs/superpowers/plans/2026-06-03-robust-ondisk-artifacts.md

Large diffs are not rendered by default.

165 changes: 165 additions & 0 deletions docs/superpowers/specs/2026-06-02-robust-ondisk-artifacts-design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
# Robust on-disk artifacts: atomic creation + validation — Design

**Status:** Approved design with all implementation decisions resolved; ready for implementation planning.

**Goal:** Make GenVarLoader's generated on-disk artifacts safe under concurrency and resilient to format drift. Two coupled problems, one initiative:

1. **Concurrency (closes [#21]):** Parallel jobs that share an input (classically one reference FASTA across a SLURM array) race to build the same `.gvlfa` cache and corrupt it. The same race can affect `gvl.write` dataset directories.
2. **Validation hardening:** Files written into a `*.gvl/` dataset directory by `gvl.write` are write-once-assume-valid — no real format-version gate and no integrity check on read. A future on-disk layout change would silently load old datasets as wrong data; an interrupted write leaves an undetected partial dataset.

This extends the robust-sidecar idiom already established by the `.gvlfa` FASTA cache (PR #206) and the existing `SvarLink` (`_svar_link.py`): fingerprint/version + graceful resolution + safe creation.

[#21]: https://github.com/mcvickerlab/GenVarLoader/issues/21

---

## Background & decisions

The corruption in #21 stems from a **shared mutable target**: today `_fasta_cache.build()` does `np.memmap(gvlfa_dir / "sequence.bin", mode="w+")` and writes into the live path; `gvl.write()` writes its many files directly into the destination directory. Concurrent builders interleave writes, and readers can observe partially-written artifacts.

Two ways to make creation safe were considered:

- **Advisory locks** (`flock`/`fcntl`) make other jobs wait. On network filesystems (NFS, Lustre, GPFS) — exactly where #21's parallel jobs run — advisory locks are unreliable: `flock` is often local-only across nodes, POSIX locks depend on a frequently-misconfigured network lock manager, and Lustre needs a specific mount option. The failure mode is silent (looks locked, isn't).
- **Atomic build-to-temp-then-rename** makes the *publish* indivisible. Each job builds into its own private sibling temp directory, then `os.replace`s it into place. `rename(2)` is a single metadata operation the server serializes (atomic even on NFS, within one filesystem); there is no shared target during the build, so writes cannot interleave. "Last writer wins" is harmless because every builder produces byte-identical valid content.

**Decisions (from brainstorming):**

| # | Decision |
|---|---|
| Concurrency primitive | **Atomic rename for correctness** (works on any filesystem) **+ advisory lock as a best-effort optimization** (avoid N redundant builds). The lock is never load-bearing for correctness. |
| Lock library | Add **`filelock`** dependency (v1). |
| Validation depth (datasets) | **Format-version gate + structural/size integrity** on open. |
| Dataset mismatch policy | **Always raise an actionable error** — datasets cannot auto-rebuild (no retained source). Only the FASTA cache auto-rebuilds (it has its source). |
| Atomic-creation scope | **Both** the `.gvlfa` cache **and** `gvl.write` dataset creation. |
| Back-compat | A dataset with **no `format_version`** is treated as the current layout (`1.0.0`) and loaded best-effort (no warning, no forced regeneration). |
| Out of scope | genoray `.gvi` and pysam `.fai`/`.gzi` are created by those libraries; making them atomic needs upstream changes. Documented as a limitation. |

---

## Architecture

A new small module **`python/genvarloader/_atomic.py`** owns the single reusable primitive both artifact types need — safely publishing a directory. Everything else reuses it.

### `_atomic.py` — safe directory publish (single responsibility)

```python
@contextmanager
def atomic_dir(
dest: Path,
*,
overwrite: bool = False,
lock: bool = True,
timeout: float = DEFAULT_LOCK_TIMEOUT,
) -> Iterator[Path]:
"""Yield a private temp dir to build into; atomically publish it to `dest` on
clean exit, or remove it on error. Optional best-effort lock avoids redundant
concurrent builds but is never required for correctness."""
```

- **Temp location:** `<dest>.tmp.<pid>-<uniq>/` as a *sibling* of `dest` (same filesystem, so `os.replace` is atomic; avoids `EXDEV`).
- **Publish:** on clean context exit, `os.replace(tmp, dest)`. On exception, remove `tmp` and re-raise; `dest` is never touched.
- **Existing dest:** checked up front — exists & not `overwrite` → `FileExistsError`; exists & `overwrite` → **move-aside then rename**: `os.replace(dest, <dest>.old.<uniq>)`, then `os.replace(tmp, dest)`, then `rmtree(<dest>.old.<uniq>)`. Two fast metadata renames keep the dest-absent window minimal; if the second rename fails the `.old` copy is still recoverable.
- **Lock layer** (`filelock`): a `<dest>.lock` sibling, **internal with a 60s default `DEFAULT_LOCK_TIMEOUT`** (not exposed on the public API — the lock is only a best-effort optimization; atomic rename is the correctness guarantee). Flow is **double-checked**: acquire lock (with `timeout`) → re-validate `dest` (another job may have just published it → reuse, skip build) → else build. On lock timeout or a silent network-FS no-op, fall back to building anyway; atomic rename keeps that correct.

### `_fasta_cache.py` (from PR #206)

`build()` and `migrate_legacy()` refactored to publish through `atomic_dir` rather than writing `sequence.bin`/`metadata.json` into the live directory. `ensure_cache` already has the format-version gate + blake2b fingerprint; it gains the lock + atomic publish and a double-check re-classification inside the lock. Stale/corrupt caches still auto-rebuild (the source FASTA is available).

### `_write.py`

`write()`'s body runs against a temp dir yielded by `atomic_dir(path, overwrite=overwrite, ...)`; the final `os.replace` publishes the complete `*.gvl/`. The existing `overwrite` / `FileExistsError` semantics move into `atomic_dir`. `Metadata` gains a real **`format_version: SemanticVersion`** field — bumped only when the on-disk layout changes, distinct from the existing package `version` field.

### `_open.py`

`_load_metadata()` gains a `_validate(metadata, path)` step:

1. **Format-version gate** — incompatible major (too-new *or* too-old) → actionable `ValueError` ("written by format N; regenerate with `gvl.write`"). Missing `format_version` → treat as `1.0.0`.
2. **Structural + size integrity** — required files exist, and each `.npy`/memmap `stat().st_size` equals the size implied by the shapes in metadata (`n_regions`, `ploidy`, `n_samples`, final offset values, track set). Mismatch → `ValueError` naming the file.

Datasets never auto-rebuild.

---

## Data flow

### A. Building the FASTA cache (`ensure_cache`)

1. Dispatch `.fa` vs `.gvlfa` and decide build / migrate / rebuild / reuse (unchanged).
2. If a build is needed:
- `with atomic_dir(gvlfa_dir, lock=True, timeout=T) as tmp:`
- lock acquired (or timed out → proceed)
- **double-check:** re-classify `gvlfa_dir`; if another job just published a fresh cache, skip building and reuse it
- else write `sequence.bin` + `metadata.json` into `tmp`
- context exit → `os.replace(tmp, gvlfa_dir)`
3. return `(meta, data_path)`.

### B. Writing a dataset (`gvl.write`)

1. `with atomic_dir(path, overwrite=overwrite, lock=True, timeout=T) as tmp:`
- existing-dest handled up front (`FileExistsError` unless `overwrite`)
- every write (`input_regions.arrow`, `regions.npy`, `genotypes/*`, `intervals/**`, `metadata.json` with `format_version`) targets `tmp`
2. exit → atomic publish to `path`. An interrupted write leaves only an orphan `<path>.tmp.*`, never a half-written `path`.

### C. Opening a dataset (`Dataset.open`)

1. `_load_metadata()` reads `metadata.json`.
2. `_validate(metadata, path)`: format-version gate, then structural + size integrity.
3. construct readers as today.

### Orphan / lock-file policy

Temp dirs are removed on exception. A *crashed* process can still leave an orphan `<dest>.tmp.*`; these are harmless (distinctly named, ignored on open) and documented rather than swept (sweeping risks deleting a live job's temp). `<dest>.lock` files persist empty and are reused.

---

## Error handling

| Situation | Behavior |
|---|---|
| Lock timeout / silent network-FS no-op | `log.info`, build anyway — atomic rename keeps it correct (never an error) |
| Build raises mid-way | temp dir removed; exception propagates; live `dest` untouched |
| Dataset `dest` exists, `overwrite=False` | `FileExistsError` (preserved behavior) |
| FASTA format too-new | `ValueError` "newer than supported" (existing behavior) |
| Dataset format too-new / too-old-incompatible | actionable `ValueError` → regenerate with `gvl.write` |
| Dataset missing `format_version` | treat as `1.0.0`, load (no warning) |
| Dataset integrity fail (missing file / size mismatch) | `ValueError` naming the file → regenerate |
| FASTA cache integrity fail | auto-rebuild (source available), else `ValueError` |

**Correctness invariant:** a reader never observes a partially-written `dest`, and two builders never write the same file.

---

## Test plan

### `tests/unit/test_atomic.py` (the primitive)
- publishes on clean exit (dest appears, temp gone); removes temp + leaves no dest on exception
- `overwrite=False` + existing dest → `FileExistsError`; `overwrite=True` replaces
- lock double-check: `build_fn` runs once when dest becomes valid mid-wait; lock timeout → falls back to build, no error

### Concurrency regression (`multiprocessing`, `@pytest.mark.slow`) — the #21 fix
- N processes `ensure_cache` the same `.fa` concurrently → resulting `sequence.bin` is valid and **byte-identical** to a single-process build; no corruption
- N processes `gvl.write` the same path (overwrite) → exactly one valid, openable dataset; no partial dirs left as `path`

### Format / validation
- dataset `format_version` too-new → `ValueError`; missing → loads as `1.0.0`; truncated/missing `.npy` → `ValueError` naming it
- write→open round-trip validates clean; back-compat: a dataset with the `format_version` field stripped still opens

### Regression
- full fast suite green (write + open are now atomic); `gen` + a `Dataset` round-trip

Cross-platform: `filelock` covers posix/Windows; `os.replace` is atomic within a directory on both.

---

## Out of scope (documented limitations)

- **genoray `.gvi`** (VCF/PGEN variant index) and **pysam `.fai`/`.gzi`** (FASTA index) are created by their respective libraries; making their creation atomic/locked requires upstream changes. Concurrent jobs relying on these still depend on those libraries' own behavior.
- No per-region / per-sample integrity beyond total byte-size checks (full-content hashing of multi-GB datasets is intentionally avoided as too expensive — same tradeoff `_data_size_ok` documents for the FASTA cache).

---

## Resolved implementation decisions

1. **Overwrite publish (non-empty existing dir).** `os.replace` cannot replace a non-empty dir on POSIX, so `overwrite=True` uses **move-aside then rename**: `os.replace(dest, <dest>.old.<uniq>)` → `os.replace(tmp, dest)` → `rmtree(<dest>.old.<uniq>)`. Two fast metadata renames keep the dest-absent window minimal, and a failure of the second rename leaves the old data recoverable under `.old` (vs. an `rmtree`-first approach that spans the whole delete and loses old data on a mid-delete crash).
2. **Lock timeout / API surface.** `DEFAULT_LOCK_TIMEOUT = 60s`, kept **internal** — `lock`/`timeout` are not exposed on `gvl.write` / `Reference.from_path` / `Fasta`. The lock is a best-effort optimization only; atomic rename is the correctness guarantee, so tuning is rarely needed.
3. **`format_version`.** Current dataset layout is **`1.0.0`**. Bump **MAJOR** only when an existing dataset can no longer be read correctly by new code (incompatible layout change); minor/patch for backward-compatible additions. A missing `format_version` field is treated as `1.0.0`.
1 change: 1 addition & 0 deletions pixi.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ dependencies = [
"pooch",
"awkward",
"hirola>=0.3,<0.4",
"filelock>=3.12",
]

[project.optional-dependencies]
Expand Down
121 changes: 121 additions & 0 deletions python/genvarloader/_atomic.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
"""Safely publish a directory artifact via build-to-temp-then-atomic-rename.

The single primitive both the `.gvlfa` FASTA cache and `gvl.write` dataset
directories need: build into a private sibling temp dir, then publish it to the
destination with an atomic `os.replace`. A best-effort `filelock` avoids N
redundant concurrent builds, but is never required for correctness — the atomic
rename is the correctness guarantee, so a lock timeout or a silent network-FS
no-op simply means "build anyway".
"""

from __future__ import annotations

import os
import shutil
from contextlib import contextmanager
from pathlib import Path
from typing import Iterator
from uuid import uuid4

from filelock import FileLock, Timeout
from loguru import logger

__all__ = ["atomic_dir", "SkipPublish", "DEFAULT_LOCK_TIMEOUT"]

DEFAULT_LOCK_TIMEOUT = 60.0
"""Seconds to wait for the build lock before giving up and building anyway."""


class SkipPublish(Exception): # noqa: N818 - control-flow sentinel, not an error
"""Raise inside an `atomic_dir` block to abort publishing.

Use when a double-check inside the lock finds `dest` already valid: the temp
dir is removed and `dest` is left untouched, with no exception surfacing to
the caller.
"""


def _publish(tmp: Path, dest: Path, *, overwrite: bool) -> None:
"""Move `tmp` into place at `dest` as atomically as the filesystem allows."""
if dest.exists():
if not overwrite:
# A racing builder published `dest` while we built. Discard ours and
# let theirs win: for the FASTA cache the content is byte-identical so
# this is harmless; for a dataset write it is first-writer-wins.
shutil.rmtree(tmp, ignore_errors=True)
return
aside = dest.with_name(f"{dest.name}.old.{uuid4().hex[:8]}")
try:
os.replace(dest, aside)
except FileNotFoundError:
aside = None # another writer already moved/removed dest
os.replace(tmp, dest)
if aside is not None:
shutil.rmtree(aside, ignore_errors=True)
else:
os.replace(tmp, dest)


@contextmanager
def atomic_dir(
dest: str | os.PathLike[str],
*,
overwrite: bool = False,
lock: bool = True,
timeout: float = DEFAULT_LOCK_TIMEOUT,
) -> Iterator[Path]:
"""Yield a private temp dir to build into; atomically publish it to `dest`.

On clean exit the temp dir is `os.replace`-d into `dest`. On `SkipPublish`
the temp dir is removed and `dest` is left untouched. On any other exception
the temp dir is removed and the exception propagates; `dest` is never
partially written.

Parameters
----------
dest
Final destination directory.
overwrite
If False and `dest` already exists, raise `FileExistsError` up front. If
True, an existing `dest` is replaced via move-aside-then-rename.
lock
Acquire a best-effort `<dest>.lock` to avoid redundant concurrent builds.
timeout
Seconds to wait for the lock before logging and proceeding anyway.
"""
dest = Path(dest)
if dest.exists() and not overwrite:
raise FileExistsError(
f"{dest} already exists; pass overwrite=True to replace it."
)

flock: FileLock | None = None
if lock:
flock = FileLock(str(dest) + ".lock")
try:
flock.acquire(timeout=timeout)
except Timeout:
logger.info(
f"Timed out after {timeout}s waiting for {dest}.lock; "
"building anyway (atomic rename keeps this correct)."
)
flock = None

tmp = dest.with_name(f"{dest.name}.tmp.{os.getpid()}-{uuid4().hex[:8]}")
tmp.mkdir(parents=True)
try:
yield tmp
except SkipPublish:
shutil.rmtree(tmp, ignore_errors=True)
except BaseException:
shutil.rmtree(tmp, ignore_errors=True)
raise
else:
_publish(tmp, dest, overwrite=overwrite)
finally:
if flock is not None:
lock_path = Path(str(dest) + ".lock")
flock.release()
# Ensure the lock file persists as a sibling so future callers can
# reuse it; filelock may remove it on release.
lock_path.touch(exist_ok=True)
5 changes: 4 additions & 1 deletion python/genvarloader/_dataset/_open.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
from ._reconstruct import Haps, Ref, Tracks, _build_reconstructor
from ._reference import Reference
from ._utils import bed_to_regions
from ._validate import validate_dataset
from ._write import Metadata

if TYPE_CHECKING:
Expand Down Expand Up @@ -100,7 +101,9 @@ def _validate_path(self) -> None:

def _load_metadata(self) -> Metadata:
with _py_open(self.path / "metadata.json") as f:
return Metadata.model_validate_json(f.read())
metadata = Metadata.model_validate_json(f.read())
validate_dataset(metadata, self.path)
return metadata

def _build_indexer(
self, metadata: Metadata
Expand Down
Loading
Loading