POC: zip as a built-in post_run hook (optional zipping)#382
Closed
asmacdo wants to merge 16 commits into
Closed
Conversation
Add two optional, ordered hook lists to the participant job script:
- pre_app: spliced before the `datalad run` wrapper
- post_run: spliced after the run, before the push to output storage
Each splice region is a subshell that exports the contract vars
(subid, BRANCH, PROJECT_ROOT, JOB_SCRATCH_DIR; plus sesid at session
level) so separate-process hook scripts read them by name, while the
export cannot leak into the datalad run / container. processing_level
is intentionally not exported: it is a render-time constant, and sesid
presence already distinguishes subject vs session.
Hooks reach the template as list[str]; resolving the YAML hooks: block
into those strings is a later step. With no hooks configured the render
is byte-identical to before (jinja {% if %} guard + trim/lstrip).
Thread hook_pre_app/hook_post_run (default None) through
generate_submit_script(); both existing call sites are unaffected.
Add render tests: no-hooks (None == [], no markers, shellcheck),
pre_app+post_run expansion (contract export, snippet order, position
around the run wrapper, shellcheck), and session-only sesid export.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
First resolution step for the participant_job splice points (8b4185a): turn a `hooks:` config block into the command strings those splice points consume, plus the files babs materializes at init. This is what lets cross-cutting concerns -- optional zip, NORDIC, a BIDS-validator gate -- live as user-configurable hooks in one codepath instead of diverged templates. Design: PennLINC#365. babs/hooks.py (pure, no I/O): resolve_hooks(hooks_config, *, source_base) classifies each entry into Verbatim (form a, raw snippet) or CopyIn (form b, `{script: <path>}` copied to code/hooks/<basename>.sh). Render (templated built-ins / container hooks) is defined as the forward-compat seam but not yet produced. Fail-fast ValueError on bad config. CopyIn.as_import() emits the {original_path, analysis_path} shape _init_import_files already consumes, so a later slice wires materialization with no duplicated copy logic. tests/test_hooks.py: 21 unit tests (run in the slurm-docker-ci container; conftest pulls slurm deps at collection). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A hook collision is about the materialized file, not the name alone. Key the check on descriptor equality: the *same* hook reused at multiple splice points (e.g. a BIDS validator at both pre_app and post_run) is copied once and referenced from each list, while two *different* descriptors claiming the same code/hooks/<name>.sh still raise. Because Render carries `context`, this also catches one template rendered two ways into the same name (a real conflict) once Render is wired. tests: same-source-at-both-points materializes once; different-source-same- name still collides; Render equality distinguishes context (pins the invariant the rule depends on, since resolve_hooks doesn't emit Render yet). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Follow babs's existing convention for copied-in local files instead of
inventing a parallel one. `imported_files.original_path` -- which form (b)
reuses via _init_import_files -- is an absolute local path used as-is
(op.exists + copy, no cwd-join, no abspath validation); mechababs likewise
resolves the inclusion file to absolute before passing it ("so datalad run cp
doesn't need the cwd to match"). So resolve_hooks no longer takes source_base
and no longer joins: CopyIn.original_path is the `script:` value verbatim, and
_init_import_files raises FileNotFoundError on a bad path exactly as it does
for imported_files today.
This also lands on the right side of the path-handling review on
PennLINC#369: source paths are absolute (the explicitly-supported
copy-in case), while the destination (code/hooks/<name>.sh) stays relative
and is containment-validated by _validate_name.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Connect the resolver to babs for the single-app codepath:
- Render: Container.generate_bash_participant_job resolves
`self.config.get('hooks')` and passes hook_pre_app/hook_post_run to
generate_submit_script (container.py).
- Materialize: babs_bootstrap appends the resolved CopyIns'
as_import() dicts to the existing shared _init_import_files call, so
hook scripts copy into code/hooks/ with no duplicated import logic.
- _init_import_files now makedirs the destination's parent (hooks land in
code/hooks/, which -- unlike flat code/ -- doesn't pre-exist).
The pipeline call site (bootstrap.py:508) is left unwired by design: it
defaults hook_pre_app=None so pipeline renders byte-identical, and pipeline
mode is slated for deletion once NORDIC is a hook. No hooks configured =>
no code/hooks/ dir and no splice subshell (both are guarded), so projects
without hooks are untouched.
Tests:
- test_babs_init_raw_bids: splice a form-(b) contract-guard hook at both
pre_app and post_run. It runs as a separate process, so it only sees the
contract vars (subid/BRANCH/PROJECT_ROOT/JOB_SCRATCH_DIR) because the
splice subshell exports them; `${var:?}` fails the job under set -e if any
is unset. Same source at both points also exercises copy-once dedup. Plus
a post-init assertion that the guard is materialized + spliced -- a dropped
hook would otherwise pass silently (no guard left to fail).
- test_babs_init_single_app_hooks: init-only wiring check (no job execution)
-- asserts a configured hook lands in code/hooks/ and is referenced twice.
- test_babs_init_list_sub_file: assert no code/hooks/ dir when no hooks.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the alphabetic form labels in docstrings/comments with names derived
from the config syntax, so a reader doesn't have to decode "(a)/(b)/(c)":
(a) raw snippet -> snippet (Verbatim)
(b) user script -> script (CopyIn)
(c) container -> templated built-in (Render)
"templated built-in" names the Render *mode* (babs renders a shipped
*.sh.jinja2 into code/hooks/), which serves both a zip hook and a
container-running hook (e.g. nordic) alike -- they differ only in what the
shipped template does, not in how it's produced. So there is no distinct
"container" form; a container-running hook is just a {builtin: <name>} whose
template composes a singularity run.
Comments/docstrings only -- no behavior change.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The two splice points bracket the same `datalad run`, so name them for that boundary: pre_run / post_run (was pre_app / post_run, which mixed "app" and "run" vocab). Renames the config key, the hook_pre_app -> hook_pre_run template param, and the test markers. Config-key churn is cheapest now, before the keys are public. (Review feedback, PR1-review Q8.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a `hooks` section to the config-yaml docs: the pre_run/post_run splice points, the snippet and script entry forms, the exported runtime contract (subid/sesid/BRANCH/PROJECT_ROOT/JOB_SCRATCH_DIR) + subshell/set -e semantics, and -- most importantly -- that hooks splice OUTSIDE the datalad run wrapper, so a hook that writes files leaves uncommitted changes that aren't pushed; the safe default is validation that fails the job, and persisting output requires the hook to own its own datalad run/save. (Review feedback, PR1-review Q1/Q2/Q10/Q11.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two different scripts sharing a basename within one splice point produced "(in 'pre_run' and 'pre_run')". Name the single point when prior_point == point, the pair otherwise. Adds test_same_point_same_name_collide and tightens the cross-point test to assert the full location. (Review feedback, PR1-review Q12.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A {builtin: <name>} entry now resolves to a Render of the shipped
templates/hooks/<name>.sh.jinja2; keys beyond `builtin` become the
per-hook part of the render context (e.g. zip's optional `path`).
Bootstrap still gates with NotImplementedError until the next step
wires the actual rendering; CopyIns keep flowing to _init_import_files.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
`_init_render_hooks` renders each Render's shipped template (babs/templates/hooks/) into code/hooks/<name>.sh at init and datalad saves, replacing the NotImplementedError gate. Render context = the per-hook config params plus the derived part (processing_level), with derived values taking precedence. zip.sh.jinja2 is the first built-in: unlock -> 7z inside its own `datalad run --explicit` -> remove the granular outputs in a separate commit (the datalad/datalad#7822 workaround, TODO'd for removal once babs's minimum datalad carries the fix). One hook = one zip; the archive is named ${subid}[_${sesid}]_<basename(path)>.zip and contains basename(path) at its top level. Also wraps a pre-existing >99-char line in hooks.py that failed ruff. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Asserts {builtin: zip, path: ...} renders code/hooks/zip.sh (id, archive
name, 7z line, granular removal), splices once at post_run, and lands
the "Materialize built-in hook scripts" init commit. The TODO-fenced
demo block copies the rendered artifacts into demo-output/ for
inspection after a docker run; remove before PR assembly.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
BUILTIN_PARAMS maps each built-in to its accepted per-hook params
(zip -> {path}). An unknown builtin or a typo'd param (e.g. `paht:`)
now fails at resolve time instead of flowing silently into the render
context (StrictUndefined only catches missing context, not unexpected
keys). The registry also subsumes name validation for built-ins: only
known names resolve, so a path-escaping name can't reach the template
loader.
Also wraps a >99-char line in test_hooks.py that failed ruff.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The inner run script stops being named for zipping (zip moves to the post_run hook). generate_submit_script grows an optional `output_dir` (mutually exclusive with the pipeline-only `zip_foldernames`): when given, the datalad run declares the app output folder itself as the output, committing granular results. Single-app still passes zip_foldernames until the config-surface swap lands; pipeline mode is untouched. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…t_dir The top-level output_dir (the folder the app writes into, carrying the versioned derivative name) is now the single source for the app write dir, the datalad run -o declaration, and the default zip-hook path. Legacy keys hard-error with a migration message (app_output_settings_from_config stays for pipeline mode; dies in PR 3). - bidsapp_run.sh.jinja2 de-zipped: cmd_zip gone, and the granular outputs survive the script (the run now commits them; the zip hook owns their removal) - the zip built-in becomes a STATIC script taking <path> [<name>] as splice-site arguments (visible in participant_job.sh): path defaults to output_dir at resolve time, name (a free-form archive stem, the old map value's only real job) defaults to basename(path) at runtime, and sesid presence encodes the processing level. All zip instances share one code/hooks/zip.sh; multi-zip differs only in args. The Render seam stays reserved for the container-running built-ins (PR 3). - notebooks/eg_*.yaml + test fixtures migrated; docs: output_dir section replaces zip_foldernames; raw_bids e2e flips post_run to [contract guard, zip] (guard before zip -- ordering is load-bearing) WIP: not yet run in docker (units + e2e pending; docker held by the PR-1 CI session). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Collaborator
Author
|
Folded into #381 — per today's community meeting we're combining the splice-points PR and the zip-as-hook PR into one (a release will be cut before merge, so the additive/behavioral split bought nothing). #381 now carries the full stack including the config swap ( |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Draft / WIP — stacked on #381 (the first 9 commits are that PR; the 6 on top are this one).
Moves output zipping out of the generated run script and into a built-in
post_runhook, using the splice-point machinery from #381.Zipping becomes opt-in: no zip hook configured = no zipping (#327, #364).
This PR introduces the built-in hook form (zip is the first, shipped as a static script), i.e.
it is partly a demonstration that the hooks design generalizes — flag anything that doesn't.
Config shape
The zip/output config is redefined around one new top-level key,
output_dir— the folder the app writes into (relative to the dataset root), carrying the full versioned derivative name. It replaces bothzip_foldernames(the name was derived from key + version) andall_results_in_one_zip:output_diris the single source for the versioned name — app write dir = zip source = zip name, one string that can't drift. Zipping itself becomes just a hook: no zip hook = no zipping.The zip hook takes two optional parameters:
path:(the folder to zip, relative to the dataset root; defaults to the top-leveloutput_dir) andname:(the archive-name stem — the X in${subid}[_${sesid}]_X.zip; defaults topath's basename). The archive contains the zipped folder at its top level.name:exists because in the old schema the map value only ever versioned the archive name; making it a free-form stem keeps that without any derivation.Non-default case — not all output in one zip (what the multi-entry
zip_foldernamesmap did): one zip hook per folder, each with its ownpath:(+name:when the app controls the folder name and the archive should stay versioned):The zip built-in is a static script taking its params as args at the splice site (
bash ./code/hooks/zip.sh outputs/freesurfer freesurfer-24-1-1— visible inparticipant_job.sh), so all instances share onecode/hooks/zip.sh. Nothing zip varies on needs render time (sesidpresence encodes the processing level at runtime, per the splice contract); theRenderseam from the splice-points PR stays reserved for the container-running built-ins (NORDIC), which genuinely need init-time composition.Question for review: do any of your configs actually use multiple
zip_foldernamesentries? Every public config has exactly one.Built-in names and their params are validated against a registry at resolve time (a typo'd builtin or unknown param fails
babs init, not the job).Done so far in this draft
output_dirreplaceszip_foldernames/all_results_in_one_zip, which hard-error with a migration message — a clean break, no deprecation shim (question for review below). All shipped example configs (notebooks/eg_*.yaml), test fixtures, and the docs (zip_foldernamessection →output_dirsection) are migrated; the legacy derivation survives only for pipeline mode, which PR 3 deletes.{builtin: zip}materializes the packaged staticcode/hooks/zip.sh(git-tracked, copied verbatim) and splices it with its resolved arguments.datalad unlock→7zinside its owndatalad run --explicit→ a separategit rmof the granular outputs (separate becausedatalad run --explicitdoesn't track deletions — datalad run --explicitdoes not save file deletions in--output` paths datalad/datalad#7822, since fixed upstream; TODO in-script to fold it back in once babs's minimum datalad has the fix).<container>_zip.sh→<container>_run.sh, the in-script 7z +rm -rf outputstail is gone, and the participant job'sdatalad run -odeclares the granularoutput_dir(the run commits granular outputs; the zip hook replaces them with the archive).post_run: [<contract guard>, {builtin: zip}]— guard before zip, because hook ordering is load-bearing (a validator must see the outputs before zip flattens them) — so the runtime zip path is exercised end-to-end in CI.Demo: what
babs initproducesArtifacts below are produced by this PR's init-level test (simbids, in the
slurm-docker-cicontainer), from a config withoutput_dir: outputs/fmriprep_anat-24-1-1and an argless{builtin: zip}atpost_run.code/hooks/zip.sh— the static zip built-in, copied in verbatim (git-tracked, so you can read exactly what will run)code/participant_job.sh— the run's-odeclares the granularoutput_dir; thepost_runsplice runs the zip hook with its resolved argumentanalysis dataset history after
babs init— the static hook rides the imported-files commitRemaining before review-ready
babs statuswith zipping off:OutputDatasetcurrently hardcodesis_zipped=True, so merged-results accounting silently reports nothing when no zip hook is configured — needs the zip-presence persisted; follow-up, design open.*prepapps (suspected:sesid="$2"now unused once the in-script zip is gone → SC2034) — to fix, plus a fresh full-e2e run.Known trade-off to weigh at review: post-run zip vs within-run zip
With the hook owning its own
datalad run, each job commits granular outputs first, then the zip (two commits). The granular annex keys stay in history and thegit-annexbranch even aftergit rm— sobabs mergeat ~1000-subject scale gets strictly heavier than today's within-run zip. Is the two-commit shape tolerable at your scale?If it isn't, the design answer we'd propose is not coupling zip back into the run script, but a second seam: command decoration — a configurable prefix/suffix on the command inside the
datalad runwrapper (prepend_run/append_run), so zip happens within the run and only the archive is ever committed. That seam already has a second independent consumer: per-job resource monitoring with duct (#356), which also needs to wrap the wrapped command and can't be expressed as apre_run/post_runhook. Caveat to weigh: a decoration lands in the recorded rerun command, sodatalad rerunthen requires the decorating tool on the host.