Skip to content

POC: zip as a built-in post_run hook (optional zipping)#382

Closed
asmacdo wants to merge 16 commits into
PennLINC:mainfrom
asmacdo:hooks-pr2-zip-builtin
Closed

POC: zip as a built-in post_run hook (optional zipping)#382
asmacdo wants to merge 16 commits into
PennLINC:mainfrom
asmacdo:hooks-pr2-zip-builtin

Conversation

@asmacdo

@asmacdo asmacdo commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Draft / WIP — stacked on #381 (the first 9 commits are that PR; the 6 on top are this one).

Moves output zipping out of the generated run script and into a built-in post_run hook, using the splice-point machinery from #381.
Zipping becomes opt-in: no zip hook configured = no zipping (#327, #364).
This PR introduces the built-in hook form (zip is the first, shipped as a static script), i.e.
it is partly a demonstration that the hooks design generalizes — flag anything that doesn't.

Config shape

The zip/output config is redefined around one new top-level key, output_dir — the folder the app writes into (relative to the dataset root), carrying the full versioned derivative name. It replaces both zip_foldernames (the name was derived from key + version) and all_results_in_one_zip:

# today                                    # this PR
all_results_in_one_zip: true               output_dir: outputs/fmriprep_anat-24-1-1
zip_foldernames:                           hooks:
    fmriprep_anat: "24-1-1"                  post_run:
                                               - builtin: zip   # zips `output_dir`

output_dir is the single source for the versioned name — app write dir = zip source = zip name, one string that can't drift. Zipping itself becomes just a hook: no zip hook = no zipping.

  • The zip hook takes two optional parameters: path: (the folder to zip, relative to the dataset root; defaults to the top-level output_dir) and name: (the archive-name stem — the X in ${subid}[_${sesid}]_X.zip; defaults to path's basename). The archive contains the zipped folder at its top level. name: exists because in the old schema the map value only ever versioned the archive name; making it a free-form stem keeps that without any derivation.

  • Non-default case — not all output in one zip (what the multi-entry zip_foldernames map did): one zip hook per folder, each with its own path: (+ name: when the app controls the folder name and the archive should stay versioned):

    output_dir: outputs    # the app writes here, creating the subfolders below
    hooks:
      post_run:
        - builtin: zip
          path: outputs/fmriprep_func
          name: fmriprep_func-24-1-1     # -> ${subid}_fmriprep_func-24-1-1.zip
        - builtin: zip
          path: outputs/freesurfer
          name: freesurfer-24-1-1        # -> ${subid}_freesurfer-24-1-1.zip

    The zip built-in is a static script taking its params as args at the splice site (bash ./code/hooks/zip.sh outputs/freesurfer freesurfer-24-1-1 — visible in participant_job.sh), so all instances share one code/hooks/zip.sh. Nothing zip varies on needs render time (sesid presence encodes the processing level at runtime, per the splice contract); the Render seam from the splice-points PR stays reserved for the container-running built-ins (NORDIC), which genuinely need init-time composition.

    Question for review: do any of your configs actually use multiple zip_foldernames entries? Every public config has exactly one.

  • Built-in names and their params are validated against a registry at resolve time (a typo'd builtin or unknown param fails babs init, not the job).

Done so far in this draft

  • The config swap is in: top-level output_dir replaces zip_foldernames / all_results_in_one_zip, which hard-error with a migration message — a clean break, no deprecation shim (question for review below). All shipped example configs (notebooks/eg_*.yaml), test fixtures, and the docs (zip_foldernames section → output_dir section) are migrated; the legacy derivation survives only for pipeline mode, which PR 3 deletes.
  • {builtin: zip} materializes the packaged static code/hooks/zip.sh (git-tracked, copied verbatim) and splices it with its resolved arguments.
  • The hook owns its own commit: datalad unlock7z inside its own datalad run --explicit → a separate git rm of the granular outputs (separate because datalad run --explicit doesn't track deletions — datalad run --explicit does not save file deletions in --output` paths datalad/datalad#7822, since fixed upstream; TODO in-script to fold it back in once babs's minimum datalad has the fix).
  • The generated run script no longer zips: <container>_zip.sh<container>_run.sh, the in-script 7z + rm -rf outputs tail is gone, and the participant job's datalad run -o declares the granular output_dir (the run commits granular outputs; the zip hook replaces them with the archive).
  • The raw_bids e2e runs post_run: [<contract guard>, {builtin: zip}] — guard before zip, because hook ordering is load-bearing (a validator must see the outputs before zip flattens them) — so the runtime zip path is exercised end-to-end in CI.
  • Unit + init-level test coverage and user docs throughout.

Demo: what babs init produces

Artifacts below are produced by this PR's init-level test (simbids, in the slurm-docker-ci container), from a config with output_dir: outputs/fmriprep_anat-24-1-1 and an argless {builtin: zip} at post_run.

code/hooks/zip.sh — the static zip built-in, copied in verbatim (git-tracked, so you can read exactly what will run)
#!/bin/bash
# Built-in babs `zip` hook: archive a BIDS App output folder, commit the
# archive, and remove the granular outputs it replaces.
#
# Copied verbatim at `babs init` into `code/hooks/zip.sh`. Runs at the
# `post_run` splice point (cwd = the job's dataset clone) as a separate
# process: `subid` (and `sesid` at session level) arrive via the exported
# splice-point contract; what to zip arrives as arguments:
#
#   zip.sh <path> [<name>]
#
# <path> is the folder to zip, relative to the dataset root. <name> is the
# archive-name stem (the X in ${subid}[_${sesid}]_X.zip), defaulting to
# <path>'s basename. The archive is written to the dataset root and contains
# the folder itself (not its parents), matching the layout of babs zips to
# date.
set -e -u -x

path="$1"
name="${2:-$(basename "$path")}"
zip_dir="$(dirname "$path")"
zip_folder="$(basename "$path")"

# subid is exported by the splice-point subshell in participant_job.sh;
# sesid only at session level, so its presence encodes the processing level.
# shellcheck disable=SC2154
ZIP_ID="${subid}${sesid:+_${sesid}}"
ZIP_NAME="${ZIP_ID}_${name}.zip"

# Zip real file content, not annex symlinks:
datalad unlock "${path}"

# cd into the parent so the archive contains the folder at its top level;
# OLDPWD (the dataset root) is where the archive lands.
datalad run \
	--explicit \
	--output "${ZIP_NAME}" \
	-m "Zip ${path} for ${ZIP_ID}" \
	-- \
	"cd ${zip_dir} && 7z a \"\${OLDPWD}/${ZIP_NAME}\" ${zip_folder}"

# `datalad run --explicit` does not track deletions, so the granular outputs
# are removed in a separate commit (workaround for datalad/datalad#7822,
# since fixed upstream).
# TODO research which datalad version shipped the datalad/datalad#7822 fix;
# once babs's minimum supported datalad is at or above it, fold this removal
# into the datalad run above and drop this step.
git rm -rf -q --sparse "${path}"
git commit -m "Remove ${path} for ${ZIP_ID} (zipped)"
code/participant_job.sh — the run's -o declares the granular output_dir; the post_run splice runs the zip hook with its resolved argument
# datalad run:
datalad run \
	-i "code/simbids-0-0-3_run.sh" \
	-i "inputs/data/BIDS/${subid}" \
	-i "inputs/data/BIDS/dataset_description.json" \
	-i "containers/.datalad/environments/simbids-0-0-3/image" \
	--expand inputs \
	--explicit \
	-o "outputs/fmriprep_anat-24-1-1" \
	-m "simbids-0-0-3 ${subid}" \
    "bash ./code/simbids-0-0-3_run.sh ${subid} "
# post_run hooks: spliced after the run, before push; subshell exports the contract.
(
  export subid BRANCH PROJECT_ROOT JOB_SCRATCH_DIR
bash ./code/hooks/zip.sh outputs/fmriprep_anat-24-1-1
)

# Finish up:
# push result file content to output RIA storage:
echo '# Push result file content to output RIA storage:'
datalad push --to output-storage
analysis dataset history after babs init — the static hook rides the imported-files commit
80bdd80 Save anything in folder code/ that hasn't been saved
2f4b876 Template for job submission
dc34509 Record of inclusion/exclusion of participants/sessions
695f62a Import files
3d4cab1 Generate script of running container
1e4f86f [DATALAD] Added subdataset
465ea59 Register input data dataset 'BIDS' as a subdataset
059b1da Initial save of babs_proj_config.yaml
62358f3 Save .gitignore file
580eaf7 Apply YODA dataset setup
45194cd [DATALAD] new dataset

Remaining before review-ready

  • The clean break needs your sign-off: legacy keys hard-error with a migration message, no deprecation shim. Acceptable, or do you need a deprecation path?
  • babs status with zipping off: OutputDataset currently hardcodes is_zipped=True, so merged-results accounting silently reports nothing when no zip hook is configured — needs the zip-presence persisted; follow-up, design open.
  • Remove the TODO-fenced demo block from the init test (it exists to produce the artifacts above).
  • Test state at last push: 99/103 green; 4 session-level render-shellcheck failures on non-*prep apps (suspected: sesid="$2" now unused once the in-script zip is gone → SC2034) — to fix, plus a fresh full-e2e run.

Known trade-off to weigh at review: post-run zip vs within-run zip

With the hook owning its own datalad run, each job commits granular outputs first, then the zip (two commits). The granular annex keys stay in history and the git-annex branch even after git rm — so babs merge at ~1000-subject scale gets strictly heavier than today's within-run zip. Is the two-commit shape tolerable at your scale?

If it isn't, the design answer we'd propose is not coupling zip back into the run script, but a second seam: command decoration — a configurable prefix/suffix on the command inside the datalad run wrapper (prepend_run/append_run), so zip happens within the run and only the archive is ever committed. That seam already has a second independent consumer: per-job resource monitoring with duct (#356), which also needs to wrap the wrapped command and can't be expressed as a pre_run/post_run hook. Caveat to weigh: a decoration lands in the recorded rerun command, so datalad rerun then requires the decorating tool on the host.

asmacdo and others added 15 commits June 10, 2026 10:02
Add two optional, ordered hook lists to the participant job script:
- pre_app: spliced before the `datalad run` wrapper
- post_run: spliced after the run, before the push to output storage

Each splice region is a subshell that exports the contract vars
(subid, BRANCH, PROJECT_ROOT, JOB_SCRATCH_DIR; plus sesid at session
level) so separate-process hook scripts read them by name, while the
export cannot leak into the datalad run / container. processing_level
is intentionally not exported: it is a render-time constant, and sesid
presence already distinguishes subject vs session.

Hooks reach the template as list[str]; resolving the YAML hooks: block
into those strings is a later step. With no hooks configured the render
is byte-identical to before (jinja {% if %} guard + trim/lstrip).

Thread hook_pre_app/hook_post_run (default None) through
generate_submit_script(); both existing call sites are unaffected.

Add render tests: no-hooks (None == [], no markers, shellcheck),
pre_app+post_run expansion (contract export, snippet order, position
around the run wrapper, shellcheck), and session-only sesid export.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
First resolution step for the participant_job splice points (8b4185a):
turn a `hooks:` config block into the command strings those splice points
consume, plus the files babs materializes at init. This is what lets
cross-cutting concerns -- optional zip, NORDIC, a BIDS-validator gate --
live as user-configurable hooks in one codepath instead of diverged
templates. Design: PennLINC#365.

babs/hooks.py (pure, no I/O): resolve_hooks(hooks_config, *, source_base)
classifies each entry into Verbatim (form a, raw snippet) or CopyIn
(form b, `{script: <path>}` copied to code/hooks/<basename>.sh). Render
(templated built-ins / container hooks) is defined as the forward-compat
seam but not yet produced. Fail-fast ValueError on bad config.
CopyIn.as_import() emits the {original_path, analysis_path} shape
_init_import_files already consumes, so a later slice wires materialization
with no duplicated copy logic.

tests/test_hooks.py: 21 unit tests (run in the slurm-docker-ci container;
conftest pulls slurm deps at collection).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A hook collision is about the materialized file, not the name alone. Key the
check on descriptor equality: the *same* hook reused at multiple splice
points (e.g. a BIDS validator at both pre_app and post_run) is copied once
and referenced from each list, while two *different* descriptors claiming the
same code/hooks/<name>.sh still raise. Because Render carries `context`, this
also catches one template rendered two ways into the same name (a real
conflict) once Render is wired.

tests: same-source-at-both-points materializes once; different-source-same-
name still collides; Render equality distinguishes context (pins the
invariant the rule depends on, since resolve_hooks doesn't emit Render yet).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Follow babs's existing convention for copied-in local files instead of
inventing a parallel one. `imported_files.original_path` -- which form (b)
reuses via _init_import_files -- is an absolute local path used as-is
(op.exists + copy, no cwd-join, no abspath validation); mechababs likewise
resolves the inclusion file to absolute before passing it ("so datalad run cp
doesn't need the cwd to match"). So resolve_hooks no longer takes source_base
and no longer joins: CopyIn.original_path is the `script:` value verbatim, and
_init_import_files raises FileNotFoundError on a bad path exactly as it does
for imported_files today.

This also lands on the right side of the path-handling review on
PennLINC#369: source paths are absolute (the explicitly-supported
copy-in case), while the destination (code/hooks/<name>.sh) stays relative
and is containment-validated by _validate_name.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Connect the resolver to babs for the single-app codepath:

- Render: Container.generate_bash_participant_job resolves
  `self.config.get('hooks')` and passes hook_pre_app/hook_post_run to
  generate_submit_script (container.py).
- Materialize: babs_bootstrap appends the resolved CopyIns'
  as_import() dicts to the existing shared _init_import_files call, so
  hook scripts copy into code/hooks/ with no duplicated import logic.
- _init_import_files now makedirs the destination's parent (hooks land in
  code/hooks/, which -- unlike flat code/ -- doesn't pre-exist).

The pipeline call site (bootstrap.py:508) is left unwired by design: it
defaults hook_pre_app=None so pipeline renders byte-identical, and pipeline
mode is slated for deletion once NORDIC is a hook. No hooks configured =>
no code/hooks/ dir and no splice subshell (both are guarded), so projects
without hooks are untouched.

Tests:
- test_babs_init_raw_bids: splice a form-(b) contract-guard hook at both
  pre_app and post_run. It runs as a separate process, so it only sees the
  contract vars (subid/BRANCH/PROJECT_ROOT/JOB_SCRATCH_DIR) because the
  splice subshell exports them; `${var:?}` fails the job under set -e if any
  is unset. Same source at both points also exercises copy-once dedup. Plus
  a post-init assertion that the guard is materialized + spliced -- a dropped
  hook would otherwise pass silently (no guard left to fail).
- test_babs_init_single_app_hooks: init-only wiring check (no job execution)
  -- asserts a configured hook lands in code/hooks/ and is referenced twice.
- test_babs_init_list_sub_file: assert no code/hooks/ dir when no hooks.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the alphabetic form labels in docstrings/comments with names derived
from the config syntax, so a reader doesn't have to decode "(a)/(b)/(c)":

  (a) raw snippet  -> snippet      (Verbatim)
  (b) user script  -> script       (CopyIn)
  (c) container    -> templated built-in  (Render)

"templated built-in" names the Render *mode* (babs renders a shipped
*.sh.jinja2 into code/hooks/), which serves both a zip hook and a
container-running hook (e.g. nordic) alike -- they differ only in what the
shipped template does, not in how it's produced. So there is no distinct
"container" form; a container-running hook is just a {builtin: <name>} whose
template composes a singularity run.

Comments/docstrings only -- no behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The two splice points bracket the same `datalad run`, so name them for that
boundary: pre_run / post_run (was pre_app / post_run, which mixed "app" and
"run" vocab). Renames the config key, the hook_pre_app -> hook_pre_run template
param, and the test markers. Config-key churn is cheapest now, before the keys
are public. (Review feedback, PR1-review Q8.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a `hooks` section to the config-yaml docs: the pre_run/post_run splice
points, the snippet and script entry forms, the exported runtime contract
(subid/sesid/BRANCH/PROJECT_ROOT/JOB_SCRATCH_DIR) + subshell/set -e semantics,
and -- most importantly -- that hooks splice OUTSIDE the datalad run wrapper, so
a hook that writes files leaves uncommitted changes that aren't pushed; the safe
default is validation that fails the job, and persisting output requires the
hook to own its own datalad run/save. (Review feedback, PR1-review Q1/Q2/Q10/Q11.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two different scripts sharing a basename within one splice point produced
"(in 'pre_run' and 'pre_run')". Name the single point when prior_point ==
point, the pair otherwise. Adds test_same_point_same_name_collide and tightens
the cross-point test to assert the full location. (Review feedback, PR1-review Q12.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A {builtin: <name>} entry now resolves to a Render of the shipped
templates/hooks/<name>.sh.jinja2; keys beyond `builtin` become the
per-hook part of the render context (e.g. zip's optional `path`).
Bootstrap still gates with NotImplementedError until the next step
wires the actual rendering; CopyIns keep flowing to _init_import_files.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
`_init_render_hooks` renders each Render's shipped template
(babs/templates/hooks/) into code/hooks/<name>.sh at init and datalad
saves, replacing the NotImplementedError gate. Render context = the
per-hook config params plus the derived part (processing_level), with
derived values taking precedence.

zip.sh.jinja2 is the first built-in: unlock -> 7z inside its own
`datalad run --explicit` -> remove the granular outputs in a separate
commit (the datalad/datalad#7822 workaround, TODO'd for removal once
babs's minimum datalad carries the fix). One hook = one zip; the
archive is named ${subid}[_${sesid}]_<basename(path)>.zip and contains
basename(path) at its top level.

Also wraps a pre-existing >99-char line in hooks.py that failed ruff.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Asserts {builtin: zip, path: ...} renders code/hooks/zip.sh (id, archive
name, 7z line, granular removal), splices once at post_run, and lands
the "Materialize built-in hook scripts" init commit. The TODO-fenced
demo block copies the rendered artifacts into demo-output/ for
inspection after a docker run; remove before PR assembly.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
BUILTIN_PARAMS maps each built-in to its accepted per-hook params
(zip -> {path}). An unknown builtin or a typo'd param (e.g. `paht:`)
now fails at resolve time instead of flowing silently into the render
context (StrictUndefined only catches missing context, not unexpected
keys). The registry also subsumes name validation for built-ins: only
known names resolve, so a path-escaping name can't reach the template
loader.

Also wraps a >99-char line in test_hooks.py that failed ruff.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The inner run script stops being named for zipping (zip moves to the
post_run hook). generate_submit_script grows an optional `output_dir`
(mutually exclusive with the pipeline-only `zip_foldernames`): when
given, the datalad run declares the app output folder itself as the
output, committing granular results. Single-app still passes
zip_foldernames until the config-surface swap lands; pipeline mode is
untouched.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…t_dir

The top-level output_dir (the folder the app writes into, carrying the
versioned derivative name) is now the single source for the app write
dir, the datalad run -o declaration, and the default zip-hook path.
Legacy keys hard-error with a migration message
(app_output_settings_from_config stays for pipeline mode; dies in PR 3).

- bidsapp_run.sh.jinja2 de-zipped: cmd_zip gone, and the granular
  outputs survive the script (the run now commits them; the zip hook
  owns their removal)
- the zip built-in becomes a STATIC script taking <path> [<name>] as
  splice-site arguments (visible in participant_job.sh): path defaults
  to output_dir at resolve time, name (a free-form archive stem, the
  old map value's only real job) defaults to basename(path) at runtime,
  and sesid presence encodes the processing level. All zip instances
  share one code/hooks/zip.sh; multi-zip differs only in args. The
  Render seam stays reserved for the container-running built-ins (PR 3).
- notebooks/eg_*.yaml + test fixtures migrated; docs: output_dir
  section replaces zip_foldernames; raw_bids e2e flips post_run to
  [contract guard, zip] (guard before zip -- ordering is load-bearing)

WIP: not yet run in docker (units + e2e pending; docker held by the
PR-1 CI session).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@asmacdo

asmacdo commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator Author

Folded into #381 — per today's community meeting we're combining the splice-points PR and the zip-as-hook PR into one (a release will be cut before merge, so the additive/behavioral split bought nothing). #381 now carries the full stack including the config swap (output_dir + optional zip) and the static built-in form.

@asmacdo asmacdo closed this Jun 10, 2026
@asmacdo asmacdo deleted the hooks-pr2-zip-builtin branch June 10, 2026 20:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant