mechababs

Automation glue for running BIDS apps across many datasets on HPC clusters using BABS.

Concept

An mechababs run is the composition of three things:

A dataset — typically an OpenNeuro raw BIDS study (OpenNeuroDatasets/dsXXXXXX).
A pipeline — one of pipelines/*.yaml (mriqc, fmriprep-anat/minimal/resampling/full, etc.).
A cluster — one of clusters/*.yaml (currently dartmouth.yaml).

merge_config.py combines pipeline + cluster + per-run args into a single babs-config.yaml. execute-dataset.sh drives a single dataset end-to-end; spawn-all.sh fans the same workflow across many datasets in parallel via tmux.

Quick start

Before any long run on Kerberos/NFS clusters (Dartmouth):

Start a tmux session — tmux new -s mecha — so the run survives ssh disconnects. Reattach with tmux attach -t mecha.

Inside tmux, run krenew -b to keep your Kerberos ticket alive. Long runs (>10h) can outlive the ticket, causing stale NFS file handles and crashes.

# One-time setup: creates venv, installs babs + datalad, clones containers
bash setup-dev.sh
source .venv/bin/activate

Per-dataset workflow

1. Sniff the dataset

./sniff.sh https://github.com/OpenNeuroDatasets/<DATASET_ID>

Reports subjects, sessions, scan counts, and sizes per subject. Use the output to choose a processing level (next step).

2. Pick a processing level

Dataset shape	Processing level
No sessions	`subject` (default)
Few sessions (1-4) with light scans	`subject`
Many sessions (10+) or heavy scans per subject	`session`

For datasets with many sessions per subject, check how many sessions the first subject actually has — it may differ from the dataset average. select-eligible-sub-ses.py (used by spawn-all.sh) picks the appropriate level automatically based on what TSV OpenNeuroStudies exposes.

3. Run

DATASET_ID=ds000113
duct -p logs/${DATASET_ID}-mriqc/ \
  bash execute-dataset.sh \
    --dataset-url https://github.com/OpenNeuroDatasets/${DATASET_ID} \
    --pipeline pipelines/mriqc-24.0.2.yaml \
    --cluster clusters/dartmouth.yaml \
    --working-dir processing/${DATASET_ID}-mriqc \
    --output derivative-datasets/${DATASET_ID}-mriqc \
    [--processing-level session] \
    [--inclusion-file <path>] \
    [--submit-only]

execute-dataset.sh does, in order:

Merge configs (merge_config.py) → babs-config.yaml
babs init — creates the babs project, clones dataset + container dataset
Pin inclusion (if --inclusion-file given) — datalad run copies the CSV into analysis/code/inclusion.csv so the actually-scheduled subjects are recorded in git history
Pull container image (datalad get the SIF)
babs submit — submits SLURM jobs (with --inclusion-file if provided)
babs status --wait — polls until jobs finish
finalize.sh — babs merge, clone from output RIA, datalad-get archives and duct logs, extract zips

--submit-only stops after step 5: jobs are submitted, then the script exits without steps 6–7 (no babs status --wait, no finalize). Used by staged deployments that poll + merge by hand (e.g. deployments/june-1-fmriprep/).

A sentinel file is written at <working-dir>/.status on exit:

exit_code=<int>
completed_at=<ISO-8601 UTC>
dataset_url=<...>
pipeline=<...>

Use this to scan many runs without attaching to each tmux pane.

4. Recover from interruption

If jobs finished but the run was killed before finalize, rerun just finalize:

bash finalize.sh \
  --working-dir processing/${DATASET_ID}-mriqc \
  --output derivative-datasets/${DATASET_ID}-mriqc

5. Troubleshooting

Job failed? Check babs status <working-dir>/babs-project, then look at the SLURM log inside the output RIA.
HTML reports? Serve via python -m http.server from the derivative dir. Don't datalad unlock annexed figures.

add-archive-content failed in finalize? Re-run manually:

cd derivative-datasets/<run>
bash -c 'for f in *.zip; do
  datalad add-archive-content -D --allow-dirty --no-commit \
    --existing overwrite --strip-leading-dirs --leading-dirs-depth 1 \
    --annex-options="--no-check-gitignore" "$f"
done'
datalad save -m "Extract archives"

mriqc INT64 crash? Known issue on some datasets (e.g. ds002685); record and skip.
Container not found? Re-run setup-dev.sh to refresh repronim-containers/.

Parallel runs

spawn-all.sh fans a pipeline across every row in the candidates CSV — one detached tmux session per dataset, each running execute-dataset.sh end-to-end.

bash spawn-all.sh \
    --pipeline pipelines/mriqc-24.0.2.yaml \
    --cluster clusters/dartmouth.yaml \
    --experiment parallel-exp1 \
    [--candidates priority-openneuro-datasets.csv] \
    [--per-dataset-count N] \
    [--dry-run]

For each dataset, spawn-all.sh:

Runs select-eligible-sub-ses.py to produce processing/<experiment>/<ds>-<pipeline>/inclusion.csv (and prints the matching --processing-level).
Skips the dataset if 0 rows are eligible.
Spawns tmux new -d -s mecha-<ds>-<pipeline> 'execute-dataset.sh ...' with --inclusion-file pointing at the CSV.
Sleeps 600s between spawns to avoid datalad/git-annex/NFS contention during babs init (5 min was insufficient on 2026-05-05).

--dry-run writes inclusion CSVs and prints would-spawn commands without launching tmux.

Per-experiment layout

processing/<experiment>/<ds>-<pipeline>/
    babs-config.yaml
    inclusion.csv                  # staging copy from select-eligible
    babs-project/analysis/code/
        inclusion.csv              # pinned via datalad run (= what babs submit consumed)
    .status                        # sentinel: exit code on completion

derivative-datasets/<experiment>/<ds>-<pipeline>/
    sub-*.zip / extracted contents
    logs/duct_*                    # duct logs of the per-subject jobs

logs/<experiment>/<ds>-<pipeline>/  # duct log of the spawn-all wrapper

The <experiment> namespace lets multiple passes coexist (parallel-exp1/, parallel-exp2/, …).

Eligibility selection

select-eligible-sub-ses.py fetches per-study metadata from OpenNeuroStudies (sourcedata+subjects+sessions.tsv or, on 404, the subject-level TSV) and filters rows by pipeline:

Pipeline	Rule
`mriqc`	`'anat' in datatypes` AND `t1w_num > 0`
`fmriprep`	`'anat' in datatypes` AND `'func' in datatypes` AND `t1w_num > 0` AND `bold_num > 0`

Output CSV has columns sub_id and (optionally) ses_id, matching what babs submit --inclusion-file expects. The processing level (subject or session) is printed to stdout.

For ad-hoc single-subject smoke tests, hand-write a one-row CSV instead:

printf "sub_id\nsub-s003\n" > inclusion.csv
bash execute-dataset.sh ... --inclusion-file inclusion.csv

Configuration

Pipeline YAMLs (pipelines/) hold container info + BIDS-app flags + zip foldernames.
Cluster YAMLs (clusters/) hold SLURM resource templates + script preamble (per-job /tmp bind, etc.).
merge_config.py merges the two plus --dataset-url into a single babs-config.yaml that babs init consumes. It preserves YAML-declared input_datasets (e.g. for chained-anat fmriprep stages).

To add a new pipeline: copy an existing pipelines/*.yaml, change container + flags + zip foldername, run it.

To add a new cluster: copy clusters/dartmouth.yaml, adjust SLURM resources + script_preamble, run a smoke test on it.

Docs

CLAUDE.md — project conventions, the pipeline, venv rules, working agreement.
design/ — architecture proposals.

Upstream

OpenNeuroStudies — superdataset
OpenNeuroDerivatives — derivative mirrors
BABS — execution engine
ReproNim/containers — container datasets
FAIRly Big processing workflow — the pattern
STAMPED principles — guiding principles

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mechababs

Concept

Quick start

Per-dataset workflow

1. Sniff the dataset

2. Pick a processing level

3. Run

4. Recover from interruption

5. Troubleshooting

Parallel runs

Per-experiment layout

Eligibility selection

Configuration

Docs

Upstream

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 184 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
clusters		clusters
deployments/june-1-fmriprep		deployments/june-1-fmriprep
design		design
logs		logs
pipelines		pipelines
prev_deploys		prev_deploys
reports/openneuro-pipe-2026-06-01		reports/openneuro-pipe-2026-06-01
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
execute-dataset.sh		execute-dataset.sh
finalize.sh		finalize.sh
merge_config.py		merge_config.py
pipeline-of-one-context.md		pipeline-of-one-context.md
preflight.py		preflight.py
priority-openneuro-datasets.csv		priority-openneuro-datasets.csv
select-eligible-sub-ses.py		select-eligible-sub-ses.py
select-fmriprep-targets.py		select-fmriprep-targets.py
setup-dev.sh		setup-dev.sh
sniff.sh		sniff.sh
tmp-test-candidates.csv		tmp-test-candidates.csv

Folders and files

Latest commit

History

Repository files navigation

mechababs

Concept

Quick start

Per-dataset workflow

1. Sniff the dataset

2. Pick a processing level

3. Run

4. Recover from interruption

5. Troubleshooting

Parallel runs

Per-experiment layout

Eligibility selection

Configuration

Docs

Upstream

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages