Adding common_paths by djarecka · Pull Request #376 · PennLINC/babs

djarecka · 2026-05-13T20:20:05Z

closes #374

Adding common_paths to input dataset config for files from dataset top level (not non-subject).
Replaces the hard-coded dataset_description.json in sparse-checkout.

Replaces the hard-coded `dataset_description.json` in sparse-checkout, `datalad get`, and `datalad run -i` with a configurable `common_paths` list on each input dataset entry. Defaults to `["dataset_description.json"]` to preserve existing behaviour. An empty list disables all common-path inclusion. Closes PennLINC#374 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add common_paths to the section overview list and optional sections list, add a stub required_files section, and add a full common_paths section with examples and usage notes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add common paths to the submit script test inputs. No need to add to the zipped ones, they are skipped.

asmacdo · 2026-05-21T14:01:44Z

Opened djarecka#1 to fixup the tests. Otherwise looks good to me!

Fixup tests

asmacdo · 2026-06-05T17:30:33Z

I've found an additional requirement for how this needs to function that changes how the PR should be implemented.

The problem I hit

sparse-checkout doesn't include root-level bids sidecars in the per-subject job. (BIDS Inheritance Principle), e.g. a top-level task-rest_bold.json carrying RepetitionTime/PhaseEncodingDirection. In my case, the fmriprep jobs failed on valid data with KeyError: 'RepetitionTime' and Missing readout timing information across several OpenNeuro datasets.

what we need to do

common_paths (the sparse-checkout + datalad get) needs to respect the BIDS Inheritance Principle. Playing with this briefly, that means:

common_paths needs to respect globbing. The inherited sidecars vary per dataset (task names differ), so users can't enumerate them — a glob like *.json is the only dataset-agnostic way, and keeps the bootstrapped job from carrying hard-coded paths.
The default common_paths should include the root-level sidecars (not just dataset_description.json), so valid BIDS works out of the box instead of silently dropping inherited metadata.
For processing_level: session, the default should also include the subject-level sidecars. A session job sparse-checks-out sub-XX/ses-YY, which misses files sitting directly under sub-XX/ — the subject tier between root and session, which also inherits down. (Subject-level jobs are fine: they get the whole sub-XX/ subtree already.)

In inheritance terms the tiers are root → sub-XX/ → sub-XX/ses-YY/ → modality. The sub-XX/ses-YY subtree already covers the session + modality tiers; root is this PR; the subject tier is the only remaining gap, and only for session-level jobs.

AI-generated pseudocode (one way to implement) + gotchas

Gotchas found while testing, that shape the approach:

datalad get -n "<ds>/*.json" errors on a quoted glob (exit 1). datalad never globs — the shell does, and the template necessarily quotes the arg, so the glob reaches datalad literally and fails. Under the job's set -e that kills the job. So a glob can't be passed through verbatim — resolve it to literal paths first.
In git sparse-checkout --no-cone, a bare *.json over-matches into other subjects' trees (gitignore-style * crosses /). Root-anchoring avoids it.
Resolve with git ls-tree --name-only HEAD, not by globbing the working tree. ls-tree reads the committed tree, so it's independent of checkout/sparse state (robust to step reordering) and, being non-recursive, is naturally root-anchored. Resolution still happens at job runtime, so the scaffolded script carries the pattern and adapts if the dataset changes.

Default (in input_dataset.py): make the root sidecars the default, e.g. common_paths defaults to ['*.json'] (covers dataset_description.json too; add '*.tsv' if you also want scans/sessions/events inheritance). Keep [] to disable, as in the PR.

Template (participant_job.sh.jinja2):

{# Resolve common_paths (globs allowed) at runtime, from git. ls-tree reads HEAD
   -> independent of checkout/sparse state, and non-recursive -> root-anchored.
   We resolve to LITERAL paths because `datalad get -n` can't take a glob. #}
resolve_tier() {            # $1 = dir relative to dataset root ('' = dataset root)
  git -C "{{ input_dataset['path_in_babs'] }}" ls-tree HEAD "$1" \
    | awk '$2 == "blob" { print $4 }' \
    | while IFS= read -r f; do
        for pat in {% for p in input_dataset['common_paths'] %}'{{ p }}' {% endfor %}; do
          case "${f##*/}" in $pat) printf '%s\n' "$f"; break ;; esac
        done
      done
}

mapfile -t common_paths < <(resolve_tier '')                       # root tier (always)
{% if processing_level == 'session' %}
mapfile -O "${#common_paths[@]}" -t common_paths < <(resolve_tier "${subid}")  # subject tier
{% endif %}

# Hand the resolved LITERAL paths to all three consumers:
for c in "${common_paths[@]}"; do
  datalad get -n "{{ input_dataset['path_in_babs'] }}/${c}"        # no-op for in-git, fetches annexed
done
{ echo "${subid}{% if processing_level == 'session' %}/${sesid}{% endif %}"
  printf '%s\n' "${common_paths[@]}"
} | ( cd "{{ input_dataset['path_in_babs'] }}" && git sparse-checkout set --stdin )
inputs=(); for c in "${common_paths[@]}"; do inputs+=( -i "{{ input_dataset['path_in_babs'] }}/${c}" ); done
# datalad containers-run ... "${inputs[@]}" ...

Notes: this must run after the input subdataset is installed (so ls-tree can read its HEAD). It's illustrative — BIDS filenames have no spaces, so the mapfile/word-splitting is fine in practice, but a real impl may want to harden that.

djarecka and others added 3 commits May 13, 2026 15:00

adding echo

fb21ddf

djarecka changed the title ~~Enh/extra paths~~ Adding common_paths May 13, 2026

djarecka requested review from asmacdo and tien-tong May 13, 2026 20:20

Fixup tests

443dcf6

Add common paths to the submit script test inputs. No need to add to the zipped ones, they are skipped.

Merge pull request #1 from asmacdo/pr-376

782685c

Fixup tests

tien-tong added the enhancement New feature or request label May 22, 2026

This was referenced Jun 5, 2026

minimal fmriprep fails: dataset-root inherited BIDS sidecars dropped from per-job sparse-checkout asmacdo/mechababs#41

Open

Epic: upstream babs dependencies (M2) asmacdo/mechababs#38

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding common_paths#376

Adding common_paths#376
djarecka wants to merge 5 commits into
PennLINC:mainfrom
djarecka:enh/extra_paths

djarecka commented May 13, 2026

Uh oh!

asmacdo commented May 21, 2026

Uh oh!

asmacdo commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

djarecka commented May 13, 2026

Uh oh!

asmacdo commented May 21, 2026

Uh oh!

asmacdo commented Jun 5, 2026

The problem I hit

what we need to do

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants