Skip to content

Adding common_paths#376

Open
djarecka wants to merge 5 commits into
PennLINC:mainfrom
djarecka:enh/extra_paths
Open

Adding common_paths#376
djarecka wants to merge 5 commits into
PennLINC:mainfrom
djarecka:enh/extra_paths

Conversation

@djarecka

Copy link
Copy Markdown
Collaborator

closes #374

Adding common_paths to input dataset config for files from dataset top level (not non-subject).
Replaces the hard-coded dataset_description.json in sparse-checkout.

djarecka and others added 3 commits May 13, 2026 15:00
Replaces the hard-coded `dataset_description.json` in sparse-checkout,
`datalad get`, and `datalad run -i` with a configurable `common_paths`
list on each input dataset entry. Defaults to `["dataset_description.json"]`
to preserve existing behaviour. An empty list disables all common-path
inclusion.

Closes PennLINC#374

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add common_paths to the section overview list and optional sections list,
add a stub required_files section, and add a full common_paths section
with examples and usage notes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@djarecka djarecka changed the title Enh/extra paths Adding common_paths May 13, 2026
@djarecka djarecka requested review from asmacdo and tien-tong May 13, 2026 20:20
Add common paths to the submit script test inputs.

No need to add to the zipped ones, they are skipped.
@asmacdo

asmacdo commented May 21, 2026

Copy link
Copy Markdown
Collaborator

Opened djarecka#1 to fixup the tests. Otherwise looks good to me!

@tien-tong tien-tong added the enhancement New feature or request label May 22, 2026
@asmacdo

asmacdo commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator

I've found an additional requirement for how this needs to function that changes how the PR should be implemented.

The problem I hit

sparse-checkout doesn't include root-level bids sidecars in the per-subject job. (BIDS Inheritance Principle), e.g. a top-level task-rest_bold.json carrying RepetitionTime/PhaseEncodingDirection. In my case, the fmriprep jobs failed on valid data with KeyError: 'RepetitionTime' and Missing readout timing information across several OpenNeuro datasets.

what we need to do

common_paths (the sparse-checkout + datalad get) needs to respect the BIDS Inheritance Principle. Playing with this briefly, that means:

  • common_paths needs to respect globbing. The inherited sidecars vary per dataset (task names differ), so users can't enumerate them — a glob like *.json is the only dataset-agnostic way, and keeps the bootstrapped job from carrying hard-coded paths.
  • The default common_paths should include the root-level sidecars (not just dataset_description.json), so valid BIDS works out of the box instead of silently dropping inherited metadata.
  • For processing_level: session, the default should also include the subject-level sidecars. A session job sparse-checks-out sub-XX/ses-YY, which misses files sitting directly under sub-XX/ — the subject tier between root and session, which also inherits down. (Subject-level jobs are fine: they get the whole sub-XX/ subtree already.)

In inheritance terms the tiers are root → sub-XX/sub-XX/ses-YY/ → modality. The sub-XX/ses-YY subtree already covers the session + modality tiers; root is this PR; the subject tier is the only remaining gap, and only for session-level jobs.

AI-generated pseudocode (one way to implement) + gotchas

Gotchas found while testing, that shape the approach:

  • datalad get -n "<ds>/*.json" errors on a quoted glob (exit 1). datalad never globs — the shell does, and the template necessarily quotes the arg, so the glob reaches datalad literally and fails. Under the job's set -e that kills the job. So a glob can't be passed through verbatim — resolve it to literal paths first.
  • In git sparse-checkout --no-cone, a bare *.json over-matches into other subjects' trees (gitignore-style * crosses /). Root-anchoring avoids it.
  • Resolve with git ls-tree --name-only HEAD, not by globbing the working tree. ls-tree reads the committed tree, so it's independent of checkout/sparse state (robust to step reordering) and, being non-recursive, is naturally root-anchored. Resolution still happens at job runtime, so the scaffolded script carries the pattern and adapts if the dataset changes.

Default (in input_dataset.py): make the root sidecars the default, e.g. common_paths defaults to ['*.json'] (covers dataset_description.json too; add '*.tsv' if you also want scans/sessions/events inheritance). Keep [] to disable, as in the PR.

Template (participant_job.sh.jinja2):

{# Resolve common_paths (globs allowed) at runtime, from git. ls-tree reads HEAD
   -> independent of checkout/sparse state, and non-recursive -> root-anchored.
   We resolve to LITERAL paths because `datalad get -n` can't take a glob. #}
resolve_tier() {            # $1 = dir relative to dataset root ('' = dataset root)
  git -C "{{ input_dataset['path_in_babs'] }}" ls-tree HEAD "$1" \
    | awk '$2 == "blob" { print $4 }' \
    | while IFS= read -r f; do
        for pat in {% for p in input_dataset['common_paths'] %}'{{ p }}' {% endfor %}; do
          case "${f##*/}" in $pat) printf '%s\n' "$f"; break ;; esac
        done
      done
}

mapfile -t common_paths < <(resolve_tier '')                       # root tier (always)
{% if processing_level == 'session' %}
mapfile -O "${#common_paths[@]}" -t common_paths < <(resolve_tier "${subid}")  # subject tier
{% endif %}

# Hand the resolved LITERAL paths to all three consumers:
for c in "${common_paths[@]}"; do
  datalad get -n "{{ input_dataset['path_in_babs'] }}/${c}"        # no-op for in-git, fetches annexed
done
{ echo "${subid}{% if processing_level == 'session' %}/${sesid}{% endif %}"
  printf '%s\n' "${common_paths[@]}"
} | ( cd "{{ input_dataset['path_in_babs'] }}" && git sparse-checkout set --stdin )
inputs=(); for c in "${common_paths[@]}"; do inputs+=( -i "{{ input_dataset['path_in_babs'] }}/${c}" ); done
# datalad containers-run ... "${inputs[@]}" ...

Notes: this must run after the input subdataset is installed (so ls-tree can read its HEAD). It's illustrative — BIDS filenames have no spaces, so the mapfile/word-splitting is fine in practice, but a real impl may want to harden that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add extra_paths to input dataset config to retain non-subject files (e.g. nidm.ttl)

3 participants