Skip to content

Harden datatype/derivatives extraction in filename parser#28

Merged
gdevenyi merged 1 commit into
align-bids-nomenclaturefrom
harden-parsing
Jun 11, 2026
Merged

Harden datatype/derivatives extraction in filename parser#28
gdevenyi merged 1 commit into
align-bids-nomenclaturefrom
harden-parsing

Conversation

@gdevenyi

Copy link
Copy Markdown
Member

Stacked on #27 (base align-bids-nomenclature). Fixes latent/fragile parsing in _libBIDSsh_parse_filename surfaced during the #21 review.

Bugs

_libBIDSsh_parse_filename derived the datatype and the derivatives pipeline with substring greps piped through head/awk:

  1. Substring false-match (correctness). grep -oE "(anat|…|func|…)" matched the datatype anywhere in the path. A dataset path like study_func_proj/sub-01/anat/sub-01_T1w.nii.gz was reported as datatype=func instead of anat. The same flaw let myderivatives/… be read as a derivatives dataset.
  2. SIGPIPE fragility. grep … | head -1 || echo "NA" can have grep killed by SIGPIPE when head closes the pipe; under set -o pipefail the pipeline reports failure and silently falls back to NA.

Fix

Anchored bash regex on the path — no subshells, no pipes, deterministic:

  • datatype = file's immediate parent directory, matched as a whole component (^|/)<datatype>$;
  • derivatives pipeline = component right after (^|/)derivatives/.

generate_entity_patterns.sh now emits the anchored datatype regex so the generated block stays in sync.

Tests

New test_parse_filename_datatype_derivatives covers: the func-substring false-match, emg detection, missing-datatype → NA, derivatives extraction, and the myderivatives non-match.

Verification

  • bash -n + shellcheck clean on all three files.
  • ./test_libBIDS.sh9/9 pass.
  • Repro now returns datatype=anat.
  • Real dataset ds000117 still extracts meg_derivatives / meg; ds001 datatype distribution unchanged (32 anat / 96 func).

🤖 Generated with Claude Code

_libBIDSsh_parse_filename derived the datatype and derivatives pipeline
with substring greps piped through head/awk. Two latent defects:

- Substring false-match: `grep -oE "(anat|...|func|...)"` matched the
  datatype anywhere in the path, so e.g. `study_func_proj/sub-01/anat/...`
  was mis-detected as `func` instead of `anat`. The same flaw let a
  directory like `myderivatives/` be read as a derivatives dataset.
- SIGPIPE fragility: `grep ... | head -1 || echo "NA"` can have grep
  killed by SIGPIPE when head closes the pipe; under `set -o pipefail`
  the pipeline then reports failure and silently falls back to `NA`.

Replace both with anchored bash regex on the path (no subshells, no
pipes): datatype is the file's immediate parent directory matched as a
whole component `(^|/)<datatype>$`; the derivatives pipeline is the
component right after `(^|/)derivatives/`. generate_entity_patterns.sh
now emits the anchored datatype regex to keep the generated block in
sync.

Add regression tests covering the substring false-match, emg detection,
missing-datatype NA, derivatives extraction, and the `myderivatives`
non-match. Suite: 9/9 pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 11, 2026 16:48
@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b7ce9e80-040f-4e4c-90a2-bbd6f3aa6938

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch harden-parsing

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens _libBIDSsh_parse_filename so datatype and derivatives are extracted from whole path components (avoiding substring false-matches) and avoids grep|head SIGPIPE fragility under set -o pipefail.

Changes:

  • Replace substring/pipe-based datatype + derivatives parsing with anchored bash-regex extraction.
  • Update generate_entity_patterns.sh to emit the anchored datatype regex for keeping generated blocks in sync.
  • Add a focused regression test covering substring false-matches, emg, missing datatype → NA, derivatives extraction, and non-match for myderivatives.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
libBIDS.sh Switches datatype/derivatives extraction to anchored bash regex matching.
generate_entity_patterns.sh Emits anchored datatype regex intended for _libBIDSsh_parse_filename.
test_libBIDS.sh Adds regression tests for datatype/derivatives extraction edge cases.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread libBIDS.sh
Comment on lines +255 to +257
local dir
dir=$(dirname "${path}")
if [[ ${dir} =~ (^|/)(anat|beh|dwi|eeg|emg|fmap|func|ieeg|meg|micr|motion|mrs|nirs|perf|pet|phenotype)$ ]]; then
Comment thread libBIDS.sh
Comment on lines +265 to +266
if [[ ${path} =~ (^|/)derivatives/([^/]+)/ ]]; then
arr[derivatives]="${BASH_REMATCH[2]}"
@gdevenyi gdevenyi merged commit 1b333f2 into align-bids-nomenclature Jun 11, 2026
2 checks passed
gdevenyi added a commit that referenced this pull request Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants