Harden datatype/derivatives extraction in filename parser#28
Conversation
_libBIDSsh_parse_filename derived the datatype and derivatives pipeline with substring greps piped through head/awk. Two latent defects: - Substring false-match: `grep -oE "(anat|...|func|...)"` matched the datatype anywhere in the path, so e.g. `study_func_proj/sub-01/anat/...` was mis-detected as `func` instead of `anat`. The same flaw let a directory like `myderivatives/` be read as a derivatives dataset. - SIGPIPE fragility: `grep ... | head -1 || echo "NA"` can have grep killed by SIGPIPE when head closes the pipe; under `set -o pipefail` the pipeline then reports failure and silently falls back to `NA`. Replace both with anchored bash regex on the path (no subshells, no pipes): datatype is the file's immediate parent directory matched as a whole component `(^|/)<datatype>$`; the derivatives pipeline is the component right after `(^|/)derivatives/`. generate_entity_patterns.sh now emits the anchored datatype regex to keep the generated block in sync. Add regression tests covering the substring false-match, emg detection, missing-datatype NA, derivatives extraction, and the `myderivatives` non-match. Suite: 9/9 pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This PR hardens _libBIDSsh_parse_filename so datatype and derivatives are extracted from whole path components (avoiding substring false-matches) and avoids grep|head SIGPIPE fragility under set -o pipefail.
Changes:
- Replace substring/pipe-based datatype + derivatives parsing with anchored bash-regex extraction.
- Update
generate_entity_patterns.shto emit the anchored datatype regex for keeping generated blocks in sync. - Add a focused regression test covering substring false-matches,
emg, missing datatype →NA, derivatives extraction, and non-match formyderivatives.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
libBIDS.sh |
Switches datatype/derivatives extraction to anchored bash regex matching. |
generate_entity_patterns.sh |
Emits anchored datatype regex intended for _libBIDSsh_parse_filename. |
test_libBIDS.sh |
Adds regression tests for datatype/derivatives extraction edge cases. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| local dir | ||
| dir=$(dirname "${path}") | ||
| if [[ ${dir} =~ (^|/)(anat|beh|dwi|eeg|emg|fmap|func|ieeg|meg|micr|motion|mrs|nirs|perf|pet|phenotype)$ ]]; then |
| if [[ ${path} =~ (^|/)derivatives/([^/]+)/ ]]; then | ||
| arr[derivatives]="${BASH_REMATCH[2]}" |
Bring harden-parsing (#28) into master
Stacked on #27 (base
align-bids-nomenclature). Fixes latent/fragile parsing in_libBIDSsh_parse_filenamesurfaced during the #21 review.Bugs
_libBIDSsh_parse_filenamederived the datatype and the derivatives pipeline with substring greps piped throughhead/awk:grep -oE "(anat|…|func|…)"matched the datatype anywhere in the path. A dataset path likestudy_func_proj/sub-01/anat/sub-01_T1w.nii.gzwas reported asdatatype=funcinstead ofanat. The same flaw letmyderivatives/…be read as a derivatives dataset.grep … | head -1 || echo "NA"can havegrepkilled by SIGPIPE whenheadcloses the pipe; underset -o pipefailthe pipeline reports failure and silently falls back toNA.Fix
Anchored bash regex on the path — no subshells, no pipes, deterministic:
(^|/)<datatype>$;(^|/)derivatives/.generate_entity_patterns.shnow emits the anchored datatype regex so the generated block stays in sync.Tests
New
test_parse_filename_datatype_derivativescovers: thefunc-substring false-match,emgdetection, missing-datatype →NA, derivatives extraction, and themyderivativesnon-match.Verification
bash -n+shellcheckclean on all three files../test_libBIDS.sh→ 9/9 pass.datatype=anat.ds000117still extractsmeg_derivatives/meg;ds001datatype distribution unchanged (32 anat / 96 func).🤖 Generated with Claude Code