fix(sieve): utilize constrained depth file discovery for dependency checks#225
fix(sieve): utilize constrained depth file discovery for dependency checks#225Jaydeep869 wants to merge 2 commits into
Conversation
mlieberman85
left a comment
There was a problem hiding this comment.
Thanks for tackling this! CI lint is red — uv run ruff check --fix . will clear the W293 / SIM114 warnings in the new helper. Once that's clean, two things need a closer look:
-
Possible bug — at
builtin_handlers.py:64,for file in files + dirs:iterates over the combined files-and-directories lists fromos.walk. That means a directory namedpackage.jsonwould match apackage.jsonfile pattern. Did you meanfor file in files:? -
Sieve-wide blast radius —
max_depth=3is now the default for every control usingfile_existsandregex, not just the nested-dependency case from #221. That's a behavioral change across ~60 controls.For this PR, let's keep the current default behavior unchanged (unbounded recursive glob) and make the new depth-limited search opt-in via TOML — controls that need it set
max_depth = 3explicitly, everyone else keeps today's semantics. That keeps scope tight and unblocks #221 without risking regressions elsewhere.I'll open a follow-up issue to explore whether depth-limited search should become the default later (with a proper audit-regression check).
Minor: the pattern.replace("**/", "") strip-and-retry fallback isn't standard glob semantics. If you need recursive matching with a depth cap, pathlib.Path.rglob plus a depth filter would be more predictable than the current branching.
|
Follow-up issue for evaluating depth-limited search as the default: #226 |
|
Thanks for the work here, @Jaydeep869. I dug into this and there's a gap I want to flag before we merge. The PR ships the capability but no control opts inAfter commit 21399f3 made $ grep -c max_depth packages/darnit-baseline/openssf-baseline.toml
0I then reproduced issue #221 directly against this branch: # /tmp/nested-test/src/app/package.json exists
config = {
"files": ["package.json", "package-lock.json", "requirements.txt",
"pyproject.toml", "Cargo.toml", "go.mod", "go.sum",
"pom.xml", "build.gradle", "Gemfile.lock"],
} # exact OSPS-BR-05.01 config from the TOML
ctx = HandlerContext(local_path="/tmp/nested-test", control_id="OSPS-BR-05.01")
file_exists_handler(config, ctx)
# Status: fail
# Msg: None of the required files found
config["max_depth"] = 3
file_exists_handler(config, ctx)
# Status: pass
# Msg: Required file found: src/app/package.jsonSo as currently configured, a user running Why opt-in is correct (don't change this part)Several controls intentionally check root-only because finding the file anywhere would be a false positive: Suggested fixAdd [[controls."OSPS-BR-05.01".passes]]
handler = "file_exists"
max_depth = 3
files = [
"package.json", "package-lock.json", ...
]Worth applying to other dep-adjacent controls too (e.g. Smaller items while we're here
|
21399f3 to
723300e
Compare
|
Hey @mlieberman85, can you review this pr now. |
mlieberman85
left a comment
There was a problem hiding this comment.
Code Review: Constrained Depth File Discovery
Overview
Fixes #221 by replacing root-only file discovery in file_exists_handler and regex_handler with a depth-bounded walk (os.walk + ignore-list pruning). New optional max_depth config field on passes; OSPS-BR-05.01 opts in with max_depth = 3 so monorepo manifests like src/app/package.json are found. Adds three tests.
What works
- Targeted fix lands cleanly on the reported bug (BR-05.01) without changing other controls' default behavior.
os.walk+dirs[:] = [...]pruning is the correct idiom for skippingnode_modules/.gitcheaply.- Sorted, deduped output (
sorted(set(found))) is deterministic. - Schema compatibility:
PassConfigusesextra="allow", somax_depthflows through without schema changes. - Tests cover both directions: nested-found (positive) and
node_modules-pruned (negative).
Issues to address
1. Docstring contradicts behavior (builtin_handlers.py:115)
max_depth: int - Max directory depth to search (default: 3)
Actual default is None (unbounded glob, root-only for non-globs). Either change the docstring to (default: None, no depth limit), or actually default to 3 in the handler. Misleading docstrings here are a footgun for TOML authors.
2. Subtle behavior change for file_exists_handler glob patterns
Old: glob.glob(path) — recursive=False.
New (max_depth=None branch): glob.glob(path, recursive=True).
For * patterns this is identical, but **/... patterns now recurse where before they didn't. Practically no current TOML uses **/ with file_exists, but worth a note in the commit message and ideally a regression test asserting backward-compat for the unbounded path.
3. Code/comment mismatch on boundary-depth dirs (builtin_handlers.py:80–88)
if depth >= max_depth:
dirs.clear()
...
for item in files + dirs: # dirs is now emptyThe comment claims "directory names at exactly the boundary depth are still yielded in dirs and can match" — but dirs.clear() empties the list before iteration, so they can't. Either fix the comment, or move the dirs.clear() to after the match loop if matching boundary-depth dirs is intentional. Low real-world impact, but the comment will mislead the next reader.
4. Issue #221 mentions go.mod and pyproject.toml — those live in other controls and didn't get max_depth = 3 here. If the goal is to fully resolve the bug as filed, audit other dependency-manifest controls (requirements.txt, Cargo.toml, go.mod, pyproject.toml, Gemfile, pom.xml, build.gradle) for the same opt-in. Otherwise the bug recurs for non-JS monorepos.
5. ignore_dirs set is JS/Python-biased
Missing common large dirs: target (Rust/Java), vendor (Go/PHP), bin, obj, .gradle, .next, .nuxt, .cache, .pytest_cache, .mypy_cache. Consider expanding — pruning is essentially free and avoids surprise slowdowns.
Suggestions
- The three-strategy match (pathlib + fnmatch-stripped + plain fnmatch) is over-engineered for the patterns in use.
PurePath(rel_path).match(pattern)alone covers all current TOML patterns and is easier to reason about. Consider simplifying unless there's a concrete pattern that needs the fallbacks. - Hoist imports (
fnmatch,glob,PurePath) to the module top. The existing per-callimport globis a pre-existing wart, not a convention worth preserving. - Default
max_depth = 3for the helper, opt-out viaNone, rather than opt-in. Most controls would benefit from depth-limited search; few benefit from unbounded recursion. Follow-up PR if scope-limiting this one.
Risk
Low for the targeted scope (BR-05.01). Medium-low elsewhere — the recursive=True flip in the unbounded path is the main thing to watch in CI on real repos.
Test coverage
Adequate for the new code paths. Missing: a backward-compat assertion that file_exists_handler({"files": ["pkg.json"]}, ctx) (no max_depth) behaves exactly as before.
Verdict: Docstring fix (#1) and comment fix (#3) as blockers; the rest as follow-ups or commit-message notes.
…ecks Add max_depth support to _find_files_with_depth() so dependency controls can discover nested manifests (e.g. src/app/package.json) without imposing unbounded recursive scans. The default behavior (max_depth=None) remains unchanged—unbounded glob—keeping all existing controls unaffected. Key changes: - builtin_handlers.py: clarify match-expression precedence by splitting the or/and chain into named booleans (pathlib_match, recursive_match, simple_match). Document recursive=True in the unbounded-glob fast path, and add a comment explaining dirs.clear() semantics at the depth boundary. - openssf-baseline.toml: opt OSPS-BR-05.01 into max_depth=3 so nested dependency manifests are correctly discovered. Metadata-file controls (LICENSE, SECURITY, CODE_OF_CONDUCT) are intentionally left unbounded to avoid false positives from vendored copies. - tests: move test_pass_when_file_exists_nested into TestFileExistsHandler (was misplaced in TestProjectUpdateHandler), add coverage for regex_handler with max_depth, and add ignore_dirs pruning assertion. Fixes kusari-oss#221 Signed-off-by: jaydeep869 <jaydeeppokhariya2106@gmail.com>
- Update docstring for max_depth in file_exists_handler - Fix boundary-depth directory iteration bug inside _find_files_with_depth - Expand ignore_dirs list for other language ecosystems (target, vendor, etc.) - Opt-in max_depth=3 for OSPS-QA-02.01 so standard manifests like go.mod are found Note: Unbounded globs now default to passing recursive=True down to glob. Signed-off-by: jaydeep869 <jaydeeppokhariya2106@gmail.com>
e5b8315 to
af83acf
Compare
mlieberman85
left a comment
There was a problem hiding this comment.
Thanks for the diagnosis and the patch -- this was a real bug and your investigation is good. Sharing some context that should have surfaced sooner.
Heads-up: this bug was fixed on main two months ago
While this PR was open, another fix for the same issue (#221) landed on main:
- PR #259 --
fix(sieve): depth-limited file discovery via opt-in max_depth (closes #221) - Commit
2e2d55b, merged in May
Issue #221 was closed by that merge rather than by this PR. Sorry no one tagged you on this PR at the time -- that's on us, not on you.
What main currently has, vs. what your PR adds
Comparing your PR against main:
| Area | main (via #259) | this PR | Action |
|---|---|---|---|
max_depth on file_exists_handler |
yes | yes | overlapping -- would conflict on rebase |
_walk_depth_limited / _find_files_with_depth helper |
yes (generator, opt-in) | yes (different shape, see inline) | overlapping |
TOML opt-in on OSPS-BR-05.01 |
yes (max_depth=2) |
yes (max_depth=3) |
overlapping -- main picked 2, you picked 3 |
max_depth on regex_handler and _regex_exclude_evidence |
no | yes | net-new value -- not on main |
TOML opt-in on OSPS-QA-02.01 |
no | yes (max_depth=3) |
net-new value -- not on main |
Test test_pass_when_file_exists_nested |
(main has similar) | yes | overlapping |
Test test_ignore_dirs_are_pruned |
(main has similar) | yes | overlapping |
Test test_max_depth_finds_nested_files (regex) |
no | yes | net-new value |
So roughly 60% of this PR overlaps with what's already on main, but ~40% is genuinely new and useful -- specifically the regex_handler support and the OSPS-QA-02.01 opt-in. That part shouldn't be lost.
Suggested path forward
Three options, in order of effort:
- Rebase and scope down. Drop the changes that overlap with #259, keep only the
max_depthextension toregex_handlerand theOSPS-QA-02.01TOML opt-in. That would be ~50 lines instead of ~250 and would land cleanly. I'd lean toward this -- the regex_handler depth support is genuinely missing on main and would be welcome. - Close this PR and open a smaller follow-up. Same outcome as option 1, but starting from a clean slate against main might be easier than rebasing through the file_exists conflicts.
- Close this PR without a follow-up. Acceptable -- the bulk of the user-facing bug is fixed.
If you do go with option 1 or 2, also worth reading the inline note on the helper's default behavior (recursive=True for max_depth is None) -- that's a backward-incompatible change for any control already using glob patterns at root level, and it's something main's design specifically avoided.
Brief technical comparison of the two approaches
Both PRs solve the same problem; the design choices differ:
- Backward compatibility: Main's
max_depth=0(int) default is strictly root-only and preserves existing behavior bit-for-bit. Yourmax_depth=Nonedefault switches the underlyingglob.globtorecursive=True, which means any existing control using glob patterns (e.g. a*.ymlpattern at root) will silently start matching nested files. That's the kind of change that can cause surprise re-classifications on next audit. The inline comment expands on this. - Helper shape: Main uses a
_walk_depth_limited(root, max_depth)generator yielding(dir, depth)tuples; the handlers iterate and check files themselves. Yours uses_find_files_with_depth(base_dir, pattern, max_depth)that returns a fully-materialised sorted+deduped list of matches. Both work; the generator approach is slightly more memory-efficient on giant trees but the difference is small in practice. - Pattern matching: Your three-strategy approach (pathlib glob + fnmatch on stripped path + plain fnmatch) handles
**/-prefixed patterns more flexibly than main's plainglob.glob. That's a real point on the for-side that's worth porting if you do a follow-up -- main's pattern handling on the depth-walked path is fairly bare. - Ignore list: Yours has ~17 entries hardcoded in the helper; main uses a module-level constant
_FILE_DISCOVERY_PRUNE_DIRS. Same intent; main's constant is reusable.
None of these design differences are "wrong" -- they're trade-offs, and the design that landed on main is the one to align with at this point.
Constitution alignment (for the parts not yet on main)
| Principle | Status | Note |
|---|---|---|
| I. Plugin Separation | OK | Changes are within the framework package |
| II. Conservative-by-Default | Mostly OK | The default recursive=True change cuts against this -- see inline |
| III. TOML-First Architecture | OK | New behavior is opt-in via TOML max_depth = N |
| IV. Never Guess User Values | OK | No auto-detection; user explicitly sets max_depth |
| V. Sieve Pipeline Integrity | OK | Doesn't change PASS/FAIL/INCONCLUSIVE semantics for the handler outcomes |
Tests
The test additions are solid: test_pass_when_file_exists_nested, test_ignore_dirs_are_pruned, and test_max_depth_finds_nested_files all assert the right behaviour. The regex one (test_max_depth_finds_nested_files) is the most valuable since it covers ground that main doesn't.
TL;DR
Genuine apologies that the bug got fixed in parallel without anyone looping you in. The regex_handler max_depth work and the OSPS-QA-02.01 opt-in are real wins that should land in some form -- I'd happily review a scoped-down follow-up. Otherwise, closing this PR is reasonable. Your call.
| # across directory boundaries, consistent with pre-max_depth behavior. | ||
| if max_depth is None: | ||
| if "*" in pattern or "?" in pattern: | ||
| matches = glob.glob(os.path.join(base_dir, pattern), recursive=True) |
There was a problem hiding this comment.
Behaviour change worth flagging if you rebase: when max_depth is None, this calls glob.glob(..., recursive=True), but the pre-existing code in file_exists_handler called glob.glob(...) without recursive=True. With recursive=True, the ** glob syntax now matches across directories, and even non-** patterns that happen to contain * get evaluated differently against nested files.
Net effect: any control already using a glob pattern at root (e.g., a *.yml pattern that previously only matched files in the repo root) will silently start matching files in subdirectories on the next audit -- a result re-classification with no TOML opt-in. The merged solution (#259) avoided this by keeping glob.glob(...) non-recursive at the default max_depth=0 and only walking subdirs when max_depth > 0. That preserves the previous semantics bit-for-bit for non-opt-in controls.
If you do a follow-up that keeps the regex_handler max_depth support, this is the one behaviour-changing bit I'd recommend dropping -- match main's default-is-root-only posture and only walk when explicitly opted in.
Summary
Fixes #221. Modified the Sieve dependency checkers (
file_exists_handlerandregex_handler) by switching from unbounded recursive globbing to a constrained depth search (os.walklimited to 3 directories deep). This efficiently discovers nested manifests likesrc/app/package.jsonwhile explicitly rejecting large unneeded directories (.git,node_modules,venv, etc.) without incurring the massive performance penalty of full recursive scans.Type of Change
Framework Changes Checklist
If this PR modifies the darnit framework (
packages/darnit/):openspec/specs/framework-design/spec.md) if behavior changeduv run python scripts/validate_sync.py --verboseand it passesuv run python scripts/generate_docs.pyand committed any doc changesControl/TOML Changes Checklist
If this PR modifies controls or TOML configuration:
Testing
uv run pytest tests/ -v)uv run ruff check .)Additional Notes
Added
test_pass_when_file_exists_nestedtotests/darnit/sieve/test_builtin_handlers.pyto assert that files hidden two folders deep (e.g.src/app/package.json) are correctly flagged by the updated file_exists_handler logic.