Skip to content

fix(sieve): utilize constrained depth file discovery for dependency checks#225

Open
Jaydeep869 wants to merge 2 commits into
kusari-oss:mainfrom
Jaydeep869:fix-issue-221-nested-dependencies
Open

fix(sieve): utilize constrained depth file discovery for dependency checks#225
Jaydeep869 wants to merge 2 commits into
kusari-oss:mainfrom
Jaydeep869:fix-issue-221-nested-dependencies

Conversation

@Jaydeep869

Copy link
Copy Markdown
Contributor

Summary

Fixes #221. Modified the Sieve dependency checkers (file_exists_handler and regex_handler) by switching from unbounded recursive globbing to a constrained depth search (os.walk limited to 3 directories deep). This efficiently discovers nested manifests like src/app/package.json while explicitly rejecting large unneeded directories (.git, node_modules, venv, etc.) without incurring the massive performance penalty of full recursive scans.

Type of Change

  • Bug fix (non-breaking change fixing an issue)
  • New feature (non-breaking change adding functionality)
  • Breaking change (fix or feature causing existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)

Framework Changes Checklist

If this PR modifies the darnit framework (packages/darnit/):

  • Updated framework spec (openspec/specs/framework-design/spec.md) if behavior changed
  • Ran uv run python scripts/validate_sync.py --verbose and it passes
  • Ran uv run python scripts/generate_docs.py and committed any doc changes

Control/TOML Changes Checklist

If this PR modifies controls or TOML configuration:

  • Control metadata defined in TOML (not Python code)
  • SARIF fields (description, severity, help_url) included where appropriate
  • Ran validation to confirm TOML schema compliance

Testing

  • Tests pass locally (uv run pytest tests/ -v)
  • Added tests for new functionality (if applicable)
  • Linting passes (uv run ruff check .)

Additional Notes

Added test_pass_when_file_exists_nested to tests/darnit/sieve/test_builtin_handlers.py to assert that files hidden two folders deep (e.g. src/app/package.json) are correctly flagged by the updated file_exists_handler logic.

@mlieberman85 mlieberman85 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this! CI lint is red — uv run ruff check --fix . will clear the W293 / SIM114 warnings in the new helper. Once that's clean, two things need a closer look:

  1. Possible bug — at builtin_handlers.py:64, for file in files + dirs: iterates over the combined files-and-directories lists from os.walk. That means a directory named package.json would match a package.json file pattern. Did you mean for file in files:?

  2. Sieve-wide blast radiusmax_depth=3 is now the default for every control using file_exists and regex, not just the nested-dependency case from #221. That's a behavioral change across ~60 controls.

    For this PR, let's keep the current default behavior unchanged (unbounded recursive glob) and make the new depth-limited search opt-in via TOML — controls that need it set max_depth = 3 explicitly, everyone else keeps today's semantics. That keeps scope tight and unblocks #221 without risking regressions elsewhere.

    I'll open a follow-up issue to explore whether depth-limited search should become the default later (with a proper audit-regression check).

Minor: the pattern.replace("**/", "") strip-and-retry fallback isn't standard glob semantics. If you need recursive matching with a depth cap, pathlib.Path.rglob plus a depth filter would be more predictable than the current branching.

@mlieberman85

Copy link
Copy Markdown
Contributor

Follow-up issue for evaluating depth-limited search as the default: #226

@mlieberman85

Copy link
Copy Markdown
Contributor

Thanks for the work here, @Jaydeep869. I dug into this and there's a gap I want to flag before we merge.

The PR ships the capability but no control opts in

After commit 21399f3 made max_depth opt-in (which is the right call — see below), I checked whether anything actually uses it:

$ grep -c max_depth packages/darnit-baseline/openssf-baseline.toml
0

I then reproduced issue #221 directly against this branch:

# /tmp/nested-test/src/app/package.json exists
config = {
    "files": ["package.json", "package-lock.json", "requirements.txt",
              "pyproject.toml", "Cargo.toml", "go.mod", "go.sum",
              "pom.xml", "build.gradle", "Gemfile.lock"],
}  # exact OSPS-BR-05.01 config from the TOML
ctx = HandlerContext(local_path="/tmp/nested-test", control_id="OSPS-BR-05.01")

file_exists_handler(config, ctx)
# Status: fail
# Msg: None of the required files found

config["max_depth"] = 3
file_exists_handler(config, ctx)
# Status: pass
# Msg: Required file found: src/app/package.json

So as currently configured, a user running darnit audit against a monorepo with src/app/package.json will see the same FAIL they did before the PR. The fix needs the corresponding TOML opt-in.

Why opt-in is correct (don't change this part)

Several controls intentionally check root-only because finding the file anywhere would be a false positive: LICENSE, CODE_OF_CONDUCT.md, SECURITY.md, CONTRIBUTING.md. A repo that vendors a dep with vendor/somelib/LICENSE shouldn't pass a license check on the transitive copy — the question is "does this project declare X" not "is there an X anywhere in the tree." The ignore_dirs set doesn't cover Go vendor/, git submodules, third_party/, etc., so a global default-on would silently flip several controls into false positives. That violates our conservative-by-default principle.

Suggested fix

Add max_depth = 3 to the controls that should match nested manifests — the dependency/build controls, not metadata files. For #221 specifically:

[[controls."OSPS-BR-05.01".passes]]
handler = "file_exists"
max_depth = 3
files = [
    "package.json", "package-lock.json", ...
]

Worth applying to other dep-adjacent controls too (e.g. OSPS-BR-04.01 if it checks lockfiles). Please skip LICENSE/SECURITY/CODE_OF_CONDUCT-style controls.

Smaller items while we're here

  • The match expression is hard to scan due to or/and precedence:

    if PurePath(rel_path).match(pattern) or pattern.startswith("**/") and fnmatch.fnmatch(rel_path, pattern[3:]) or fnmatch.fnmatch(rel_path, pattern):

    Python parses this as A or (B and C) or D. Splitting into named branches would make the intent clearer.

  • dirs.clear() at depth >= max_depth means directory matches at the boundary depth are skipped (verified: a target/ dir at depth 2 with max_depth=2 returns [] while a target.txt file at the same depth is found). Probably fine for these handlers since manifests are files, but worth a comment.

  • file_exists_handler now passes recursive=True to glob.glob for unbounded mode — it didn't before. This is a subtle behavior change for patterns like **/foo. Probably benign but uncalled-out.

  • Test coverage: the new test covers file_exists_handler with max_depth=3. No coverage for regex_handler with max_depth, and no test that ignore_dirs actually prunes.

  • New TOML schema field (max_depth) — the spec validation checkbox in the PR description is unchecked. Worth running uv run python scripts/validate_sync.py --verbose and updating the spec if it touches the schema docs.

@Jaydeep869 Jaydeep869 force-pushed the fix-issue-221-nested-dependencies branch from 21399f3 to 723300e Compare May 1, 2026 00:40
@Jaydeep869

Copy link
Copy Markdown
Contributor Author

Hey @mlieberman85, can you review this pr now.

@mlieberman85 mlieberman85 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: Constrained Depth File Discovery

Overview

Fixes #221 by replacing root-only file discovery in file_exists_handler and regex_handler with a depth-bounded walk (os.walk + ignore-list pruning). New optional max_depth config field on passes; OSPS-BR-05.01 opts in with max_depth = 3 so monorepo manifests like src/app/package.json are found. Adds three tests.

What works

  • Targeted fix lands cleanly on the reported bug (BR-05.01) without changing other controls' default behavior.
  • os.walk + dirs[:] = [...] pruning is the correct idiom for skipping node_modules/.git cheaply.
  • Sorted, deduped output (sorted(set(found))) is deterministic.
  • Schema compatibility: PassConfig uses extra="allow", so max_depth flows through without schema changes.
  • Tests cover both directions: nested-found (positive) and node_modules-pruned (negative).

Issues to address

1. Docstring contradicts behavior (builtin_handlers.py:115)

max_depth: int - Max directory depth to search (default: 3)

Actual default is None (unbounded glob, root-only for non-globs). Either change the docstring to (default: None, no depth limit), or actually default to 3 in the handler. Misleading docstrings here are a footgun for TOML authors.

2. Subtle behavior change for file_exists_handler glob patterns
Old: glob.glob(path)recursive=False.
New (max_depth=None branch): glob.glob(path, recursive=True).
For * patterns this is identical, but **/... patterns now recurse where before they didn't. Practically no current TOML uses **/ with file_exists, but worth a note in the commit message and ideally a regression test asserting backward-compat for the unbounded path.

3. Code/comment mismatch on boundary-depth dirs (builtin_handlers.py:80–88)

if depth >= max_depth:
    dirs.clear()
...
for item in files + dirs:  # dirs is now empty

The comment claims "directory names at exactly the boundary depth are still yielded in dirs and can match" — but dirs.clear() empties the list before iteration, so they can't. Either fix the comment, or move the dirs.clear() to after the match loop if matching boundary-depth dirs is intentional. Low real-world impact, but the comment will mislead the next reader.

4. Issue #221 mentions go.mod and pyproject.toml — those live in other controls and didn't get max_depth = 3 here. If the goal is to fully resolve the bug as filed, audit other dependency-manifest controls (requirements.txt, Cargo.toml, go.mod, pyproject.toml, Gemfile, pom.xml, build.gradle) for the same opt-in. Otherwise the bug recurs for non-JS monorepos.

5. ignore_dirs set is JS/Python-biased
Missing common large dirs: target (Rust/Java), vendor (Go/PHP), bin, obj, .gradle, .next, .nuxt, .cache, .pytest_cache, .mypy_cache. Consider expanding — pruning is essentially free and avoids surprise slowdowns.

Suggestions

  • The three-strategy match (pathlib + fnmatch-stripped + plain fnmatch) is over-engineered for the patterns in use. PurePath(rel_path).match(pattern) alone covers all current TOML patterns and is easier to reason about. Consider simplifying unless there's a concrete pattern that needs the fallbacks.
  • Hoist imports (fnmatch, glob, PurePath) to the module top. The existing per-call import glob is a pre-existing wart, not a convention worth preserving.
  • Default max_depth = 3 for the helper, opt-out via None, rather than opt-in. Most controls would benefit from depth-limited search; few benefit from unbounded recursion. Follow-up PR if scope-limiting this one.

Risk

Low for the targeted scope (BR-05.01). Medium-low elsewhere — the recursive=True flip in the unbounded path is the main thing to watch in CI on real repos.

Test coverage

Adequate for the new code paths. Missing: a backward-compat assertion that file_exists_handler({"files": ["pkg.json"]}, ctx) (no max_depth) behaves exactly as before.

Verdict: Docstring fix (#1) and comment fix (#3) as blockers; the rest as follow-ups or commit-message notes.

Jaydeep869 added 2 commits May 8, 2026 01:45
…ecks

Add max_depth support to _find_files_with_depth() so dependency controls
can discover nested manifests (e.g. src/app/package.json) without
imposing unbounded recursive scans. The default behavior (max_depth=None)
remains unchanged—unbounded glob—keeping all existing controls unaffected.

Key changes:

- builtin_handlers.py: clarify match-expression precedence by splitting
  the or/and chain into named booleans (pathlib_match, recursive_match,
  simple_match). Document recursive=True in the unbounded-glob fast
  path, and add a comment explaining dirs.clear() semantics at the
  depth boundary.

- openssf-baseline.toml: opt OSPS-BR-05.01 into max_depth=3 so nested
  dependency manifests are correctly discovered. Metadata-file controls
  (LICENSE, SECURITY, CODE_OF_CONDUCT) are intentionally left unbounded
  to avoid false positives from vendored copies.

- tests: move test_pass_when_file_exists_nested into TestFileExistsHandler
  (was misplaced in TestProjectUpdateHandler), add coverage for
  regex_handler with max_depth, and add ignore_dirs pruning assertion.

Fixes kusari-oss#221

Signed-off-by: jaydeep869 <jaydeeppokhariya2106@gmail.com>
- Update docstring for max_depth in file_exists_handler
- Fix boundary-depth directory iteration bug inside _find_files_with_depth
- Expand ignore_dirs list for other language ecosystems (target, vendor, etc.)
- Opt-in max_depth=3 for OSPS-QA-02.01 so standard manifests like go.mod are found
Note: Unbounded globs now default to passing recursive=True down to glob.

Signed-off-by: jaydeep869 <jaydeeppokhariya2106@gmail.com>
@Jaydeep869 Jaydeep869 force-pushed the fix-issue-221-nested-dependencies branch from e5b8315 to af83acf Compare May 7, 2026 20:15

@mlieberman85 mlieberman85 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the diagnosis and the patch -- this was a real bug and your investigation is good. Sharing some context that should have surfaced sooner.

Heads-up: this bug was fixed on main two months ago

While this PR was open, another fix for the same issue (#221) landed on main:

  • PR #259 -- fix(sieve): depth-limited file discovery via opt-in max_depth (closes #221)
  • Commit 2e2d55b, merged in May

Issue #221 was closed by that merge rather than by this PR. Sorry no one tagged you on this PR at the time -- that's on us, not on you.

What main currently has, vs. what your PR adds

Comparing your PR against main:

Area main (via #259) this PR Action
max_depth on file_exists_handler yes yes overlapping -- would conflict on rebase
_walk_depth_limited / _find_files_with_depth helper yes (generator, opt-in) yes (different shape, see inline) overlapping
TOML opt-in on OSPS-BR-05.01 yes (max_depth=2) yes (max_depth=3) overlapping -- main picked 2, you picked 3
max_depth on regex_handler and _regex_exclude_evidence no yes net-new value -- not on main
TOML opt-in on OSPS-QA-02.01 no yes (max_depth=3) net-new value -- not on main
Test test_pass_when_file_exists_nested (main has similar) yes overlapping
Test test_ignore_dirs_are_pruned (main has similar) yes overlapping
Test test_max_depth_finds_nested_files (regex) no yes net-new value

So roughly 60% of this PR overlaps with what's already on main, but ~40% is genuinely new and useful -- specifically the regex_handler support and the OSPS-QA-02.01 opt-in. That part shouldn't be lost.

Suggested path forward

Three options, in order of effort:

  1. Rebase and scope down. Drop the changes that overlap with #259, keep only the max_depth extension to regex_handler and the OSPS-QA-02.01 TOML opt-in. That would be ~50 lines instead of ~250 and would land cleanly. I'd lean toward this -- the regex_handler depth support is genuinely missing on main and would be welcome.
  2. Close this PR and open a smaller follow-up. Same outcome as option 1, but starting from a clean slate against main might be easier than rebasing through the file_exists conflicts.
  3. Close this PR without a follow-up. Acceptable -- the bulk of the user-facing bug is fixed.

If you do go with option 1 or 2, also worth reading the inline note on the helper's default behavior (recursive=True for max_depth is None) -- that's a backward-incompatible change for any control already using glob patterns at root level, and it's something main's design specifically avoided.

Brief technical comparison of the two approaches

Both PRs solve the same problem; the design choices differ:

  • Backward compatibility: Main's max_depth=0 (int) default is strictly root-only and preserves existing behavior bit-for-bit. Your max_depth=None default switches the underlying glob.glob to recursive=True, which means any existing control using glob patterns (e.g. a *.yml pattern at root) will silently start matching nested files. That's the kind of change that can cause surprise re-classifications on next audit. The inline comment expands on this.
  • Helper shape: Main uses a _walk_depth_limited(root, max_depth) generator yielding (dir, depth) tuples; the handlers iterate and check files themselves. Yours uses _find_files_with_depth(base_dir, pattern, max_depth) that returns a fully-materialised sorted+deduped list of matches. Both work; the generator approach is slightly more memory-efficient on giant trees but the difference is small in practice.
  • Pattern matching: Your three-strategy approach (pathlib glob + fnmatch on stripped path + plain fnmatch) handles **/-prefixed patterns more flexibly than main's plain glob.glob. That's a real point on the for-side that's worth porting if you do a follow-up -- main's pattern handling on the depth-walked path is fairly bare.
  • Ignore list: Yours has ~17 entries hardcoded in the helper; main uses a module-level constant _FILE_DISCOVERY_PRUNE_DIRS. Same intent; main's constant is reusable.

None of these design differences are "wrong" -- they're trade-offs, and the design that landed on main is the one to align with at this point.

Constitution alignment (for the parts not yet on main)

Principle Status Note
I. Plugin Separation OK Changes are within the framework package
II. Conservative-by-Default Mostly OK The default recursive=True change cuts against this -- see inline
III. TOML-First Architecture OK New behavior is opt-in via TOML max_depth = N
IV. Never Guess User Values OK No auto-detection; user explicitly sets max_depth
V. Sieve Pipeline Integrity OK Doesn't change PASS/FAIL/INCONCLUSIVE semantics for the handler outcomes

Tests

The test additions are solid: test_pass_when_file_exists_nested, test_ignore_dirs_are_pruned, and test_max_depth_finds_nested_files all assert the right behaviour. The regex one (test_max_depth_finds_nested_files) is the most valuable since it covers ground that main doesn't.

TL;DR

Genuine apologies that the bug got fixed in parallel without anyone looping you in. The regex_handler max_depth work and the OSPS-QA-02.01 opt-in are real wins that should land in some form -- I'd happily review a scoped-down follow-up. Otherwise, closing this PR is reasonable. Your call.

# across directory boundaries, consistent with pre-max_depth behavior.
if max_depth is None:
if "*" in pattern or "?" in pattern:
matches = glob.glob(os.path.join(base_dir, pattern), recursive=True)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Behaviour change worth flagging if you rebase: when max_depth is None, this calls glob.glob(..., recursive=True), but the pre-existing code in file_exists_handler called glob.glob(...) without recursive=True. With recursive=True, the ** glob syntax now matches across directories, and even non-** patterns that happen to contain * get evaluated differently against nested files.

Net effect: any control already using a glob pattern at root (e.g., a *.yml pattern that previously only matched files in the repo root) will silently start matching files in subdirectories on the next audit -- a result re-classification with no TOML opt-in. The merged solution (#259) avoided this by keeping glob.glob(...) non-recursive at the default max_depth=0 and only walking subdirs when max_depth > 0. That preserves the previous semantics bit-for-bit for non-opt-in controls.

If you do a follow-up that keeps the regex_handler max_depth support, this is the one behaviour-changing bit I'd recommend dropping -- match main's default-is-root-only posture and only walk when explicitly opted in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Dependency detection (e.g., go.mod, pyproject.toml) fails if not at absolute repo root

2 participants