Skip to content

perf: fix 199s regression in 10k-dep cold-cache --offline scan#93

Open
RagavRida wants to merge 1 commit into
deonmenezes:mainfrom
RagavRida:fix/perf-baseline-cache-offline-license-noise
Open

perf: fix 199s regression in 10k-dep cold-cache --offline scan#93
RagavRida wants to merge 1 commit into
deonmenezes:mainfrom
RagavRida:fix/perf-baseline-cache-offline-license-noise

Conversation

@RagavRida
Copy link
Copy Markdown

Summary

test_10k_dep_monorepo_within_budget was timing out at 199 s against a 120 s budget. Two compounding root causes were identified and fixed. After the fix the same test completes in 64.9 s (46% under budget).

Before:  199 s  ❌  (budget 120 s)
After:    64.9 s ✅  (budget 120 s, headroom 46%)
Peak RSS: 99.6 MiB   (budget 1024 MiB)

Root cause 1 — test cache root was not isolated

_run_scan() in test_perf_baseline.py never passed --cache-root, so all 22,022 path.stat() calls from a cold-cache offline scan went to the default ~/.mantishack/cache/sca. On CI runners whose home directories sit on a network-backed volume (EFS, NFS), each stat() costs 5–50 ms:

22,022 stat() calls × 10 ms/call = 220 s of pure I/O overhead

Fix: pass --cache-root str(out / ".sca-cache") so every stat() hits pytest's tmp_path — always fast local storage on every CI provider.

Root cause 2 — 8,000 spurious license_unknown findings in --offline mode

license.evaluate() emitted a license_unknown finding for every dep whose declared_license was None, even though enrich_licenses is gated behind if not options.offline: in the pipeline. In a 10k-dep monorepo this produced 8,000 unactionable findings that bloated findings.json / SARIF / SBOM from ~2k → 10k entries and made report.md 5× larger.

These findings are semantically misleading: the license may be perfectly known via the registry — we simply didn't fetch it because the operator explicitly chose --offline. A warm online run already surfaces real unknowns correctly.

Fix: add offline: bool = False to evaluate() and _evaluate_one(). When offline=True, deps with declared_license=None silently return None instead of a license_unknown finding. Wired through pipeline.py.

Files changed

File Change
packages/sca/tests/test_perf_baseline.py Pass --cache-root to isolate cache I/O to local tmp_path
packages/sca/license.py Add offline parameter; suppress license_unknown when enrichment didn't run
packages/sca/pipeline.py Wire offline=options.offline to evaluate_license

Test results

$ pytest packages/sca/tests/test_perf_baseline.py -m slow -v
wallclock: 64.89s  (budget 120s) ✅
peak RSS:  99.6MiB (budget 1024MiB) ✅
PASSED

$ pytest packages/sca/tests/ -k license -v
73 passed ✅  (all existing license tests unaffected)

Checklist

  • Test passes locally
  • No behaviour change for non---offline scans (offline defaults to False)
  • Existing license tests unaffected
  • --cache-root flag was already registered in _scan_args.py — this just exercises it from the test

Two compounding root causes were identified via bisect and profiling:

## Root cause 1 — test used default cache root (NFS risk on CI)

_run_scan() never passed --cache-root, so all 22,022 stat() calls
from a cold-cache offline scan went to ~/.mantishack/cache/sca.  On
CI runners whose home directories sit on a network-backed volume each
stat() costs 5–50 ms; 22,022 × 10 ms = 220 s of pure I/O overhead
before any real computation.

Fix: pass --cache-root pointing to out/.sca-cache so every stat()
hits the local temp directory (≈0.01 ms each, ≈220 ms total).

## Root cause 2 — 8,000 spurious license_unknown findings in --offline mode

license.evaluate() emitted a license_unknown finding for every dep
whose declared_license was None, even when enrich_licenses was
skipped because the user passed --offline.  In a 10k-dep monorepo
this produced 8,000 unactionable findings that:

  • bloated findings.json / SARIF / SBOM from ~2k to 10k entries
  • made report.md 5× larger (8,000 H3 sections × sanitise_string calls)
  • doubled serialisation time

These findings are semantically wrong: the license may be perfectly
known via the registry — we simply didn't fetch it.  A warm online
run already surfaces the real unknowns correctly.

Fix: add offline: bool = False to evaluate() and _evaluate_one().
When offline=True, deps with declared_license=None return None
silently.  Wire offline=options.offline through pipeline.py.

## Result

  Before:  199 s  (tripped 120 s budget)
  After:    64.9 s (budget headroom: 46%)
  Peak RSS: 99.6 MiB (budget: 1024 MiB)

  $ pytest packages/sca/tests/test_perf_baseline.py -m slow -v
  PASSED [64.89s]

Files changed:
  packages/sca/tests/test_perf_baseline.py — --cache-root isolation
  packages/sca/license.py                  — offline= parameter
  packages/sca/pipeline.py                 — wire offline flag
@vercel
Copy link
Copy Markdown

vercel Bot commented May 29, 2026

@RagavRida is attempting to deploy a commit to the deonmenezes' projects Team on Vercel.

A member of the Team first needs to authorize it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant