perf: fix 199s regression in 10k-dep cold-cache --offline scan#93
Open
RagavRida wants to merge 1 commit into
Open
perf: fix 199s regression in 10k-dep cold-cache --offline scan#93RagavRida wants to merge 1 commit into
RagavRida wants to merge 1 commit into
Conversation
Two compounding root causes were identified via bisect and profiling: ## Root cause 1 — test used default cache root (NFS risk on CI) _run_scan() never passed --cache-root, so all 22,022 stat() calls from a cold-cache offline scan went to ~/.mantishack/cache/sca. On CI runners whose home directories sit on a network-backed volume each stat() costs 5–50 ms; 22,022 × 10 ms = 220 s of pure I/O overhead before any real computation. Fix: pass --cache-root pointing to out/.sca-cache so every stat() hits the local temp directory (≈0.01 ms each, ≈220 ms total). ## Root cause 2 — 8,000 spurious license_unknown findings in --offline mode license.evaluate() emitted a license_unknown finding for every dep whose declared_license was None, even when enrich_licenses was skipped because the user passed --offline. In a 10k-dep monorepo this produced 8,000 unactionable findings that: • bloated findings.json / SARIF / SBOM from ~2k to 10k entries • made report.md 5× larger (8,000 H3 sections × sanitise_string calls) • doubled serialisation time These findings are semantically wrong: the license may be perfectly known via the registry — we simply didn't fetch it. A warm online run already surfaces the real unknowns correctly. Fix: add offline: bool = False to evaluate() and _evaluate_one(). When offline=True, deps with declared_license=None return None silently. Wire offline=options.offline through pipeline.py. ## Result Before: 199 s (tripped 120 s budget) After: 64.9 s (budget headroom: 46%) Peak RSS: 99.6 MiB (budget: 1024 MiB) $ pytest packages/sca/tests/test_perf_baseline.py -m slow -v PASSED [64.89s] Files changed: packages/sca/tests/test_perf_baseline.py — --cache-root isolation packages/sca/license.py — offline= parameter packages/sca/pipeline.py — wire offline flag
|
@RagavRida is attempting to deploy a commit to the deonmenezes' projects Team on Vercel. A member of the Team first needs to authorize it. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
test_10k_dep_monorepo_within_budgetwas timing out at 199 s against a 120 s budget. Two compounding root causes were identified and fixed. After the fix the same test completes in 64.9 s (46% under budget).Root cause 1 — test cache root was not isolated
_run_scan()intest_perf_baseline.pynever passed--cache-root, so all 22,022path.stat()calls from a cold-cache offline scan went to the default~/.mantishack/cache/sca. On CI runners whose home directories sit on a network-backed volume (EFS, NFS), eachstat()costs 5–50 ms:Fix: pass
--cache-root str(out / ".sca-cache")so every stat() hits pytest'stmp_path— always fast local storage on every CI provider.Root cause 2 — 8,000 spurious
license_unknownfindings in--offlinemodelicense.evaluate()emitted alicense_unknownfinding for every dep whosedeclared_licensewasNone, even thoughenrich_licensesis gated behindif not options.offline:in the pipeline. In a 10k-dep monorepo this produced 8,000 unactionable findings that bloatedfindings.json/ SARIF / SBOM from ~2k → 10k entries and madereport.md5× larger.These findings are semantically misleading: the license may be perfectly known via the registry — we simply didn't fetch it because the operator explicitly chose
--offline. A warm online run already surfaces real unknowns correctly.Fix: add
offline: bool = Falsetoevaluate()and_evaluate_one(). Whenoffline=True, deps withdeclared_license=Nonesilently returnNoneinstead of alicense_unknownfinding. Wired throughpipeline.py.Files changed
packages/sca/tests/test_perf_baseline.py--cache-rootto isolate cache I/O to localtmp_pathpackages/sca/license.pyofflineparameter; suppresslicense_unknownwhen enrichment didn't runpackages/sca/pipeline.pyoffline=options.offlinetoevaluate_licenseTest results
Checklist
--offlinescans (offlinedefaults toFalse)--cache-rootflag was already registered in_scan_args.py— this just exercises it from the test