feat(redact): scrub secrets + entropy substrings before remote LLMs by nirmalgupta · Pull Request #11 · leverj/security-scanner

nirmalgupta · 2026-06-02T22:24:37Z

Closes #3.

Summary

Adds security_scan/redact.py with redact_text, redact_obj, is_local_url.
Wires it into the three places source-derived content can reach a remote model: triage._finding_brief, runners/gemma._build_user_prompt, cross_validate (both _gemma_verdict and _codex_verdict, plus _finding_summary).
All three callsites also gate on is_local_url: Triage disables itself, the gemma runner returns failure, and the gemma direction of cross-validate is skipped when base_url/ollama_url isn't loopback/private/host.docker.internal.

Patterns redacted

Known shapes: AWS access keys (AKIA/ASIA), GitHub tokens (ghp_/gho_/ghu_/ghs_/github_pat_), GitLab (glpat-), Stripe (sk_live_/sk_test_/..., whsec_), OpenAI/Anthropic-style sk-…, Slack (xox[abeprs]-, xapp-), Google API (AIza…, ya29.), SendGrid (SG.<id>.<secret>), Age private keys (AGE-SECRET-KEY-1…), JWTs, PEM key blocks, database/broker URLs with embedded credentials (postgres/mysql/mongodb+srv/redis/amqp/jdbc/...), Azure storage AccountKey=/SharedAccessKey=/SAS sig=, 32+ char hex digests.
Heuristic: any contiguous substring (≥ 20 chars from [A-Za-z0-9+/=_-]) with Shannon entropy ≥ 4.0 bits/char.
Assignment shapes: NAME=value or NAME: value where NAME hints at a secret (api_key, secret_access_key, password, token, bearer, jwt_secret, ...). Permissive left-boundary handles prefixed names like AWS_SECRET_ACCESS_KEY.

Codex peer-review hardening

Ran the diff through codex for a second opinion. Addressed:

Truncate after redact (was truncating first; a credential straddling the cutoff would lose its prefix and slip through).
Path-sanitize _read_snippet against .. / absolute file_path escapes.
Drop sum()-then-list() anti-pattern on the file generator in gemma runner.
Several missed token shapes (GitLab, Stripe webhook, Azure SAS, DB URLs, age keys, hex digests).
Redact title in _finding_brief (was passing through verbatim).

Codex also flagged that cross_validate._codex_verdict runs codex with -C repo_dir and read-only repo access — Codex can inspect raw source via its own tools, so redacting the prompt isn't sufficient to prevent leakage on that direction. Noted as a follow-up; would require feeding Codex a tempdir-of-just-the-snippet rather than the full clone, which is a meaningful architectural change.

Test plan

31 unit tests for the redactor (tests/test_redact.py)
Wire-up assertions in test_triage.py, test_gemma_runner.py, test_cross_validate.py proving no plaintext secret reaches the network
Path-traversal test for _read_snippet
Full suite: 265 passed, ruff clean
Manual run against a repo with planted credentials post-merge

🤖 Generated with Claude Code

Closes #3. Adds `security_scan/redact.py` with `redact_text`, `redact_obj`, and `is_local_url`. Wires it into the three exit points to remote models: - triage._finding_brief: snippet, message, and extra dict pass through redact_text/redact_obj before serialization. Triage refuses to operate at all if base_url isn't loopback/private. - runners.gemma: every file body is redacted before going into the prompt; the runner refuses to send anything when base_url isn't local. - cross_validate: snippets handed to both gemma and codex validators are redacted; finding messages are redacted; gemma direction skipped when ollama_url isn't local. Patterns covered: AWS access keys, GitHub tokens/PATs, Stripe, Slack, Google API, OpenAI/Anthropic-style sk-..., JWTs, PEM key blocks, and NAME=value assignments where NAME hints at a secret. Plus high-Shannon- entropy (>=4.0 bits/char over >=20 chars) substrings. Tests: 31 unit tests for the redactor, plus wire-up assertions in test_triage.py, test_gemma_runner.py, and test_cross_validate.py that verify no plaintext secret reaches the network. Test fixtures deliberately split secret-shaped prefixes with string concat so source files don't contain the literal token shapes GitHub push protection detects. Full suite 258 passed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Truncate AFTER redact in _finding_brief and cross_validate prompts (a credential straddling the cutoff would otherwise lose its prefix and slip the known-token regexes). - Permissive left-boundary on the assignment regex so prefixed names like AWS_SECRET_ACCESS_KEY, DB_PASSWORD, JWT_SECRET match (the prior \b anchor treated `_` as a word char and failed those). - Path-sanitize cross_validate._read_snippet: a finding emitting an absolute or `..`-escaped file_path no longer lets the validator read outside repo_dir. - Stop sum()-then-list() on the Iterable in gemma._build_user_prompt; generators would have been silently consumed twice. Materialize once. - New patterns: GitLab tokens (glpat-), Stripe webhook (whsec_), Slack app tokens (xapp-), Slack xoxe- variant, Google OAuth (ya29.), SendGrid (SG.<id>.<secret>), Age private key, Azure connection-string AccountKey/SharedAccessKey/SAS sig, DB/broker URLs with embedded credentials (postgres/mysql/mongodb/redis/amqp/jdbc/...), and a long hex-digest pattern that the entropy heuristic missed (hex's per-char entropy <= 4). - Redact `title` in _finding_brief (was passing through verbatim). 258 → 265 tests; full suite green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Adds a centralized redaction layer to prevent hardcoded credentials (known token shapes and high-entropy substrings) from being sent to remote LLMs, and enforces a “local-only” policy for Ollama-based features to mitigate source/snippet leakage (Issue #3).

Changes:

Introduces security_scan/redact.py (redact_text, redact_obj, is_local_url) and unit tests covering token-shape + entropy redaction.
Wires redaction into triage, Gemma runner prompts, and cross-validation prompts; adds guardrails to skip/disable LLM paths when URLs aren’t considered local.
Updates docs/manifests/versioning for the new redaction behavior.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
`security_scan/redact.py`	New redaction + local-URL gate utilities (patterns + entropy heuristic).
`security_scan/triage.py`	Disables triage on non-local base_url; redacts title/message/extra/snippet in `_finding_brief`.
`security_scan/runners/gemma.py`	Refuses non-local base_url; redacts file contents before building user prompt.
`security_scan/cross_validate.py`	Skips Gemma direction on non-local ollama_url; redacts snippet/message before validator prompts; path-sanitizes `_read_snippet`.
`tests/test_redact.py`	Adds comprehensive unit tests for token patterns, entropy heuristic, and is_local_url.
`tests/test_triage.py`	Adds assertions that triage brief redacts secrets and disables on remote base_url.
`tests/test_gemma_runner.py`	Adds assertions that prompts redact secrets and runner refuses non-local base_url.
`tests/test_cross_validate.py`	Adds assertions for redaction, path traversal refusal, and skipping Gemma on remote URL.
`README.md`	Documents redaction + local-only behavior for LLM integrations.
`SECURITY-SCAN-MANIFEST.yaml`	Bumps version and adds changelog entry describing redaction + refusal behavior.
`security_scan/__init__.py`	Bumps package version to 0.2.4.
`pyproject.toml`	Bumps project version to 0.2.4.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    (re.compile(
+        r"(?i)(?:^|[^A-Za-z0-9])(?P<k>(?:api[_-]?key|secret(?:[_-]?(?:access[_-]?)?key)?"
+        r"|token|password|passwd|auth(?:[_-]?token)?|bearer|client[_-]?secret"
+        r"|access[_-]?token|refresh[_-]?token|private[_-]?key|jwt[_-]?secret"
+        r"|db[_-]?password|database[_-]?password))"
+        r"['\"]?\s*[:=]\s*['\"]?(?P<v>[^\s'\";,]{8,})['\"]?"
+    ), lambda m: f"{m.group('k')}=<REDACTED:secret-like>"),


+    # Long hex digests (32+ char) — common for HMAC keys, session secrets, etc.
+    # Entropy heuristic misses these because hex has only 16 symbols (entropy ≤ 4).
+    (re.compile(r"\b[a-f0-9]{32,}\b"), "<REDACTED:hex-digest>"),


    if not file_path:
        return ""
-    p = repo_dir / file_path
+    p = (repo_dir / file_path).resolve()
    try:
+        repo_resolved = repo_dir.resolve()


+- For Gemma (Ollama) and the Gemma direction of cross-validation, the scanner
+  refuses to send anything at all if `base_url` doesn't resolve to loopback or
+  RFC1918. Same for `triage.base_url` — if set to a non-local host, triage is
+  disabled at construction time.


+        if self.enabled and not is_local_url(self.cfg.base_url):
+            print(
+                f"triage: base_url {self.cfg.base_url!r} is not loopback/private — "
+                "triage is disabled to prevent source snippets leaving the host. "
+                "Set triage.base_url to a local Ollama (e.g. host.docker.internal:11434).",


+    if not is_local_url(ollama_url):
+        print(
+            f"cross-validate: gemma validator skipped — ollama_url {ollama_url!r} "
+            "is not loopback/private",
+            file=sys.stderr,
+        )
+        gemma_reachable = False


+    f = Finding(
+        scanner="semgrep", category="sast", rule_id="hardcoded-token",
+        severity="high", file_path="src/a.py", line=1, title="t",
+        message="found AKIAIOSFODNN7EXAMPLE in config",
+        extra={"snippet": "GITHUB_TOKEN = 'ghp_aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'"},
+    )


+    (tmp_path / "src").mkdir()
+    (tmp_path / "src" / "creds.py").write_text(
+        "AWS_KEY = 'AKIAIOSFODNN7EXAMPLE'\n"
+        "GH = 'ghp_aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'\n"
+        "BLOB = 'f4Z7q2pHk8wT3sNcRy9LbVxJgQmDeAo5'\n"
+    )


+    f = _f("codex", "auth.foo", severity="high")
+    f.message = "exposed token ghp_aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa in source"
+    f.extra = {"snippet": "secret = 'AKIAIOSFODNN7EXAMPLE'"}
+    captured = {}


nirmalgupta and others added 2 commits June 2, 2026 17:17

Copilot AI review requested due to automatic review settings June 2, 2026 22:24

Copilot started reviewing on behalf of nirmalgupta June 2, 2026 22:24 View session

Copilot AI reviewed Jun 2, 2026

View reviewed changes

nirmalgupta merged commit df9216d into main Jun 2, 2026
3 checks passed

nirmalgupta deleted the feat/redact-secrets-before-llm branch June 2, 2026 22:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(redact): scrub secrets + entropy substrings before remote LLMs#11

feat(redact): scrub secrets + entropy substrings before remote LLMs#11
nirmalgupta merged 2 commits into
mainfrom
feat/redact-secrets-before-llm

nirmalgupta commented Jun 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nirmalgupta commented Jun 2, 2026

Summary

Patterns redacted

Codex peer-review hardening

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants