feat(redact): scrub secrets + entropy substrings before remote LLMs#11
Merged
Conversation
Closes #3. Adds `security_scan/redact.py` with `redact_text`, `redact_obj`, and `is_local_url`. Wires it into the three exit points to remote models: - triage._finding_brief: snippet, message, and extra dict pass through redact_text/redact_obj before serialization. Triage refuses to operate at all if base_url isn't loopback/private. - runners.gemma: every file body is redacted before going into the prompt; the runner refuses to send anything when base_url isn't local. - cross_validate: snippets handed to both gemma and codex validators are redacted; finding messages are redacted; gemma direction skipped when ollama_url isn't local. Patterns covered: AWS access keys, GitHub tokens/PATs, Stripe, Slack, Google API, OpenAI/Anthropic-style sk-..., JWTs, PEM key blocks, and NAME=value assignments where NAME hints at a secret. Plus high-Shannon- entropy (>=4.0 bits/char over >=20 chars) substrings. Tests: 31 unit tests for the redactor, plus wire-up assertions in test_triage.py, test_gemma_runner.py, and test_cross_validate.py that verify no plaintext secret reaches the network. Test fixtures deliberately split secret-shaped prefixes with string concat so source files don't contain the literal token shapes GitHub push protection detects. Full suite 258 passed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Truncate AFTER redact in _finding_brief and cross_validate prompts (a credential straddling the cutoff would otherwise lose its prefix and slip the known-token regexes). - Permissive left-boundary on the assignment regex so prefixed names like AWS_SECRET_ACCESS_KEY, DB_PASSWORD, JWT_SECRET match (the prior \b anchor treated `_` as a word char and failed those). - Path-sanitize cross_validate._read_snippet: a finding emitting an absolute or `..`-escaped file_path no longer lets the validator read outside repo_dir. - Stop sum()-then-list() on the Iterable in gemma._build_user_prompt; generators would have been silently consumed twice. Materialize once. - New patterns: GitLab tokens (glpat-), Stripe webhook (whsec_), Slack app tokens (xapp-), Slack xoxe- variant, Google OAuth (ya29.), SendGrid (SG.<id>.<secret>), Age private key, Azure connection-string AccountKey/SharedAccessKey/SAS sig, DB/broker URLs with embedded credentials (postgres/mysql/mongodb/redis/amqp/jdbc/...), and a long hex-digest pattern that the entropy heuristic missed (hex's per-char entropy <= 4). - Redact `title` in _finding_brief (was passing through verbatim). 258 → 265 tests; full suite green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a centralized redaction layer to prevent hardcoded credentials (known token shapes and high-entropy substrings) from being sent to remote LLMs, and enforces a “local-only” policy for Ollama-based features to mitigate source/snippet leakage (Issue #3).
Changes:
- Introduces
security_scan/redact.py(redact_text,redact_obj,is_local_url) and unit tests covering token-shape + entropy redaction. - Wires redaction into triage, Gemma runner prompts, and cross-validation prompts; adds guardrails to skip/disable LLM paths when URLs aren’t considered local.
- Updates docs/manifests/versioning for the new redaction behavior.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
security_scan/redact.py |
New redaction + local-URL gate utilities (patterns + entropy heuristic). |
security_scan/triage.py |
Disables triage on non-local base_url; redacts title/message/extra/snippet in _finding_brief. |
security_scan/runners/gemma.py |
Refuses non-local base_url; redacts file contents before building user prompt. |
security_scan/cross_validate.py |
Skips Gemma direction on non-local ollama_url; redacts snippet/message before validator prompts; path-sanitizes _read_snippet. |
tests/test_redact.py |
Adds comprehensive unit tests for token patterns, entropy heuristic, and is_local_url. |
tests/test_triage.py |
Adds assertions that triage brief redacts secrets and disables on remote base_url. |
tests/test_gemma_runner.py |
Adds assertions that prompts redact secrets and runner refuses non-local base_url. |
tests/test_cross_validate.py |
Adds assertions for redaction, path traversal refusal, and skipping Gemma on remote URL. |
README.md |
Documents redaction + local-only behavior for LLM integrations. |
SECURITY-SCAN-MANIFEST.yaml |
Bumps version and adds changelog entry describing redaction + refusal behavior. |
security_scan/__init__.py |
Bumps package version to 0.2.4. |
pyproject.toml |
Bumps project version to 0.2.4. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+74
to
+80
| (re.compile( | ||
| r"(?i)(?:^|[^A-Za-z0-9])(?P<k>(?:api[_-]?key|secret(?:[_-]?(?:access[_-]?)?key)?" | ||
| r"|token|password|passwd|auth(?:[_-]?token)?|bearer|client[_-]?secret" | ||
| r"|access[_-]?token|refresh[_-]?token|private[_-]?key|jwt[_-]?secret" | ||
| r"|db[_-]?password|database[_-]?password))" | ||
| r"['\"]?\s*[:=]\s*['\"]?(?P<v>[^\s'\";,]{8,})['\"]?" | ||
| ), lambda m: f"{m.group('k')}=<REDACTED:secret-like>"), |
Comment on lines
+69
to
+71
| # Long hex digests (32+ char) — common for HMAC keys, session secrets, etc. | ||
| # Entropy heuristic misses these because hex has only 16 symbols (entropy ≤ 4). | ||
| (re.compile(r"\b[a-f0-9]{32,}\b"), "<REDACTED:hex-digest>"), |
Comment on lines
276
to
+280
| if not file_path: | ||
| return "" | ||
| p = repo_dir / file_path | ||
| p = (repo_dir / file_path).resolve() | ||
| try: | ||
| repo_resolved = repo_dir.resolve() |
Comment on lines
+232
to
+235
| - For Gemma (Ollama) and the Gemma direction of cross-validation, the scanner | ||
| refuses to send anything at all if `base_url` doesn't resolve to loopback or | ||
| RFC1918. Same for `triage.base_url` — if set to a non-local host, triage is | ||
| disabled at construction time. |
Comment on lines
+43
to
+47
| if self.enabled and not is_local_url(self.cfg.base_url): | ||
| print( | ||
| f"triage: base_url {self.cfg.base_url!r} is not loopback/private — " | ||
| "triage is disabled to prevent source snippets leaving the host. " | ||
| "Set triage.base_url to a local Ollama (e.g. host.docker.internal:11434).", |
Comment on lines
+106
to
+112
| if not is_local_url(ollama_url): | ||
| print( | ||
| f"cross-validate: gemma validator skipped — ollama_url {ollama_url!r} " | ||
| "is not loopback/private", | ||
| file=sys.stderr, | ||
| ) | ||
| gemma_reachable = False |
Comment on lines
+136
to
+141
| f = Finding( | ||
| scanner="semgrep", category="sast", rule_id="hardcoded-token", | ||
| severity="high", file_path="src/a.py", line=1, title="t", | ||
| message="found AKIAIOSFODNN7EXAMPLE in config", | ||
| extra={"snippet": "GITHUB_TOKEN = 'ghp_aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'"}, | ||
| ) |
Comment on lines
+173
to
+178
| (tmp_path / "src").mkdir() | ||
| (tmp_path / "src" / "creds.py").write_text( | ||
| "AWS_KEY = 'AKIAIOSFODNN7EXAMPLE'\n" | ||
| "GH = 'ghp_aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'\n" | ||
| "BLOB = 'f4Z7q2pHk8wT3sNcRy9LbVxJgQmDeAo5'\n" | ||
| ) |
Comment on lines
+170
to
+173
| f = _f("codex", "auth.foo", severity="high") | ||
| f.message = "exposed token ghp_aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa in source" | ||
| f.extra = {"snippet": "secret = 'AKIAIOSFODNN7EXAMPLE'"} | ||
| captured = {} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #3.
Summary
security_scan/redact.pywithredact_text,redact_obj,is_local_url.triage._finding_brief,runners/gemma._build_user_prompt,cross_validate(both_gemma_verdictand_codex_verdict, plus_finding_summary).is_local_url: Triage disables itself, the gemma runner returns failure, and the gemma direction of cross-validate is skipped whenbase_url/ollama_urlisn't loopback/private/host.docker.internal.Patterns redacted
ghp_/gho_/ghu_/ghs_/github_pat_), GitLab (glpat-), Stripe (sk_live_/sk_test_/...,whsec_), OpenAI/Anthropic-stylesk-…, Slack (xox[abeprs]-,xapp-), Google API (AIza…,ya29.), SendGrid (SG.<id>.<secret>), Age private keys (AGE-SECRET-KEY-1…), JWTs, PEM key blocks, database/broker URLs with embedded credentials (postgres/mysql/mongodb+srv/redis/amqp/jdbc/...), Azure storageAccountKey=/SharedAccessKey=/SASsig=, 32+ char hex digests.[A-Za-z0-9+/=_-]) with Shannon entropy ≥ 4.0 bits/char.NAME=valueorNAME: valuewhere NAME hints at a secret (api_key, secret_access_key, password, token, bearer, jwt_secret, ...). Permissive left-boundary handles prefixed names likeAWS_SECRET_ACCESS_KEY.Codex peer-review hardening
Ran the diff through codex for a second opinion. Addressed:
_read_snippetagainst../ absolute file_path escapes.titlein_finding_brief(was passing through verbatim).Codex also flagged that
cross_validate._codex_verdictrunscodexwith-C repo_dirand read-only repo access — Codex can inspect raw source via its own tools, so redacting the prompt isn't sufficient to prevent leakage on that direction. Noted as a follow-up; would require feeding Codex a tempdir-of-just-the-snippet rather than the full clone, which is a meaningful architectural change.Test plan
tests/test_redact.py)test_triage.py,test_gemma_runner.py,test_cross_validate.pyproving no plaintext secret reaches the network_read_snippet🤖 Generated with Claude Code