Defense framework against AI agent content injection, memory poisoning, and RAG attacks. Platform-agnostic with adapters for Claude Code, MCP middleware, and CLI.
AI agents (Claude Code, MCP-connected tools, RAG pipelines) consume content from untrusted sources: web pages, emails, PDFs, knowledge bases, user-uploaded documents. That content can contain prompt injection attacks that hijack the agent's behavior, exfiltrate data, poison memory stores, or escalate privileges.
Most prompt injection defenses live inside the model. Agent Content Shield works outside the model as an independent content firewall. It scans everything flowing into and out of agent tools before the model ever sees it, catching attacks at the infrastructure layer rather than relying on the model to self-police.
This project was born from a real need: protecting a Claude Code setup with 15+ MCP servers, long-term memory systems, and knowledge bases from content injection across those trust boundaries. After building it, a coordinated red team of 6 specialized AI agents stress-tested it with 95+ attack vectors across 5 waves, hardening it to a 98% catch rate.
Content In
|
v
[Layer 1: Regex + Heuristics] <5ms
| 500+ patterns, 61 curated languages, 40+ semantic heuristics
| Catches: direct injection, role hijacking, credential access,
| system boundary faking, SSRF, hidden content, encoding tricks
|
v
[Layer 2: Embedding Similarity] ~50ms
| 77-phrase injection seed bank (Ollama nomic-embed-text)
| Cosine similarity against known attack embeddings
| <0.78 pass | 0.78-0.88 escalate | >0.88 block
|
v
[Layer 3: LLM Classifier] ~500ms
| Intent classification via local LLM (Ollama deepseek-r1:8b)
| Detects: polymorphic paraphrases, narrative embedding,
| metaphorical attacks, fabricated authority
|
v
[Layer 4: NLI Intent Classifier] ~200ms (opt-in)
| Claude Haiku API or Ollama fallback
| Resolves metaphorical/domain-framed attacks
|
v
Result: PASS | WARN | SANITIZE | BLOCK
Only content that passes Layer 1 reaches Layer 2. Only borderline Layer 2 cases reach Layer 3. Most benign content exits at Layer 1 in under 5ms.
Direct Attacks -- instruction override, role hijacking, system boundary faking ([SYSTEM], [INST], <|system|>), behavioral manipulation, data exfiltration, credential harvesting
Encoding Evasion -- base64, hex, HTML entities, JavaScript unicode escapes, CSS unicode escapes, URL encoding, zero-width character insertion, Unicode NFKC bypass
Homoglyph Attacks -- Cyrillic (a/e/o/p/c/x), Greek, Armenian, Cherokee character substitution
Hidden Content -- CSS display:none, visibility:hidden, opacity:0, @font-face glyph remapping, CSS var() reconstruction, HTML comment injection
Semantic Injection -- passive voice exfiltration, legal/regulatory framing, educational/red-team pretexts, Socratic question framing, metaphorical extraction, completion priming, authority fabrication
Infrastructure -- SSRF (localhost, metadata endpoints, decimal/hex/octal IPs), blocked exfiltration domains (webhook.site, requestbin, ngrok, etc.), DNS rebinding patterns
Multilingual -- curated injection regex for 61 languages: Amharic, Arabic, Basque, Bengali, Bulgarian, Burmese, Catalan, Chinese, Croatian, Czech, Danish, Dutch, Esperanto, Farsi/Persian, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Hausa, Hebrew, Hindi, Hungarian, Igbo, Indonesian, Japanese, Kannada, Kazakh, Khmer, Korean, Lao, Malayalam, Marathi, Mongolian, Nepali, Norwegian, Pashto, Polish, Portuguese, Punjabi, Romanian, Russian, Sinhala, Slovak, Spanish, Swahili, Swedish, Tagalog, Tamil, Telugu, Thai, Tibetan, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, Yoruba. Languages without curated regex fall through to the semantic embedding + NLI intent layers, which are vocabulary-independent by design — so an attack in Icelandic, Welsh, or Javanese still trips Layer 2/3/4 via intent rather than keyword.
Memory Poisoning -- behavioral override attempts, internal file reference probing (.env, CLAUDE.md, settings.json)
Document Injection -- PDF JavaScript/URI injection, markdown tracking beacons, HTML <script>/<iframe>/<object> tags
Before any detection runs, content goes through:
- Zero-width character stripping (ZWS, ZWNJ, ZWJ, soft hyphens, bidi overrides)
- Unicode NFKC normalization (fullwidth to ASCII)
- HTML entity decoding
- Homoglyph translation (Cyrillic/Greek/Armenian/Cherokee to Latin)
- Diacritical mark stripping
- Base64/hex payload extraction and decode
- Recursive text extraction from nested objects (depth limit 20)
- Integrity verification of signature database (SHA-256)
- It's not a WAF. It doesn't sit in front of a web server. It sits between an AI agent and its tool outputs.
- It can't read images. Steganographic injection in images (pixel-level instructions) is flagged with a warning but not analyzed. Image analysis would require a vision model in the pipeline.
- It doesn't replace model-level safety. This is defense-in-depth. The model's own guardrails are still your first line. This catches what gets past them, or what arrives before the model processes it.
- Bash monitoring is new and pattern-based. The
pre-bashhook (v0.3.0) scans commands before execution for exfiltration, reverse shells, sensitive file access, and rogue script execution. It's regex-based, so novel obfuscation may evade it. - It's not production-hardened for high-throughput. Designed for developer workstation use (single agent, moderate request volume). Not benchmarked for thousands of concurrent scans.
- Cross-temporal analysis is limited. It scans content per-source. Benign fragments that reconstitute as attacks across multiple sources or sessions are not yet detected.
- Python port caveats. v0.4.1 closed the module-level gap:
core/post_flight.py,core/escalation_tracker.py,core/canary.py,core/behavioral_engine.py,core/semantic_detector.py, andcore/nli_classifier.pyare all at behavior parity with their JS counterparts, and both ports share the same lexicons viacore/semantic-lexicon.json+core/nli-intents.json. The one remaining gap is the Wave 2-5 Unicode preprocessing incore/detectors.py— the JS detector's homoglyph translation and encoding-decode chains are more complete. Track for a future release.
agent-content-shield/
core/
detectors.{js,py} # Main regex + heuristic engine (Layer 1; Python port is partial)
semantic-detector.{js,py} # Embedding + LLM classifier orchestrator (Layers 2-3)
nli-classifier.{js,py} # NLI intent classifier (Layer 4, opt-in)
post-flight.{js,py} # Post-flight output scanner (v0.4.0)
escalation-tracker.{js,py} # Multi-turn escalation tracking (v0.4.0)
behavioral-engine.{js,py} # Markov tool-sequence anomaly detector
canary.{js,py} # 128-bit canary token tripwire
signatures.json # Threat pattern library (500+ patterns)
semantic-lexicon.json # Shared: 106 injection seeds + 240 IDF terms
nli-intents.json # Shared: 8-intent threat taxonomy + system prompt
adapters/
claude-code/
hooks.js # Claude Code PreToolUse/PostToolUse hooks
post-flight-hook.js # Claude Code Stop/SubagentStop hook (v0.4.0)
python_middleware/ # Python decorator + async context manager
mcp-middleware/ # MCP protocol adapter
stdin-pipe/
scan.js # Universal pipe adapter
cli/
shield.js # CLI commands (scan, scan-dir, validate-url)
config/
default.yaml # Thresholds, trusted/blocked domains, tool mappings
test/
detectors.test.js # Core detector tests (50+ cases)
semantic.test.js # Semantic layer tests (needs Ollama)
post-flight.test.js # Post-flight scanner tests
escalation-tracker.test.js # Escalation tracking tests
shared-lexicon.test.js # Drift guard: JS must match shared JSON (v0.4.2)
test_*.py # Python mirrors of all the above (105 tests total)
run.js # JS test orchestrator
logs/ # Detection logs (JSONL)
- Node.js >= 18
- No other dependencies for regex-based detection
- Ollama running locally at
localhost:11434 - nomic-embed-text model (~270MB) for embedding extraction
- deepseek-r1:8b model (~4.7GB) for LLM classification
- Anthropic API key for Claude Haiku classification, or falls back to Ollama
git clone https://github.com/anthonyonazure/agent-content-shield.git
cd agent-content-shield
npm install
# For semantic layers (optional):
ollama pull nomic-embed-text
ollama pull deepseek-r1:8bRegister in your Claude Code settings.json:
{
"hooks": {
"PreToolUse": [
{
"matcher": "WebFetch",
"hooks": [{
"type": "command",
"command": "node /path/to/agent-content-shield/adapters/claude-code/hooks.js pre-fetch"
}]
},
{
"matcher": "Bash",
"hooks": [{
"type": "command",
"command": "node /path/to/agent-content-shield/adapters/claude-code/hooks.js pre-bash"
}]
}
],
"PostToolUse": [
{
"matcher": "*",
"hooks": [{
"type": "command",
"command": "node /path/to/agent-content-shield/adapters/claude-code/hooks.js postContentScanner"
}]
}
]
}
}# Scan a file
npx shield scan document.md
# Scan a directory recursively
npx shield scan-dir ./downloaded-docs
# Validate a URL against SSRF/blocklist
npx shield validate-url https://webhook.site/abc123curl https://example.com | node adapters/stdin-pipe/scan.js --context web_fetch
cat untrusted-doc.md | node adapters/stdin-pipe/scan.js --sanitizeconst { scanContent, validateUrl } = require('./core/detectors');
const result = scanContent(untrustedText, { context: 'web_fetch' });
if (result.maxSeverity >= 8) {
console.log('Blocked:', result.findings);
}
const urlCheck = validateUrl('http://169.254.169.254/metadata', SIGS);
if (urlCheck.blocked) {
console.log('SSRF attempt blocked');
}Edit config/default.yaml:
# Severity threshold for sanitization (strip payloads)
sanitize_threshold: 8
# Risk score threshold for blocking memory writes
block_threshold: 0.4
# Domains that bypass URL validation
trusted_domains:
- github.com
- docs.microsoft.com
- anthropic.com
# ...
# Domains that are always blocked (exfiltration endpoints)
blocked_domains:
- webhook.site
- requestbin.com
- interact.sh
# ...
# Semantic layer (requires Ollama)
semantic:
enabled: true
embed_alert_threshold: 0.78
embed_block_threshold: 0.88
classifier_threshold: 0.70# Fast: regex + heuristic tests only
npm test
# Include semantic layer tests (requires Ollama)
npm run test:semantic
# Everything
npm run test:allThe test suite includes 50+ core detector tests plus red team attack vectors from 5 progressive waves covering 21 attack classes.
The framework has been through a coordinated red team exercise using 6 specialized AI agents simulating nation-state threat actors (NSA TAO, Chinese MSS, Russian GRU, North Korean Lazarus). The full report is in RED-TEAM-REPORT.md.
Results: 4 CRITICAL, 6 HIGH, 6 MEDIUM findings across 95+ attack vectors. Post-remediation catch rate: 98%.
Notable attack chains discovered:
- Cyrillic SSRF + Ollama DoS = permanent semantic bypass
- Python Unicode gap + classifier injection = double gate bypass
- Config theater + unmonitored Bash = invisible exfiltration
- Statistical short-circuit + passive voice injection = full pipeline evasion
- Ollama has no authentication -- port squatting risk mitigated by model integrity checks (v0.3.0) but not fully solved
- Ollama mTLS or API key auth to prevent port squatting
- Image analysis pipeline -- vision model to detect steganographic injection
- Long-tail language coverage -- 61 languages have curated injection regex today. Expanding to the remaining top-100 (Icelandic, Welsh, Afrikaans, Kurdish, Javanese, Cebuano, Quechua, and others) is low marginal value because the semantic + NLI layers already catch those attacks intent-first, but it would tighten the offline-only path
- Embedding ensemble -- two embedding models to resist adversarial suffix attacks
- Statistical gate hardening -- remove the score < 0.15 fast-path that allows semantic bypass
- Metrics dashboard -- visualize detection logs, false positive rates, latency percentiles
- Write/Edit tool pre-hooks -- scan content before file writes to config/settings
The shield is designed to degrade gracefully:
| Condition | Behavior |
|---|---|
| Ollama not running | Layers 2-3 skipped, TF-IDF + entropy fallback active, Layer 1 (regex) still active |
| Anthropic API key missing | Layer 4 falls back to Ollama, or skips |
signatures.json tampered |
SHA-256 integrity check against known-good hash fails, scan aborts |
| Ollama model replaced | Canary embedding integrity check detects dimension/magnitude anomalies |
| Scan timeout (>30s) | Fail-open with warning logged |
| Unknown tool name | Classified as general context, full Layer 1 scan applied |
MIT