Stop memory poisoning attacks on your AI agents.
When an attacker poisons your agent's persistent memory — its RAG knowledge base, vector store, or conversation history — every future decision is compromised. Unlike prompt injection, which requires the attacker to be present each session, memory poisoning is a one-time attack with permanent effect. The agent keeps making bad decisions long after the initial compromise, across all users and all sessions.
Existing defenses are either red-teaming tools that find the vulnerability but don't fix it, or academic prototypes tied to specific agent architectures. Nothing you can pip install today.
pip install memshieldfrom memshield import MemShield
from memshield.adapters.openai_provider import OpenAIProvider
shield = MemShield(local_provider=OpenAIProvider(
model="llama3.1:8b",
base_url="http://localhost:11434/v1",
api_key="not-needed",
))
vectorstore = shield.wrap(vectorstore) # done — reads are validated, writes are trackedYour agent code doesn't change. The wrapped store has the same interface as the original.
# Core library (no dependencies)
pip install memshield
# With OpenAI/Ollama provider
pip install memshield[openai]
# With all optional integrations
pip install memshield[all]
# From source
git clone https://github.com/npow/memshield.git
cd memshield
pip install -e ".[dev]"from memshield import MemShield, ShieldConfig
from memshield.adapters.openai_provider import OpenAIProvider
provider = OpenAIProvider(model="llama3.1:8b", base_url="http://localhost:11434/v1")
shield = MemShield(local_provider=provider)
store = shield.wrap(your_vectorstore)
results = store.similarity_search("company refund policy")
# Poisoned entries are blocked. Clean entries pass through.local = OpenAIProvider(model="llama3.1:8b", base_url="http://localhost:11434/v1")
cloud = OpenAIProvider(model="gpt-4o") # only called for ambiguous entries
shield = MemShield(local_provider=local, cloud_provider=cloud)
store = shield.wrap(your_vectorstore)shield = MemShield(local_provider=provider)
store = shield.wrap(your_vectorstore)
# Writes are automatically tagged with source, timestamp, and hash chain
store.add_texts(["New company policy: ..."], metadatas=[{"source": "user_input"}])
# Later, verify the provenance chain is intact
assert shield.provenance.verify_chain()print(f"Clean: {shield.stats.clean}")
print(f"Blocked: {shield.stats.blocked}")
print(f"Escalated to cloud: {shield.stats.cloud_calls}")MemShield wraps your vector store with a proxy. When your agent calls similarity_search, the proxy intercepts the results and validates each entry before returning them. Writes go through too — tagged with provenance metadata.
The core idea is simple: ask an LLM to analyze each retrieved memory entry. Specifically, the LLM decomposes the entry into three parts — what factual claim it makes, what action it implies, and what context would make it legitimate. When these conflict (e.g., the factual claim is benign but the implied action is "override system instructions"), the entry is flagged as poisoned. This catches attacks that look like knowledge but behave like instructions.
Three layers, from fast/cheap to slow/expensive:
- Local LLM (every read) — A local model (Llama 3.1 8B via Ollama/vLLM) runs the analysis. Adds ~200-500ms per entry on GPU. Resolves the clear cases — obviously clean or obviously poisoned.
- Cloud LLM (ambiguous cases only) — When the local model isn't confident, the entry escalates to a frontier model (GPT-4o, Claude, etc.) for a second pass. Typically 1-3% of reads.
- Provenance + drift (metadata, no LLM) — Every write is logged in a SHA-256 hash chain with source and trust level. A statistical profiler flags unusual access patterns — catching subtle poisoning that content analysis misses.
The proxy delegates all methods it doesn't intercept via __getattr__, so your agent code doesn't change.
There are 60+ products in the LLM guardrails space. Almost all operate at the prompt/response boundary — they screen what goes into and comes out of the model. Very few operate at the memory/retrieval layer where poisoning actually lives.
These are the closest to what memshield does:
| Tool | What it does | How memshield differs |
|---|---|---|
| NVIDIA NeMo Guardrails | Open-source programmable guardrails with "retrieval rails" that filter RAG chunks before they reach the LLM. | NeMo filters based on configurable rules (Colang). memshield uses LLM reasoning to detect semantic manipulation — entries that follow the rules but contain disguised instructions. NeMo is broader (dialog rails, output rails); memshield is deeper on memory specifically. |
| Daxa AI | Shift-left RAG security: scans content before vectorization, labels malicious vectors, policy-based retrieval controls. | Daxa scans at ingestion time (pre-vectorization). memshield validates at retrieval time (post-vectorization). Different interception points — Daxa prevents bad data from entering; memshield catches it on the way out. Complementary. |
| Preamble | Runtime guardrails explicitly built for RAG, LLMs, and agents. | Preamble is a commercial platform. memshield is open-source middleware you embed in your code. Different deployment models. |
| NeuralTrust | Generative Application Firewall that maintains context across sessions and detects accumulation attacks. | NeuralTrust is a full commercial firewall platform. memshield is a pip-installable library that wraps your vector store. Different scale and scope. |
| Meta LlamaFirewall | Open-source guardrails with AlignmentCheck that audits agent reasoning from untrusted data sources. | LlamaFirewall audits the agent's chain-of-thought after retrieval. memshield validates the memory entries themselves before they reach the agent. Different layer — LlamaFirewall trusts the data and checks the reasoning; memshield checks the data. |
The bulk of the market. These screen model I/O but don't inspect what's stored in your knowledge base:
| Tool | Status |
|---|---|
| Lakera Guard | Acquired by Check Point. Prompt injection detection API. |
| Guardrails AI | Open-source I/O validators. No memory-specific validators. |
| LLM Guard (Protect AI → Palo Alto) | Open-source input/output scanners. |
| Arthur AI Shield | LLM firewall for PII, toxicity, hallucination, injection. |
| Prompt Security | Acquired by SentinelOne. Shadow AI + injection defense. |
| CalypsoAI | Acquired by F5. Inference-time security. |
| DynamoGuard (Dynamo AI) | Real-time guardrails with sub-50ms latency. |
| Provider | Feature | Handles memory? |
|---|---|---|
| AWS Bedrock | Guardrails with contextual grounding checks | Partial — checks responses against source docs, doesn't scan the knowledge base itself |
| Google Vertex AI | Model Armor for injection/jailbreak/content safety | No — prompt/response only |
| Azure AI | Content Safety + AI Threat Protection | Partial — anomaly monitoring, not knowledge base scanning |
| Cisco AI Defense | Built on Robust Intelligence acquisition | Prompt/response screening |
| Palo Alto Prisma AIRS | Built on Protect AI acquisition | Prompt/response + model scanning |
Funded companies focused on AI security posture, not specifically memory defense:
Noma Security ($132M raised) · HiddenLayer (model-level attacks) · Lasso Security ($28M, MCP gateway) · Straiker ($21M, agentic AI) · WitnessAI ($58M, AI governance) · Cranium AI ($32M, governance) · Zenity (agent behavior monitoring)
| Paper | What it proved |
|---|---|
| A-MemGuard (NTU/Oxford, 2025) | Consensus validation reduces memory poisoning attack success by >95%. memshield adapts this approach. |
| RAGuard (NeurIPS 2025) | Adversarial retriever training + zero-knowledge filter blocks RAG corpus poisoning. |
| RAGPart & RAGMask (U. Maryland, 2026) | Fragment-and-vote and masking approaches reduce poison impact. |
| TrustRAG (2025) | K-means clustering detects poisoned embeddings. |
| MemoryGraft (2025) | Attack paper proposing Cryptographic Provenance Attestation (unbuilt). memshield implements provenance tracking. |
memshield is open-source middleware that wraps your vector store. It's not a platform, not a firewall, not a governance tool. It's a library you pip install and add to your agent in three lines.
It complements the products above:
- Use Promptfoo to test if you're vulnerable → use memshield to fix it at runtime
- Use NeMo Guardrails for broad dialog/output rails → use memshield for deep memory content analysis
- Use Lakera/LLM Guard for prompt injection on user inputs → use memshield for poisoning in stored knowledge
- Use Daxa to prevent bad data from entering your vector store → use memshield to catch what Daxa missed on the way out
Honest caveats:
- NVIDIA NeMo Guardrails already has retrieval rails. If you're in the NeMo ecosystem, check if that's sufficient before adding memshield.
- Platform vendors (Meta, Microsoft, AWS, Google) may add native memory integrity features within 6-12 months.
- This is a new tool with no production deployments yet. The benchmark results will tell us how well it actually works.
from memshield import MemShield, ShieldConfig, FailurePolicy
config = ShieldConfig(
confidence_threshold=0.7, # min confidence to resolve locally
untrusted_confidence_boost=0.15, # extra confidence required for untrusted sources
failure_policy=FailurePolicy.ALLOW_WITH_WARNING, # what to do with ambiguous entries
enable_provenance=True,
enable_drift_detection=True,
)
shield = MemShield(local_provider=provider, config=config)| Option | Default | Description |
|---|---|---|
confidence_threshold |
0.7 | Minimum confidence to resolve an entry locally |
untrusted_confidence_boost |
0.15 | Extra confidence required for entries from untrusted sources |
failure_policy |
ALLOW_WITH_WARNING |
BLOCK, ALLOW_WITH_WARNING, or ALLOW_WITH_REVIEW |
enable_provenance |
True |
Track cryptographic provenance of memory writes |
enable_drift_detection |
True |
Monitor for unusual memory access patterns |
git clone https://github.com/npow/memshield.git
cd memshield
pip install -e ".[dev]"
pytest -vmemshield ships with multiple validation strategies you can use individually or ensemble:
from memshield import (
MemShield, ConsensusStrategy, KeywordHeuristicStrategy, EnsembleStrategy
)
from memshield.adapters.openai_provider import OpenAIProvider
# Fast keyword heuristic (instant, zero cost, catches obvious attacks)
shield = MemShield(strategy=KeywordHeuristicStrategy())
# LLM consensus — A-MemGuard approach (deep, catches subtle attacks)
provider = OpenAIProvider(model="llama3.1:8b", base_url="http://localhost:11434/v1")
shield = MemShield(strategy=ConsensusStrategy(provider))
# Ensemble — run both, flag as poisoned if either detects it
shield = MemShield(strategy=EnsembleStrategy(
[KeywordHeuristicStrategy(), ConsensusStrategy(provider)],
mode="any_poisoned", # or "majority" for balanced precision/recall
))The benchmark compares strategies against labeled datasets including reconstructed AgentPoison (NeurIPS 2024) attack data — the same corpus A-MemGuard evaluated against.
# Heuristic baseline (instant, no LLM needed):
python benchmarks/run_benchmark.py --strategy heuristic --tier 2 3 4
# Compare all strategies head-to-head:
python benchmarks/run_benchmark.py --compare-strategies --tier 2 3 4 \
--local-url http://localhost:11434/v1 --local-model llama3.1:8b \
--cloud-model gpt-4o
# Via claude-relay:
python benchmarks/run_benchmark.py --compare-strategies --tier 2 3 4 \
--local-url http://localhost:8082/v1 --local-model sonnetBenchmark tiers:
| Tier | Source | Entries | What it tests |
|---|---|---|---|
| 2 | Hand-crafted + MemoryGraft-style | 55 | Memory-specific: experience records, schema spoofing, rubric mimicry |
| 3 | Adversarial paired entries | 18 | Subtle pairs where clean and poisoned look structurally identical |
| 4 | AgentPoison (NeurIPS 2024) | 330 | Reconstructed StrategyQA + EHR poisoned passages with published golden triggers |
Tested using Claude Haiku via claude-relay:
| Strategy | Precision | Recall | F1 | FPR | Latency |
|---|---|---|---|---|---|
| Keyword heuristic | 100% | 14.5% | 25.4% | 0% | <1ms |
| LLM consensus | 97.1% | 98.6% | 97.8% | 0.6% | ~9s |
| Ensemble (majority) | 100% | 100%* | 100%* | 0% | ~8s |
| Ensemble (any_poisoned) | 94.5% | 100% | 97.2% | 1.2% | ~6s |
*Majority mode pushes 57 entries to ambiguous when heuristic and LLM disagree.
Key findings:
- Keyword heuristic catches 0% of AgentPoison attacks. The poisoned passages contain no obvious instruction patterns — they're disguised as reasoning traces.
- LLM consensus catches 98.6% of all poisoned entries with only 2 false positives out of 334 clean entries. This validates A-MemGuard's published claim (>95% reduction) on the same attack data.
- Ensemble (any_poisoned) achieves 100% recall — every poisoned entry detected — at the cost of 4 false positives (1.2% FPR).