A production-grade persistent memory system for LLMs, inspired by human cognitive memory theory.
| Memory Type | Human Analogy | Implementation | Persistence |
|---|---|---|---|
| Working | Short-term (seconds) | In-context buffer (last N turns) | Session only |
| Episodic | "I remember that conversation…" | ChromaDB vector store | Permanent (with decay) |
| Semantic | Facts you know | SQLite user profile | Permanent |
| Procedural | How you do things | SQLite behavior store | Permanent |
User message
│
▼
Memory Manager
├── Semantic profile (SQLite) → "Name: Ayan, Stack: Python"
├── Procedural rules (SQLite) → "Be concise, use code blocks"
├── Episodic retrieval (ChromaDB) → Top-3 relevant past conversations
└── Working memory (SQLite) → Last 6 turns
│
▼
Assembled prompt → LLM (Claude) → Response
│
▼
Memory Consolidator (every 4 turns)
├── Extract new facts → update semantic memory
├── Detect preferences → update procedural memory
└── Store summary → new episode in ChromaDB
Nightly (or on-demand):
Forgetting worker → delete episodes where retention < 20%
using Ebbinghaus decay: R = e^(-t/S)
cd llm-memory
# Install dependencies
pip install -r requirements.txt
# Set your Gemini API key (free at aistudio.google.com)
export GEMINI_API_KEY=AIza...
# Run smoke test (no server needed)
python test_memory.py
# Start the API server
uvicorn api.main:app --reload --port 8000cd ui
npm install
npm run dev
# Open http://localhost:5173| Method | Endpoint | Description |
|---|---|---|
| POST | /chat |
Send a message, get memory-aware response |
| GET | /memory/{user_id} |
Inspect all memory for a user |
| DELETE | /memory/{user_id}/forget |
Run Ebbinghaus forgetting pass |
| DELETE | /memory/{user_id}/reset |
Wipe all memory for a user |
| POST | /chat/end-session |
Explicitly end session + consolidate |
{
"user_id": "ayan_01",
"message": "How do I add streaming to my FastAPI app?",
"session_id": null
}{
"session_id": "abc-123",
"reply": "Here's how to add streaming to your async FastAPI setup, Ayan…",
"memory_debug": {
"profile_keys": ["name", "tech_stack", "goals"],
"behaviors_count": 1,
"episodes_retrieved": 2,
"working_turns": 4
},
"consolidation": {
"facts_updated": ["tech_stack"],
"episode_summary": "[FastAPI] Discussed adding streaming endpoints",
"episode_id": "a3f9c2b1"
}
}Episodic memories are retrieved by what you were talking about, not by when it happened. Semantic similarity (cosine distance on embeddings) captures this far better than a timestamp index.
Facts like "user's name = Ayan" are structured, deterministic key-value pairs. A relational store makes conflict resolution (newer fact overwrites older) trivial and auditable.
R(t) = e^(-t / S)
t= days since storedS= stability (scales with conversation length + reinforcement count)- Delete when
R < 0.20
This prevents the vector store from bloating with stale, irrelevant memories while keeping frequently-accessed ones alive.
score = 0.7 × similarity + 0.3 × retention
Blends semantic relevance with memory freshness. A very relevant but old memory scores lower than a slightly less relevant but recent one.
Consolidation runs every 4 turns (2 exchanges) — frequent enough to capture facts early, infrequent enough to avoid excessive LLM calls.
"How do you decide what's important enough to store?"
The consolidator uses a secondary LLM call with a strict extraction prompt. Only explicitly stated facts are stored — the prompt explicitly forbids guessing. Conversation length drives stability: longer conversations get higher S values, so they decay slower.
"How do you handle conflicting memories?"
Semantic memory uses an ON CONFLICT DO UPDATE SQL pattern — newer facts silently overwrite older ones. Episodic memories are never overwritten; they just get lower retrieval scores as they age.
"What's your retrieval strategy — recency vs relevance?"
Both, blended. The score = 0.7 × similarity + 0.3 × retention formula means relevance dominates, but freshness breaks ties. You can tune these weights.
"How do you stop the vector store from growing forever?"
The forgetting worker runs R(t) = e^(-t/S) for every stored episode. Anything below 20% retention gets deleted from both ChromaDB and the metadata SQLite table. Reinforcement (re-accessing an episode) boosts its effective stability.
llm-memory/
├── memory/
│ ├── memory_manager.py # Core — all 4 memory types
│ └── consolidator.py # Post-session extraction
├── api/
│ └── main.py # FastAPI endpoints
├── ui/
│ ├── src/App.jsx # React chat + memory inspector
│ └── src/main.jsx
├── test_memory.py # Smoke test
└── requirements.txt