LLM Memory Architecture

A production-grade persistent memory system for LLMs, inspired by human cognitive memory theory.

Memory Types

Memory Type	Human Analogy	Implementation	Persistence
Working	Short-term (seconds)	In-context buffer (last N turns)	Session only
Episodic	"I remember that conversation…"	ChromaDB vector store	Permanent (with decay)
Semantic	Facts you know	SQLite user profile	Permanent
Procedural	How you do things	SQLite behavior store	Permanent

Architecture

User message
     │
     ▼
Memory Manager
  ├── Semantic profile   (SQLite)    → "Name: Ayan, Stack: Python"
  ├── Procedural rules   (SQLite)    → "Be concise, use code blocks"
  ├── Episodic retrieval (ChromaDB)  → Top-3 relevant past conversations
  └── Working memory     (SQLite)    → Last 6 turns
     │
     ▼
Assembled prompt → LLM (Claude) → Response
     │
     ▼
Memory Consolidator (every 4 turns)
  ├── Extract new facts  → update semantic memory
  ├── Detect preferences → update procedural memory
  └── Store summary      → new episode in ChromaDB

Nightly (or on-demand):
  Forgetting worker → delete episodes where retention < 20%
                       using Ebbinghaus decay: R = e^(-t/S)

Quickstart

1. Backend

cd llm-memory

# Install dependencies
pip install -r requirements.txt

# Set your Gemini API key (free at aistudio.google.com)
export GEMINI_API_KEY=AIza...

# Run smoke test (no server needed)
python test_memory.py

# Start the API server
uvicorn api.main:app --reload --port 8000

2. Frontend

cd ui
npm install
npm run dev
# Open http://localhost:5173

API Reference

Method	Endpoint	Description
POST	`/chat`	Send a message, get memory-aware response
GET	`/memory/{user_id}`	Inspect all memory for a user
DELETE	`/memory/{user_id}/forget`	Run Ebbinghaus forgetting pass
DELETE	`/memory/{user_id}/reset`	Wipe all memory for a user
POST	`/chat/end-session`	Explicitly end session + consolidate

Chat request

{
  "user_id": "ayan_01",
  "message": "How do I add streaming to my FastAPI app?",
  "session_id": null
}

Chat response

{
  "session_id": "abc-123",
  "reply": "Here's how to add streaming to your async FastAPI setup, Ayan…",
  "memory_debug": {
    "profile_keys": ["name", "tech_stack", "goals"],
    "behaviors_count": 1,
    "episodes_retrieved": 2,
    "working_turns": 4
  },
  "consolidation": {
    "facts_updated": ["tech_stack"],
    "episode_summary": "[FastAPI] Discussed adding streaming endpoints",
    "episode_id": "a3f9c2b1"
  }
}

Key Design Decisions

Why ChromaDB for episodic memory?

Episodic memories are retrieved by what you were talking about, not by when it happened. Semantic similarity (cosine distance on embeddings) captures this far better than a timestamp index.

Why a separate SQLite profile for semantic memory?

Facts like "user's name = Ayan" are structured, deterministic key-value pairs. A relational store makes conflict resolution (newer fact overwrites older) trivial and auditable.

The Ebbinghaus forgetting curve

R(t) = e^(-t / S)

t = days since stored
S = stability (scales with conversation length + reinforcement count)
Delete when R < 0.20

This prevents the vector store from bloating with stale, irrelevant memories while keeping frequently-accessed ones alive.

Retrieval scoring

score = 0.7 × similarity + 0.3 × retention

Blends semantic relevance with memory freshness. A very relevant but old memory scores lower than a slightly less relevant but recent one.

Consolidation timing

Consolidation runs every 4 turns (2 exchanges) — frequent enough to capture facts early, infrequent enough to avoid excessive LLM calls.

Interview Q&A Prep

"How do you decide what's important enough to store?" The consolidator uses a secondary LLM call with a strict extraction prompt. Only explicitly stated facts are stored — the prompt explicitly forbids guessing. Conversation length drives stability: longer conversations get higher S values, so they decay slower.

"How do you handle conflicting memories?" Semantic memory uses an ON CONFLICT DO UPDATE SQL pattern — newer facts silently overwrite older ones. Episodic memories are never overwritten; they just get lower retrieval scores as they age.

"What's your retrieval strategy — recency vs relevance?" Both, blended. The score = 0.7 × similarity + 0.3 × retention formula means relevance dominates, but freshness breaks ties. You can tune these weights.

"How do you stop the vector store from growing forever?" The forgetting worker runs R(t) = e^(-t/S) for every stored episode. Anything below 20% retention gets deleted from both ChromaDB and the metadata SQLite table. Reinforcement (re-accessing an episode) boosts its effective stability.

File Structure

llm-memory/
├── memory/
│   ├── memory_manager.py    # Core — all 4 memory types
│   └── consolidator.py      # Post-session extraction
├── api/
│   └── main.py              # FastAPI endpoints
├── ui/
│   ├── src/App.jsx          # React chat + memory inspector
│   └── src/main.jsx
├── test_memory.py           # Smoke test
└── requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Memory Architecture

Memory Types

Architecture

Quickstart

1. Backend

2. Frontend

API Reference

Chat request

Chat response

Key Design Decisions

Why ChromaDB for episodic memory?

Why a separate SQLite profile for semantic memory?

The Ebbinghaus forgetting curve

Retrieval scoring

Consolidation timing

Interview Q&A Prep

File Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
api		api
memory		memory
ui		ui
LICENSE		LICENSE
Procfile		Procfile
README.md		README.md
requirements.txt		requirements.txt
test_memory.py		test_memory.py

Folders and files

Latest commit

History

Repository files navigation

LLM Memory Architecture

Memory Types

Architecture

Quickstart

1. Backend

2. Frontend

API Reference

Chat request

Chat response

Key Design Decisions

Why ChromaDB for episodic memory?

Why a separate SQLite profile for semantic memory?

The Ebbinghaus forgetting curve

Retrieval scoring

Consolidation timing

Interview Q&A Prep

File Structure

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages