Skip to content

Ayan03092005/LLM-Memory-Architecture

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Memory Architecture

image

A production-grade persistent memory system for LLMs, inspired by human cognitive memory theory.

Memory Types

Memory Type Human Analogy Implementation Persistence
Working Short-term (seconds) In-context buffer (last N turns) Session only
Episodic "I remember that conversation…" ChromaDB vector store Permanent (with decay)
Semantic Facts you know SQLite user profile Permanent
Procedural How you do things SQLite behavior store Permanent

Architecture

User message
     │
     ▼
Memory Manager
  ├── Semantic profile   (SQLite)    → "Name: Ayan, Stack: Python"
  ├── Procedural rules   (SQLite)    → "Be concise, use code blocks"
  ├── Episodic retrieval (ChromaDB)  → Top-3 relevant past conversations
  └── Working memory     (SQLite)    → Last 6 turns
     │
     ▼
Assembled prompt → LLM (Claude) → Response
     │
     ▼
Memory Consolidator (every 4 turns)
  ├── Extract new facts  → update semantic memory
  ├── Detect preferences → update procedural memory
  └── Store summary      → new episode in ChromaDB

Nightly (or on-demand):
  Forgetting worker → delete episodes where retention < 20%
                       using Ebbinghaus decay: R = e^(-t/S)

Quickstart

1. Backend

cd llm-memory

# Install dependencies
pip install -r requirements.txt

# Set your Gemini API key (free at aistudio.google.com)
export GEMINI_API_KEY=AIza...

# Run smoke test (no server needed)
python test_memory.py

# Start the API server
uvicorn api.main:app --reload --port 8000

2. Frontend

cd ui
npm install
npm run dev
# Open http://localhost:5173

API Reference

Method Endpoint Description
POST /chat Send a message, get memory-aware response
GET /memory/{user_id} Inspect all memory for a user
DELETE /memory/{user_id}/forget Run Ebbinghaus forgetting pass
DELETE /memory/{user_id}/reset Wipe all memory for a user
POST /chat/end-session Explicitly end session + consolidate

Chat request

{
  "user_id": "ayan_01",
  "message": "How do I add streaming to my FastAPI app?",
  "session_id": null
}

Chat response

{
  "session_id": "abc-123",
  "reply": "Here's how to add streaming to your async FastAPI setup, Ayan…",
  "memory_debug": {
    "profile_keys": ["name", "tech_stack", "goals"],
    "behaviors_count": 1,
    "episodes_retrieved": 2,
    "working_turns": 4
  },
  "consolidation": {
    "facts_updated": ["tech_stack"],
    "episode_summary": "[FastAPI] Discussed adding streaming endpoints",
    "episode_id": "a3f9c2b1"
  }
}

Key Design Decisions

Why ChromaDB for episodic memory?

Episodic memories are retrieved by what you were talking about, not by when it happened. Semantic similarity (cosine distance on embeddings) captures this far better than a timestamp index.

Why a separate SQLite profile for semantic memory?

Facts like "user's name = Ayan" are structured, deterministic key-value pairs. A relational store makes conflict resolution (newer fact overwrites older) trivial and auditable.

The Ebbinghaus forgetting curve

R(t) = e^(-t / S)
  • t = days since stored
  • S = stability (scales with conversation length + reinforcement count)
  • Delete when R < 0.20

This prevents the vector store from bloating with stale, irrelevant memories while keeping frequently-accessed ones alive.

Retrieval scoring

score = 0.7 × similarity + 0.3 × retention

Blends semantic relevance with memory freshness. A very relevant but old memory scores lower than a slightly less relevant but recent one.

Consolidation timing

Consolidation runs every 4 turns (2 exchanges) — frequent enough to capture facts early, infrequent enough to avoid excessive LLM calls.

Interview Q&A Prep

"How do you decide what's important enough to store?" The consolidator uses a secondary LLM call with a strict extraction prompt. Only explicitly stated facts are stored — the prompt explicitly forbids guessing. Conversation length drives stability: longer conversations get higher S values, so they decay slower.

"How do you handle conflicting memories?" Semantic memory uses an ON CONFLICT DO UPDATE SQL pattern — newer facts silently overwrite older ones. Episodic memories are never overwritten; they just get lower retrieval scores as they age.

"What's your retrieval strategy — recency vs relevance?" Both, blended. The score = 0.7 × similarity + 0.3 × retention formula means relevance dominates, but freshness breaks ties. You can tune these weights.

"How do you stop the vector store from growing forever?" The forgetting worker runs R(t) = e^(-t/S) for every stored episode. Anything below 20% retention gets deleted from both ChromaDB and the metadata SQLite table. Reinforcement (re-accessing an episode) boosts its effective stability.

File Structure

llm-memory/
├── memory/
│   ├── memory_manager.py    # Core — all 4 memory types
│   └── consolidator.py      # Post-session extraction
├── api/
│   └── main.py              # FastAPI endpoints
├── ui/
│   ├── src/App.jsx          # React chat + memory inspector
│   └── src/main.jsx
├── test_memory.py           # Smoke test
└── requirements.txt

About

A persistent, human-inspired memory system for LLMs built with Python, FastAPI & React. Implements all 4 cognitive memory types: working, episodic, semantic, and procedural. Features Ebbinghaus forgetting curve decay, auto memory consolidation using Gemini 2.5 Flash, and a live memory inspector UI.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors