AutoRAG: Self-Improving RAG System

An autonomous system that optimizes its own RAG pipeline -- not just prompts, but chunking, embeddings, model routing, retrieval parameters, and pipeline topology -- all tuned autonomously against Meta's CRAG benchmark across 5 domains.

Applies Karpathy's autoresearch pattern to the entire RAG stack instead of model training.

Results

                       Baseline    Optimized    Change
crag_score:            0.208       0.360        +73%
accuracy:              39.4%       46.0%        +17%
hallucination_rate:    18.6%       10.0%        -46%

18 experiments run autonomously. 5 kept, 13 discarded. No human in the loop.

What the Optimizer Learned

Experiment	Change	Score	Verdict
002	Raise confidence threshold 0.70 to 0.85	0.208 -> 0.260	keep
003	Better false premise detection (5 categories + examples)	0.260 -> 0.310	keep
005	Stricter hallucination patterns in validator	0.310 -> 0.320	keep
009	Lower threshold 0.85 to 0.80 (sweet spot)	0.320 -> 0.330	keep
011	Upgrade validator from Haiku to Sonnet	0.330 -> 0.360	keep

Key discoveries (without human guidance):

Hallucination is the bottleneck, not accuracy. Each incorrect answer costs -1.0 while "I don't know" costs 0.0. The optimizer learned to prioritize reducing hallucination over increasing recall.
Confidence threshold has a sweet spot. Too low (0.70) lets hallucinations through. Too high (0.85) blocks correct answers. The optimizer found 0.80 through two experiments.
More expensive models aren't always better. Upgrading the classifier to Sonnet added 48% cost with zero score improvement. But upgrading the validator to Sonnet improved score by +0.030.
More context can hurt. Increasing top_k from 5 to 7-8 added noise that increased hallucination.
Smaller chunks hurt. Chunk size 256 produced fragments too small for coherent answers.

How It Works

                    INNER PIPELINE (per question)
                    ==============================
User Question
     |
     v
[Query Classifier]  -->  domain + type + false premise?     (Haiku)
     |
     v
[Query Rewriter]    -->  optimized search query              (Haiku)
     |
     v
[LanceDB Retrieval] -->  top 5 relevant chunks              (vector search)
     |
     v
[Answer Generator]  -->  candidate answer                    (Sonnet)
     |
     v
[Answer Validator]  -->  final answer or "I don't know"      (Sonnet)


                    OUTER LOOP (autonomous)
                    ========================
              +---> Read results.tsv (experiment history)
              |     Read config.yaml + agents/skills/*.md
              |
              +---> Make ONE change (prompt, config param, or model route)
              |
              +---> Re-index if chunking/embedding changed
              |
              +---> Run evaluate.py on 500-question CRAG dev set
              |
              +---> crag_score improved?
              |      YES --> git commit (keep)
              |      NO  --> git checkout (discard)
              |
              +---> Log to results.tsv, LOOP BACK

The Autoresearch Mapping

Karpathy's autoresearch	Project 1 (Financial)	Project 2 (AutoRAG)
`prepare.py` (fixed eval)	`evaluate.py` (fixed eval)	`evaluate.py` (fixed eval)
`train.py` (agent edits)	`agents/skills/*.md`	`config.yaml` + `agents/skills/*.md`
`val_bpb` (metric)	`composite_score`	`crag_score`
`program.md` (strategy)	`optimizer_program.md`	`optimizer_program.md`
5-min GPU budget	~$1.80 API budget	~$0.78-$3.89 API budget
git keep/discard	git keep/discard	git keep/discard

What Project 2 Adds Over Project 1

Project 1 had 1 optimization dimension (edit skill file text). Project 2 has 7 optimization dimensions:

#	Dimension	Config Location	Re-index?
1	Agent prompts	`agents/skills/*.md`	No
2	Retrieval parameters	`config.yaml` retrieval	No
3	Model routing	`config.yaml` models	No
4	Pipeline topology	`config.yaml` pipeline	No
5	Few-shot examples	`config.yaml` few_shot	No
6	Chunking strategy	`config.yaml` chunking	Yes
7	Embedding model	`config.yaml` embedding	Yes

The optimizer learns that dimensions 1-5 are cheap experiments (~~$0.78 per eval) and 6-7 are expensive (~~$0.78 + 30 min re-indexing). It naturally front-loads cheap experiments.

Evaluation

CRAG scoring (per-question):

Perfect (1.0): correct answer
Acceptable (0.5): partially correct
Missing (0.0): "I don't know" or no answer
Incorrect (-1.0): wrong or hallucinated

Composite metric: crag_score = accuracy - hallucination_rate

Scoring uses a single LLM judge (Claude Haiku) for consistency across experiments. Official CRAG evaluation uses dual judges (ChatGPT + Llama 3).

Benchmark

Domain	Baseline	Optimized
Finance	0.10	--
Sports	0.19	--
Movie	0.14	--
Music	0.28	--
Open	0.33	--

500 questions, 5 domains, 8 question types. Dev/test split stratified by domain.

Quick Start

Prerequisites

Python 3.12+
uv package manager
API keys: Anthropic (Claude), OpenAI (embeddings)

Setup

# Install dependencies
uv sync

# Set API keys
cp .env.example .env
# Edit .env with your keys

# Download CRAG dataset (~5.5 GB, one-time)
uv run scripts/download_crag.py

# Build vector index (~30 min, eval-only docs)
uv run scripts/build_index.py --eval-only --force

# Run baseline evaluation (~60 min, ~$3.89)
uv run evaluate.py --split dev

Single Query Test

uv run -m agents.pipeline --query "who won the NFL MVP?" --verbose

Start the Optimizer

# Windows (PowerShell)
.\scripts\run_optimizer.ps1

# Linux/Mac
./scripts/run_optimizer.sh

The optimizer runs continuously with a $15 budget cap, making one focused change per iteration, keeping improvements and discarding regressions. Check results.tsv for progress.

Project Structure

evaluate.py                  # FIXED. CRAG scoring harness (never modified)
config.yaml                  # TUNABLE. All 7 optimization dimensions
optimizer_program.md         # Instructions for the optimizer agent
results.tsv                  # Experiment log (keep/discard history)
pyproject.toml               # Dependencies
.env                         # API keys (ANTHROPIC, OPENAI)

agents/
  pipeline.py                # FIXED. Orchestrates the 5-stage RAG pipeline
  rag.py                     # FIXED. Chunking, embedding, indexing, retrieval
  llm.py                     # FIXED. Anthropic SDK wrapper with cost tracking
  config.py                  # FIXED. Config loader with validation
  skills/
    query_classifier.md      # TUNABLE. Question classification prompt
    query_rewriter.md        # TUNABLE. Query rewriting prompt
    answer_generator.md      # TUNABLE. Answer generation prompt
    answer_validator.md      # TUNABLE. Hallucination check prompt

data/
  crag/
    dev.jsonl                # 500 stratified eval questions
    test.jsonl               # 500 held-out test questions
    documents/               # 18,752 extracted HTML documents
  vectorstore/               # LanceDB index (rebuilt from config)

scripts/
  download_crag.py           # Downloads and prepares CRAG data
  build_index.py             # Builds LanceDB vector index
  run_optimizer.sh           # Bash optimizer loop with budget tracking
  run_optimizer.ps1          # PowerShell optimizer loop with budget tracking

Fixed infrastructure (evaluate.py, pipeline.py, rag.py, llm.py) is never modified -- just like prepare.py in autoresearch. The optimizer only touches config.yaml and the four skill files in agents/skills/.

Tech Stack

Component	Technology
LLM (generation)	Claude Sonnet 4.6
LLM (routing/validation)	Claude Haiku 4.5
Embeddings	OpenAI text-embedding-3-small
Vector Store	LanceDB 0.30.0
Benchmark	Meta CRAG (4,409 QA pairs, 5 domains)
Package Manager	uv
Experiment Tracking	Git + results.tsv

Cost

Item	Cost
Single query (all 5 stages)	~$0.008
Full 500-question eval	~$3.89
Quick 100-question eval	~$0.78
Index build (3,720 docs)	~$0.70 (OpenAI embeddings)
18-experiment optimization run	~$15

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AutoRAG: Self-Improving RAG System

Results

What the Optimizer Learned

How It Works

The Autoresearch Mapping

What Project 2 Adds Over Project 1

Evaluation

Benchmark

Quick Start

Prerequisites

Setup

Single Query Test

Start the Optimizer

Project Structure

Tech Stack

Cost

License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

AutoRAG: Self-Improving RAG System

Results

What the Optimizer Learned

How It Works

The Autoresearch Mapping

What Project 2 Adds Over Project 1

Evaluation

Benchmark

Quick Start

Prerequisites

Setup

Single Query Test

Start the Optimizer

Project Structure

Tech Stack

Cost

License