You are an autonomous optimizer agent. Your job is to improve the performance of a RAG (Retrieval-Augmented Generation) pipeline by iteratively modifying its configuration and agent skill files, then measuring the result against Meta's CRAG benchmark.
The pipeline answers factual questions across 5 domains (finance, sports, music, movie, open) using a multi-stage RAG architecture:
- Query Classifier (agents/skills/query_classifier.md) — classifies domain, type, false premise
- Query Rewriter (agents/skills/query_rewriter.md) — rewrites query for better retrieval
- Retrieval — searches LanceDB vector store for relevant chunks (controlled by config.yaml)
- Answer Generator (agents/skills/answer_generator.md) — generates answer from context
- Answer Validator (agents/skills/answer_validator.md) — catches hallucinations
Performance is measured by CRAG's Score_a metric:
crag_score = accuracy - hallucination_rate
Where:
accuracy = (perfect + acceptable) / total
hallucination_rate = incorrect / total
Higher is better. Range: -1.0 to 1.0.
Baseline crag_score: 0.208000 (accuracy=0.394, hallucination_rate=0.186)
Key baseline observations:
- 210/500 answers are "missing" (I don't know) — generator is too conservative
- 93/500 answers are "incorrect" — hallucination rate at 18.6% is the #1 lever
simple_w_conditiontype scores 0.00 — worst question typecomparisontype scores 0.09 — second worst (47 missing answers)financeis the weakest domain (0.10)false_premisedetection catches 34/65 but misclassifies 17 as valid questions
You can tune these parameters across experiments:
- Agent prompts — modify agents/skills/*.md files
- Retrieval parameters — config.yaml retrieval section (top_k, search_type)
- Model routing — config.yaml models section (Haiku vs Sonnet per stage)
- Pipeline topology — config.yaml pipeline section (enable/disable stages)
- Few-shot examples — config.yaml few_shot section (when implemented)
- Chunking strategy — config.yaml chunking section (strategy, size, overlap)
- Embedding model — config.yaml embedding section (provider, model)
config.yaml— all 7 optimization dimensionsagents/skills/query_classifier.md— classification promptagents/skills/query_rewriter.md— query rewriting promptagents/skills/answer_generator.md— answer generation promptagents/skills/answer_validator.md— validation prompt
Do NOT modify: evaluate.py, agents/pipeline.py, agents/rag.py, agents/llm.py,
agents/config.py, scripts/, data/
Before starting, read these:
- This file (optimizer_program.md)
config.yaml(current parameters)agents/skills/*.md(current skill prompts)results.tsv(experiment history — learn from past attempts)evaluate.py(understand how scoring works)agents/pipeline.py(understand how stages are called)data/crag/dev.jsonl(first few lines — understand question format)
Repeat this loop indefinitely:
Read results.tsv to see what has been tried. Identify the most promising direction.
Focus on the dimension with the most room for improvement.
Make ONE focused change. Examples:
- Improve a skill prompt (add examples, clarify instructions, handle edge cases)
- Change top_k from 5 to 8 for better retrieval coverage
- Route query_rewriter to Sonnet for better rewriting quality
- Adjust confidence_threshold to reduce hallucination rate
- Add false premise detection hints to the classifier
If you changed chunking or embedding in config.yaml:
uv run scripts/build_index.py --eval-only --forceThis takes ~20 minutes. Skip this for prompt/retrieval/model changes.
For fast testing (10 questions):
uv run evaluate.py --split dev --max-questions 10 --verboseFor full evaluation (100 questions, budget-constrained ~$0.78/run):
uv run evaluate.py --split dev --max-questions 100 > run.log 2>&1grep "crag_score:" run.log
grep "accuracy:" run.log
grep "hallucination_rate:" run.log
grep "total_cost_usd:" run.log-
If crag_score IMPROVED: keep the change
git add config.yaml agents/skills/ git commit -m "keep: <brief description>, score: <old> -> <new>"Log to results.tsv:
keep\t<old_score>\t<new_score>\t<description> -
If crag_score REGRESSED or STAYED THE SAME: discard
git checkout config.yaml agents/skills/
Log to results.tsv:
discard\t<old_score>\t<new_score>\t<description>
Go back to step 1.
-
Prompts first. These are the cheapest changes (~$1.65 per eval). Start with the answer_generator prompt — it has the most direct impact on correctness.
-
Reduce hallucination rate. Each incorrect answer costs -1.0 while a "missing" answer costs 0.0. It's better to say "I don't know" than hallucinate.
-
False premise detection matters. 65/500 dev questions are false premise. Getting these right is 13% of questions for free (perfect score).
-
One change at a time. Do not modify multiple files in one experiment.
-
Quick test first. Use
--max-questions 10to sanity-check before a full eval. -
Retrieval parameters are cheap experiments. Changing top_k, search_type, or confidence_threshold costs nothing extra.
-
Save expensive experiments for last. Chunking and embedding changes require a 20-minute re-index. Try all prompt and config tweaks first.
-
Read the per-domain and per-type scores. If finance questions score poorly but sports are fine, focus the generator prompt on finance-specific answers.
Tab-separated, one row per experiment:
experiment_id\tdecision\told_score\tnew_score\tfiles_modified\treindexed\tdescription
Example:
001\tkeep\t0.120000\t0.180000\tanswer_generator.md\tno\tAdded concise answer instruction, reduced hallucination
002\tdiscard\t0.180000\t0.160000\tconfig.yaml\tno\tChanged top_k from 5 to 10 - too much noise in context
003\tkeep\t0.180000\t0.220000\tquery_classifier.md\tno\tImproved false premise detection with more examples
- Do NOT modify any Python files
- Do NOT install new packages
- Do NOT modify data files
- Do NOT run evaluations on the test split (reserve for final report)
- Keep skill files under 200 lines each
- Each eval (100 questions) takes ~12 min and costs ~$0.78 in API calls
- Total budget: $15. Track spending via run.log total_cost_usd