Benchmark and code for studying context assembly methods under token budgets for LLM-based code repair.
Given a bug and a token budget, how should you pick which code to show the LLM? This repo contains a benchmark that evaluates different context assembly methods on that question, using controlled single-line mutations across two Python repositories.
src/cegraph/— library that builds a knowledge graph from Python source and implements the retrieval/assembly methodspaper/experiments/— benchmark runner, mutation discovery, figure generation, analysis scriptspaper/results/— raw results from benchmark runspaper/— paper draft (not yet published)tests/— unit tests for the librarycsrc/— optional C++ acceleration for graph traversal
pip install -e ".[dev]"
pip install matplotlib
pytest tests/ -v# pydantic-ai
git clone https://github.com/pydantic/pydantic-ai.git /path/to/repos/pydantic-ai
cd /path/to/repos/pydantic-ai
git checkout 69a578a1e101
pip install -e pydantic_graph && pip install -e 'pydantic_ai_slim[test]' && pip install inline-snapshot
# httpx
git clone https://github.com/encode/httpx.git /path/to/repos/httpx
cd /path/to/repos/httpx
git checkout ae1b9f66238f
pip install -e '.[brotli,zstd,cli,http2,socks]' && pip install pytest uvicorn trio trustme chardetThe mutations are already defined in the task files — no discovery step needed.
export OPENAI_API_KEY=...
python paper/experiments/benchmark.py \
--tasks-file paper/experiments/eval_tasks_full.jsonl \
--repos-dir /path/to/repos \
--budgets 2000,4000,8000,10000 \
--methods no_retrieval,bm25,vector,keyword_map,bca_d1,bca,bca_d5,bca_no_closure,bca_no_scoring,target_file \
--query-types exact,vague,dev_report \
--output-dir paper/results/my_runpython -m paper.experiments.merge_and_report --run-dir paper/results/my_run
python paper/experiments/generate_figures.py --run-dir paper/results/my_run
python paper/experiments/posthoc_analysis.py --run-dir paper/results/my_run
python paper/experiments/router_ab.py --run-dir paper/results/my_runMIT