A benchmark comparing 7 search strategies over the same document corpus, using Claude Code agents and MCP tools.
Each strategy is implemented as a Claude Code sub-agent (Claude Sonnet) with a different search tool. An arbiter (Claude Opus) then evaluates answer quality using a standardized scoring rubric.
The example corpus uses D&D 5th Edition PDFs, but the approach is generic and works with any document collection (PDF, DOCX, PPTX, RTF, ODT, TXT, MD, HTML).
┌─────────────────────────────────────────────────────────────┐
│ INDEXING (one-time) │
│ │
│ docs/* ──► extract + chunk ──► BM25 index + Qdrant DB │
└─────────────────────────────────────────────────────────────┘
┌──────────────┐
│ questions.json │
└──────┬───────┘
│
┌───────────▼───────────┐
│ Agent (Claude Sonnet) │
│ 1 agent per strategy │
└───────────┬───────────┘
│
┌──────────────┼──────────────┐
│ │ │
┌──────▼──────┐ ┌────▼────┐ ┌───────▼───────┐
│ BM25 (MCP) │ │ Qdrant │ │ Document │
│ bm25s │ │ (MCP) │ │ Corpus docs/* │
└──────┬──────┘ └────┬────┘ └───────────────┘
│ │
└──────┬───────┘
│
┌────────▼────────┐
│ Agent Response │
└────────┬────────┘
│
┌────────▼────────┐
│ Arbiter (Opus) │
│ Quality Score │
└─────────────────┘
| # | Strategy | Engine | Rounds | maxTurns |
|---|---|---|---|---|
| 1 | BM25 simple | bm25s | 1 | 2 |
| 2 | BM25 2-round | bm25s | 2 | 4 |
| 3 | Vector DB naive | Qdrant | 1 | 2 |
| 4 | Vector DB 2-round | Qdrant | 2 | 4 |
| # | Strategy | Engine | maxTurns |
|---|---|---|---|
| 5 | BM25 free | bm25s | default |
| 6 | Vector free | Qdrant | default |
| 7 | Hybrid free | bm25s + Qdrant | default |
All tools read from a single config file search_bench.json (copy from search_bench.json.example and customize):
{
"docs_dir": "./docs",
"collection_name": "my_corpus",
"stemmer_language": "english",
"qdrant_url": "http://localhost:6333",
"embedding_model": "jinaai/jina-embeddings-v3",
"pdf_backend": "marker",
"poison": false
}| Key | Description |
|---|---|
docs_dir |
Path to document corpus (relative to config file), recurses into subdirectories |
collection_name |
Used to derive BM25 index dir (data/bm25_{name}) and Qdrant collection name |
stemmer_language |
BM25 stemmer language (e.g. english, french) |
qdrant_url |
Qdrant server URL |
embedding_model |
FastEmbed model for vector embeddings |
pdf_backend |
"pymupdf" (default, fast, CPU) or "marker" (OCR-capable, GPU) |
poison |
false (default) or true — enable corpus poisoning |
cudnn_path |
null (default) — path to cuDNN DLLs directory, Windows only (e.g. "C:\\Program Files\\NVIDIA\\CUDNN\\v9.20\\bin\\12.9\\x64") |
To switch corpus or collection: edit search_bench.json, then re-index.
Two backends are available, selected via pdf_backend in search_bench.json:
pymupdf(default): pymupdf4llm → Markdown. Fast, CPU-only. Best for native PDFs with text layers.marker: Marker (Surya-based) with visual layout detection and auto-OCR. Requires GPU (~5GB VRAM). Best for scanned PDFs with complex layouts (multi-column, tables, sidebars).
Both backends produce per-page text that is then chunked into 512-word blocks with 75-word overlap.
A clean_text() function normalizes the text before chunking to avoid token-dense artifacts:
<br>tags (from markdown tables) → newlines- Long table separators (
----...----) →--- - URLs →
[url deleted]placeholder - Long dot sequences (TOC lines) →
...
This is critical for embedding performance — without it, a single "word" (per .split()) could contain 900+ chars of <br>-separated HTML, producing hundreds of tokens and slowing embedding from ~1s to 250s per batch.
jinaai/jina-embeddings-v3 — 570M params, 1024 dims, 8192 token context.
- Multilingual: 30+ languages officially supported, pre-trained on 89
- Task-specific LoRA adapters:
retrieval.query/retrieval.passage(handled automatically by FastEmbed) - GPU acceleration via
fastembed-gpu(ONNX Runtime + CUDA) - License: CC BY-NC 4.0 (non-commercial)
Vector name and dimension are derived dynamically from the model (same convention as mcp-server-qdrant).
Final Score = Quality (80%) + Efficiency (20%), range 0-10.
- Quality (0-10) — judged by Claude Opus:
- Accuracy (0-5): factual correctness vs. expected answer
- Completeness (0-3): all aspects of the question covered
- Faithfulness (0-2): no hallucination, grounded in search results
- Efficiency (0-10) — measured automatically:
- Latency (0-5): normalized against the group
- Token usage (0-5): normalized against the group
See scoring.md for the full rubric.
When "poison": true in search_bench.json, the indexer applies targeted text mutations to chunks before indexing (BM25 + Qdrant). This creates controlled discrepancies between the indexed corpus and widely-known facts (e.g. Fireball damage 8d6 becomes 6d8).
Purpose: Detect when an agent answers from training memory instead of search results. If an agent returns the original (well-known) value instead of the poisoned value, it's hallucinating. The Faithfulness score (0-2) penalizes this.
How it works (tools/corpus_poison.py):
- Global rules: regex replacements applied to all chunks
- Contextual rules: replacements only when a specific keyword is present in the chunk
- Rules are idempotent (match only original values, not replacements)
- Gated by the
poisonconfig key —falseby default
- Python 3.12 + uv
- Node.js + yarn
- Docker (for Qdrant vector DB)
- NVIDIA GPU + CUDA (required for Marker OCR backend and recommended for embedding)
- Claude Code CLI
Note on the corpus: This repository does not include document files. You must provide your own documents in the
docs/directory.
git clone https://github.com/<your-username>/search_bench.git
cd search_bench
# Python dependencies (includes fastembed-gpu)
uv sync
# Node dependencies (for TypeScript tooling)
yarn installcp search_bench.json.example search_bench.json
# Edit search_bench.json: set docs_dir, collection_name, stemmer_language, etc.Place your documents in the docs/ directory (or wherever docs_dir points). Supported formats: PDF, DOCX, PPTX, RTF, ODT, TXT, MD, HTML.
docker run -d --name qdrant -p 6333:6333 \
-v "$(pwd)/qdrant_storage:/qdrant/storage" \
qdrant/qdrant# Build both indexes in one pass (recommended — extracts documents only once)
PYTHONIOENCODING=utf-8 uv run python tools/index_all.py --reset
# Or build individually:
PYTHONIOENCODING=utf-8 uv run python tools/bm25_index.py
PYTHONIOENCODING=utf-8 uv run python tools/qdrant_index.pyUse --reset to drop and recreate the Qdrant collection. With Marker backend, always use index_all.py to avoid running OCR twice.
To add new documents without re-indexing everything:
# Add specific documents (Qdrant incremental + BM25 rebuild)
PYTHONIOENCODING=utf-8 uv run python tools/index_add.py docs/my_new_file.pdf
# Rebuild BM25 index only (from cache, no new embedding)
PYTHONIOENCODING=utf-8 uv run python tools/index_add.py --bm25-onlyThe research agents are available as Claude Code sub-agents. Ask Claude to search your corpus:
"Search the corpus for combat rules"
Claude will automatically select and use the appropriate researcher agent.
Available agents: researcher, researcher-2round, researcher-vector, researcher-vector-2round, researcher-bm25-free, researcher-vector-free, researcher-hybrid-free.
# BM25 keyword search
PYTHONIOENCODING=utf-8 uv run python tools/bm25_search.py "fireball damage" --top-k 5Ask Claude to run all agents against the questions in questions.json. No separate script is needed — Claude orchestrates the agents and collects results into results/.
The agents communicate with search backends via MCP servers. Claude Code launches them automatically from agent configs, but you can run them manually:
uv run python tools/bm25_mcp_server.py --config search_bench.jsonExposes a retrieve(query, k) tool over stdio.
uv run python tools/qdrant_mcp_wrapper.py --config search_bench.jsonWraps mcp-server-qdrant with config-driven env vars. Exposes a qdrant-find tool over stdio.
search_bench/
├── .claude/
│ └── agents/ # Claude Code sub-agent definitions (1 per strategy)
├── data/
│ └── bm25_{collection}/ # Serialized BM25 index (generated)
├── docs/ # Document corpus (not included — bring your own)
├── results/ # Benchmark run results (generated)
├── test/ # Unit tests (pytest)
├── tools/
│ ├── config.py # Shared config loader (reads search_bench.json)
│ ├── pdf_utils.py # PDF extraction + clean_text() + chunking
│ ├── doc_utils.py # Non-PDF format extractors (docx, pptx, rtf, odt, txt, html)
│ ├── index_all.py # Full index builder (BM25 + Qdrant in one pass)
│ ├── index_add.py # Incremental indexing (add new documents)
│ ├── bm25_index.py # Build BM25 index from documents
│ ├── bm25_search.py # CLI search over BM25 index
│ ├── bm25_mcp_server.py # BM25 MCP server (config-driven)
│ ├── qdrant_index.py # Build Qdrant vector index from documents
│ ├── qdrant_mcp_wrapper.py # Wrapper for mcp-server-qdrant (config-driven)
│ ├── extract_page.py # Extract a single page from a document
│ └── corpus_poison.py # Anti-hallucination: injects modified stats
├── search_bench.json.example # Config template (copy to search_bench.json)
├── questions.json # Benchmark questions with expected answers
├── scoring.md # Full scoring rubric
└── JOURNAL.md # Technical decisions and findings log
| Component | Technology |
|---|---|
| BM25 search | bm25s + PyStemmer |
| Vector search | Qdrant + fastembed-gpu (jina-embeddings-v3, 1024 dims) |
| PDF parsing | PyMuPDF + pymupdf4llm / Marker (Surya OCR) |
| LLM agents | Claude Sonnet (search) / Claude Opus (arbitration) |
| MCP servers | bm25s built-in MCP / mcp-server-qdrant |
| Python | 3.12, managed with uv |
| TypeScript | tsx + yarn (tooling) |
On Windows, always prefix Python commands with PYTHONIOENCODING=utf-8 to avoid encoding errors with PDF text output. On Linux/macOS this is typically not needed.
PowerShell syntax:
$env:PYTHONIOENCODING="utf-8"; uv run python tools/qdrant_index.py --resetIf you need GPU-accelerated embedding, set cudnn_path in search_bench.json to point to the directory containing the cuDNN DLLs.