A testing harness that measures how well a Retrieval-Augmented Generation pipeline actually works — scoring retrieval accuracy, answer groundedness, citation coverage, latency, and cost across different chunking strategies and retrieval configurations.
Every RAG tutorial shows how to stuff documents into a vector store and get answers out. Almost none of them answer the harder questions: Is the retriever returning the right chunks? Is the generated answer grounded in those chunks or hallucinated? Do the citations actually support the claims? Without quantitative answers, you're guessing.
RAG Evaluation Lab exists to close that gap. It provides a structured evaluation harness with golden question datasets, automated scoring functions, and markdown report generation — making it possible to compare chunking strategies, retrieval methods, and generation configurations with hard numbers instead of vibes.
This is part of a multi-project AI infrastructure portfolio. It consumes documents from the Document Intelligence Pipeline and its evaluation patterns feed into the Personal Knowledge Base OS.
- RAG pipeline architecture — ingestion → chunking → embedding → vector storage → retrieval → generation → evaluation, implemented as composable modules
- Evaluation design — golden question sets with known-correct contexts, automated scoring for retrieval hit-rate and answer groundedness
- Vector search — pgvector-backed similarity search with mock engine for offline testing (
MockVectorSearchEngine) - Chunking strategy comparison — framework for evaluating different chunk sizes and overlap configurations against the same golden dataset
- Data quality thinking — separation of eval metrics (hit rate, groundedness, citation count) from generation logic
- Cost & latency measurement — infrastructure for tracking per-query cost and response time across evaluation runs
- Structured reporting —
EvaluationReportergenerates markdown tables with per-query metric breakdowns
graph TB
subgraph "Ingestion Layer"
DOCS["Document Source"] --> CHUNK["Chunking Module"]
CHUNK --> EMBED["Embedding Generator"]
end
subgraph "Storage Layer"
EMBED --> PGV["pgvector / PostgreSQL"]
PGV --> IDX["Vector Index"]
end
subgraph "Query Pipeline"
Q["User Query"] --> RET["MockVectorSearchEngine"]
IDX --> RET
RET --> GEN["RAGAnswerGenerator"]
GEN --> ANS["Answer + Citations"]
end
subgraph "Evaluation Pipeline"
GOLD["Golden Question Dataset"] --> EVAL["Eval Functions"]
ANS --> EVAL
RET --> EVAL
EVAL --> |"hit_rate, groundedness"| RPT["EvaluationReporter"]
RPT --> MD["Markdown Report"]
end
subgraph "Infrastructure"
REDIS["Redis"] --> CELERY["Celery Worker"]
CELERY --> |"background eval runs"| EVAL
end
| Component | Choice | Rationale |
|---|---|---|
| API Framework | FastAPI + Uvicorn | Async-ready, auto-generated OpenAPI docs, Pydantic validation |
| Vector Database | PostgreSQL 16 + pgvector | Embeddings stored alongside relational metadata in one database |
| Cache / Broker | Redis 7 | Celery task broker + result backend; future caching of embeddings |
| Task Queue | Celery 5.3+ | Background evaluation runs and bulk ingestion jobs |
| Config | pydantic-settings | Type-safe env var loading via BaseAppConfig → AppConfig |
| ORM | SQLAlchemy 2.0+ | Session management via shared_core.database.DatabaseManager |
| Logging | Loguru (via shared-core) | Structured JSON logging with service name tagging |
| Lint / Format | ruff | E, W, F, I, C, B rules; 88-char lines; double quotes |
| Type Checking | pyright | Basic mode, targets Python 3.10 |
| Testing | pytest | FastAPI TestClient for endpoint tests |
# 1. Clone the portfolio (if you haven't already)
cd "Showcase Projects"
# 2. Start infrastructure
cd rag-evaluation-lab
docker compose up -d # PostgreSQL (pgvector) + Redis
# 3. Install dependencies (shared-core must be installed first)
make install # pip install -e ../shared-core && pip install -r requirements.txt
# 4. Configure environment
cp .env.example .env # Edit DATABASE_URL, REDIS_URL, API keys as needed
# 5. Run the API server
make dev # uvicorn on 0.0.0.0:8000 with hot-reload
# 6. Verify it works
curl http://localhost:8000/healthThe demo (examples/run_demo.py) runs a full evaluation cycle without external dependencies:
make demoWhat it does:
- Creates a two-document corpus (Hermes agent framework, ClickHouse telemetry)
- Initializes
MockVectorSearchEnginewith the corpus - Runs two golden questions through retrieval → generation → evaluation
- Scores each query for
hit_rateandgroundedness - Generates a markdown evaluation report table
Expected output:
--- Executing RAG Eval Lab Golden Tests ---
# RAG Evaluation Run Report
| Query | Hit Rate | Groundedness | Citations |
|---|---|---|---|
| Tell me about Hermes approvals | 1.00 | 0.92 | 1 |
| What database is used for telemetry logs? | 1.00 | 0.88 | 2 |
make testCurrent test coverage (tests/test_core.py):
test_health_endpoint— Verifies the/healthendpoint returns 200, correct service name (rag-evaluation-lab), and adependenciesobject with database/redis status
Planned test expansion:
- Unit tests for
calculate_retrieval_hit_rateedge cases (empty gold set, partial matches) - Unit tests for
calculate_answer_groundedness(short answers, zero-overlap answers) - Integration tests for
MockVectorSearchEngine.query()ranking correctness - End-to-end evaluation pipeline tests using golden datasets
- Report format validation tests
Returns service health including database and Redis connectivity.
{
"status": "healthy",
"service": "rag-evaluation-lab",
"dependencies": {
"database": "online",
"redis": "online"
}
}Status is "degraded" when either dependency is offline.
Planned endpoints:
| Method | Path | Description |
|---|---|---|
POST |
/ingest |
Ingest and chunk a document into the vector store |
POST |
/query |
Run a retrieval + generation query |
POST |
/eval/run |
Execute an evaluation run against a golden dataset |
GET |
/eval/results/{run_id} |
Retrieve evaluation results for a specific run |
GET |
/eval/report/{run_id} |
Get a markdown-formatted evaluation report |
GET |
/eval/compare |
Compare metrics across multiple eval runs |
Key environment variables (from .env.example):
| Variable | Default | Description |
|---|---|---|
APP_NAME |
rag-evaluation-lab |
Service identifier in logs and health checks |
ENV |
development |
Environment mode (development, staging, production) |
DEBUG |
true |
Enables Uvicorn hot-reload and verbose logging |
LOG_LEVEL |
INFO |
Loguru log level |
DATABASE_URL |
postgresql+psycopg://... |
PostgreSQL connection string (pgvector-enabled) |
REDIS_URL |
redis://localhost:6379/0 |
Redis connection for Celery broker + cache |
OPENAI_API_KEY |
— | Required for embedding generation and LLM-based eval scoring |
ANTHROPIC_API_KEY |
— | Optional alternate LLM provider for generation |
- Mock retrieval only —
MockVectorSearchEngineuses naive keyword matching (word overlap scoring), not actual vector similarity. Real pgvector queries are not implemented yet. - No real embeddings — The current pipeline does not call any embedding API. Documents are matched by string containment, not semantic similarity.
- Simplistic groundedness —
calculate_answer_groundednesscounts word overlaps between the answer and retrieved contexts. Words shorter than 4 characters are automatically counted as matches, which inflates the score. - No chunking module — Document chunking (fixed-size, sentence-based, recursive) is planned but not yet implemented. The demo uses pre-chunked documents.
- No persistence of eval runs — Evaluation results are generated in-memory and printed. No database storage or run-over-run comparison exists yet.
- Template worker — The Celery worker (
worker.py) contains only a placeholdersample_background_task. Evaluation runs are not yet dispatched as background tasks. - Single health endpoint — The API exposes only
/health; no ingestion, query, or evaluation endpoints exist yet.
See docs/roadmap.md for the phased build plan. In summary:
- MVP (Current) — Skeleton with mock retrieval, basic eval functions, demo script, health endpoint
- Phase 1: Display-Ready — Real pgvector search, chunking strategies, golden dataset YAML format, full eval API endpoints, markdown report generation
- Phase 2: Showcase — Multi-strategy comparison, hybrid retrieval, CI eval regression gates, evaluation dashboard
- Future — LLM-as-judge scoring, RAGAS integration, cost tracking per provider, export to document-intelligence-pipeline
MIT — see LICENSE.