RAG Evaluation Lab

A testing harness that measures how well a Retrieval-Augmented Generation pipeline actually works — scoring retrieval accuracy, answer groundedness, citation coverage, latency, and cost across different chunking strategies and retrieval configurations.

Why This Exists

Every RAG tutorial shows how to stuff documents into a vector store and get answers out. Almost none of them answer the harder questions: Is the retriever returning the right chunks? Is the generated answer grounded in those chunks or hallucinated? Do the citations actually support the claims? Without quantitative answers, you're guessing.

RAG Evaluation Lab exists to close that gap. It provides a structured evaluation harness with golden question datasets, automated scoring functions, and markdown report generation — making it possible to compare chunking strategies, retrieval methods, and generation configurations with hard numbers instead of vibes.

This is part of a multi-project AI infrastructure portfolio. It consumes documents from the Document Intelligence Pipeline and its evaluation patterns feed into the Personal Knowledge Base OS.

What It Demonstrates

RAG pipeline architecture — ingestion → chunking → embedding → vector storage → retrieval → generation → evaluation, implemented as composable modules
Evaluation design — golden question sets with known-correct contexts, automated scoring for retrieval hit-rate and answer groundedness
Vector search — pgvector-backed similarity search with mock engine for offline testing (MockVectorSearchEngine)
Chunking strategy comparison — framework for evaluating different chunk sizes and overlap configurations against the same golden dataset
Data quality thinking — separation of eval metrics (hit rate, groundedness, citation count) from generation logic
Cost & latency measurement — infrastructure for tracking per-query cost and response time across evaluation runs
Structured reporting — EvaluationReporter generates markdown tables with per-query metric breakdowns

Architecture

graph TB
    subgraph "Ingestion Layer"
        DOCS["Document Source"] --> CHUNK["Chunking Module"]
        CHUNK --> EMBED["Embedding Generator"]
    end

    subgraph "Storage Layer"
        EMBED --> PGV["pgvector / PostgreSQL"]
        PGV --> IDX["Vector Index"]
    end

    subgraph "Query Pipeline"
        Q["User Query"] --> RET["MockVectorSearchEngine"]
        IDX --> RET
        RET --> GEN["RAGAnswerGenerator"]
        GEN --> ANS["Answer + Citations"]
    end

    subgraph "Evaluation Pipeline"
        GOLD["Golden Question Dataset"] --> EVAL["Eval Functions"]
        ANS --> EVAL
        RET --> EVAL
        EVAL --> |"hit_rate, groundedness"| RPT["EvaluationReporter"]
        RPT --> MD["Markdown Report"]
    end

    subgraph "Infrastructure"
        REDIS["Redis"] --> CELERY["Celery Worker"]
        CELERY --> |"background eval runs"| EVAL
    end

Tech Stack

Component	Choice	Rationale
API Framework	FastAPI + Uvicorn	Async-ready, auto-generated OpenAPI docs, Pydantic validation
Vector Database	PostgreSQL 16 + pgvector	Embeddings stored alongside relational metadata in one database
Cache / Broker	Redis 7	Celery task broker + result backend; future caching of embeddings
Task Queue	Celery 5.3+	Background evaluation runs and bulk ingestion jobs
Config	pydantic-settings	Type-safe env var loading via `BaseAppConfig` → `AppConfig`
ORM	SQLAlchemy 2.0+	Session management via `shared_core.database.DatabaseManager`
Logging	Loguru (via shared-core)	Structured JSON logging with service name tagging
Lint / Format	ruff	E, W, F, I, C, B rules; 88-char lines; double quotes
Type Checking	pyright	Basic mode, targets Python 3.10
Testing	pytest	FastAPI TestClient for endpoint tests

Local Setup

# 1. Clone the portfolio (if you haven't already)
cd "Showcase Projects"

# 2. Start infrastructure
cd rag-evaluation-lab
docker compose up -d          # PostgreSQL (pgvector) + Redis

# 3. Install dependencies (shared-core must be installed first)
make install                   # pip install -e ../shared-core && pip install -r requirements.txt

# 4. Configure environment
cp .env.example .env           # Edit DATABASE_URL, REDIS_URL, API keys as needed

# 5. Run the API server
make dev                       # uvicorn on 0.0.0.0:8000 with hot-reload

# 6. Verify it works
curl http://localhost:8000/health

Demo

The demo (examples/run_demo.py) runs a full evaluation cycle without external dependencies:

make demo

What it does:

Creates a two-document corpus (Hermes agent framework, ClickHouse telemetry)
Initializes MockVectorSearchEngine with the corpus
Runs two golden questions through retrieval → generation → evaluation
Scores each query for hit_rate and groundedness
Generates a markdown evaluation report table

Expected output:

--- Executing RAG Eval Lab Golden Tests ---
# RAG Evaluation Run Report

| Query | Hit Rate | Groundedness | Citations |
|---|---|---|---|
| Tell me about Hermes approvals | 1.00 | 0.92 | 1 |
| What database is used for telemetry logs? | 1.00 | 0.88 | 2 |

Tests

make test

Current test coverage (tests/test_core.py):

test_health_endpoint — Verifies the /health endpoint returns 200, correct service name (rag-evaluation-lab), and a dependencies object with database/redis status

Planned test expansion:

Unit tests for calculate_retrieval_hit_rate edge cases (empty gold set, partial matches)
Unit tests for calculate_answer_groundedness (short answers, zero-overlap answers)
Integration tests for MockVectorSearchEngine.query() ranking correctness
End-to-end evaluation pipeline tests using golden datasets
Report format validation tests

API Reference

`GET /health`

Returns service health including database and Redis connectivity.

{
  "status": "healthy",
  "service": "rag-evaluation-lab",
  "dependencies": {
    "database": "online",
    "redis": "online"
  }
}

Status is "degraded" when either dependency is offline.

Planned endpoints:

Method	Path	Description
`POST`	`/ingest`	Ingest and chunk a document into the vector store
`POST`	`/query`	Run a retrieval + generation query
`POST`	`/eval/run`	Execute an evaluation run against a golden dataset
`GET`	`/eval/results/{run_id}`	Retrieve evaluation results for a specific run
`GET`	`/eval/report/{run_id}`	Get a markdown-formatted evaluation report
`GET`	`/eval/compare`	Compare metrics across multiple eval runs

Configuration

Key environment variables (from .env.example):

Variable	Default	Description
`APP_NAME`	`rag-evaluation-lab`	Service identifier in logs and health checks
`ENV`	`development`	Environment mode (`development`, `staging`, `production`)
`DEBUG`	`true`	Enables Uvicorn hot-reload and verbose logging
`LOG_LEVEL`	`INFO`	Loguru log level
`DATABASE_URL`	`postgresql+psycopg://...`	PostgreSQL connection string (pgvector-enabled)
`REDIS_URL`	`redis://localhost:6379/0`	Redis connection for Celery broker + cache
`OPENAI_API_KEY`	—	Required for embedding generation and LLM-based eval scoring
`ANTHROPIC_API_KEY`	—	Optional alternate LLM provider for generation

Known Limitations

Mock retrieval only — MockVectorSearchEngine uses naive keyword matching (word overlap scoring), not actual vector similarity. Real pgvector queries are not implemented yet.
No real embeddings — The current pipeline does not call any embedding API. Documents are matched by string containment, not semantic similarity.
Simplistic groundedness — calculate_answer_groundedness counts word overlaps between the answer and retrieved contexts. Words shorter than 4 characters are automatically counted as matches, which inflates the score.
No chunking module — Document chunking (fixed-size, sentence-based, recursive) is planned but not yet implemented. The demo uses pre-chunked documents.
No persistence of eval runs — Evaluation results are generated in-memory and printed. No database storage or run-over-run comparison exists yet.
Template worker — The Celery worker (worker.py) contains only a placeholder sample_background_task. Evaluation runs are not yet dispatched as background tasks.
Single health endpoint — The API exposes only /health; no ingestion, query, or evaluation endpoints exist yet.

Roadmap

See docs/roadmap.md for the phased build plan. In summary:

MVP (Current) — Skeleton with mock retrieval, basic eval functions, demo script, health endpoint
Phase 1: Display-Ready — Real pgvector search, chunking strategies, golden dataset YAML format, full eval API endpoints, markdown report generation
Phase 2: Showcase — Multi-strategy comparison, hybrid retrieval, CI eval regression gates, evaluation dashboard
Future — LLM-as-judge scoring, RAGAS integration, cost tracking per provider, export to document-intelligence-pipeline

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
datasets		datasets
docs		docs
examples		examples
src/rag_lab		src/rag_lab
tests		tests
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
pytest.ini		pytest.ini
requirements.txt		requirements.txt
ruff.toml		ruff.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG Evaluation Lab

Why This Exists

What It Demonstrates

Architecture

Tech Stack

Local Setup

Demo

Tests

API Reference

`GET /health`

Configuration

Known Limitations

Roadmap

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RAG Evaluation Lab

Why This Exists

What It Demonstrates

Architecture

Tech Stack

Local Setup

Demo

Tests

API Reference

GET /health

Configuration

Known Limitations

Roadmap

License

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`GET /health`

Packages