Skip to content

FishRaposo/rag-evaluation-lab

Repository files navigation

RAG Evaluation Lab

Python 3.10+ FastAPI PostgreSQL pgvector Redis Celery License: MIT

A testing harness that measures how well a Retrieval-Augmented Generation pipeline actually works — scoring retrieval accuracy, answer groundedness, citation coverage, latency, and cost across different chunking strategies and retrieval configurations.

Why This Exists

Every RAG tutorial shows how to stuff documents into a vector store and get answers out. Almost none of them answer the harder questions: Is the retriever returning the right chunks? Is the generated answer grounded in those chunks or hallucinated? Do the citations actually support the claims? Without quantitative answers, you're guessing.

RAG Evaluation Lab exists to close that gap. It provides a structured evaluation harness with golden question datasets, automated scoring functions, and markdown report generation — making it possible to compare chunking strategies, retrieval methods, and generation configurations with hard numbers instead of vibes.

This is part of a multi-project AI infrastructure portfolio. It consumes documents from the Document Intelligence Pipeline and its evaluation patterns feed into the Personal Knowledge Base OS.

What It Demonstrates

  • RAG pipeline architecture — ingestion → chunking → embedding → vector storage → retrieval → generation → evaluation, implemented as composable modules
  • Evaluation design — golden question sets with known-correct contexts, automated scoring for retrieval hit-rate and answer groundedness
  • Vector search — pgvector-backed similarity search with mock engine for offline testing (MockVectorSearchEngine)
  • Chunking strategy comparison — framework for evaluating different chunk sizes and overlap configurations against the same golden dataset
  • Data quality thinking — separation of eval metrics (hit rate, groundedness, citation count) from generation logic
  • Cost & latency measurement — infrastructure for tracking per-query cost and response time across evaluation runs
  • Structured reportingEvaluationReporter generates markdown tables with per-query metric breakdowns

Architecture

graph TB
    subgraph "Ingestion Layer"
        DOCS["Document Source"] --> CHUNK["Chunking Module"]
        CHUNK --> EMBED["Embedding Generator"]
    end

    subgraph "Storage Layer"
        EMBED --> PGV["pgvector / PostgreSQL"]
        PGV --> IDX["Vector Index"]
    end

    subgraph "Query Pipeline"
        Q["User Query"] --> RET["MockVectorSearchEngine"]
        IDX --> RET
        RET --> GEN["RAGAnswerGenerator"]
        GEN --> ANS["Answer + Citations"]
    end

    subgraph "Evaluation Pipeline"
        GOLD["Golden Question Dataset"] --> EVAL["Eval Functions"]
        ANS --> EVAL
        RET --> EVAL
        EVAL --> |"hit_rate, groundedness"| RPT["EvaluationReporter"]
        RPT --> MD["Markdown Report"]
    end

    subgraph "Infrastructure"
        REDIS["Redis"] --> CELERY["Celery Worker"]
        CELERY --> |"background eval runs"| EVAL
    end
Loading

Tech Stack

Component Choice Rationale
API Framework FastAPI + Uvicorn Async-ready, auto-generated OpenAPI docs, Pydantic validation
Vector Database PostgreSQL 16 + pgvector Embeddings stored alongside relational metadata in one database
Cache / Broker Redis 7 Celery task broker + result backend; future caching of embeddings
Task Queue Celery 5.3+ Background evaluation runs and bulk ingestion jobs
Config pydantic-settings Type-safe env var loading via BaseAppConfigAppConfig
ORM SQLAlchemy 2.0+ Session management via shared_core.database.DatabaseManager
Logging Loguru (via shared-core) Structured JSON logging with service name tagging
Lint / Format ruff E, W, F, I, C, B rules; 88-char lines; double quotes
Type Checking pyright Basic mode, targets Python 3.10
Testing pytest FastAPI TestClient for endpoint tests

Local Setup

# 1. Clone the portfolio (if you haven't already)
cd "Showcase Projects"

# 2. Start infrastructure
cd rag-evaluation-lab
docker compose up -d          # PostgreSQL (pgvector) + Redis

# 3. Install dependencies (shared-core must be installed first)
make install                   # pip install -e ../shared-core && pip install -r requirements.txt

# 4. Configure environment
cp .env.example .env           # Edit DATABASE_URL, REDIS_URL, API keys as needed

# 5. Run the API server
make dev                       # uvicorn on 0.0.0.0:8000 with hot-reload

# 6. Verify it works
curl http://localhost:8000/health

Demo

The demo (examples/run_demo.py) runs a full evaluation cycle without external dependencies:

make demo

What it does:

  1. Creates a two-document corpus (Hermes agent framework, ClickHouse telemetry)
  2. Initializes MockVectorSearchEngine with the corpus
  3. Runs two golden questions through retrieval → generation → evaluation
  4. Scores each query for hit_rate and groundedness
  5. Generates a markdown evaluation report table

Expected output:

--- Executing RAG Eval Lab Golden Tests ---
# RAG Evaluation Run Report

| Query | Hit Rate | Groundedness | Citations |
|---|---|---|---|
| Tell me about Hermes approvals | 1.00 | 0.92 | 1 |
| What database is used for telemetry logs? | 1.00 | 0.88 | 2 |

Tests

make test

Current test coverage (tests/test_core.py):

  • test_health_endpoint — Verifies the /health endpoint returns 200, correct service name (rag-evaluation-lab), and a dependencies object with database/redis status

Planned test expansion:

  • Unit tests for calculate_retrieval_hit_rate edge cases (empty gold set, partial matches)
  • Unit tests for calculate_answer_groundedness (short answers, zero-overlap answers)
  • Integration tests for MockVectorSearchEngine.query() ranking correctness
  • End-to-end evaluation pipeline tests using golden datasets
  • Report format validation tests

API Reference

GET /health

Returns service health including database and Redis connectivity.

{
  "status": "healthy",
  "service": "rag-evaluation-lab",
  "dependencies": {
    "database": "online",
    "redis": "online"
  }
}

Status is "degraded" when either dependency is offline.

Planned endpoints:

Method Path Description
POST /ingest Ingest and chunk a document into the vector store
POST /query Run a retrieval + generation query
POST /eval/run Execute an evaluation run against a golden dataset
GET /eval/results/{run_id} Retrieve evaluation results for a specific run
GET /eval/report/{run_id} Get a markdown-formatted evaluation report
GET /eval/compare Compare metrics across multiple eval runs

Configuration

Key environment variables (from .env.example):

Variable Default Description
APP_NAME rag-evaluation-lab Service identifier in logs and health checks
ENV development Environment mode (development, staging, production)
DEBUG true Enables Uvicorn hot-reload and verbose logging
LOG_LEVEL INFO Loguru log level
DATABASE_URL postgresql+psycopg://... PostgreSQL connection string (pgvector-enabled)
REDIS_URL redis://localhost:6379/0 Redis connection for Celery broker + cache
OPENAI_API_KEY Required for embedding generation and LLM-based eval scoring
ANTHROPIC_API_KEY Optional alternate LLM provider for generation

Known Limitations

  1. Mock retrieval onlyMockVectorSearchEngine uses naive keyword matching (word overlap scoring), not actual vector similarity. Real pgvector queries are not implemented yet.
  2. No real embeddings — The current pipeline does not call any embedding API. Documents are matched by string containment, not semantic similarity.
  3. Simplistic groundednesscalculate_answer_groundedness counts word overlaps between the answer and retrieved contexts. Words shorter than 4 characters are automatically counted as matches, which inflates the score.
  4. No chunking module — Document chunking (fixed-size, sentence-based, recursive) is planned but not yet implemented. The demo uses pre-chunked documents.
  5. No persistence of eval runs — Evaluation results are generated in-memory and printed. No database storage or run-over-run comparison exists yet.
  6. Template worker — The Celery worker (worker.py) contains only a placeholder sample_background_task. Evaluation runs are not yet dispatched as background tasks.
  7. Single health endpoint — The API exposes only /health; no ingestion, query, or evaluation endpoints exist yet.

Roadmap

See docs/roadmap.md for the phased build plan. In summary:

  • MVP (Current) — Skeleton with mock retrieval, basic eval functions, demo script, health endpoint
  • Phase 1: Display-Ready — Real pgvector search, chunking strategies, golden dataset YAML format, full eval API endpoints, markdown report generation
  • Phase 2: Showcase — Multi-strategy comparison, hybrid retrieval, CI eval regression gates, evaluation dashboard
  • Future — LLM-as-judge scoring, RAGAS integration, cost tracking per provider, export to document-intelligence-pipeline

License

MIT — see LICENSE.

About

RAG evaluation framework: hit-rate, MRR, faithfulness scoring, and async batch evaluation with golden question datasets

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors