Skip to content

HeinyJR/CT-Forge-March-2026

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Setup

This is the README for the RAG system you are building. For instructions to the consultant leading the project, see the readme_setup.

Wikipedia RAG System

A Retrieval-Augmented Generation (RAG) system using LlamaIndex with Wikipedia articles, built with FastAPI and Azure OpenAI models.

Overview

This project implements a complete RAG pipeline that:

  • Loads Wikipedia passages from HuggingFace datasets
  • Generates embeddings using bge-small-en-v1.5 (local HuggingFace model)
  • Stores vectors in a local LlamaIndex vector database
  • Retrieves relevant documents based on semantic similarity
  • Generates answers using Azure OpenAI's gpt-4o model

Features

  • 🔒 Secure Model Access: All Azure OpenAI models accessed through controlled authentication
  • 🚀 FastAPI Backend: RESTful API for all RAG operations
  • 📊 Full Observability: Jupyter notebooks for inspecting every stage of the pipeline
  • 🎯 Explainable: Clear separation of retrieval and generation steps
  • 🧪 Evaluation Ready: Built-in support for test questions and metrics

Architecture

User Query → Embedding → Vector Search → Retrieve Top-K Docs → Augment Prompt → GPT-4o → Answer

See docs/architecture.md for detailed architecture diagrams.

Prerequisites

  1. Python 3.13+
  2. Azure Access: Access to Azure OpenAI services with AI Lab models
  3. Azure CLI: For authentication
  4. UV Package Manager: Recommended for dependency management

Installation

1. Clone and Setup

cd /path/to/your_project

# Set up virtual environment and install dependencies using uv
uv venv
uv sync

2. Authenticate with Azure

# Login with AI Lab scope
azd auth login --scope api://ailab/Model.Access

This authentication is required for accessing the Azure OpenAI GPT-4o model used for answer generation.

See docs/authentication.md for more details on authentication setup.

Usage

Starting the API Server

uv run uvicorn api.main:app --reload --app-dir src

The server starts on http://localhost:8000. If a persisted index already exists in data/index/, it is loaded automatically at startup.

API Documentation

Interactive Swagger UI is available at http://localhost:8000/docs once the server is running.

Key Endpoints

Health Check

curl http://localhost:8000/health
# {"status": "ok", "index_loaded": true}

Ingest Data

# Trigger full ingestion: load HuggingFace passages → embed → persist index
# Takes ~5–10 minutes on first run (embeds ~1,000 passages)
curl -X POST http://localhost:8000/ingest

# Check ingestion status
curl http://localhost:8000/ingest/status

# Preview raw passages from the dataset
curl "http://localhost:8000/ingest/sample?n=3"

Query Documents

# Retrieve top-K passages for a query
curl -X POST http://localhost:8000/query/retrieve \
  -H "Content-Type: application/json" \
  -d '{"query": "Who was Abraham Lincoln?", "k": 5}'

# Inspect the embedding vector for a query
curl -X POST http://localhost:8000/query/embed \
  -H "Content-Type: application/json" \
  -d '{"query": "Who was Abraham Lincoln?"}'

# Browse nodes stored in the index
curl "http://localhost:8000/query/documents?limit=10"

Generate Answer (RAG)

# Full pipeline: retrieve context → augment prompt → GPT-4o → answer
curl -X POST http://localhost:8000/rag/answer \
  -H "Content-Type: application/json" \
  -d '{"query": "Who was Abraham Lincoln?", "k": 5}'

# Preview the augmented prompt without calling the LLM
curl "http://localhost:8000/rag/prompt-preview?query=Who+was+Abraham+Lincoln%3F&k=3"

# Evaluate against the HuggingFace test questions dataset
curl -X POST http://localhost:8000/rag/evaluate \
  -H "Content-Type: application/json" \
  -d '{"n": 10, "k": 5}'

Observability with Jupyter Notebooks

The system includes three Jupyter notebooks for inspecting every stage of the RAG pipeline. All notebooks call the running FastAPI server — start the server and run POST /ingest before opening them.

Notebook Purpose
notebooks/01_ingestion.ipynb Inspect passage count, sample texts, dataset structure, index stats
notebooks/02_retrieval.ipynb Visualise embedding vectors, compare cosine similarity, see top-K results
notebooks/03_rag.ipynb Preview augmented prompts, full Q&A pipeline, evaluation against test dataset

Running Notebooks

# In one terminal: start the API server
uv run uvicorn api.main:app --reload --app-dir src

# In another terminal: start JupyterLab
uv run jupyter lab

Then open any notebook from the notebooks/ directory in the JupyterLab interface.

Project Structure

.
├── src/
│   ├── llamaindex_models.py      # Controlled model access (embedding + LLM)
│   ├── ailab/
│   │   └── utils/azure.py        # Azure auth token provider
│   ├── rag/
│   │   ├── ingestion.py          # Load passages, build/persist vector index
│   │   ├── retrieval.py          # Embed query, retrieve top-K passages
│   │   └── generation.py         # Augment prompt, generate answer
│   └── api/
│       ├── main.py               # FastAPI app, lifespan index loading
│       ├── state.py              # Shared index/model singletons
│       └── routes/
│           ├── ingest.py         # POST /ingest, GET /ingest/status|sample
│           ├── query.py          # POST /query/embed|retrieve, GET /query/documents
│           └── rag.py            # POST /rag/answer|evaluate, GET /rag/prompt-preview
├── tests/
│   ├── test_00_smoke.py          # Import and filesystem checks
│   ├── test_10_ingestion.py      # Ingestion pipeline unit + API tests
│   ├── test_20_retrieval.py      # Retrieval unit + API tests
│   ├── test_30_generation.py     # Generation unit + API tests
│   └── test_50_integration.py    # End-to-end tests (require Azure auth)
├── notebooks/
│   ├── 01_ingestion.ipynb        # Ingestion observability
│   ├── 02_retrieval.ipynb        # Retrieval observability
│   └── 03_rag.ipynb              # Full RAG pipeline + evaluation
├── docs/
│   ├── architecture.md
│   ├── authentication.md
│   ├── model_isolation.md
│   ├── testing.md
│   └── llamaindex_examples/      # Standalone example scripts
└── data/
    └── index/                    # Persisted vector index (gitignored)

Data Sources

The system uses the rag-mini-wikipedia dataset from HuggingFace:

  • Passages: hf://datasets/rag-datasets/rag-mini-wikipedia/data/passages.parquet/part.0.parquet
  • Test Questions: hf://datasets/rag-datasets/rag-mini-wikipedia/data/test.parquet/part.0.parquet

Model Configuration

This project uses controlled access to Azure OpenAI models:

  • Embedding Model: bge-small-en-v1.5 (local HuggingFace) — switched from text-embedding-3-large due to Azure rate limiting
  • Chat Model: gpt-4o (2024-10-01-preview)

All model access goes through the llamaindex_models.py isolation layer. See docs/model_isolation.md for details.

Development Workflow

1. First Time Setup

# Authenticate
azd auth login --scope api://ailab/Model.Access

# Install dependencies
uv venv
uv sync

# Start API server
uv run uvicorn api.main:app --reload --app-dir src

2. Ingest Data

# Trigger the ETL pipeline (takes several minutes the first time)
curl -X POST http://localhost:8000/ingest

# Verify the index is loaded
curl http://localhost:8000/ingest/status

The index is persisted to data/index/ and reloaded automatically on subsequent server starts.

3. Explore with Notebooks

# Start JupyterLab (with the server still running in another terminal)
uv run jupyter lab

Open notebooks in order: 01_ingestion02_retrieval03_rag.

4. Iterate and Observe

  • Use GET /rag/prompt-preview to inspect what context the system retrieved before spending an LLM call
  • Use POST /rag/evaluate to run batches of the labeled test questions and compare generated vs. expected answers
  • The test.parquet dataset has 918 labeled Q&A pairs for evaluation

Troubleshooting

Authentication Issues

# Verify Azure login
azd auth login --scope api://ailab/Model.Access

# Check token
azd auth token --output json | jq -r '.expiresOn'
azd auth token --output json | jq -r '.token'

# remove token
azd auth logout

Import Errors

# Reinstall dependencies
uv sync --reinstall

Index Not Found

If you get "Index not loaded. Run POST /ingest first." errors:

# Build the index
curl -X POST http://localhost:8000/ingest

# Or check that the server loaded an existing index at startup
curl http://localhost:8000/ingest/status

Dataset Loading Issues

The HuggingFace dataset is fetched over the network via pandas.read_parquet("hf://..."). Ensure you have network access and huggingface-hub installed (uv sync handles this).

Testing

The system includes 39 tests across 5 files. Unit tests run without Azure auth; integration tests require a live connection.

Quick Start

# All unit tests (no Azure auth needed) — runs in ~2 seconds
uv run pytest tests/ -k "not integration"

# Integration tests (require Azure auth + network)
uv run pytest tests/ -m integration -v

# Full suite
uv run pytest tests/ -v

See docs/testing.md for the complete testing guide.

Test Organization

  • test_00_smoke — Import and filesystem checks (8 tests)
  • test_10_ingestion — Ingestion pipeline unit tests + API endpoint tests (10 tests)
  • test_20_retrieval — Retrieval unit tests + API endpoint tests (6 tests)
  • test_30_generation — Generation unit tests + API endpoint tests (7 tests)
  • test_50_integration — End-to-end tests with real models (5 tests, marked integration)

Troubleshooting

This project follows two core principles:

  1. Simple (a la Rich Hickey): Independent, unentangled components in a clear data pipeline
  2. Explainable (a la "Rewilding Software Engineering"): Observable, inspectable steps from query to answer

See instructions.md for the complete project brief.

Examples

See the docs/llamaindex_examples/example_*.py files for standalone demonstrations:

  • example_model_isolation.py - Model access patterns
  • example_chat_usage.py - LLM completions
  • example_vector_search.py - Vector similarity search

Documentation

Comprehensive documentation is available in the docs/ directory:

  • Architecture: System design and data flow
  • Authentication: Azure setup and model access
  • Model Isolation: Security and controlled access patterns
  • Examples: Usage patterns and best practices

Requirements Coverage

Traceability from the original project spec to implementation.

Prerequisites

Requirement Implementation
FastAPI endpoints for each task 8 endpoints across 3 routers: /ingest, /query, /rag
Observability notebooks for each phase 01_ingestion.ipynb, 02_retrieval.ipynb, 03_rag.ipynb
Test-driven: tests pass before proceeding 31 unit tests passing in ~3s; integration tests marked separately

Ingestion

Requirement Implementation
Load from passages.parquet on HuggingFace load_wikipedia_passages() via pd.read_parquet("hf://...")
Create embeddings for source data Embedded at ingest time via HuggingFaceEmbedding (bge-small-en-v1.5)
Load into local-only LlamaIndex vector DB VectorStoreIndex, persisted to data/index/, reloaded on server startup
Observability notebook 01_ingestion.ipynb — sample passages, ingest trigger, status, node browser

Note on embedding model: The spec references text-embedding-3-large (Azure). We switched to the local bge-small-en-v1.5 model due to Azure rate limiting issues encountered during development. The local model runs entirely offline and requires no Azure auth for embeddings.

User Query and Content Retrieval

Requirement Implementation
Generate embeddings for a query embed_query()POST /query/embed
Retrieve top-K matches from vector DB retrieve_top_k()POST /query/retrieve
Observability notebook 02_retrieval.ipynb — vector inspection, cosine similarity comparison (similar vs. unrelated queries), top-K results

Augment Prompt & Generate Response

Requirement Implementation
Construct augmented prompt from query + retrieved docs build_augmented_prompt()GET /rag/prompt-preview
Send to LLM and synthesize answer generate_answer()POST /rag/answer (GPT-4o)
Observability notebook 03_rag.ipynb — prompt inspection, full Q&A, pipeline trace, evaluation
Use test.parquet Q&A pairs for evaluation POST /rag/evaluate loads from test.parquet, returns generated vs. expected for N questions

About

Repo to house the CapTech Forge Capstone Project of March 2026

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors