Setup

This is the README for the RAG system you are building. For instructions to the consultant leading the project, see the readme_setup.

Wikipedia RAG System

A Retrieval-Augmented Generation (RAG) system using LlamaIndex with Wikipedia articles, built with FastAPI and Azure OpenAI models.

Overview

This project implements a complete RAG pipeline that:

Loads Wikipedia passages from HuggingFace datasets
Generates embeddings using bge-small-en-v1.5 (local HuggingFace model)
Stores vectors in a local LlamaIndex vector database
Retrieves relevant documents based on semantic similarity
Generates answers using Azure OpenAI's gpt-4o model

Features

🔒 Secure Model Access: All Azure OpenAI models accessed through controlled authentication
🚀 FastAPI Backend: RESTful API for all RAG operations
📊 Full Observability: Jupyter notebooks for inspecting every stage of the pipeline
🎯 Explainable: Clear separation of retrieval and generation steps
🧪 Evaluation Ready: Built-in support for test questions and metrics

Architecture

User Query → Embedding → Vector Search → Retrieve Top-K Docs → Augment Prompt → GPT-4o → Answer

See docs/architecture.md for detailed architecture diagrams.

Prerequisites

Python 3.13+
Azure Access: Access to Azure OpenAI services with AI Lab models
Azure CLI: For authentication
UV Package Manager: Recommended for dependency management

Installation

1. Clone and Setup

cd /path/to/your_project

# Set up virtual environment and install dependencies using uv
uv venv
uv sync

2. Authenticate with Azure

# Login with AI Lab scope
azd auth login --scope api://ailab/Model.Access

This authentication is required for accessing the Azure OpenAI GPT-4o model used for answer generation.

See docs/authentication.md for more details on authentication setup.

Usage

Starting the API Server

uv run uvicorn api.main:app --reload --app-dir src

The server starts on http://localhost:8000. If a persisted index already exists in data/index/, it is loaded automatically at startup.

API Documentation

Interactive Swagger UI is available at http://localhost:8000/docs once the server is running.

Key Endpoints

Health Check

curl http://localhost:8000/health
# {"status": "ok", "index_loaded": true}

Ingest Data

# Trigger full ingestion: load HuggingFace passages → embed → persist index
# Takes ~5–10 minutes on first run (embeds ~1,000 passages)
curl -X POST http://localhost:8000/ingest

# Check ingestion status
curl http://localhost:8000/ingest/status

# Preview raw passages from the dataset
curl "http://localhost:8000/ingest/sample?n=3"

Query Documents

# Retrieve top-K passages for a query
curl -X POST http://localhost:8000/query/retrieve \
  -H "Content-Type: application/json" \
  -d '{"query": "Who was Abraham Lincoln?", "k": 5}'

# Inspect the embedding vector for a query
curl -X POST http://localhost:8000/query/embed \
  -H "Content-Type: application/json" \
  -d '{"query": "Who was Abraham Lincoln?"}'

# Browse nodes stored in the index
curl "http://localhost:8000/query/documents?limit=10"

Generate Answer (RAG)

# Full pipeline: retrieve context → augment prompt → GPT-4o → answer
curl -X POST http://localhost:8000/rag/answer \
  -H "Content-Type: application/json" \
  -d '{"query": "Who was Abraham Lincoln?", "k": 5}'

# Preview the augmented prompt without calling the LLM
curl "http://localhost:8000/rag/prompt-preview?query=Who+was+Abraham+Lincoln%3F&k=3"

# Evaluate against the HuggingFace test questions dataset
curl -X POST http://localhost:8000/rag/evaluate \
  -H "Content-Type: application/json" \
  -d '{"n": 10, "k": 5}'

Observability with Jupyter Notebooks

The system includes three Jupyter notebooks for inspecting every stage of the RAG pipeline. All notebooks call the running FastAPI server — start the server and run POST /ingest before opening them.

Notebook	Purpose
`notebooks/01_ingestion.ipynb`	Inspect passage count, sample texts, dataset structure, index stats
`notebooks/02_retrieval.ipynb`	Visualise embedding vectors, compare cosine similarity, see top-K results
`notebooks/03_rag.ipynb`	Preview augmented prompts, full Q&A pipeline, evaluation against test dataset

Running Notebooks

# In one terminal: start the API server
uv run uvicorn api.main:app --reload --app-dir src

# In another terminal: start JupyterLab
uv run jupyter lab

Then open any notebook from the notebooks/ directory in the JupyterLab interface.

Project Structure

.
├── src/
│   ├── llamaindex_models.py      # Controlled model access (embedding + LLM)
│   ├── ailab/
│   │   └── utils/azure.py        # Azure auth token provider
│   ├── rag/
│   │   ├── ingestion.py          # Load passages, build/persist vector index
│   │   ├── retrieval.py          # Embed query, retrieve top-K passages
│   │   └── generation.py         # Augment prompt, generate answer
│   └── api/
│       ├── main.py               # FastAPI app, lifespan index loading
│       ├── state.py              # Shared index/model singletons
│       └── routes/
│           ├── ingest.py         # POST /ingest, GET /ingest/status|sample
│           ├── query.py          # POST /query/embed|retrieve, GET /query/documents
│           └── rag.py            # POST /rag/answer|evaluate, GET /rag/prompt-preview
├── tests/
│   ├── test_00_smoke.py          # Import and filesystem checks
│   ├── test_10_ingestion.py      # Ingestion pipeline unit + API tests
│   ├── test_20_retrieval.py      # Retrieval unit + API tests
│   ├── test_30_generation.py     # Generation unit + API tests
│   └── test_50_integration.py    # End-to-end tests (require Azure auth)
├── notebooks/
│   ├── 01_ingestion.ipynb        # Ingestion observability
│   ├── 02_retrieval.ipynb        # Retrieval observability
│   └── 03_rag.ipynb              # Full RAG pipeline + evaluation
├── docs/
│   ├── architecture.md
│   ├── authentication.md
│   ├── model_isolation.md
│   ├── testing.md
│   └── llamaindex_examples/      # Standalone example scripts
└── data/
    └── index/                    # Persisted vector index (gitignored)

Data Sources

The system uses the rag-mini-wikipedia dataset from HuggingFace:

Passages: hf://datasets/rag-datasets/rag-mini-wikipedia/data/passages.parquet/part.0.parquet
Test Questions: hf://datasets/rag-datasets/rag-mini-wikipedia/data/test.parquet/part.0.parquet

Model Configuration

This project uses controlled access to Azure OpenAI models:

Embedding Model: bge-small-en-v1.5 (local HuggingFace) — switched from text-embedding-3-large due to Azure rate limiting
Chat Model: gpt-4o (2024-10-01-preview)

All model access goes through the llamaindex_models.py isolation layer. See docs/model_isolation.md for details.

Development Workflow

1. First Time Setup

# Authenticate
azd auth login --scope api://ailab/Model.Access

# Install dependencies
uv venv
uv sync

# Start API server
uv run uvicorn api.main:app --reload --app-dir src

2. Ingest Data

# Trigger the ETL pipeline (takes several minutes the first time)
curl -X POST http://localhost:8000/ingest

# Verify the index is loaded
curl http://localhost:8000/ingest/status

The index is persisted to data/index/ and reloaded automatically on subsequent server starts.

3. Explore with Notebooks

# Start JupyterLab (with the server still running in another terminal)
uv run jupyter lab

Open notebooks in order: 01_ingestion → 02_retrieval → 03_rag.

4. Iterate and Observe

Use GET /rag/prompt-preview to inspect what context the system retrieved before spending an LLM call
Use POST /rag/evaluate to run batches of the labeled test questions and compare generated vs. expected answers
The test.parquet dataset has 918 labeled Q&A pairs for evaluation

Troubleshooting

Authentication Issues

# Verify Azure login
azd auth login --scope api://ailab/Model.Access

# Check token
azd auth token --output json | jq -r '.expiresOn'
azd auth token --output json | jq -r '.token'

# remove token
azd auth logout

Import Errors

# Reinstall dependencies
uv sync --reinstall

Index Not Found

If you get "Index not loaded. Run POST /ingest first." errors:

# Build the index
curl -X POST http://localhost:8000/ingest

# Or check that the server loaded an existing index at startup
curl http://localhost:8000/ingest/status

Dataset Loading Issues

The HuggingFace dataset is fetched over the network via pandas.read_parquet("hf://..."). Ensure you have network access and huggingface-hub installed (uv sync handles this).

Testing

The system includes 39 tests across 5 files. Unit tests run without Azure auth; integration tests require a live connection.

Quick Start

# All unit tests (no Azure auth needed) — runs in ~2 seconds
uv run pytest tests/ -k "not integration"

# Integration tests (require Azure auth + network)
uv run pytest tests/ -m integration -v

# Full suite
uv run pytest tests/ -v

See docs/testing.md for the complete testing guide.

Test Organization

test_00_smoke — Import and filesystem checks (8 tests)
test_10_ingestion — Ingestion pipeline unit tests + API endpoint tests (10 tests)
test_20_retrieval — Retrieval unit tests + API endpoint tests (6 tests)
test_30_generation — Generation unit tests + API endpoint tests (7 tests)
test_50_integration — End-to-end tests with real models (5 tests, marked integration)

Troubleshooting

This project follows two core principles:

Simple (a la Rich Hickey): Independent, unentangled components in a clear data pipeline
Explainable (a la "Rewilding Software Engineering"): Observable, inspectable steps from query to answer

See instructions.md for the complete project brief.

Examples

See the docs/llamaindex_examples/example_*.py files for standalone demonstrations:

example_model_isolation.py - Model access patterns
example_chat_usage.py - LLM completions
example_vector_search.py - Vector similarity search

Documentation

Comprehensive documentation is available in the docs/ directory:

Architecture: System design and data flow
Authentication: Azure setup and model access
Model Isolation: Security and controlled access patterns
Examples: Usage patterns and best practices

Requirements Coverage

Traceability from the original project spec to implementation.

Prerequisites

Requirement	Implementation
FastAPI endpoints for each task	8 endpoints across 3 routers: `/ingest`, `/query`, `/rag`
Observability notebooks for each phase	`01_ingestion.ipynb`, `02_retrieval.ipynb`, `03_rag.ipynb`
Test-driven: tests pass before proceeding	31 unit tests passing in ~3s; integration tests marked separately

Ingestion

Requirement	Implementation
Load from `passages.parquet` on HuggingFace	`load_wikipedia_passages()` via `pd.read_parquet("hf://...")`
Create embeddings for source data	Embedded at ingest time via `HuggingFaceEmbedding` (`bge-small-en-v1.5`)
Load into local-only LlamaIndex vector DB	`VectorStoreIndex`, persisted to `data/index/`, reloaded on server startup
Observability notebook	`01_ingestion.ipynb` — sample passages, ingest trigger, status, node browser

Note on embedding model: The spec references text-embedding-3-large (Azure). We switched to the local bge-small-en-v1.5 model due to Azure rate limiting issues encountered during development. The local model runs entirely offline and requires no Azure auth for embeddings.

User Query and Content Retrieval

Requirement	Implementation
Generate embeddings for a query	`embed_query()` → `POST /query/embed`
Retrieve top-K matches from vector DB	`retrieve_top_k()` → `POST /query/retrieve`
Observability notebook	`02_retrieval.ipynb` — vector inspection, cosine similarity comparison (similar vs. unrelated queries), top-K results

Augment Prompt & Generate Response

Requirement	Implementation
Construct augmented prompt from query + retrieved docs	`build_augmented_prompt()` → `GET /rag/prompt-preview`
Send to LLM and synthesize answer	`generate_answer()` → `POST /rag/answer` (GPT-4o)
Observability notebook	`03_rag.ipynb` — prompt inspection, full Q&A, pipeline trace, evaluation
Use `test.parquet` Q&A pairs for evaluation	`POST /rag/evaluate` loads from `test.parquet`, returns generated vs. expected for N questions

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
docs		docs
notebooks		notebooks
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
STATUS.md		STATUS.md
instructions.md		instructions.md
pyproject.toml		pyproject.toml
readme_setup.md		readme_setup.md
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Setup

Wikipedia RAG System

Overview

Features

Architecture

Prerequisites

Installation

1. Clone and Setup

2. Authenticate with Azure

Usage

Starting the API Server

API Documentation

Key Endpoints

Health Check

Ingest Data

Query Documents

Generate Answer (RAG)

Observability with Jupyter Notebooks

Running Notebooks

Project Structure

Data Sources

Model Configuration

Development Workflow

1. First Time Setup

2. Ingest Data

3. Explore with Notebooks

4. Iterate and Observe

Troubleshooting

Authentication Issues

Import Errors

Index Not Found

Dataset Loading Issues

Testing

Quick Start

Test Organization

Troubleshooting

Examples

Documentation

Requirements Coverage

Prerequisites

Ingestion

User Query and Content Retrieval

Augment Prompt & Generate Response

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages