This is the README for the RAG system you are building. For instructions to the consultant leading the project, see the readme_setup.
A Retrieval-Augmented Generation (RAG) system using LlamaIndex with Wikipedia articles, built with FastAPI and Azure OpenAI models.
This project implements a complete RAG pipeline that:
- Loads Wikipedia passages from HuggingFace datasets
- Generates embeddings using
bge-small-en-v1.5(local HuggingFace model) - Stores vectors in a local LlamaIndex vector database
- Retrieves relevant documents based on semantic similarity
- Generates answers using Azure OpenAI's
gpt-4omodel
- 🔒 Secure Model Access: All Azure OpenAI models accessed through controlled authentication
- 🚀 FastAPI Backend: RESTful API for all RAG operations
- 📊 Full Observability: Jupyter notebooks for inspecting every stage of the pipeline
- 🎯 Explainable: Clear separation of retrieval and generation steps
- 🧪 Evaluation Ready: Built-in support for test questions and metrics
User Query → Embedding → Vector Search → Retrieve Top-K Docs → Augment Prompt → GPT-4o → Answer
See docs/architecture.md for detailed architecture diagrams.
- Python 3.13+
- Azure Access: Access to Azure OpenAI services with AI Lab models
- Azure CLI: For authentication
- UV Package Manager: Recommended for dependency management
cd /path/to/your_project
# Set up virtual environment and install dependencies using uv
uv venv
uv sync# Login with AI Lab scope
azd auth login --scope api://ailab/Model.AccessThis authentication is required for accessing the Azure OpenAI GPT-4o model used for answer generation.
See docs/authentication.md for more details on authentication setup.
uv run uvicorn api.main:app --reload --app-dir srcThe server starts on http://localhost:8000. If a persisted index already exists in data/index/, it is loaded automatically at startup.
Interactive Swagger UI is available at http://localhost:8000/docs once the server is running.
curl http://localhost:8000/health
# {"status": "ok", "index_loaded": true}# Trigger full ingestion: load HuggingFace passages → embed → persist index
# Takes ~5–10 minutes on first run (embeds ~1,000 passages)
curl -X POST http://localhost:8000/ingest
# Check ingestion status
curl http://localhost:8000/ingest/status
# Preview raw passages from the dataset
curl "http://localhost:8000/ingest/sample?n=3"# Retrieve top-K passages for a query
curl -X POST http://localhost:8000/query/retrieve \
-H "Content-Type: application/json" \
-d '{"query": "Who was Abraham Lincoln?", "k": 5}'
# Inspect the embedding vector for a query
curl -X POST http://localhost:8000/query/embed \
-H "Content-Type: application/json" \
-d '{"query": "Who was Abraham Lincoln?"}'
# Browse nodes stored in the index
curl "http://localhost:8000/query/documents?limit=10"# Full pipeline: retrieve context → augment prompt → GPT-4o → answer
curl -X POST http://localhost:8000/rag/answer \
-H "Content-Type: application/json" \
-d '{"query": "Who was Abraham Lincoln?", "k": 5}'
# Preview the augmented prompt without calling the LLM
curl "http://localhost:8000/rag/prompt-preview?query=Who+was+Abraham+Lincoln%3F&k=3"
# Evaluate against the HuggingFace test questions dataset
curl -X POST http://localhost:8000/rag/evaluate \
-H "Content-Type: application/json" \
-d '{"n": 10, "k": 5}'The system includes three Jupyter notebooks for inspecting every stage of the RAG pipeline. All notebooks call the running FastAPI server — start the server and run POST /ingest before opening them.
| Notebook | Purpose |
|---|---|
notebooks/01_ingestion.ipynb |
Inspect passage count, sample texts, dataset structure, index stats |
notebooks/02_retrieval.ipynb |
Visualise embedding vectors, compare cosine similarity, see top-K results |
notebooks/03_rag.ipynb |
Preview augmented prompts, full Q&A pipeline, evaluation against test dataset |
# In one terminal: start the API server
uv run uvicorn api.main:app --reload --app-dir src
# In another terminal: start JupyterLab
uv run jupyter labThen open any notebook from the notebooks/ directory in the JupyterLab interface.
.
├── src/
│ ├── llamaindex_models.py # Controlled model access (embedding + LLM)
│ ├── ailab/
│ │ └── utils/azure.py # Azure auth token provider
│ ├── rag/
│ │ ├── ingestion.py # Load passages, build/persist vector index
│ │ ├── retrieval.py # Embed query, retrieve top-K passages
│ │ └── generation.py # Augment prompt, generate answer
│ └── api/
│ ├── main.py # FastAPI app, lifespan index loading
│ ├── state.py # Shared index/model singletons
│ └── routes/
│ ├── ingest.py # POST /ingest, GET /ingest/status|sample
│ ├── query.py # POST /query/embed|retrieve, GET /query/documents
│ └── rag.py # POST /rag/answer|evaluate, GET /rag/prompt-preview
├── tests/
│ ├── test_00_smoke.py # Import and filesystem checks
│ ├── test_10_ingestion.py # Ingestion pipeline unit + API tests
│ ├── test_20_retrieval.py # Retrieval unit + API tests
│ ├── test_30_generation.py # Generation unit + API tests
│ └── test_50_integration.py # End-to-end tests (require Azure auth)
├── notebooks/
│ ├── 01_ingestion.ipynb # Ingestion observability
│ ├── 02_retrieval.ipynb # Retrieval observability
│ └── 03_rag.ipynb # Full RAG pipeline + evaluation
├── docs/
│ ├── architecture.md
│ ├── authentication.md
│ ├── model_isolation.md
│ ├── testing.md
│ └── llamaindex_examples/ # Standalone example scripts
└── data/
└── index/ # Persisted vector index (gitignored)
The system uses the rag-mini-wikipedia dataset from HuggingFace:
- Passages:
hf://datasets/rag-datasets/rag-mini-wikipedia/data/passages.parquet/part.0.parquet - Test Questions:
hf://datasets/rag-datasets/rag-mini-wikipedia/data/test.parquet/part.0.parquet
This project uses controlled access to Azure OpenAI models:
- Embedding Model:
bge-small-en-v1.5(local HuggingFace) — switched fromtext-embedding-3-largedue to Azure rate limiting - Chat Model:
gpt-4o(2024-10-01-preview)
All model access goes through the llamaindex_models.py isolation layer. See docs/model_isolation.md for details.
# Authenticate
azd auth login --scope api://ailab/Model.Access
# Install dependencies
uv venv
uv sync
# Start API server
uv run uvicorn api.main:app --reload --app-dir src# Trigger the ETL pipeline (takes several minutes the first time)
curl -X POST http://localhost:8000/ingest
# Verify the index is loaded
curl http://localhost:8000/ingest/statusThe index is persisted to data/index/ and reloaded automatically on subsequent server starts.
# Start JupyterLab (with the server still running in another terminal)
uv run jupyter labOpen notebooks in order: 01_ingestion → 02_retrieval → 03_rag.
- Use
GET /rag/prompt-previewto inspect what context the system retrieved before spending an LLM call - Use
POST /rag/evaluateto run batches of the labeled test questions and compare generated vs. expected answers - The
test.parquetdataset has 918 labeled Q&A pairs for evaluation
# Verify Azure login
azd auth login --scope api://ailab/Model.Access
# Check token
azd auth token --output json | jq -r '.expiresOn'
azd auth token --output json | jq -r '.token'
# remove token
azd auth logout# Reinstall dependencies
uv sync --reinstallIf you get "Index not loaded. Run POST /ingest first." errors:
# Build the index
curl -X POST http://localhost:8000/ingest
# Or check that the server loaded an existing index at startup
curl http://localhost:8000/ingest/statusThe HuggingFace dataset is fetched over the network via pandas.read_parquet("hf://..."). Ensure you have network access and huggingface-hub installed (uv sync handles this).
The system includes 39 tests across 5 files. Unit tests run without Azure auth; integration tests require a live connection.
# All unit tests (no Azure auth needed) — runs in ~2 seconds
uv run pytest tests/ -k "not integration"
# Integration tests (require Azure auth + network)
uv run pytest tests/ -m integration -v
# Full suite
uv run pytest tests/ -vSee docs/testing.md for the complete testing guide.
test_00_smoke— Import and filesystem checks (8 tests)test_10_ingestion— Ingestion pipeline unit tests + API endpoint tests (10 tests)test_20_retrieval— Retrieval unit tests + API endpoint tests (6 tests)test_30_generation— Generation unit tests + API endpoint tests (7 tests)test_50_integration— End-to-end tests with real models (5 tests, markedintegration)
This project follows two core principles:
- Simple (a la Rich Hickey): Independent, unentangled components in a clear data pipeline
- Explainable (a la "Rewilding Software Engineering"): Observable, inspectable steps from query to answer
See instructions.md for the complete project brief.
See the docs/llamaindex_examples/example_*.py files for standalone demonstrations:
example_model_isolation.py- Model access patternsexample_chat_usage.py- LLM completionsexample_vector_search.py- Vector similarity search
Comprehensive documentation is available in the docs/ directory:
- Architecture: System design and data flow
- Authentication: Azure setup and model access
- Model Isolation: Security and controlled access patterns
- Examples: Usage patterns and best practices
Traceability from the original project spec to implementation.
| Requirement | Implementation |
|---|---|
| FastAPI endpoints for each task | 8 endpoints across 3 routers: /ingest, /query, /rag |
| Observability notebooks for each phase | 01_ingestion.ipynb, 02_retrieval.ipynb, 03_rag.ipynb |
| Test-driven: tests pass before proceeding | 31 unit tests passing in ~3s; integration tests marked separately |
| Requirement | Implementation |
|---|---|
Load from passages.parquet on HuggingFace |
load_wikipedia_passages() via pd.read_parquet("hf://...") |
| Create embeddings for source data | Embedded at ingest time via HuggingFaceEmbedding (bge-small-en-v1.5) |
| Load into local-only LlamaIndex vector DB | VectorStoreIndex, persisted to data/index/, reloaded on server startup |
| Observability notebook | 01_ingestion.ipynb — sample passages, ingest trigger, status, node browser |
Note on embedding model: The spec references
text-embedding-3-large(Azure). We switched to the localbge-small-en-v1.5model due to Azure rate limiting issues encountered during development. The local model runs entirely offline and requires no Azure auth for embeddings.
| Requirement | Implementation |
|---|---|
| Generate embeddings for a query | embed_query() → POST /query/embed |
| Retrieve top-K matches from vector DB | retrieve_top_k() → POST /query/retrieve |
| Observability notebook | 02_retrieval.ipynb — vector inspection, cosine similarity comparison (similar vs. unrelated queries), top-K results |
| Requirement | Implementation |
|---|---|
| Construct augmented prompt from query + retrieved docs | build_augmented_prompt() → GET /rag/prompt-preview |
| Send to LLM and synthesize answer | generate_answer() → POST /rag/answer (GPT-4o) |
| Observability notebook | 03_rag.ipynb — prompt inspection, full Q&A, pipeline trace, evaluation |
Use test.parquet Q&A pairs for evaluation |
POST /rag/evaluate loads from test.parquet, returns generated vs. expected for N questions |