This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Two independent workstreams evaluating LLM long-context capabilities:
| Workstream | Entry Point | Purpose |
|---|---|---|
| BABILong Benchmark | resources/notebooks/benchmark.py |
Evaluate LLM accuracy across context lengths (0k–128k tokens) using bAbI reasoning tasks |
| VoyageAI RAG Pipeline | main.py |
Semantic retrieval demo: VoyageAI embedding → K-NN → reranking → OpenAI generation |
# Install dependencies
pip install -r requirements.txt
# Run VoyageAI RAG pipeline
python main.py
# Run BABILong benchmark
cd resources/notebooks && python benchmark.py
# Generate benchmark heatmaps
cd resources/notebooks && python visualize_results.pySingle .env file at repo root with all keys:
OPENAI_API_KEY=sk-...
VOYAGE_API_KEY=pa-...
Both workstreams load from this file. VOYAGE_API_KEY is only needed for the RAG pipeline (main.py).
User Query → VoyageAI embed (voyage-3.5) → Cosine similarity K-NN (k=2, sklearn)
→ VoyageAI rerank (rerank-2.5, top_k=3) → OpenAI GPT-4o response
Hardcoded knowledge base of customer support documents. No persistent vector store yet — MongoDB Atlas Vector Search is planned but not implemented.
- Loads bAbI tasks (qa1–qa20) from HuggingFace (
RMT-team/babilong) with noise text from PG-19 - Interactive model selection via OpenAI
/modelsAPI - Supports dual-model side-by-side comparison
- Uses LangChain SQLite cache to avoid redundant API calls
- Outputs: CSV results, JSON exports, PNG heatmaps
prompts.py— Task prompt templates for all 20 bAbI tasksmetrics.py— Answer comparison with task-specific expected labelsbabilong_utils.py— Dataset loading and formatting utilitiescollect_results.py— Results aggregation
- VoyageAI (
voyageai) — embeddings and reranking - OpenAI (
openai) — LLM inference - LangChain (
langchain,langchain-openai) — caching and LLM orchestration in benchmarks - scikit-learn — cosine similarity for retrieval
- HuggingFace Datasets — BABILong dataset loading