CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

Two independent workstreams evaluating LLM long-context capabilities:

Workstream	Entry Point	Purpose
BABILong Benchmark	`resources/notebooks/benchmark.py`	Evaluate LLM accuracy across context lengths (0k–128k tokens) using bAbI reasoning tasks
VoyageAI RAG Pipeline	`main.py`	Semantic retrieval demo: VoyageAI embedding → K-NN → reranking → OpenAI generation

Commands

# Install dependencies
pip install -r requirements.txt

# Run VoyageAI RAG pipeline
python main.py

# Run BABILong benchmark
cd resources/notebooks && python benchmark.py

# Generate benchmark heatmaps
cd resources/notebooks && python visualize_results.py

Environment Configuration

Single .env file at repo root with all keys:

OPENAI_API_KEY=sk-...
VOYAGE_API_KEY=pa-...

Both workstreams load from this file. VOYAGE_API_KEY is only needed for the RAG pipeline (main.py).

Architecture

VoyageAI RAG Pipeline (`main.py`)

User Query → VoyageAI embed (voyage-3.5) → Cosine similarity K-NN (k=2, sklearn)
           → VoyageAI rerank (rerank-2.5, top_k=3) → OpenAI GPT-4o response

Hardcoded knowledge base of customer support documents. No persistent vector store yet — MongoDB Atlas Vector Search is planned but not implemented.

BABILong Benchmark (`resources/notebooks/benchmark.py`)

Loads bAbI tasks (qa1–qa20) from HuggingFace (RMT-team/babilong) with noise text from PG-19
Interactive model selection via OpenAI /models API
Supports dual-model side-by-side comparison
Uses LangChain SQLite cache to avoid redundant API calls
Outputs: CSV results, JSON exports, PNG heatmaps

Shared Module (`resources/babilong/`)

prompts.py — Task prompt templates for all 20 bAbI tasks
metrics.py — Answer comparison with task-specific expected labels
babilong_utils.py — Dataset loading and formatting utilities
collect_results.py — Results aggregation

Key Dependencies

VoyageAI (voyageai) — embeddings and reranking
OpenAI (openai) — LLM inference
LangChain (langchain, langchain-openai) — caching and LLM orchestration in benchmarks
scikit-learn — cosine similarity for retrieval
HuggingFace Datasets — BABILong dataset loading

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLAUDE.md

Project Overview

Commands

Environment Configuration

Architecture

VoyageAI RAG Pipeline (`main.py`)

BABILong Benchmark (`resources/notebooks/benchmark.py`)

Shared Module (`resources/babilong/`)

Key Dependencies

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md

Project Overview

Commands

Environment Configuration

Architecture

VoyageAI RAG Pipeline (main.py)

BABILong Benchmark (resources/notebooks/benchmark.py)

Shared Module (resources/babilong/)

Key Dependencies

VoyageAI RAG Pipeline (`main.py`)

BABILong Benchmark (`resources/notebooks/benchmark.py`)

Shared Module (`resources/babilong/`)