Skip to content

Latest commit

 

History

History
73 lines (50 loc) · 2.38 KB

File metadata and controls

73 lines (50 loc) · 2.38 KB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

Two independent workstreams evaluating LLM long-context capabilities:

Workstream Entry Point Purpose
BABILong Benchmark resources/notebooks/benchmark.py Evaluate LLM accuracy across context lengths (0k–128k tokens) using bAbI reasoning tasks
VoyageAI RAG Pipeline main.py Semantic retrieval demo: VoyageAI embedding → K-NN → reranking → OpenAI generation

Commands

# Install dependencies
pip install -r requirements.txt

# Run VoyageAI RAG pipeline
python main.py

# Run BABILong benchmark
cd resources/notebooks && python benchmark.py

# Generate benchmark heatmaps
cd resources/notebooks && python visualize_results.py

Environment Configuration

Single .env file at repo root with all keys:

OPENAI_API_KEY=sk-...
VOYAGE_API_KEY=pa-...

Both workstreams load from this file. VOYAGE_API_KEY is only needed for the RAG pipeline (main.py).

Architecture

VoyageAI RAG Pipeline (main.py)

User Query → VoyageAI embed (voyage-3.5) → Cosine similarity K-NN (k=2, sklearn)
           → VoyageAI rerank (rerank-2.5, top_k=3) → OpenAI GPT-4o response

Hardcoded knowledge base of customer support documents. No persistent vector store yet — MongoDB Atlas Vector Search is planned but not implemented.

BABILong Benchmark (resources/notebooks/benchmark.py)

  • Loads bAbI tasks (qa1–qa20) from HuggingFace (RMT-team/babilong) with noise text from PG-19
  • Interactive model selection via OpenAI /models API
  • Supports dual-model side-by-side comparison
  • Uses LangChain SQLite cache to avoid redundant API calls
  • Outputs: CSV results, JSON exports, PNG heatmaps

Shared Module (resources/babilong/)

  • prompts.py — Task prompt templates for all 20 bAbI tasks
  • metrics.py — Answer comparison with task-specific expected labels
  • babilong_utils.py — Dataset loading and formatting utilities
  • collect_results.py — Results aggregation

Key Dependencies

  • VoyageAI (voyageai) — embeddings and reranking
  • OpenAI (openai) — LLM inference
  • LangChain (langchain, langchain-openai) — caching and LLM orchestration in benchmarks
  • scikit-learn — cosine similarity for retrieval
  • HuggingFace Datasets — BABILong dataset loading