You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A retrieval-augmented generation pipeline in Python with a rigorous offline evaluation harness. Chunks and embeds documents, retrieves by vector similarity, and generates grounded answers — with pluggable LLM providers (including a deterministic local fake for tests) and metrics for retrieval quality and answer faithfulness. No API key required.
Why a passing benchmark isn't safe to ship: a free 2-stage (benchmark + replay) validation recipe for LLM model swaps & prompt changes, run on flat-rate coding-agent subagents — no eval API bill.
Catch LLM quality regressions in CI: a golden-set regression gate with calibrated graders (grounding, hallucination, tool-call, LLM-judge) that fails the PR when answer quality drops.
Agentic diagnostic assistant for distributed-system incidents: multi-turn RAG, hypothesis updates, evidence packing, golden evals, and failure-attributed run reports.
Production evaluation harness for AI-driven interactive entertainment. 7 deterministic evaluators, LLM-as-Judge, golden-set regression, red-team probes. Extracted from Dead Signal, a live game running Gemini dialogue generation.