Evaluation Infrastructure for AI Agents
-
Updated
Feb 25, 2026 - TypeScript
Evaluation Infrastructure for AI Agents
🔍 Benchmark jailbreak resilience in LLMs with JailBench for clear insights and improved model defenses against jailbreak attempts.
Benchmark LLM jailbreak resilience across providers with standardized tests, adversarial mode, rich analytics, and a clean Web UI.
Free, local Langfuse OSS setup with Ollama for LLM evaluation, scoring, and datasets.
EvalLoop is a self-improving agent that iterates on its own outputs using evals + automatic policy patches. It runs a task, scores the result against a rubric, updates its rules, and re-runs until it hits a target score with a UI showing attempts, score trends, violations, and policy diffs.
Add a description, image, and links to the ai-evals topic page so that developers can more easily learn about it.
To associate your repository with the ai-evals topic, visit your repo's landing page and select "manage topics."