AgentEval — Product Brief

Author: Paige (Product Strategist)
Date: 2026-02-12
Status: Draft

Problem

Agent developers have no reliable way to test their agents. The current state:

Manual vibes-checking — run the agent, eyeball the output, ship it
Prompt evals don't cover agents — existing eval tools test prompt→response. Agents do multi-step workflows with tool calls, branching logic, and non-deterministic behavior. Prompt evals miss all of this.
No regression detection — you change a system prompt or swap models, and you have no idea what broke until users complain
Cost blindness — agents are expensive. Nobody tracks cost-per-eval to know if a "better" config is 3x more expensive.

This is the gap between "I built an agent" and "I ship an agent to production with confidence."

Who Is This For?

Primary: Solo devs and small teams building AI agents in Python who want to test before shipping.

Not for: Enterprise ML teams with custom infra, academic researchers benchmarking foundation models, prompt engineers who aren't building agents.

Market Landscape — Honest Assessment

Tool	What It Does	Agent Eval Support	OSS?
Braintrust	Full eval platform, logging, datasets	Good — supports multi-step, tool use	No (proprietary SaaS)
LangSmith Evals	LangChain's eval suite	Decent — tied to LangChain ecosystem	No
Promptfoo	CLI-based prompt/model eval	Prompt-focused, limited agent support	Yes (MIT)
DeepEval	Python eval framework	Growing agent support, pytest-style	Yes (Apache 2.0)
Ragas	RAG-focused evals	RAG only, not general agents	Yes
Inspect AI	UK AISI's agent eval framework	Strong agent support, tasks-based	Yes

Honest take: This market is getting crowded. Braintrust and DeepEval are strong. Inspect AI is well-designed for agent evals specifically.

What's NOT covered well:

Importing real production sessions as test cases (everyone starts from synthetic data)
Local-first with zero cloud dependency
Framework-agnostic agent interface (most tools assume LangChain or their own framework)
Statistical run comparison with cost tracking

Unique Angle

Real sessions → test cases. AgentLens captures production sessions. AgentEval imports them as eval cases. Your tests come from reality, not imagination.
Local-first, zero cloud. pip install agenteval, SQLite, done. No accounts, no API keys (except for your agent's own LLM).
Agent-native, not prompt-native. First-class support for tool call evaluation, trajectory comparison, multi-step grading.
AgentKit ecosystem. Works standalone, but integrates naturally with Lore, AgentLens, AgentGate.

Business Model

Open-core (MIT license):

Free: CLI, all graders, local runs, SQLite storage
Cloud (later, maybe): Team dashboards, shared eval results, CI/CD integration, hosted runs

Reality check: The cloud tier is speculative. Don't build for it. Build a great CLI tool first.

Market Timing: 5/10

Why not higher:

Competitors exist and are well-funded (Braintrust has $36M+)
DeepEval is OSS and has momentum
The "agent eval" category is still forming — unclear if it becomes its own thing or gets absorbed into existing tools

Why not lower:

Agent adoption is exploding, testing is genuinely painful
The AgentLens integration angle is unique and real
Local-first OSS still has a lane (see: Lore's traction)

KILLED From Scope

❌ Hosted eval runners
❌ Dataset marketplace
❌ Model fine-tuning
❌ UI dashboard (CLI only for MVP)
❌ TypeScript SDK (Python first)
❌ Benchmarking against public datasets
❌ Custom model hosting
❌ Multi-tenant anything

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AgentEval — Product Brief

Problem

Who Is This For?

Market Landscape — Honest Assessment

Unique Angle

Business Model

Market Timing: 5/10

KILLED From Scope

FilesExpand file tree

product-brief.md

Latest commit

History

product-brief.md

File metadata and controls

AgentEval — Product Brief

Problem

Who Is This For?

Market Landscape — Honest Assessment

Unique Angle

Business Model

Market Timing: 5/10

KILLED From Scope