Lighthouse for agents - score an agent run, then tell it how to get faster and cheaper.
The ecosystem is full of tracers (Langfuse, LangSmith, Opik) that show you
what happened. agent-house is a profiler: it ingests an agent run
trace and produces a scored report (cost, latency, reliability, context-hygiene)
with concrete, ranked fixes - redundant tool calls, missed parallelization,
oversized context, uncached retrievals, model-tier mismatches - each with an
estimated $ / ms saved and a code-level hint.
It's a build-time tool: a CLI + library + static HTML report. Zero required backend, zero runtime dependencies.
agent-house - before / after
naive ReAct agent score 72/100 cost $0.245 7.0s
after top-3 fixes score 100/100 cost $0.024 2.0s
▲ +28 points · saved $0.221 and 5.0s per run
npm install agent-house # library
npx agent-house run trace.json # or just run the CLIRequires Node 18+. Ships as ESM with TypeScript types.
# Score a trace and print a report (+ optional HTML/JSON output)
npx agent-house run ./trace.json --out report.html --json report.json
# Fail CI when the score drops below a threshold
npx agent-house assert ./trace.json --min-score 80
# Compare against a saved baseline report and gate on regressions
npx agent-house ci ./trace.json --baseline ./baseline.json --min-score 80import { audit } from 'agent-house'
const report = await audit(trace)
// {
// score: 72,
// categories: [{ category: 'cost', score: 77, ... }, ...],
// audits: [{ id, title, category, score, savings: { usd, ms }, findings, hint }, ...],
// savings: { usd: 0.235, ms: 6400 },
// run: { source, durationMs, spans, llmCalls, toolCalls, tokens, totalCostUsd },
// }
import { renderHtml, parseTrace } from 'agent-house'
const run = parseTrace(trace)
const html = renderHtml(audit(run), run) // self-contained, Lighthouse-styleaudit() accepts a raw trace (auto-detected) or an already-normalized run.
Ingest is built on the OpenTelemetry GenAI semantic conventions, with thin adapters for popular agent frameworks. Format is auto-detected.
| Framework | Format | Detected as |
|---|---|---|
| OpenTelemetry GenAI | OTLP/JSON with gen_ai.* attributes |
otel |
| Vercel AI SDK | OTLP/JSON with ai.* attributes |
vercel |
| LangGraph / LangSmith | LangSmith run tree (nested or flat) | langgraph |
| Google ADK | OTLP/JSON with gcp.vertex.agent.* |
adk |
Read from a JSON file (CLI) or pass parsed JSON (library). See
examples/ for a runnable example per framework.
Each run gets four category scores (0–100) and a weighted overall score:
| Category | Default weight | What it measures |
|---|---|---|
| Cost | 0.30 | wasted spend - wrong model tier, uncached prompts, bloat |
| Latency | 0.30 | wall-clock waste - duplicate and serial-when-parallelizable calls |
| Reliability | 0.20 | errors, retries, and loops |
| Context | 0.20 | oversized prompts relative to a healthy budget |
Weights are configurable (audit(trace, { weights })).
| Audit | Category | Flags |
|---|---|---|
duplicate-tool-calls |
latency | identical tool calls repeated in a run |
parallelizable-tool-calls |
latency | independent calls run serially that could be Promise.all |
model-tier-mismatch |
cost | a frontier model used for a trivial step |
uncacheable-prompts |
cost | large prompts re-sent across calls without caching |
errors |
reliability | spans that ended in an error |
retry-loops |
reliability | immediate retries and repeated-call loops |
context-bloat |
context | LLM calls sending oversized context |
Every audit returns a 0–1 score, the specific spans it flagged, an estimated saving, and a code-level hint for the fix.
- JSON - the full structured report (the value
audit()returns). - Static HTML - a self-contained, Lighthouse-style page (inline CSS, no external assets, no backend) with score gauges and an SVG span waterfall. The report JSON is embedded in the page, so it doubles as a portable artifact.
audit(trace, {
format: 'auto', // 'otel' | 'vercel' | 'langgraph' | 'adk'
models: [/* ModelPrice */],// override the price table with your negotiated rates
thresholds: { contextBloatTokens: 12000 },
weights: { cost: 0.4, latency: 0.3, reliability: 0.2, context: 0.1 },
})The built-in price table covers Anthropic, OpenAI, Google Gemini, and xAI Grok (standard-tier list prices, as of 2026-06; older model ids still resolve to the nearest current entry). The $/ms estimates are only as credible as this table, so it's fully overridable.
The savings are estimates and may overlap across audits (a duplicate call is also "extra latency"). They're designed to be directionally credible and to rank fixes by impact - not to be exact invoices. The per-run cost and duration totals come straight from the trace and are exact.
No hosted dashboard, no continuous monitoring, no auto-fix. It's a build-time tool: a CLI + library + static report, not a SaaS.
trace (OTLP / Vercel / LangGraph / ADK)
│ ingest adapters → normalize
▼
AgentRun (spans: kind, timing, model, usage, …)
│
├─ pricing engine → cost per span
├─ audits → findings + savings + hints
└─ scoring → category + overall scores
▼
Report → JSON + static HTML (gauges + waterfall)
Adding a framework is one adapter that normalizes to AgentRun; nothing
downstream changes.
See CONTRIBUTING.md. The project is test-driven - every
feature lands with Vitest tests. Run npm test.
MIT © Addy Osmani