Skip to content

addyosmani/agent-house

Repository files navigation

🏠 agent-house

npm CI license: MIT

Lighthouse for agents - score an agent run, then tell it how to get faster and cheaper.

The ecosystem is full of tracers (Langfuse, LangSmith, Opik) that show you what happened. agent-house is a profiler: it ingests an agent run trace and produces a scored report (cost, latency, reliability, context-hygiene) with concrete, ranked fixes - redundant tool calls, missed parallelization, oversized context, uncached retrievals, model-tier mismatches - each with an estimated $ / ms saved and a code-level hint.

It's a build-time tool: a CLI + library + static HTML report. Zero required backend, zero runtime dependencies.

  agent-house - before / after

  naive ReAct agent     score 72/100   cost $0.245   7.0s
  after top-3 fixes     score 100/100   cost $0.024   2.0s

  ▲ +28 points · saved $0.221 and 5.0s per run

Install

npm install agent-house        # library
npx agent-house run trace.json # or just run the CLI

Requires Node 18+. Ships as ESM with TypeScript types.

Quick start

CLI

# Score a trace and print a report (+ optional HTML/JSON output)
npx agent-house run ./trace.json --out report.html --json report.json

# Fail CI when the score drops below a threshold
npx agent-house assert ./trace.json --min-score 80

# Compare against a saved baseline report and gate on regressions
npx agent-house ci ./trace.json --baseline ./baseline.json --min-score 80

Library

import { audit } from 'agent-house'

const report = await audit(trace)
// {
//   score: 72,
//   categories: [{ category: 'cost', score: 77, ... }, ...],
//   audits: [{ id, title, category, score, savings: { usd, ms }, findings, hint }, ...],
//   savings: { usd: 0.235, ms: 6400 },
//   run: { source, durationMs, spans, llmCalls, toolCalls, tokens, totalCostUsd },
// }

import { renderHtml, parseTrace } from 'agent-house'
const run = parseTrace(trace)
const html = renderHtml(audit(run), run) // self-contained, Lighthouse-style

audit() accepts a raw trace (auto-detected) or an already-normalized run.

Supported trace formats

Ingest is built on the OpenTelemetry GenAI semantic conventions, with thin adapters for popular agent frameworks. Format is auto-detected.

Framework Format Detected as
OpenTelemetry GenAI OTLP/JSON with gen_ai.* attributes otel
Vercel AI SDK OTLP/JSON with ai.* attributes vercel
LangGraph / LangSmith LangSmith run tree (nested or flat) langgraph
Google ADK OTLP/JSON with gcp.vertex.agent.* adk

Read from a JSON file (CLI) or pass parsed JSON (library). See examples/ for a runnable example per framework.

Scores

Each run gets four category scores (0–100) and a weighted overall score:

Category Default weight What it measures
Cost 0.30 wasted spend - wrong model tier, uncached prompts, bloat
Latency 0.30 wall-clock waste - duplicate and serial-when-parallelizable calls
Reliability 0.20 errors, retries, and loops
Context 0.20 oversized prompts relative to a healthy budget

Weights are configurable (audit(trace, { weights })).

Audits

Audit Category Flags
duplicate-tool-calls latency identical tool calls repeated in a run
parallelizable-tool-calls latency independent calls run serially that could be Promise.all
model-tier-mismatch cost a frontier model used for a trivial step
uncacheable-prompts cost large prompts re-sent across calls without caching
errors reliability spans that ended in an error
retry-loops reliability immediate retries and repeated-call loops
context-bloat context LLM calls sending oversized context

Every audit returns a 0–1 score, the specific spans it flagged, an estimated saving, and a code-level hint for the fix.

The report

  • JSON - the full structured report (the value audit() returns).
  • Static HTML - a self-contained, Lighthouse-style page (inline CSS, no external assets, no backend) with score gauges and an SVG span waterfall. The report JSON is embedded in the page, so it doubles as a portable artifact.

Configuration

audit(trace, {
  format: 'auto',            // 'otel' | 'vercel' | 'langgraph' | 'adk'
  models: [/* ModelPrice */],// override the price table with your negotiated rates
  thresholds: { contextBloatTokens: 12000 },
  weights: { cost: 0.4, latency: 0.3, reliability: 0.2, context: 0.1 },
})

The built-in price table covers Anthropic, OpenAI, Google Gemini, and xAI Grok (standard-tier list prices, as of 2026-06; older model ids still resolve to the nearest current entry). The $/ms estimates are only as credible as this table, so it's fully overridable.

How accurate are the savings?

The savings are estimates and may overlap across audits (a duplicate call is also "extra latency"). They're designed to be directionally credible and to rank fixes by impact - not to be exact invoices. The per-run cost and duration totals come straight from the trace and are exact.

Non-goals (MVP)

No hosted dashboard, no continuous monitoring, no auto-fix. It's a build-time tool: a CLI + library + static report, not a SaaS.

Architecture

trace (OTLP / Vercel / LangGraph / ADK)
        │  ingest adapters → normalize
        ▼
   AgentRun (spans: kind, timing, model, usage, …)
        │
        ├─ pricing engine  → cost per span
        ├─ audits          → findings + savings + hints
        └─ scoring         → category + overall scores
        ▼
   Report  →  JSON  +  static HTML (gauges + waterfall)

Adding a framework is one adapter that normalizes to AgentRun; nothing downstream changes.

Contributing

See CONTRIBUTING.md. The project is test-driven - every feature lands with Vitest tests. Run npm test.

License

MIT © Addy Osmani

About

Lighthouse for agents - score an agent run, then tell it how to get faster and cheaper.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors