🏠 agent-house

Lighthouse for agents - score an agent run, then tell it how to get faster and cheaper.

The ecosystem is full of tracers (Langfuse, LangSmith, Opik) that show you what happened. agent-house is a profiler: it ingests an agent run trace and produces a scored report (cost, latency, reliability, context-hygiene) with concrete, ranked fixes - redundant tool calls, missed parallelization, oversized context, uncached retrievals, model-tier mismatches - each with an estimated $ / ms saved and a code-level hint.

It's a build-time tool: a CLI + library + static HTML report. Zero required backend, zero runtime dependencies.

  agent-house - before / after

  naive ReAct agent     score 72/100   cost $0.245   7.0s
  after top-3 fixes     score 100/100   cost $0.024   2.0s

  ▲ +28 points · saved $0.221 and 5.0s per run

Install

npm install agent-house        # library
npx agent-house run trace.json # or just run the CLI

Requires Node 18+. Ships as ESM with TypeScript types.

Quick start

CLI

# Score a trace and print a report (+ optional HTML/JSON output)
npx agent-house run ./trace.json --out report.html --json report.json

# Fail CI when the score drops below a threshold
npx agent-house assert ./trace.json --min-score 80

# Compare against a saved baseline report and gate on regressions
npx agent-house ci ./trace.json --baseline ./baseline.json --min-score 80

Library

import { audit } from 'agent-house'

const report = await audit(trace)
// {
//   score: 72,
//   categories: [{ category: 'cost', score: 77, ... }, ...],
//   audits: [{ id, title, category, score, savings: { usd, ms }, findings, hint }, ...],
//   savings: { usd: 0.235, ms: 6400 },
//   run: { source, durationMs, spans, llmCalls, toolCalls, tokens, totalCostUsd },
// }

import { renderHtml, parseTrace } from 'agent-house'
const run = parseTrace(trace)
const html = renderHtml(audit(run), run) // self-contained, Lighthouse-style

audit() accepts a raw trace (auto-detected) or an already-normalized run.

Supported trace formats

Ingest is built on the OpenTelemetry GenAI semantic conventions, with thin adapters for popular agent frameworks. Format is auto-detected.

Framework	Format	Detected as
OpenTelemetry GenAI	OTLP/JSON with `gen_ai.*` attributes	`otel`
Vercel AI SDK	OTLP/JSON with `ai.*` attributes	`vercel`
LangGraph / LangSmith	LangSmith run tree (nested or flat)	`langgraph`
Google ADK	OTLP/JSON with `gcp.vertex.agent.*`	`adk`

Read from a JSON file (CLI) or pass parsed JSON (library). See examples/ for a runnable example per framework.

Scores

Each run gets four category scores (0–100) and a weighted overall score:

Category	Default weight	What it measures
Cost	0.30	wasted spend - wrong model tier, uncached prompts, bloat
Latency	0.30	wall-clock waste - duplicate and serial-when-parallelizable calls
Reliability	0.20	errors, retries, and loops
Context	0.20	oversized prompts relative to a healthy budget

Weights are configurable (audit(trace, { weights })).

Audits

Audit	Category	Flags
`duplicate-tool-calls`	latency	identical tool calls repeated in a run
`parallelizable-tool-calls`	latency	independent calls run serially that could be `Promise.all`
`model-tier-mismatch`	cost	a frontier model used for a trivial step
`uncacheable-prompts`	cost	large prompts re-sent across calls without caching
`errors`	reliability	spans that ended in an error
`retry-loops`	reliability	immediate retries and repeated-call loops
`context-bloat`	context	LLM calls sending oversized context

Every audit returns a 0–1 score, the specific spans it flagged, an estimated saving, and a code-level hint for the fix.

The report

JSON - the full structured report (the value audit() returns).
Static HTML - a self-contained, Lighthouse-style page (inline CSS, no external assets, no backend) with score gauges and an SVG span waterfall. The report JSON is embedded in the page, so it doubles as a portable artifact.

Configuration

audit(trace, {
  format: 'auto',            // 'otel' | 'vercel' | 'langgraph' | 'adk'
  models: [/* ModelPrice */],// override the price table with your negotiated rates
  thresholds: { contextBloatTokens: 12000 },
  weights: { cost: 0.4, latency: 0.3, reliability: 0.2, context: 0.1 },
})

The built-in price table covers Anthropic, OpenAI, Google Gemini, and xAI Grok (standard-tier list prices, as of 2026-06; older model ids still resolve to the nearest current entry). The $/ms estimates are only as credible as this table, so it's fully overridable.

How accurate are the savings?

The savings are estimates and may overlap across audits (a duplicate call is also "extra latency"). They're designed to be directionally credible and to rank fixes by impact - not to be exact invoices. The per-run cost and duration totals come straight from the trace and are exact.

Non-goals (MVP)

No hosted dashboard, no continuous monitoring, no auto-fix. It's a build-time tool: a CLI + library + static report, not a SaaS.

Architecture

trace (OTLP / Vercel / LangGraph / ADK)
        │  ingest adapters → normalize
        ▼
   AgentRun (spans: kind, timing, model, usage, …)
        │
        ├─ pricing engine  → cost per span
        ├─ audits          → findings + savings + hints
        └─ scoring         → category + overall scores
        ▼
   Report  →  JSON  +  static HTML (gauges + waterfall)

Adding a framework is one adapter that normalizes to AgentRun; nothing downstream changes.

Contributing

See CONTRIBUTING.md. The project is test-driven - every feature lands with Vitest tests. Run npm test.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.github/workflows		.github/workflows
examples		examples
src		src
test		test
.editorconfig		.editorconfig
.gitignore		.gitignore
.nvmrc		.nvmrc
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.build.json		tsconfig.build.json
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🏠 agent-house

Install

Quick start

CLI

Library

Supported trace formats

Scores

Audits

The report

Configuration

How accurate are the savings?

Non-goals (MVP)

Architecture

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🏠 agent-house

Install

Quick start

CLI

Library

Supported trace formats

Scores

Audits

The report

Configuration

How accurate are the savings?

Non-goals (MVP)

Architecture

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages