add benchmark mode to compare workflow across N provider configs

### Description

Comparing the same workflow across N provider configurations is a recurring need that AgentLoom does not support today. Use cases:

- "How does my classifier perform on GPT-4o vs Claude Sonnet vs Gemini 2.5 Flash?"
- "Which provider is fastest for this prompt at p50 / p95?"
- "What's the cost difference between OpenAI and Anthropic for this workflow at scale?"
- The PhD's H4 (combinación calibrada aproxima criterio humano) explicitly compares LLM-as-judge across providers — different judges have different biases.

Today this requires: copy the YAML N times, edit the model field in each, run them sequentially, parse the results manually, build a comparison table by hand. Painful and error-prone — and the results are not aggregated into a single report.

### Proposal

Add a benchmark mode that runs one workflow against N configurations and produces a unified comparison report.

**1. CLI:**

```bash
agentloom bench workflow.yaml --configs configs.yaml
```

Where `configs.yaml` declares the variants:

```yaml
# configs.yaml
name: "judge-comparison-2026-04"
runs: 5                            # repetitions per config (for variance)
configs:
  - name: gpt-4o
    workflow_overrides:
      config:
        provider: openai
        model: gpt-4o
  - name: claude-sonnet-4-5
    workflow_overrides:
      config:
        provider: anthropic
        model: claude-sonnet-4-5
  - name: gemini-2.5-flash
    workflow_overrides:
      config:
        provider: google
        model: gemini-2.5-flash
state_variants:                    # optional — different inputs per benchmark dimension
  - name: short-input
    state: { user_input: "Hello" }
  - name: long-input
    state: { user_input: "..." }
```

This runs `len(configs) * len(state_variants) * runs` total executions = 3 × 2 × 5 = 30 runs.

**2. Aggregation:**

Each run produces a `WorkflowResult`. The bench command aggregates:

- Per config × variant: cost mean/median/p95, latency mean/median/p95, output (collected for diff), success rate.
- Cross-config differences: cost ratios, latency ratios, output similarity (semantic, requires #118 embeddings).

**3. Output formats:**

```bash
agentloom bench workflow.yaml --configs configs.yaml --report html --out bench-report.html
agentloom bench workflow.yaml --configs configs.yaml --report json --out bench-report.json
agentloom bench workflow.yaml --configs configs.yaml --report markdown    # stdout
```

**Markdown report (stdout, default):**

```
Benchmark: judge-comparison-2026-04
Workflow: workflow.yaml | Runs per config: 5 | Total: 30

| Config             | Variant     | Cost (mean) | Latency p50 | Latency p95 | Success |
|--------------------|-------------|-------------|-------------|-------------|---------|
| gpt-4o             | short-input | $0.0023     | 1.2s        | 1.8s        | 5/5     |
| gpt-4o             | long-input  | $0.0089     | 2.4s        | 3.1s        | 5/5     |
| claude-sonnet-4-5  | short-input | $0.0015     | 0.9s        | 1.4s        | 5/5     |
| claude-sonnet-4-5  | long-input  | $0.0067     | 1.8s        | 2.5s        | 5/5     |
| gemini-2.5-flash   | short-input | $0.0008     | 0.6s        | 1.0s        | 5/5     |
| gemini-2.5-flash   | long-input  | $0.0021     | 1.1s        | 1.6s        | 5/5     |

Cheapest: gemini-2.5-flash ($0.0029 total)
Fastest p50: gemini-2.5-flash (0.85s avg)
Most expensive: gpt-4o ($0.056 total, 19.5x cheapest)

Output similarity (cosine, requires --embed):
| Pair                                  | Mean similarity |
|---------------------------------------|-----------------|
| gpt-4o vs claude-sonnet-4-5           | 0.91            |
| gpt-4o vs gemini-2.5-flash            | 0.84            |
| claude-sonnet-4-5 vs gemini-2.5-flash | 0.86            |
```

**4. Programmatic API:**

```python
from agentloom import bench

results = await bench.run(
    workflow=workflow,
    configs=configs,
    state_variants=variants,
    runs=5,
)

results.summary()           # comparison table as DataFrame-like structure
results.cheapest()          # config name
results.fastest(metric="p95")
results.diff(config_a="gpt-4o", config_b="claude-sonnet-4-5")
```

**5. Parallelism:**

By default, run all variants × configs × repetitions in parallel up to `max_concurrent_runs` (default 5). Respects each provider's rate limiter — bench runs share the same gateway/limiter.

**6. Reproducibility:**

Each run gets a deterministic `run_id` derived from `(config_name, variant_name, repetition_index, bench_id)` so traces in Jaeger are correlatable to bench cells.

**7. Observability:**

A `bench:<name>` parent span wraps all child workflow spans, with attributes: `bench.name`, `bench.total_runs`, `bench.duration_ms`, `bench.cheapest_config`, etc.

### Scope

- `src/agentloom/bench/__init__.py` — `BenchConfig`, `BenchResult`, `bench.run()`.
- `src/agentloom/bench/runner.py` — orchestrates parallel execution.
- `src/agentloom/bench/aggregator.py` — statistics computation.
- `src/agentloom/bench/reporter.py` — HTML / JSON / Markdown report generation.
- `src/agentloom/cli/bench.py` — `agentloom bench` command.
- `examples/bench/` — sample bench config + workflow.

### Regression tests

- `test_bench_runs_all_combinations`
- `test_bench_respects_runs_per_config`
- `test_bench_aggregates_cost_correctly`
- `test_bench_aggregates_latency_percentiles`
- `test_bench_html_report_renders_table`
- `test_bench_json_report_machine_readable`
- `test_bench_markdown_stdout_default`
- `test_bench_handles_run_failures_gracefully` (don't abort the whole bench on one failure)
- `test_bench_run_id_deterministic`

### Notes

- Output similarity comparison requires #118 (embeddings).
- For agent benchmarks (compare different agent definitions, not just models), the `workflow_overrides` mechanism extends to step-level overrides — out of scope for the first version.
- Pairs naturally with the AgentTest plataforma — bench reports are exactly what the Reporter module would generate, but at the workflow level rather than the agent-evaluation level.
- The PhD's H4 (criterio humano) calibration evaluation is essentially "run the same eval workflow against N judges, compare to human gold" — this primitive is the runtime for that experiment.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add benchmark mode to compare workflow across N provider configs #124

Description

Proposal

Scope

Regression tests

Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

add benchmark mode to compare workflow across N provider configs #124

Description

Description

Proposal

Scope

Regression tests

Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions