Skip to content

add benchmark mode to compare workflow across N provider configs #124

@cchinchilla-dev

Description

@cchinchilla-dev

Description

Comparing the same workflow across N provider configurations is a recurring need that AgentLoom does not support today. Use cases:

  • "How does my classifier perform on GPT-4o vs Claude Sonnet vs Gemini 2.5 Flash?"
  • "Which provider is fastest for this prompt at p50 / p95?"
  • "What's the cost difference between OpenAI and Anthropic for this workflow at scale?"
  • The PhD's H4 (combinación calibrada aproxima criterio humano) explicitly compares LLM-as-judge across providers — different judges have different biases.

Today this requires: copy the YAML N times, edit the model field in each, run them sequentially, parse the results manually, build a comparison table by hand. Painful and error-prone — and the results are not aggregated into a single report.

Proposal

Add a benchmark mode that runs one workflow against N configurations and produces a unified comparison report.

1. CLI:

agentloom bench workflow.yaml --configs configs.yaml

Where configs.yaml declares the variants:

# configs.yaml
name: "judge-comparison-2026-04"
runs: 5                            # repetitions per config (for variance)
configs:
  - name: gpt-4o
    workflow_overrides:
      config:
        provider: openai
        model: gpt-4o
  - name: claude-sonnet-4-5
    workflow_overrides:
      config:
        provider: anthropic
        model: claude-sonnet-4-5
  - name: gemini-2.5-flash
    workflow_overrides:
      config:
        provider: google
        model: gemini-2.5-flash
state_variants:                    # optional — different inputs per benchmark dimension
  - name: short-input
    state: { user_input: "Hello" }
  - name: long-input
    state: { user_input: "..." }

This runs len(configs) * len(state_variants) * runs total executions = 3 × 2 × 5 = 30 runs.

2. Aggregation:

Each run produces a WorkflowResult. The bench command aggregates:

  • Per config × variant: cost mean/median/p95, latency mean/median/p95, output (collected for diff), success rate.
  • Cross-config differences: cost ratios, latency ratios, output similarity (semantic, requires add embeddings API across providers #118 embeddings).

3. Output formats:

agentloom bench workflow.yaml --configs configs.yaml --report html --out bench-report.html
agentloom bench workflow.yaml --configs configs.yaml --report json --out bench-report.json
agentloom bench workflow.yaml --configs configs.yaml --report markdown    # stdout

Markdown report (stdout, default):

Benchmark: judge-comparison-2026-04
Workflow: workflow.yaml | Runs per config: 5 | Total: 30

| Config             | Variant     | Cost (mean) | Latency p50 | Latency p95 | Success |
|--------------------|-------------|-------------|-------------|-------------|---------|
| gpt-4o             | short-input | $0.0023     | 1.2s        | 1.8s        | 5/5     |
| gpt-4o             | long-input  | $0.0089     | 2.4s        | 3.1s        | 5/5     |
| claude-sonnet-4-5  | short-input | $0.0015     | 0.9s        | 1.4s        | 5/5     |
| claude-sonnet-4-5  | long-input  | $0.0067     | 1.8s        | 2.5s        | 5/5     |
| gemini-2.5-flash   | short-input | $0.0008     | 0.6s        | 1.0s        | 5/5     |
| gemini-2.5-flash   | long-input  | $0.0021     | 1.1s        | 1.6s        | 5/5     |

Cheapest: gemini-2.5-flash ($0.0029 total)
Fastest p50: gemini-2.5-flash (0.85s avg)
Most expensive: gpt-4o ($0.056 total, 19.5x cheapest)

Output similarity (cosine, requires --embed):
| Pair                                  | Mean similarity |
|---------------------------------------|-----------------|
| gpt-4o vs claude-sonnet-4-5           | 0.91            |
| gpt-4o vs gemini-2.5-flash            | 0.84            |
| claude-sonnet-4-5 vs gemini-2.5-flash | 0.86            |

4. Programmatic API:

from agentloom import bench

results = await bench.run(
    workflow=workflow,
    configs=configs,
    state_variants=variants,
    runs=5,
)

results.summary()           # comparison table as DataFrame-like structure
results.cheapest()          # config name
results.fastest(metric="p95")
results.diff(config_a="gpt-4o", config_b="claude-sonnet-4-5")

5. Parallelism:

By default, run all variants × configs × repetitions in parallel up to max_concurrent_runs (default 5). Respects each provider's rate limiter — bench runs share the same gateway/limiter.

6. Reproducibility:

Each run gets a deterministic run_id derived from (config_name, variant_name, repetition_index, bench_id) so traces in Jaeger are correlatable to bench cells.

7. Observability:

A bench:<name> parent span wraps all child workflow spans, with attributes: bench.name, bench.total_runs, bench.duration_ms, bench.cheapest_config, etc.

Scope

  • src/agentloom/bench/__init__.pyBenchConfig, BenchResult, bench.run().
  • src/agentloom/bench/runner.py — orchestrates parallel execution.
  • src/agentloom/bench/aggregator.py — statistics computation.
  • src/agentloom/bench/reporter.py — HTML / JSON / Markdown report generation.
  • src/agentloom/cli/bench.pyagentloom bench command.
  • examples/bench/ — sample bench config + workflow.

Regression tests

  • test_bench_runs_all_combinations
  • test_bench_respects_runs_per_config
  • test_bench_aggregates_cost_correctly
  • test_bench_aggregates_latency_percentiles
  • test_bench_html_report_renders_table
  • test_bench_json_report_machine_readable
  • test_bench_markdown_stdout_default
  • test_bench_handles_run_failures_gracefully (don't abort the whole bench on one failure)
  • test_bench_run_id_deterministic

Notes

  • Output similarity comparison requires add embeddings API across providers #118 (embeddings).
  • For agent benchmarks (compare different agent definitions, not just models), the workflow_overrides mechanism extends to step-level overrides — out of scope for the first version.
  • Pairs naturally with the AgentTest plataforma — bench reports are exactly what the Reporter module would generate, but at the workflow level rather than the agent-evaluation level.
  • The PhD's H4 (criterio humano) calibration evaluation is essentially "run the same eval workflow against N judges, compare to human gold" — this primitive is the runtime for that experiment.

Metadata

Metadata

Assignees

No one assigned

    Labels

    cliCLI commandsenhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions