You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Comparing the same workflow across N provider configurations is a recurring need that AgentLoom does not support today. Use cases:
"How does my classifier perform on GPT-4o vs Claude Sonnet vs Gemini 2.5 Flash?"
"Which provider is fastest for this prompt at p50 / p95?"
"What's the cost difference between OpenAI and Anthropic for this workflow at scale?"
The PhD's H4 (combinación calibrada aproxima criterio humano) explicitly compares LLM-as-judge across providers — different judges have different biases.
Today this requires: copy the YAML N times, edit the model field in each, run them sequentially, parse the results manually, build a comparison table by hand. Painful and error-prone — and the results are not aggregated into a single report.
Proposal
Add a benchmark mode that runs one workflow against N configurations and produces a unified comparison report.
By default, run all variants × configs × repetitions in parallel up to max_concurrent_runs (default 5). Respects each provider's rate limiter — bench runs share the same gateway/limiter.
6. Reproducibility:
Each run gets a deterministic run_id derived from (config_name, variant_name, repetition_index, bench_id) so traces in Jaeger are correlatable to bench cells.
7. Observability:
A bench:<name> parent span wraps all child workflow spans, with attributes: bench.name, bench.total_runs, bench.duration_ms, bench.cheapest_config, etc.
For agent benchmarks (compare different agent definitions, not just models), the workflow_overrides mechanism extends to step-level overrides — out of scope for the first version.
Pairs naturally with the AgentTest plataforma — bench reports are exactly what the Reporter module would generate, but at the workflow level rather than the agent-evaluation level.
The PhD's H4 (criterio humano) calibration evaluation is essentially "run the same eval workflow against N judges, compare to human gold" — this primitive is the runtime for that experiment.
Description
Comparing the same workflow across N provider configurations is a recurring need that AgentLoom does not support today. Use cases:
Today this requires: copy the YAML N times, edit the model field in each, run them sequentially, parse the results manually, build a comparison table by hand. Painful and error-prone — and the results are not aggregated into a single report.
Proposal
Add a benchmark mode that runs one workflow against N configurations and produces a unified comparison report.
1. CLI:
Where
configs.yamldeclares the variants:This runs
len(configs) * len(state_variants) * runstotal executions = 3 × 2 × 5 = 30 runs.2. Aggregation:
Each run produces a
WorkflowResult. The bench command aggregates:3. Output formats:
agentloom bench workflow.yaml --configs configs.yaml --report html --out bench-report.html agentloom bench workflow.yaml --configs configs.yaml --report json --out bench-report.json agentloom bench workflow.yaml --configs configs.yaml --report markdown # stdoutMarkdown report (stdout, default):
4. Programmatic API:
5. Parallelism:
By default, run all variants × configs × repetitions in parallel up to
max_concurrent_runs(default 5). Respects each provider's rate limiter — bench runs share the same gateway/limiter.6. Reproducibility:
Each run gets a deterministic
run_idderived from(config_name, variant_name, repetition_index, bench_id)so traces in Jaeger are correlatable to bench cells.7. Observability:
A
bench:<name>parent span wraps all child workflow spans, with attributes:bench.name,bench.total_runs,bench.duration_ms,bench.cheapest_config, etc.Scope
src/agentloom/bench/__init__.py—BenchConfig,BenchResult,bench.run().src/agentloom/bench/runner.py— orchestrates parallel execution.src/agentloom/bench/aggregator.py— statistics computation.src/agentloom/bench/reporter.py— HTML / JSON / Markdown report generation.src/agentloom/cli/bench.py—agentloom benchcommand.examples/bench/— sample bench config + workflow.Regression tests
test_bench_runs_all_combinationstest_bench_respects_runs_per_configtest_bench_aggregates_cost_correctlytest_bench_aggregates_latency_percentilestest_bench_html_report_renders_tabletest_bench_json_report_machine_readabletest_bench_markdown_stdout_defaulttest_bench_handles_run_failures_gracefully(don't abort the whole bench on one failure)test_bench_run_id_deterministicNotes
workflow_overridesmechanism extends to step-level overrides — out of scope for the first version.