This repo is the experiment harness behind PAPER.md. It runs single-agent systems (SAS) and multi-agent systems (MAS) on a shared benchmark suite, records structured execution traces, and converts repeated runs into a trace-derived descriptor that supports:
- quality vs cost analysis
- MAS vs SAS gain/cost comparison
- coordination diagnostics
- topology-level summary and Pareto analysis
The core goal is not only to ask whether MAS can help, but to measure when collaboration improves task outcomes enough to justify its execution and coordination cost.
Each task is executed one or more times under a fixed system configuration. For every run, the repo stores:
- benchmark-native evaluation output
- structured trace events
- run-level trace metrics
- a task-level descriptor aggregated over repeated runs
The descriptor follows the paper’s Q / C / D / R / P split while also exposing paper-facing aliases used directly in the draft:
Q: outcome qualityC: direct execution costD: coordination diagnosticsR: run-to-run reliabilityP: process structure
The paper defines higher-level economic quantities such as utility U = Q - C, collaboration gain G, and coordination cost K. This repo produces the trace-derived ingredients needed for those analyses.
The MAS runtime follows a supervisor/subagent design while keeping the current custom topology engine and provider-native OpenAI-compatible tool loop.
- Structural workflow roles remain authoritative: planner/orchestrator/worker/critic/aggregator roles determine routing, visibility, and output contract.
- Dynamic personas specialize the agent within that structural role. They do not override stage rules or tool requirements.
- Tool-enabled answer-producing stages are expected to call tools when evidence is missing or weak. The runtime does not fabricate tool calls or synthetic retrieval after the fact.
- Final judges and deterministic fallbacks prefer direct, evidence-backed answers over blocked-status, planning-only, or "no evidence" outputs.
- Context sharing is explicit. Agents see only task messages, selected relay packets, their prior artifact, and the tool outputs they actually received.
This design aligns with current primary-source guidance on agent systems:
- OpenAI, A practical guide to building agents: start with clear instructions, explicit tool loops, and manager-pattern orchestration when specialization is useful.
- LangChain, Subagents: the main agent should see concise subagent outputs and treat tool/subagent descriptions as routing levers.
- LangChain, Handoffs: explicit context engineering matters; malformed or overly broad context degrades multi-agent behavior.
- LangChain, Deep Agents overview: keep the main context clean and isolate specialized work into bounded subagent contexts.
References:
- https://cdn.openai.com/business-guides-and-resources/a-practical-guide-to-building-agents.pdf
- https://docs.langchain.com/oss/python/langchain/multi-agent/subagents
- https://docs.langchain.com/oss/python/langchain/multi-agent/handoffs
- https://docs.langchain.com/oss/python/deepagents/index
The metric contract is intentionally strict and reproducible.
For one run r:
success_r = 1iffbenchmark.evaluate(...).successisTruecompletion_r = 1iff the run produced a final artifact / final answer and did not terminate with an explicit runtime failure signal
Important:
successis benchmark correctnesscompletionis execution completioncompletiondoes not imply correctness
So a wrong answer can still have completion = 1 and success = 0.
At the benchmark level for a fixed MAS:
success_ratemeans: among all benchmark sample runs, what fraction were solved correctlycompletion_ratemeans: among all benchmark sample runs, what fraction finished execution and produced a final answer/artifact without an explicit runtime failure
Equivalently:
success_rate: "How many samples did this MAS actually solve?"completion_rate: "How many sample runs did this MAS complete successfully as executions?"
For one run r, the trace code computes:
latency_total_r = sum(event.latency_ms)tokens_total_r = sum(event.token_in + event.token_out)cost_total_r = sum(event.cost_usd)tool_calls_total_r = count(event_type == "tool_call")tool_fail_total_r = count(tool failures)steps_total_r = number of trace eventsbacktrack_rate_r = (#revise events + payload.redo) / steps_total_rloop_score_r = repeated-state or repeated-pattern ratio from the traceverification_density_r = #verify / steps_total_rcommunication_count_r = directed relay/message edges from all inter-agent sends, including system-mediated sendscommunication_agent_to_agent_count_r = directed send edges whose sender is a non-system agentcommunication_system_mediated_count_r = directed send edges whose sender is system / mediatorhandoff_count_r = actor switches across consecutive non-system events
Given N repeated runs for the same task and system:
Quality
Q1_success_rate = mean_r(success_r)Q2_completion_rate = mean_r(completion_r)
Execution cost
C1_latency_p95 = p95_r(latency_total_r)C2_tokens_total = mean_r(tokens_total_r)C3_cost_total = mean_r(cost_total_r)C4_tool_calls_total = mean_r(tool_calls_total_r)
Coordination diagnostics
D1_tool_error_rate = sum_r(tool_fail_total_r) / sum_r(tool_calls_total_r)D2_communication_count = mean_r(communication_count_r)D2_agent_to_agent_communication_count = mean_r(communication_agent_to_agent_count_r)D2_system_mediated_communication_count = mean_r(communication_system_mediated_count_r)D3_handoff_count = mean_r(handoff_count_r)
These D* metrics are logged as coordination diagnostics. They are not part of the paper’s direct execution-cost definition C.
Reliability
R1_success_var = Var_r(success_r)R2_latency_var = Var_r(latency_total_r)R3_tokens_var = Var_r(tokens_total_r)
Process
P1_steps_total = mean_r(steps_total_r)P2_backtrack_rate = mean_r(backtrack_rate_r)P3_loop_score = mean_r(loop_score_r)P4_verification_density = mean_r(verification_density_r)
The task descriptor also writes paper-facing fields directly so downstream scripts do not need to reconstruct them:
success_rate = Q1_success_ratepass_at_1,pass_at_3,pass_at_5,pass_at_8using the paper’s pass@k estimator over repeated runsstability = clip(1 - R1_success_var / 0.25, 0, 1)whenN >= 2, otherwise blankeval_avg_score = mean_r(score_r)tokens_total = C2_tokens_totalcost_per_success = tokens_total / success_ratewhensuccess_rate > 0, otherwise blanktokens_cv = std_r(tokens_total_r) / mean_r(tokens_total_r)whenN >= 2and mean tokens are positive, otherwise blanktool_calls_total = C4_tool_calls_total- diagnostic aliases:
tool_error_rate,communication_count,handoff_count
Interpretation notes:
stabilityandtokens_cvrequire repeated runs and are blank for single-run taskspass_at_kis blank when fewer thankrepeated runs are availablecost_per_successis blank when the system never succeeds on that task
Per task and system, summary.csv includes:
eval_avg_score: benchmark-native mean score across runseval_success_rate: benchmark-native mean boolean success across runseval_completion_rate: runtime completion rate across runs- paper-facing descriptor fields such as
success_rate,pass_at_3,stability,tokens_total,cost_per_success,tokens_cv - compatibility fields such as
Q1_success_rate,C2_tokens_total,D2_communication_count,P3_loop_score, etc.
By design:
Q1_success_rateshould matcheval_success_rateQ2_completion_rateshould matcheval_completion_rate
If those pairs disagree, that indicates a bug in the artifact pipeline.
Interpretation by level:
- per run:
successandcompletionare binary0/1 - per task with repeated runs:
Q1_success_rateandQ2_completion_rateare proportions over that task's repeated runs - per benchmark for one MAS: average those task-level values across all samples in the benchmark to get the benchmark-level success/completion rates
Looped MAS stages such as debate, representative exchange, and orchestrator cycles do not stop implicitly. A controller node calls _termination_decision(...) in MAS/langgraph_engine.py and computes explicit stop statistics from the current stage artifacts.
For one controller decision:
candidate_artifacts: the current artifacts that would be revised if the loop continuesprevious_candidate_artifacts: the previous-step artifacts for the same agents, used to measure changeconsensus_artifacts: the artifacts whose answers are compared for agreementexpected_count: how many active branches or agents were expected to produce an artifact
The code first counts:
valid_artifact_count = count(non-empty branch artifacts available at the current controller step)
If valid_artifact_count < ceil(expected_count / 2), the stage stops with invalid_or_failed_branch.
Interpretation:
- this is a branch-survival check
- if fewer than half of the expected branches produced any usable artifact at all, the collaboration stage is considered too broken to continue
- blocked or planning artifacts do not count as good final answers, but they no longer trigger branch-collapse handling by themselves
By default, the repo computes termination consensus with an LLM judge:
mas.termination_consensus_mode = "llm_judge"by default- the judge uses the system model route
models.judgeif provided, otherwisemodels.default - the controller sends the current task prompt plus the candidate answers to the judge
- the judge returns JSON groups of semantically equivalent answers
The JSON schema is:
groups: lists of artifact indices that express the same final answerinvalid_indices: indices the judge considers unusable or non-answersis_substantive: whether the largest agreement group is an actual task answerprogress_status:improving | stalled | unclearexpected_improvement:high | medium | lowshould_stop_for_no_progress: whether another round is unlikely to materially improve correctnessexplanation: short rationale
The controller then computes:
winner_count = size of the largest judged equivalence groupvalid_count = number of valid answers after removing invalid_indicesconsensus_ratio = winner_count / valid_count
The stage stops with consensus_reached when:
valid_count > 1consensus_ratio >= 0.75- the semantic judge marks the majority answer as substantive
Interpretation:
- consensus here is semantic agreement as judged by the termination judge, not exact string identity
- if the judge clusters 3 of 4 valid answers together,
consensus_ratio = 0.75 - this consensus check is still a workflow-control heuristic, not the benchmark evaluator and not the final correctness decision
Fallback behavior:
- if
mas.termination_consensus_mode = "lexical", the repo uses deterministic normalized-string voting - if
mas.termination_consensus_mode = "llm_judge"but the judge is unavailable, running in mock mode, or returns unusable JSON, the controller falls back to lexical consensus
The lexical fallback canonicalizes each answer by lowercasing, removing non-alphanumeric characters, and collapsing whitespace, then computes the same winner_count / valid_count ratio over exact normalized matches.
Final answer aggregation is separate from this stop-condition ratio. Final answer selection is configurable and can fall back to deterministic vote_artifacts(...) after the loop ends.
You can also configure final answer selection separately:
mas.final_vote_mode = "llm_judge"by default- the final judge sees the task prompt plus the candidate answers and returns JSON with semantic groups, a
winner_index, optionalinvalid_indices, and a short explanation - if the final judge is unavailable, running in mock mode, or returns unusable JSON, the repo falls back to deterministic
vote_artifacts(...)
Each artifact carries a confidence field produced by the agent JSON output schema. During artifact construction:
- the parsed value is converted to
float - it is clipped into
[0, 1] - if missing or unparsable, it defaults to
0.5
Then:
average_confidence = mean(artifact.confidence)
across the current candidate_artifacts (or consensus_artifacts if needed).
Interpretation:
- this is self-reported model confidence averaged over the active artifacts
- it is logged as a diagnostic only and does not directly terminate a run
- the prompt now defines confidence as confidence in the current
answer_artifact, not general optimism
In llm_judge mode, no_meaningful_change is semantic. The termination judge sees the current candidate artifacts plus each agent's previous answer when available and decides whether another round is likely to materially improve correctness.
The stage stops with no_meaningful_change when:
- previous comparable artifacts exist
- the semantic judge returns
should_stop_for_no_progress = true
Fallback behavior:
- if the termination judge is unavailable, mocked, or unparsable, the repo falls back to lexical change detection
- lexical fallback computes
mean_deltawithdifflib.SequenceMatcher - lexical fallback stops when
mean_delta <= 0.05
mean_delta is still logged for compatibility, but in successful llm_judge mode it is diagnostic rather than the stop criterion.
The stage stops with max_rounds_reached when the topology-specific configured round or discussion limit has been exhausted.
Important:
mas.minimum_discussion_roundsapplies only to discussion/debate controllers- outer collaboration cycles are controlled by
rounds rounds=1means one outer cycle; it does not force a second pass
The checks are applied in this order:
invalid_or_failed_branchconsensus_reachedno_meaningful_changemax_rounds_reached
So if multiple conditions are true, the first one in this list is the recorded stop reason.
Each termination decision logs:
reasonreason_detailconsensus_modeconsensus_sourceconsensus_ratioconsensus_groupsconsensus_explanationprogress_sourceprogress_statusexpected_improvementprogress_explanationaverage_confidencemean_deltavalid_artifact_count- control-step
token_in,token_out,latency_ms,cost_usdwhen an LLM judge call is used
These values are workflow-control diagnostics. They determine whether a collaboration loop continues, but they are not benchmark quality metrics like success_rate.
Each run writes a JSONL trace. A trace event contains:
timestamp_start,timestamp_endactorevent_typepayloadtoken_in,token_outlatency_ms,cost_usd- optional
state_id
Supported event types:
planacttool_calltool_resultverifyrevisefinalizeerror
The schema is designed so all run-level trace metrics are recomputable from logs.
For each run:
run_<n>.trace.jsonl: raw trace eventsrun_<n>.answer.txt: final answer textrun_<n>.metadata.json: runtime metadata from the MAS executionrun_<n>.eval.json: benchmark-native score and correctnessrun_<n>.trace_metrics.json: run-level outcome + trace totals + stage metricsrun_<n>.result.json: compact run summaryrun_<n>.trajectory.json/.md: communication trajectory export
For each task:
descriptor.json: aggregated task descriptordescriptor.csv: flat CSV version of the descriptoranalysis.json: evaluation summary, descriptor, stage bottleneck hintstask_summary.json: task-level summary across runs
For each system:
mas_graph.png/.mmd: agent-topology graphworkflow_graph.png/.mmd: workflow/control-flow graphsummary.json: task summaries for the systemsummary.csv: one row per task for the system
For a hierarchical batch experiment:
artifacts/full_experiment/<experiment-id>/<benchmark>/<system>/<task_id>/...- benchmark/system rollups under the same root
experiment_summary.jsonandexperiment_summary.csvat the experiment root
benchmark/: benchmark adapters and evaluation logicbenchmarks/: benchmark overview docsMAS/: SAS/MAS runtimes and LangGraph topologiesdescriptor/: trace schema, run metrics, task descriptor aggregation, topology analysisscripts/: experiment and analysis helpersconfig/: experiment configsmain.py: CLI entrypoint
uv syncor
python -m venv .venv
source .venv/bin/activate
pip install -e .cp config/experiment.example.toml config/experiment.tomlOpenRouter credentials can be set in config/experiment.toml or through OPENROUTER_API_KEY.
python main.py list-benchmarks
python main.py benchmark-info --benchmark browsecomp --config config/experiment.tomlpython main.py run \
--config config/experiment.toml \
--benchmark browsecomp \
--task-limit 1 \
--runs-per-task 1python main.py summarize-experiment --experiment-root artifacts/full_experiment/<experiment-id>The main batch wrapper is:
bash scripts/full_experiment.shUseful environment variables:
TASK_LIMITRUNS_PER_TASKBENCHMARKS(optional; when unset the wrapper runs all discovered benchmark configs)EXPERIMENT_IDOUTPUT_ROOT
Useful CLI patterns:
bash scripts/full_experiment.sh --list-benchmarks
bash scripts/full_experiment.sh --benchmark workbench --benchmark scicode
bash scripts/full_experiment.sh --benchmarks browsecomp,workbench
RUNS_PER_TASK=8 bash scripts/full_experiment.sh --benchmarks browsecomp,workbenchHierarchical outputs are written under:
artifacts/full_experiment/<experiment-id>/<benchmark>/<system>/<task_id>/
descriptor/topology_analysis.py provides:
- descriptor scaling
- Mahalanobis distance
- Pareto frontier extraction
- PCA / optional UMAP embeddings
This is useful when comparing SAS and multiple MAS topologies over the same task set.
See benchmarks/README.md for the paper-aligned benchmark map, benchmark-specific success definitions, and setup notes.
- Canonical benchmark package:
benchmark/ benchmarks/is a documentation / compatibility shim