Tracking issue — deliberately NOT closed by the current fix PR (each item is a standalone refactor):
benchmark.py:_dispatch duplicates dispatch.py:_dispatch_one (primary→fallback pipeline). The two have already drifted twice (exception isolation went to dispatch only; scoring semantics to benchmark only). → shared dispatch_with_fallback() in _common.py.
- Cost model exists 3×:
estimate_cost.py, benchmark.py inline gate (which additionally honors ARGUS_YES_COST and skips warn exit semantics), bench_cost.py. → estimate_roster_cost() in _common.py.
- Five near-identical stdin-CLI adapters (gemini/codex/claude/opencode/copilot) — the ARG_MAX fix needed two separate commits to port between them. → one factory
make_stdin_cli_adapter(route_key, extra_env, template_vars).
- Failed-call zero-scoring lives in 3 hand-synced copies (benchmark wall-cap stub, benchmark exit-code branch, aggregate_bench._rescore_run). → shared
score_run().
- Leaderboard markdown rendered by both
benchmark.py:_write_outputs and aggregate_bench.py:_leaderboard_md with already-divergent columns. → shared renderer.
Also: per-reviewer benchmark calls are fully serialized (single-reviewer bench wastes the max_parallel budget; interacts with --max-wall-sec semantics) and detect_host signals could move to config beside host_rules.
From /code-review rounds 1+2 (reuse + altitude angles, all verified against HEAD).
Tracking issue — deliberately NOT closed by the current fix PR (each item is a standalone refactor):
benchmark.py:_dispatchduplicatesdispatch.py:_dispatch_one(primary→fallback pipeline). The two have already drifted twice (exception isolation went to dispatch only; scoring semantics to benchmark only). → shareddispatch_with_fallback()in _common.py.estimate_cost.py,benchmark.pyinline gate (which additionally honors ARGUS_YES_COST and skips warn exit semantics),bench_cost.py. →estimate_roster_cost()in _common.py.make_stdin_cli_adapter(route_key, extra_env, template_vars).score_run().benchmark.py:_write_outputsandaggregate_bench.py:_leaderboard_mdwith already-divergent columns. → shared renderer.Also: per-reviewer benchmark calls are fully serialized (single-reviewer bench wastes the max_parallel budget; interacts with --max-wall-sec semantics) and detect_host signals could move to config beside host_rules.
From /code-review rounds 1+2 (reuse + altitude angles, all verified against HEAD).