Skip to content

fix(tau2): zero-fill missing/empty sessions in aggregate score#224

Open
elronbandel wants to merge 2 commits into
mainfrom
fix/tau2-aggregate-zero-fill-missing
Open

fix(tau2): zero-fill missing/empty sessions in aggregate score#224
elronbandel wants to merge 2 commits into
mainfrom
fix/tau2-aggregate-zero-fill-missing

Conversation

@elronbandel

Copy link
Copy Markdown
Contributor

Summary

  • Closes #223 (Layer 2 — the tau2-specific hidden denominator).
  • Tau2's aggregate_sessions was silently dropping sessions without a simulation file, while total_tasks kept reporting the planned count. Headline benchmark_score was averaged over a smaller denominator than every other field implied (e.g. Kimi+claude_code+tau2/retail returned 1.0 over 3/100 sessions).

What changed

  • Iterate over sessions directly in TAU2Evaluator.aggregate_sessions, counting aggregated_sessions as the actual contributor count.
  • Rescale avg_reward from compute_metrics to the planned denominator (total_sessions) — missing/empty sessions contribute reward=0, matching the per-session score() contract introduced in #152.
  • Expose aggregated_sessions in the metrics dict and logger.warning(...) when it diverges from total_tasks, so downstream readers can detect partial coverage without grepping logs.

Why this shape

Every other benchmark adapter (swebench, bfcl, browsecompplus, gsm8k, hotpotqa, appworld) raises on a missing planned-session file — tau2 was the only fail-quiet aggregator after #152. #152 correctly made the per-session score() return 0 for "agent returned text instead of tool calls", but in the aggregator it switched from a hard error to a silent skip — those two paths disagreed. This PR makes them agree without re-introducing the original #152 crash.

Layer 1 (core_aggregate pre-filtering to COMPLETED) is left for a follow-up — once tau2 is honest, the residual gap is just non-COMPLETED sessions, which successful_sessions / total_sessions already exposes uniformly.

Test plan

  • uv run pytest tests/benchmarks/test_tau2_aggregate.py -v — 5/5 pass, including a new test_missing_file_counts_as_zero_reward and the updated test_empty_sessions_count_as_zero_reward (now asserting score=0.4 over 2 sessions instead of 0.8 over 1).
  • uv run ruff check and ruff format --check clean.
  • Re-aggregate an existing affected run (exgentic evaluate --aggregate-only ...) and confirm benchmark_score drops from the inflated value to the planned-denominator value.

Closes #223 (Layer 2). The aggregator silently dropped sessions
without a simulation file, while `total_tasks` kept reporting the
planned count. The headline `benchmark_score` ended up averaged
over a smaller denominator than every other field implied — e.g.
Kimi+claude_code+tau2/retail returned 1.0 over 3/100 sessions.

PR #152 introduced the divergence: it correctly made the per-session
`score()` return 0 when no simulations were produced (legitimate
"agent returned text instead of tool calls" outcome), but in the
aggregator it switched from a hard error to a silent skip. Every
other benchmark adapter (swebench, bfcl, browsecompplus, gsm8k,
hotpotqa, appworld) raises on a missing planned-session file —
tau2 was the only fail-quiet aggregator.

Fix: rescale the aggregator's avg_reward to the planned denominator
so missing/empty sessions contribute reward=0, matching the
per-session contract. Also expose `aggregated_sessions` in metrics
and warn when it diverges from `total_tasks`, so downstream readers
can detect partial coverage without grepping logs.

Layer 1 (`core_aggregate` filtering to COMPLETED) is left for a
follow-up — once tau2 is honest, the residual gap is just non-
COMPLETED sessions, which `successful_sessions / total_sessions`
already exposes.

Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
Drop the aggregated_sessions counter and the partial-coverage
warning — neither is load-bearing for the score fix, and the
warning duplicates orchestrator-level logging.

Collapses the two BenchmarkResults returns into one, leaving just
the rescale that fixes the hidden denominator.

Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

benchmark_score has hidden denominator: orchestrator pre-filters to COMPLETED before aggregating

1 participant