fix(tau2): zero-fill missing/empty sessions in aggregate score by elronbandel · Pull Request #224 · Exgentic/exgentic

elronbandel · 2026-05-04T19:48:55Z

Summary

Closes #223 (Layer 2 — the tau2-specific hidden denominator).
Tau2's aggregate_sessions was silently dropping sessions without a simulation file, while total_tasks kept reporting the planned count. Headline benchmark_score was averaged over a smaller denominator than every other field implied (e.g. Kimi+claude_code+tau2/retail returned 1.0 over 3/100 sessions).

What changed

Iterate over sessions directly in TAU2Evaluator.aggregate_sessions, counting aggregated_sessions as the actual contributor count.
Rescale avg_reward from compute_metrics to the planned denominator (total_sessions) — missing/empty sessions contribute reward=0, matching the per-session score() contract introduced in #152.
Expose aggregated_sessions in the metrics dict and logger.warning(...) when it diverges from total_tasks, so downstream readers can detect partial coverage without grepping logs.

Why this shape

Every other benchmark adapter (swebench, bfcl, browsecompplus, gsm8k, hotpotqa, appworld) raises on a missing planned-session file — tau2 was the only fail-quiet aggregator after #152. #152 correctly made the per-session score() return 0 for "agent returned text instead of tool calls", but in the aggregator it switched from a hard error to a silent skip — those two paths disagreed. This PR makes them agree without re-introducing the original #152 crash.

Layer 1 (core_aggregate pre-filtering to COMPLETED) is left for a follow-up — once tau2 is honest, the residual gap is just non-COMPLETED sessions, which successful_sessions / total_sessions already exposes uniformly.

Test plan

uv run pytest tests/benchmarks/test_tau2_aggregate.py -v — 5/5 pass, including a new test_missing_file_counts_as_zero_reward and the updated test_empty_sessions_count_as_zero_reward (now asserting score=0.4 over 2 sessions instead of 0.8 over 1).
uv run ruff check and ruff format --check clean.
Re-aggregate an existing affected run (exgentic evaluate --aggregate-only ...) and confirm benchmark_score drops from the inflated value to the planned-denominator value.

Closes #223 (Layer 2). The aggregator silently dropped sessions without a simulation file, while `total_tasks` kept reporting the planned count. The headline `benchmark_score` ended up averaged over a smaller denominator than every other field implied — e.g. Kimi+claude_code+tau2/retail returned 1.0 over 3/100 sessions. PR #152 introduced the divergence: it correctly made the per-session `score()` return 0 when no simulations were produced (legitimate "agent returned text instead of tool calls" outcome), but in the aggregator it switched from a hard error to a silent skip. Every other benchmark adapter (swebench, bfcl, browsecompplus, gsm8k, hotpotqa, appworld) raises on a missing planned-session file — tau2 was the only fail-quiet aggregator. Fix: rescale the aggregator's avg_reward to the planned denominator so missing/empty sessions contribute reward=0, matching the per-session contract. Also expose `aggregated_sessions` in metrics and warn when it diverges from `total_tasks`, so downstream readers can detect partial coverage without grepping logs. Layer 1 (`core_aggregate` filtering to COMPLETED) is left for a follow-up — once tau2 is honest, the residual gap is just non- COMPLETED sessions, which `successful_sessions / total_sessions` already exposes. Signed-off-by: Elron Bandel <elron.bandel@ibm.com>

Drop the aggregated_sessions counter and the partial-coverage warning — neither is load-bearing for the score fix, and the warning duplicates orchestrator-level logging. Collapses the two BenchmarkResults returns into one, leaving just the rescale that fixes the hidden denominator. Signed-off-by: Elron Bandel <elron.bandel@ibm.com>

elronbandel added 2 commits May 4, 2026 22:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(tau2): zero-fill missing/empty sessions in aggregate score#224

fix(tau2): zero-fill missing/empty sessions in aggregate score#224
elronbandel wants to merge 2 commits into
mainfrom
fix/tau2-aggregate-zero-fill-missing

elronbandel commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

elronbandel commented May 4, 2026

Summary

What changed

Why this shape

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant