fix(tau2): zero-fill missing/empty sessions in aggregate score#224
Open
elronbandel wants to merge 2 commits into
Open
fix(tau2): zero-fill missing/empty sessions in aggregate score#224elronbandel wants to merge 2 commits into
elronbandel wants to merge 2 commits into
Conversation
Closes #223 (Layer 2). The aggregator silently dropped sessions without a simulation file, while `total_tasks` kept reporting the planned count. The headline `benchmark_score` ended up averaged over a smaller denominator than every other field implied — e.g. Kimi+claude_code+tau2/retail returned 1.0 over 3/100 sessions. PR #152 introduced the divergence: it correctly made the per-session `score()` return 0 when no simulations were produced (legitimate "agent returned text instead of tool calls" outcome), but in the aggregator it switched from a hard error to a silent skip. Every other benchmark adapter (swebench, bfcl, browsecompplus, gsm8k, hotpotqa, appworld) raises on a missing planned-session file — tau2 was the only fail-quiet aggregator. Fix: rescale the aggregator's avg_reward to the planned denominator so missing/empty sessions contribute reward=0, matching the per-session contract. Also expose `aggregated_sessions` in metrics and warn when it diverges from `total_tasks`, so downstream readers can detect partial coverage without grepping logs. Layer 1 (`core_aggregate` filtering to COMPLETED) is left for a follow-up — once tau2 is honest, the residual gap is just non- COMPLETED sessions, which `successful_sessions / total_sessions` already exposes. Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
Drop the aggregated_sessions counter and the partial-coverage warning — neither is load-bearing for the score fix, and the warning duplicates orchestrator-level logging. Collapses the two BenchmarkResults returns into one, leaving just the rescale that fixes the hidden denominator. Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
aggregate_sessionswas silently dropping sessions without a simulation file, whiletotal_taskskept reporting the planned count. Headlinebenchmark_scorewas averaged over a smaller denominator than every other field implied (e.g.Kimi+claude_code+tau2/retailreturned1.0over 3/100 sessions).What changed
sessionsdirectly inTAU2Evaluator.aggregate_sessions, countingaggregated_sessionsas the actual contributor count.avg_rewardfromcompute_metricsto the planned denominator (total_sessions) — missing/empty sessions contribute reward=0, matching the per-sessionscore()contract introduced in #152.aggregated_sessionsin the metrics dict andlogger.warning(...)when it diverges fromtotal_tasks, so downstream readers can detect partial coverage without grepping logs.Why this shape
Every other benchmark adapter (
swebench,bfcl,browsecompplus,gsm8k,hotpotqa,appworld) raises on a missing planned-session file — tau2 was the only fail-quiet aggregator after #152. #152 correctly made the per-sessionscore()return 0 for "agent returned text instead of tool calls", but in the aggregator it switched from a hard error to a silent skip — those two paths disagreed. This PR makes them agree without re-introducing the original #152 crash.Layer 1 (
core_aggregatepre-filtering toCOMPLETED) is left for a follow-up — once tau2 is honest, the residual gap is just non-COMPLETED sessions, whichsuccessful_sessions / total_sessionsalready exposes uniformly.Test plan
uv run pytest tests/benchmarks/test_tau2_aggregate.py -v— 5/5 pass, including a newtest_missing_file_counts_as_zero_rewardand the updatedtest_empty_sessions_count_as_zero_reward(now asserting score=0.4 over 2 sessions instead of 0.8 over 1).uv run ruff checkandruff format --checkclean.exgentic evaluate --aggregate-only ...) and confirmbenchmark_scoredrops from the inflated value to the planned-denominator value.