fix(session): write results.json when scorer crashes#216
Open
elronbandel wants to merge 2 commits into
Open
Conversation
When a benchmark's `session.score()` raises (e.g. tau2 rejecting a malformed AssistantMessage), the exception propagated out of `run_session`, the worker died, and the session dir was left half-written: agent/, benchmark/, otel.log, otel_spans.jsonl, session.json, trajectory.jsonl — but no results.json. On the next pass, `SessionStatus.from_config` then sees a session_dir with no results.json and classifies it INCOMPLETE, which puts it back into `to_run`. The next run hits the same deterministic crash. The session loops forever, burning compute and writing nothing. Wrap the `session.score()` call in the AgentTermination / BenchmarkTermination handler with try/except. On exception, route to `tracker.on_session_error` (wrapping the underlying exc as a `BenchmarkError`) so observers — including the results writer — record the outcome. The session ends in ERROR state on disk and the orchestrator can move on. Tests: - New `test_session_error_when_score_raises` in tests/core/test_session_observer_lifecycle.py — uses a session whose score() raises, asserts on_session_scoring → on_session_error is the observed lifecycle, and on_session_success is NOT called. Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
Per review: - Replace the early `return` with an `else` clause for cleaner control flow — the success path is now explicit. - Switch from `_close_session_agent(session, agent_instance)` to just `agent_instance.close()` to mirror the existing `except BenchmarkError` handler. The benchmark already raised in scoring; `session.close()` may not be safe. Same behavior, smaller surface, fits the existing handler pattern. Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
When a benchmark's `session.score()` raises (we hit this with tau2 rejecting a malformed AssistantMessage), the exception propagates out of `run_session`, the worker dies, and the session dir is left half-written:
```
session_dir/
├── agent/
├── benchmark/
├── config.json
├── otel.log
├── otel_spans.jsonl
├── session.json
├── trajectory.jsonl
└── (no results.json)
```
On the next pass, `SessionStatus.from_config` sees a session_dir with no `results.json` and classifies it `INCOMPLETE` → goes to `to_run` → re-run → same deterministic crash. The session loops forever, burning compute and writing nothing useful.
Fix
Wrap `session.score()` in the `AgentTermination/BenchmarkTermination` handler with try/except. On exception, route to `tracker.on_session_error` (wrapping the underlying exception as a `BenchmarkError`). Observers — including the results writer — record the outcome, the session ends in `ERROR` state on disk, and the orchestrator can move on.
```python
try:
score = session.score()
except Exception as exc:
tracker.on_session_error(session, BenchmarkError(exc))
_close_session_agent(session, agent_instance)
return
```
Tests
Caveat
This catches generic `Exception` from the scorer. That's deliberate: a benchmark's scoring code is third-party and can fail in many ways; we don't want to enumerate them. The trade-off is masking a buggy `tracker.on_session_error` if it itself raises — but that's a problem regardless.