Skip to content

fix(session): write results.json when scorer crashes#216

Open
elronbandel wants to merge 2 commits into
mainfrom
scoring-error-results
Open

fix(session): write results.json when scorer crashes#216
elronbandel wants to merge 2 commits into
mainfrom
scoring-error-results

Conversation

@elronbandel

Copy link
Copy Markdown
Contributor

Why

When a benchmark's `session.score()` raises (we hit this with tau2 rejecting a malformed AssistantMessage), the exception propagates out of `run_session`, the worker dies, and the session dir is left half-written:

```
session_dir/
├── agent/
├── benchmark/
├── config.json
├── otel.log
├── otel_spans.jsonl
├── session.json
├── trajectory.jsonl
└── (no results.json)
```

On the next pass, `SessionStatus.from_config` sees a session_dir with no `results.json` and classifies it `INCOMPLETE` → goes to `to_run` → re-run → same deterministic crash. The session loops forever, burning compute and writing nothing useful.

Fix

Wrap `session.score()` in the `AgentTermination/BenchmarkTermination` handler with try/except. On exception, route to `tracker.on_session_error` (wrapping the underlying exception as a `BenchmarkError`). Observers — including the results writer — record the outcome, the session ends in `ERROR` state on disk, and the orchestrator can move on.

```python
try:
score = session.score()
except Exception as exc:
tracker.on_session_error(session, BenchmarkError(exc))
_close_session_agent(session, agent_instance)
return
```

Tests

  • `test_session_error_when_score_raises`: a session whose `score()` raises must produce `on_session_scoring` → `on_session_error` (not `on_session_success`). All 3 existing tests in the file still pass.

Caveat

This catches generic `Exception` from the scorer. That's deliberate: a benchmark's scoring code is third-party and can fail in many ways; we don't want to enumerate them. The trade-off is masking a buggy `tracker.on_session_error` if it itself raises — but that's a problem regardless.

When a benchmark's `session.score()` raises (e.g. tau2 rejecting a
malformed AssistantMessage), the exception propagated out of
`run_session`, the worker died, and the session dir was left
half-written: agent/, benchmark/, otel.log, otel_spans.jsonl,
session.json, trajectory.jsonl — but no results.json.

On the next pass, `SessionStatus.from_config` then sees a session_dir
with no results.json and classifies it INCOMPLETE, which puts it back
into `to_run`. The next run hits the same deterministic crash. The
session loops forever, burning compute and writing nothing.

Wrap the `session.score()` call in the AgentTermination /
BenchmarkTermination handler with try/except. On exception, route
to `tracker.on_session_error` (wrapping the underlying exc as a
`BenchmarkError`) so observers — including the results writer —
record the outcome. The session ends in ERROR state on disk and
the orchestrator can move on.

Tests:
- New `test_session_error_when_score_raises` in
  tests/core/test_session_observer_lifecycle.py — uses a session
  whose score() raises, asserts on_session_scoring → on_session_error
  is the observed lifecycle, and on_session_success is NOT called.

Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
Per review:
- Replace the early `return` with an `else` clause for cleaner
  control flow — the success path is now explicit.
- Switch from `_close_session_agent(session, agent_instance)` to
  just `agent_instance.close()` to mirror the existing
  `except BenchmarkError` handler. The benchmark already raised in
  scoring; `session.close()` may not be safe.

Same behavior, smaller surface, fits the existing handler pattern.

Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant