fix(session): write results.json when scorer crashes by elronbandel · Pull Request #216 · Exgentic/exgentic

elronbandel · 2026-05-03T13:50:50Z

Why

When a benchmark's `session.score()` raises (we hit this with tau2 rejecting a malformed AssistantMessage), the exception propagates out of `run_session`, the worker dies, and the session dir is left half-written:

```
session_dir/
├── agent/
├── benchmark/
├── config.json
├── otel.log
├── otel_spans.jsonl
├── session.json
├── trajectory.jsonl
└── (no results.json)
```

On the next pass, `SessionStatus.from_config` sees a session_dir with no `results.json` and classifies it `INCOMPLETE` → goes to `to_run` → re-run → same deterministic crash. The session loops forever, burning compute and writing nothing useful.

Fix

Wrap `session.score()` in the `AgentTermination/BenchmarkTermination` handler with try/except. On exception, route to `tracker.on_session_error` (wrapping the underlying exception as a `BenchmarkError`). Observers — including the results writer — record the outcome, the session ends in `ERROR` state on disk, and the orchestrator can move on.

```python
try:
score = session.score()
except Exception as exc:
tracker.on_session_error(session, BenchmarkError(exc))
_close_session_agent(session, agent_instance)
return
```

Tests

`test_session_error_when_score_raises`: a session whose `score()` raises must produce `on_session_scoring` → `on_session_error` (not `on_session_success`). All 3 existing tests in the file still pass.

Caveat

This catches generic `Exception` from the scorer. That's deliberate: a benchmark's scoring code is third-party and can fail in many ways; we don't want to enumerate them. The trade-off is masking a buggy `tracker.on_session_error` if it itself raises — but that's a problem regardless.

When a benchmark's `session.score()` raises (e.g. tau2 rejecting a malformed AssistantMessage), the exception propagated out of `run_session`, the worker died, and the session dir was left half-written: agent/, benchmark/, otel.log, otel_spans.jsonl, session.json, trajectory.jsonl — but no results.json. On the next pass, `SessionStatus.from_config` then sees a session_dir with no results.json and classifies it INCOMPLETE, which puts it back into `to_run`. The next run hits the same deterministic crash. The session loops forever, burning compute and writing nothing. Wrap the `session.score()` call in the AgentTermination / BenchmarkTermination handler with try/except. On exception, route to `tracker.on_session_error` (wrapping the underlying exc as a `BenchmarkError`) so observers — including the results writer — record the outcome. The session ends in ERROR state on disk and the orchestrator can move on. Tests: - New `test_session_error_when_score_raises` in tests/core/test_session_observer_lifecycle.py — uses a session whose score() raises, asserts on_session_scoring → on_session_error is the observed lifecycle, and on_session_success is NOT called. Signed-off-by: Elron Bandel <elron.bandel@ibm.com>

Per review: - Replace the early `return` with an `else` clause for cleaner control flow — the success path is now explicit. - Switch from `_close_session_agent(session, agent_instance)` to just `agent_instance.close()` to mirror the existing `except BenchmarkError` handler. The benchmark already raised in scoring; `session.close()` may not be safe. Same behavior, smaller surface, fits the existing handler pattern. Signed-off-by: Elron Bandel <elron.bandel@ibm.com>

elronbandel added 2 commits May 3, 2026 16:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(session): write results.json when scorer crashes#216

fix(session): write results.json when scorer crashes#216
elronbandel wants to merge 2 commits into
mainfrom
scoring-error-results

elronbandel commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

elronbandel commented May 3, 2026

Why

Fix

Tests

Caveat

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant