Problem
During a long continuous supervision session, firstmate (Opus 4.8 — claude-opus-4-8[1m], Claude Code 2.1.193) emitted malformed tool calls: the tool-call decoder flipped to legacy Anthropic XML syntax (<invoke name="Read"><parameter name="file_path">…</parameter></invoke>, prefixed by a stray junk token) instead of the harness's expected format. This happened twice in one session.
- Once the raw
<invoke>…</invoke> text was rendered verbatim into the captain-visible transcript, the turn ended with no tool executed and no retry — to the captain it looked like firstmate had hung ("I'm not receiving anything back").
- Once the harness caught it ("Your tool call was malformed and could not be parsed. Please retry.") and recovery was clean.
Root cause (upstream)
This is the decoder-level model/harness regression tracked in anthropics/claude-code#49747 — "Opus 4.7 mixes legacy XML tool-use format into JSON tool calls on longer payloads." The upstream reporter confirmed it is not fixable from the prompt layer. Our session corroborates and broadens it:
- Persists on Opus 4.8 (not just 4.7).
- Hits built-in tools (
Read), whole call as XML — not MCP-specific.
- Correlates with total context length (~500k tokens accumulated), even when the tool-call arguments are trivial (a single file path). So total context — not per-call payload size — is at least one independent trigger.
Why firstmate is unusually exposed
firstmate's core loop is long-lived by design: continuous supervision, heartbeats, and many small repeated tool calls (peek / wake-drain / status reads / watcher re-arm) that accumulate context indefinitely within a single session. That is exactly the long-context regime where the decoder-flip rate climbs, so firstmate trips this more often than a typical short session would.
Proposed mitigation: context-threshold session breaking
firstmate already guarantees that "a restart must be a non-event" (AGENTS.md §5): all truth lives in tmux, state/, data/backlog.md, data/secondmates.md, persistent secondmate homes, and treehouse — conversation memory is only a cache. We can exploit that invariant:
- Add a context-size guard: when firstmate's own context crosses a threshold (token count / context-% — observed trouble near ~500k tokens), proactively break the session and let the existing bootstrap + recovery protocol rebuild state in a fresh, smaller context.
- Minimum viable version: a guard / heartbeat line that warns the captain ("context is large; recommend a recovery-safe reset") and relies on the captain (or the
/afk daemon) to /clear, after which recovery reconstructs everything.
- Stretch: an automatic recovery-safe reset where the harness permits an agent to trigger its own
/clear/compaction.
- Net effect: caps the context regime that triggers the upstream glitch, and also lowers cost/latency on marathon sessions.
Open questions
- Threshold value, and how firstmate reads its own context size portably across harnesses (claude/codex/opencode/pi).
- Can an agent trigger its own recovery-safe
/clear, or must this be captain-/daemon-initiated?
- Interaction with the
/afk daemon (it owns the watcher while state/.afk exists) and with in-flight crewmates mid-session.
Evidence
Problem
During a long continuous supervision session, firstmate (Opus 4.8 —
claude-opus-4-8[1m], Claude Code 2.1.193) emitted malformed tool calls: the tool-call decoder flipped to legacy Anthropic XML syntax (<invoke name="Read"><parameter name="file_path">…</parameter></invoke>, prefixed by a stray junk token) instead of the harness's expected format. This happened twice in one session.<invoke>…</invoke>text was rendered verbatim into the captain-visible transcript, the turn ended with no tool executed and no retry — to the captain it looked like firstmate had hung ("I'm not receiving anything back").Root cause (upstream)
This is the decoder-level model/harness regression tracked in anthropics/claude-code#49747 — "Opus 4.7 mixes legacy XML tool-use format into JSON tool calls on longer payloads." The upstream reporter confirmed it is not fixable from the prompt layer. Our session corroborates and broadens it:
Read), whole call as XML — not MCP-specific.Why firstmate is unusually exposed
firstmate's core loop is long-lived by design: continuous supervision, heartbeats, and many small repeated tool calls (peek / wake-drain / status reads / watcher re-arm) that accumulate context indefinitely within a single session. That is exactly the long-context regime where the decoder-flip rate climbs, so firstmate trips this more often than a typical short session would.
Proposed mitigation: context-threshold session breaking
firstmate already guarantees that "a restart must be a non-event" (AGENTS.md §5): all truth lives in tmux,
state/,data/backlog.md,data/secondmates.md, persistent secondmate homes, and treehouse — conversation memory is only a cache. We can exploit that invariant:/afkdaemon) to/clear, after which recovery reconstructs everything./clear/compaction.Open questions
/clear, or must this be captain-/daemon-initiated?/afkdaemon (it owns the watcher whilestate/.afkexists) and with in-flight crewmates mid-session.Evidence
claude-opus-4-8[1m], Claude Code 2.1.193; malformed<invoke name="Read">leaked to the captain twice at ~500k tokens of context.