Skip to content

Long-session context growth triggers Opus tool-call decoder-flip (malformed calls leak to captain); add context-threshold session breaking #95

Description

@wenindoubt

Problem

During a long continuous supervision session, firstmate (Opus 4.8 — claude-opus-4-8[1m], Claude Code 2.1.193) emitted malformed tool calls: the tool-call decoder flipped to legacy Anthropic XML syntax (<invoke name="Read"><parameter name="file_path">…</parameter></invoke>, prefixed by a stray junk token) instead of the harness's expected format. This happened twice in one session.

  • Once the raw <invoke>…</invoke> text was rendered verbatim into the captain-visible transcript, the turn ended with no tool executed and no retry — to the captain it looked like firstmate had hung ("I'm not receiving anything back").
  • Once the harness caught it ("Your tool call was malformed and could not be parsed. Please retry.") and recovery was clean.

Root cause (upstream)

This is the decoder-level model/harness regression tracked in anthropics/claude-code#49747 — "Opus 4.7 mixes legacy XML tool-use format into JSON tool calls on longer payloads." The upstream reporter confirmed it is not fixable from the prompt layer. Our session corroborates and broadens it:

  • Persists on Opus 4.8 (not just 4.7).
  • Hits built-in tools (Read), whole call as XML — not MCP-specific.
  • Correlates with total context length (~500k tokens accumulated), even when the tool-call arguments are trivial (a single file path). So total context — not per-call payload size — is at least one independent trigger.

Why firstmate is unusually exposed

firstmate's core loop is long-lived by design: continuous supervision, heartbeats, and many small repeated tool calls (peek / wake-drain / status reads / watcher re-arm) that accumulate context indefinitely within a single session. That is exactly the long-context regime where the decoder-flip rate climbs, so firstmate trips this more often than a typical short session would.

Proposed mitigation: context-threshold session breaking

firstmate already guarantees that "a restart must be a non-event" (AGENTS.md §5): all truth lives in tmux, state/, data/backlog.md, data/secondmates.md, persistent secondmate homes, and treehouse — conversation memory is only a cache. We can exploit that invariant:

  • Add a context-size guard: when firstmate's own context crosses a threshold (token count / context-% — observed trouble near ~500k tokens), proactively break the session and let the existing bootstrap + recovery protocol rebuild state in a fresh, smaller context.
  • Minimum viable version: a guard / heartbeat line that warns the captain ("context is large; recommend a recovery-safe reset") and relies on the captain (or the /afk daemon) to /clear, after which recovery reconstructs everything.
  • Stretch: an automatic recovery-safe reset where the harness permits an agent to trigger its own /clear/compaction.
  • Net effect: caps the context regime that triggers the upstream glitch, and also lowers cost/latency on marathon sessions.

Open questions

  • Threshold value, and how firstmate reads its own context size portably across harnesses (claude/codex/opencode/pi).
  • Can an agent trigger its own recovery-safe /clear, or must this be captain-/daemon-initiated?
  • Interaction with the /afk daemon (it owns the watcher while state/.afk exists) and with in-flight crewmates mid-session.

Evidence

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions