Auto history compaction does not trigger before context-full LLM errors

## Problem

Auto history compaction appears to have regressed in long TUI chat sessions.

Observed behavior from `yarn chat:dev`:

- The footer can show the request context as full, for example:

```text
estimated input 1,405,693 / 400,000 tokens (100%)
```

- Heddle continues running instead of compacting first.
- The next model call often fails with an LLM/provider context-size error.
- The user then has to manually tell the agent to continue/go on, which breaks the flow and wastes turns.

This is a regression against the intended behavior that long sessions should compact before oversized submissions reach the provider.

## Expected behavior

When the estimated request is near or over the active model context window, Heddle should automatically compact before making another LLM call.

The user should see compaction status, then the run should continue with reduced history. Users should not have to recover by manually typing “continue” after provider errors caused by full context.

## Screenshot evidence

The screenshot shows a running TUI session with:

```text
model=gpt-5.4
reasoning=high
auth=openai-oauth:07db77d8
estimated input 1,405,693 / 400,000 tokens (100%)
```

Despite that, recent activity continues and the session does not appear to enter `compacting` state before the next request.

## Trace evidence

Local `.heddle` state has a saved trace for the affected workflow:

- Session: `session-1778303465676`
- Trace: `.heddle/traces/trace-1778306292090.json`
- Trace event: final `run.finished` summary at JSON line 878

Exact persisted LLM error:

```text
LLM error: An error occurred while processing your request. You can retry your request, or contact us through our help center at help.openai.com if the error persists. Please include the request ID 3c45a07f-eec8-4fc8-b00d-7e93b0577ba3 in your message.
```

The nearby session record also shows the user had to send a follow-up `go on` after the failed turn, matching the UX symptom.

Important detail: the saved trace/session data does not preserve a more specific provider context-length message for this failure. The UI footer showed `estimated input 1,405,693 / 400,000 tokens (100%)`, but the trace only captured the generic OpenAI error above. The fix should therefore also improve failure capture/classification so context-full/provider-overflow failures become actionable local diagnostics instead of only generic provider errors.

## Likely implementation area

Relevant paths:

- `src/core/chat/engine/history/compaction.ts`
  - `compactChatHistoryWithArchive(...)`
  - `estimateChatHistoryTokens(...)`
  - internal `estimateRequestTokens(...)`
- `src/core/chat/engine/turns/preflight.ts`
  - `prepareChatSessionTurn(...)` runs preflight compaction before a session turn
- `src/cli/chat/App.tsx`
  - currently triggers background compaction when switching to a smaller context-window model
- `src/cli/chat/hooks/useChatStatusSummary.ts`
  - footer displays `lastRunInputTokens ?? estimatedRequestTokens`
- existing tests:
  - `src/__tests__/integration/tui/ask-cli.test.ts` has ask-mode preflight compaction coverage
  - add TUI/chat-session regression coverage for the full-request threshold case

## Suspected cause

A code read suggests the preflight compaction trigger is history-only:

```ts
const needsCompaction =
  estimateChatHistoryTokens(options.history) > maxHistoryTokens
  || (Boolean(options.force) && countNonCompactedMessages(options.history) > 0);
```

where `maxHistoryTokens` is currently based on a history ratio of the model window.

However, the footer and real provider request pressure are about the full request estimate, which includes at least:

- chat history
- system context
- tool names/tool descriptors or tool-related overhead
- current goal/prompt
- possibly large tool outputs preserved in history
- provider-specific payload overhead

`estimateRequestTokens(...)` exists and is used to populate context stats, but it does not appear to drive the compaction decision itself. This means a session can show `estimated input ... (100%)` while `needsCompaction` still decides not to compact.

There may also be a mismatch between TUI-visible estimates and preflight estimates if the footer is showing the last failed/last run input token count while the preflight code only sees history tokens.

## Proposed direction

Use full request pressure, not only history pressure, when deciding whether preflight compaction is required.

High-level direction:

- Compute estimated request tokens before each session turn using the same inputs used for footer/context stats: history, system context, tool names, current prompt/goal, and active model window.
- Trigger compaction when the full request estimate exceeds a safe threshold, for example 70-85% of the active model window, not only when history exceeds 60%.
- After compaction, recompute estimated request tokens and verify that the result is below the safe threshold before calling the LLM.
- If compaction cannot reduce enough, fail early with a clear local message suggesting `/compact`, `/clear`, or `/session new`, instead of sending an oversized request to the provider.
- Make sure OpenAI OAuth/Codex sessions use a compatible summarizer model and do not silently skip compaction due to auth/model mismatch.
- Preserve and classify provider failures that look like context-full/oversized-request failures, even when the provider only returns a generic request-processing error.
- Consider surfacing a warning in the TUI footer before failure when context is above threshold and compaction is pending/failed.

## Acceptance criteria

- A TUI/session-backed run automatically starts preflight compaction when estimated full request tokens are near or over the active model context window.
- The compaction trigger considers system context, current prompt, tool metadata/overhead where practical, and history, not only raw history tokens.
- After compaction, the session context estimate is recomputed and the run proceeds with the compacted history.
- If compaction fails or cannot reduce enough, Heddle stops before the provider call with an actionable local error.
- Context-full or suspected oversized-request failures are preserved in trace/session output with enough diagnostic detail to avoid a generic-only `LLM error`.
- The footer/status reflects compaction running/failed state clearly.
- Regression tests cover a session where `estimateChatHistoryTokens(history)` alone is below the trigger threshold but `estimatedRequestTokens` exceeds the model window.
- Regression tests cover OpenAI OAuth mode so compaction uses a Codex-compatible summarizer and does not regress into provider 400/context errors.
- Regression tests cover failure formatting/classification for generic provider errors that occur when Heddle already estimated the request at or over the context window.

## UX rationale

Long Heddle sessions are expected to be usable without the user manually managing context. When the UI already knows the context is full, Heddle should compact proactively instead of letting the next LLM request fail and forcing the user to type “go on.”


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto history compaction does not trigger before context-full LLM errors #69

Problem

Expected behavior

Screenshot evidence

Trace evidence

Likely implementation area

Suspected cause

Proposed direction

Acceptance criteria

UX rationale

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Auto history compaction does not trigger before context-full LLM errors #69

Description

Problem

Expected behavior

Screenshot evidence

Trace evidence

Likely implementation area

Suspected cause

Proposed direction

Acceptance criteria

UX rationale

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions