suitenumerique · camilleAND · Apr 23, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -18,6 +18,7 @@ and this project adheres to
 - 🐛(fix) add prevent_url_hallucination instruction to ConversationAgent
 - ✨(projects) handle project files for RAG search
 - ✨(banner) configurable banner with level, title, content and start/end
+- ✨(conversation) summarize messages
 
 ### Changed
 

diff --git a/docs/attachments.md b/docs/attachments.md
@@ -425,11 +425,14 @@ Notes:
 
 The decision of which documents are inlined as `full-context` vs left as `tool_call_only` is made by `chat/document_context_builder.py:build_document_context_instruction` on each turn:
 
-1. Compute the `document_budget` in tokens:
+1. Compute budgets in tokens (`chat/clients/pydantic_ai.py` subtracts the security buffer once, then splits the remainder):
    ```text
-   document_budget = max(int(model.max_token_context * DOCUMENT_CONTEXT_BUDGET_RATIO)
-                         - DOCUMENT_CONTEXT_SECURITY_BUFFER_TOKENS, 0)
+   usable_context = max(model.max_token_context - DOCUMENT_CONTEXT_SECURITY_BUFFER_TOKENS, 0)
+   document_budget = max(int(usable_context * DOCUMENT_CONTEXT_BUDGET_RATIO), 0)
    ```
+   The conversation history budget (summarization trigger) uses the other share:
+   `message_token_budget = max(int(usable_context * (1 - DOCUMENT_CONTEXT_BUDGET_RATIO)), 0)`.
+   `build_document_context_instruction` receives `usable_context` as its `max_token_context` argument (buffer already applied).
 2. Iterate documents oldest-first. For each document:
    - If its token count exceeds the whole budget alone → keep `tool_call_only`.
    - Otherwise, while adding it would overflow the budget, **evict the oldest currently-inlined document** (FIFO): demote it to `tool_call_only`, free its tokens.
@@ -444,6 +447,17 @@ Token estimation uses `tiktoken` with the `cl100k_base` encoding (GPT-4 tokenize
 The assembled instruction is **cached** per turn keyed on:
 `conversation_id`, `user_id`, `model_hrid`, `model.max_token_context`, `DOCUMENT_CONTEXT_BUDGET_RATIO`, `DOCUMENT_CONTEXT_SECURITY_BUFFER_TOKENS`, and a fingerprint of `(attachment.id, attachment.updated_at)` for every text attachment - **conversation and project text attachments both contribute to the fingerprint**. Any attachment add / remove / edit (including project files), or any settings change, invalidates the cache. TTL is 30 minutes (`CACHE_TIMEOUT`).
 
+#### Conversation history summarization
+
+When `message_token_budget` is exceeded, `chat/agents/history_processors.py` calls a separate summarization model (`LLM_SUMMARIZATION_MODEL_HRID`) and stores the result on the conversation (`history_summary`, `history_summary_checkpoint`). This runs at the **start of a new user turn**, before `agent.iter`, against **stored** `pydantic_messages` from previous turns only (the current user prompt is extracted separately and is not in that list yet). That stored history usually ends on an assistant `ModelResponse`.
+
+1. **Trigger**: estimated tokens in the active history slice exceed `message_token_budget` (see formulas above) and there are new messages after the last checkpoint.
+2. **After a summary**: the model receives the stored summary text (dynamic instruction) plus the last `CONVERSATION_SUMMARY_CONTEXT_MESSAGES` `ModelMessage` entries before the checkpoint. Use an **even** value so the retained window starts on a user `ModelRequest` in a plain user/assistant alternation (tool messages can break parity).
+3. **Summary length**: capped by `CONVERSATION_SUMMARY_MAX_TOKENS` on the summarization LLM call.
+4. **Disable summarization** without changing document budgets: remove `max_token_context` from the chat model in `LLM_CONFIGURATIONS`, or set `message_token_budget` to zero (`DOCUMENT_CONTEXT_BUDGET_RATIO=1` also zeroes it but reallocates all `usable_context` to documents).
+
+The security buffer is **not** a dedicated reserve for system prompts, tool schemas, or completion tokens; those are added on top of the planned document/history split. Size the buffer (and/or plan `max_token_context` below the model nominal window) to leave headroom for that overhead.
+
 #### Targeted document operations (`document_id`)
 
 Three tools accept an optional `document_id` argument, each with its own IDOR boundary:
@@ -522,8 +536,11 @@ A `READY` attachment whose `rag_document_id` is null (e.g. parse succeeded but t
 | `RAG_DOCUMENT_SEARCH_BACKEND`                | `AlbertRagBackend` | Import path of the vector-search backend used for indexing and search (Albert or Find) |
 | `PROJECT_FILES_MAX_COUNT`                    | `10`           | Max non-image attachments per project (excludes hidden markdown companions). Enforced at upload-time in `ChatProjectAttachmentViewSet`. Bounds per-turn system-prompt token cost (every entry contributes to `project_documents` on every conversation turn). |
 | `PROJECT_IMAGES_MAX_COUNT`                   | `3`            | Max image attachments per project. Enforced at upload-time. Bounds per-turn vision token cost - every project image is pinned to every turn alongside conversation-message images, and provider request-level image caps (Anthropic ~20/request) clip the trailing entries first. |
-| `DOCUMENT_CONTEXT_BUDGET_RATIO`              | `0.5`          | Fraction of `model.max_token_context` reserved for inlined documents (0 disables full-context inlining; everything stays `tool_call_only`) |
-| `DOCUMENT_CONTEXT_SECURITY_BUFFER_TOKENS`    | `1000`         | Tokens subtracted from the inlining budget to absorb tokenizer drift on non-OpenAI models |
+| `DOCUMENT_CONTEXT_BUDGET_RATIO`              | `0.5`          | Fraction of `usable_context` (after the security buffer) reserved for inlined documents. `(1 - ratio)` goes to the conversation history budget (summarization trigger). `0` disables full-context inlining (everything stays `tool_call_only`). |
+| `DOCUMENT_CONTEXT_SECURITY_BUFFER_TOKENS`    | `1000`         | Tokens subtracted once from `model.max_token_context` before the document/history split, to absorb tokenizer drift on non-OpenAI models and leave headroom beyond the planned split |
+| `CONVERSATION_SUMMARY_CONTEXT_MESSAGES`      | `10`           | Number of pydantic-ai `ModelMessage` entries kept in runtime history after a conversation summary (in addition to the stored summary text). Use an even value (see [Conversation history summarization](#conversation-history-summarization)) |
+| `CONVERSATION_SUMMARY_MAX_TOKENS`            | `2048`         | `max_tokens` for the summarization LLM call when generating or updating `history_summary` |
+| `LLM_SUMMARIZATION_MODEL_HRID`               | `default-summarization-model` | HRID in `LLM_CONFIGURATIONS` for the conversation summarization agent (see [LLM Configuration](llm-configuration.md)) |
 
 #### RAG_FILES_ACCEPTED_FORMATS
 
@@ -571,7 +588,7 @@ RAG_FILES_ACCEPTED_FORMATS = [
 
 ### Per-model setting: `max_token_context`
 
-Each entry in `LLM_CONFIGURATIONS` accepts a `max_token_context` integer field declaring the model's context window size. When set, it drives the inlining budget for the documents listing (`document_budget = max_token_context * DOCUMENT_CONTEXT_BUDGET_RATIO - DOCUMENT_CONTEXT_SECURITY_BUFFER_TOKENS`).
+Each entry in `LLM_CONFIGURATIONS` accepts a `max_token_context` integer field declaring the model's context window size. When set, it drives document inlining and conversation summarization budgets (`usable_context = max_token_context - DOCUMENT_CONTEXT_SECURITY_BUFFER_TOKENS`; `document_budget = usable_context * DOCUMENT_CONTEXT_BUDGET_RATIO`).
 
 If a model has no `max_token_context`, all of its documents are kept `tool_call_only` regardless of size and a warning is logged on every chat turn. Setting the field accurately matters: too low and small documents get pushed to RAG-only when they could be inlined; too high and the LLM may exceed its real window.
 

diff --git a/docs/env.md b/docs/env.md
@@ -80,6 +80,10 @@ These are the environment variables you can set for the `conversations-backend`
 | LLM_CONFIGURATION_FILE_PATH                     | Path to the LLM configuration JSON file. See [LLM Configuration](llm-configuration.md) for details                                | <BASE_DIR>/conversations/configuration/llm/default.json |
 | LLM_DEFAULT_MODEL_HRID                          | HRID of the model used for conversations                                                                                          | default-model                                           |
 | LLM_SUMMARIZATION_MODEL_HRID                    | HRID of the model used for summarization                                                                                          | default-summarization-model                             |
+| DOCUMENT_CONTEXT_BUDGET_RATIO                   | Fraction of `usable_context` (after security buffer) for inlined conversation documents; `(1 - ratio)` is the conversation history token budget. See [attachments.md](attachments.md) | `0.5`                                                   |
+| DOCUMENT_CONTEXT_SECURITY_BUFFER_TOKENS         | Tokens subtracted once from `max_token_context` before the document/history split                                                              | `1000`                                                  |
+| CONVERSATION_SUMMARY_CONTEXT_MESSAGES           | `ModelMessage` count kept after a conversation summary (use an even value). See [attachments.md](attachments.md)                               | `10`                                                    |
+| CONVERSATION_SUMMARY_MAX_TOKENS                 | Max tokens for the conversation summarization LLM output                                                                                        | `2048`                                                  |
 | AI_API_KEY                                      | AI API key to be used for the default provider (used in default LLM configuration, not for production use)                        |                                                         |
 | AI_BASE_URL                                     | OpenAI compatible AI base URL (used in default LLM configuration, not for production use)                                         |                                                         |
 | AI_MODEL                                        | AI Model name to use (used in default LLM configuration, not for production use)                                                  |                                                         |