Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ and this project adheres to
- 🐛(fix) add prevent_url_hallucination instruction to ConversationAgent
- ✨(projects) handle project files for RAG search
- ✨(banner) configurable banner with level, title, content and start/end
- ✨(conversation) summarize messages

### Changed

Expand Down
29 changes: 23 additions & 6 deletions docs/attachments.md
Original file line number Diff line number Diff line change
Expand Up @@ -425,11 +425,14 @@ Notes:

The decision of which documents are inlined as `full-context` vs left as `tool_call_only` is made by `chat/document_context_builder.py:build_document_context_instruction` on each turn:

1. Compute the `document_budget` in tokens:
1. Compute budgets in tokens (`chat/clients/pydantic_ai.py` subtracts the security buffer once, then splits the remainder):
```text
document_budget = max(int(model.max_token_context * DOCUMENT_CONTEXT_BUDGET_RATIO)
- DOCUMENT_CONTEXT_SECURITY_BUFFER_TOKENS, 0)
usable_context = max(model.max_token_context - DOCUMENT_CONTEXT_SECURITY_BUFFER_TOKENS, 0)
document_budget = max(int(usable_context * DOCUMENT_CONTEXT_BUDGET_RATIO), 0)
```
The conversation history budget (summarization trigger) uses the other share:
`message_token_budget = max(int(usable_context * (1 - DOCUMENT_CONTEXT_BUDGET_RATIO)), 0)`.
`build_document_context_instruction` receives `usable_context` as its `max_token_context` argument (buffer already applied).
2. Iterate documents oldest-first. For each document:
- If its token count exceeds the whole budget alone → keep `tool_call_only`.
- Otherwise, while adding it would overflow the budget, **evict the oldest currently-inlined document** (FIFO): demote it to `tool_call_only`, free its tokens.
Expand All @@ -444,6 +447,17 @@ Token estimation uses `tiktoken` with the `cl100k_base` encoding (GPT-4 tokenize
The assembled instruction is **cached** per turn keyed on:
`conversation_id`, `user_id`, `model_hrid`, `model.max_token_context`, `DOCUMENT_CONTEXT_BUDGET_RATIO`, `DOCUMENT_CONTEXT_SECURITY_BUFFER_TOKENS`, and a fingerprint of `(attachment.id, attachment.updated_at)` for every text attachment - **conversation and project text attachments both contribute to the fingerprint**. Any attachment add / remove / edit (including project files), or any settings change, invalidates the cache. TTL is 30 minutes (`CACHE_TIMEOUT`).

#### Conversation history summarization

When `message_token_budget` is exceeded, `chat/agents/history_processors.py` calls a separate summarization model (`LLM_SUMMARIZATION_MODEL_HRID`) and stores the result on the conversation (`history_summary`, `history_summary_checkpoint`). This runs at the **start of a new user turn**, before `agent.iter`, against **stored** `pydantic_messages` from previous turns only (the current user prompt is extracted separately and is not in that list yet). That stored history usually ends on an assistant `ModelResponse`.

1. **Trigger**: estimated tokens in the active history slice exceed `message_token_budget` (see formulas above) and there are new messages after the last checkpoint.
2. **After a summary**: the model receives the stored summary text (dynamic instruction) plus the last `CONVERSATION_SUMMARY_CONTEXT_MESSAGES` `ModelMessage` entries before the checkpoint. Use an **even** value so the retained window starts on a user `ModelRequest` in a plain user/assistant alternation (tool messages can break parity).
3. **Summary length**: capped by `CONVERSATION_SUMMARY_MAX_TOKENS` on the summarization LLM call.
4. **Disable summarization** without changing document budgets: remove `max_token_context` from the chat model in `LLM_CONFIGURATIONS`, or set `message_token_budget` to zero (`DOCUMENT_CONTEXT_BUDGET_RATIO=1` also zeroes it but reallocates all `usable_context` to documents).

The security buffer is **not** a dedicated reserve for system prompts, tool schemas, or completion tokens; those are added on top of the planned document/history split. Size the buffer (and/or plan `max_token_context` below the model nominal window) to leave headroom for that overhead.

#### Targeted document operations (`document_id`)

Three tools accept an optional `document_id` argument, each with its own IDOR boundary:
Expand Down Expand Up @@ -522,8 +536,11 @@ A `READY` attachment whose `rag_document_id` is null (e.g. parse succeeded but t
| `RAG_DOCUMENT_SEARCH_BACKEND` | `AlbertRagBackend` | Import path of the vector-search backend used for indexing and search (Albert or Find) |
| `PROJECT_FILES_MAX_COUNT` | `10` | Max non-image attachments per project (excludes hidden markdown companions). Enforced at upload-time in `ChatProjectAttachmentViewSet`. Bounds per-turn system-prompt token cost (every entry contributes to `project_documents` on every conversation turn). |
| `PROJECT_IMAGES_MAX_COUNT` | `3` | Max image attachments per project. Enforced at upload-time. Bounds per-turn vision token cost - every project image is pinned to every turn alongside conversation-message images, and provider request-level image caps (Anthropic ~20/request) clip the trailing entries first. |
| `DOCUMENT_CONTEXT_BUDGET_RATIO` | `0.5` | Fraction of `model.max_token_context` reserved for inlined documents (0 disables full-context inlining; everything stays `tool_call_only`) |
| `DOCUMENT_CONTEXT_SECURITY_BUFFER_TOKENS` | `1000` | Tokens subtracted from the inlining budget to absorb tokenizer drift on non-OpenAI models |
| `DOCUMENT_CONTEXT_BUDGET_RATIO` | `0.5` | Fraction of `usable_context` (after the security buffer) reserved for inlined documents. `(1 - ratio)` goes to the conversation history budget (summarization trigger). `0` disables full-context inlining (everything stays `tool_call_only`). |
| `DOCUMENT_CONTEXT_SECURITY_BUFFER_TOKENS` | `1000` | Tokens subtracted once from `model.max_token_context` before the document/history split, to absorb tokenizer drift on non-OpenAI models and leave headroom beyond the planned split |
| `CONVERSATION_SUMMARY_CONTEXT_MESSAGES` | `10` | Number of pydantic-ai `ModelMessage` entries kept in runtime history after a conversation summary (in addition to the stored summary text). Use an even value (see [Conversation history summarization](#conversation-history-summarization)) |
| `CONVERSATION_SUMMARY_MAX_TOKENS` | `2048` | `max_tokens` for the summarization LLM call when generating or updating `history_summary` |
| `LLM_SUMMARIZATION_MODEL_HRID` | `default-summarization-model` | HRID in `LLM_CONFIGURATIONS` for the conversation summarization agent (see [LLM Configuration](llm-configuration.md)) |

#### RAG_FILES_ACCEPTED_FORMATS

Expand Down Expand Up @@ -571,7 +588,7 @@ RAG_FILES_ACCEPTED_FORMATS = [

### Per-model setting: `max_token_context`

Each entry in `LLM_CONFIGURATIONS` accepts a `max_token_context` integer field declaring the model's context window size. When set, it drives the inlining budget for the documents listing (`document_budget = max_token_context * DOCUMENT_CONTEXT_BUDGET_RATIO - DOCUMENT_CONTEXT_SECURITY_BUFFER_TOKENS`).
Each entry in `LLM_CONFIGURATIONS` accepts a `max_token_context` integer field declaring the model's context window size. When set, it drives document inlining and conversation summarization budgets (`usable_context = max_token_context - DOCUMENT_CONTEXT_SECURITY_BUFFER_TOKENS`; `document_budget = usable_context * DOCUMENT_CONTEXT_BUDGET_RATIO`).

If a model has no `max_token_context`, all of its documents are kept `tool_call_only` regardless of size and a warning is logged on every chat turn. Setting the field accurately matters: too low and small documents get pushed to RAG-only when they could be inlined; too high and the LLM may exceed its real window.

Expand Down
4 changes: 4 additions & 0 deletions docs/env.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,10 @@ These are the environment variables you can set for the `conversations-backend`
| LLM_CONFIGURATION_FILE_PATH | Path to the LLM configuration JSON file. See [LLM Configuration](llm-configuration.md) for details | <BASE_DIR>/conversations/configuration/llm/default.json |
| LLM_DEFAULT_MODEL_HRID | HRID of the model used for conversations | default-model |
| LLM_SUMMARIZATION_MODEL_HRID | HRID of the model used for summarization | default-summarization-model |
| DOCUMENT_CONTEXT_BUDGET_RATIO | Fraction of `usable_context` (after security buffer) for inlined conversation documents; `(1 - ratio)` is the conversation history token budget. See [attachments.md](attachments.md) | `0.5` |
| DOCUMENT_CONTEXT_SECURITY_BUFFER_TOKENS | Tokens subtracted once from `max_token_context` before the document/history split | `1000` |
| CONVERSATION_SUMMARY_CONTEXT_MESSAGES | `ModelMessage` count kept after a conversation summary (use an even value). See [attachments.md](attachments.md) | `10` |
| CONVERSATION_SUMMARY_MAX_TOKENS | Max tokens for the conversation summarization LLM output | `2048` |
| AI_API_KEY | AI API key to be used for the default provider (used in default LLM configuration, not for production use) | |
| AI_BASE_URL | OpenAI compatible AI base URL (used in default LLM configuration, not for production use) | |
| AI_MODEL | AI Model name to use (used in default LLM configuration, not for production use) | |
Expand Down
Loading
Loading