Skip to content

✨(back) limit output token per message#458

Open
maxenceh wants to merge 1 commit into
mainfrom
maxenceh/limit-output-tokens
Open

✨(back) limit output token per message#458
maxenceh wants to merge 1 commit into
mainfrom
maxenceh/limit-output-tokens

Conversation

@maxenceh
Copy link
Copy Markdown
Collaborator

@maxenceh maxenceh commented May 7, 2026

Purpose

Adds a configurable output token limit per AI message to control LLM costs on very long responses. When the limit is reached, the backend stops generation and the frontend displays an error message to
the user.

Proposal

Backend

  • Adds a new setting to configure the token limit
  • Enforces the output token limit during AI response generation

Frontend

  • New component that renders the "response was truncated" message shown to the user
  • Integrates TruncatedResponseMessage into the message rendering

Demo

token_limit.mov

Summary by CodeRabbit

Release Notes

  • New Features

    • Implemented output token limit per message; responses exceeding the limit are automatically truncated with user notification
    • Added a new UI notification displayed when a response has been truncated due to reaching the maximum token limit
  • Documentation

    • Added environment variable documentation for configuring the maximum output token limit

Review Change Stack

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 7, 2026

Warning

Rate limit exceeded

@maxenceh has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 18 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 1d7f7329-d065-48d3-99a5-b5e92a254e4b

📥 Commits

Reviewing files that changed from the base of the PR and between fbabc9b and b16bf5c.

📒 Files selected for processing (9)
  • CHANGELOG.md
  • docs/env.md
  • src/backend/chat/clients/pydantic_ai.py
  • src/backend/chat/tests/clients/pydantic_ai/test_output_token_limit.py
  • src/backend/chat/tests/test_ai_agent_service_co2.py
  • src/backend/conversations/settings.py
  • src/frontend/apps/conversations/src/features/chat/components/MessageItem.tsx
  • src/frontend/apps/conversations/src/features/chat/components/TruncatedResponseMessage.tsx
  • src/frontend/apps/conversations/src/features/chat/components/__tests__/MessageItem.test.tsx

Walkthrough

This PR implements output token limits for LLM-generated chat messages. The backend now enforces a configurable maximum tokens per response via Pydantic-AI ModelSettings, tracks truncation via finish reason hooks, and propagates truncation state through message annotations. The frontend detects these annotations and renders a localized UI notice when responses are truncated. Changes include configuration, backend implementation with comprehensive tests, frontend UI component with tests, and documentation updates.

Changes

Output Token Limit Feature

Layer / File(s) Summary
Configuration and Documentation
src/backend/conversations/settings.py, docs/env.md
Introduces LLM_MAX_OUTPUT_TOKENS_PER_MESSAGE setting (default 8192) wired to environment variable and documents the setting's purpose and default value.
Backend Truncation Tracking and Token Limit Enforcement
src/backend/chat/clients/pydantic_ai.py
Adds Pydantic-AI Hooks and ModelSettings imports, registers after_model_request hook to track finish reasons, passes max_tokens setting to agent iterations, and conditionally emits/appends truncation annotations when model returns finish_reason="length".
Backend Truncation Tests
src/backend/chat/tests/clients/pydantic_ai/test_output_token_limit.py, src/backend/chat/tests/test_ai_agent_service_co2.py
Validates setting existence and defaults, hook initialization and state reset, finish reason tracking for different scenarios, and streaming annotation emission on truncation.
Frontend Truncation UI Component and Integration
src/frontend/apps/conversations/src/features/chat/components/TruncatedResponseMessage.tsx, src/frontend/apps/conversations/src/features/chat/components/MessageItem.tsx
Adds new TruncatedResponseMessage component with localized feedback, integrates detection and rendering of truncated messages in MessageItem, and updates memoization logic to re-render on annotation changes.
Frontend Truncation UI Tests
src/frontend/apps/conversations/src/features/chat/components/__tests__/MessageItem.test.tsx
Tests conditional rendering of truncation UI based on presence of truncated: true in message annotations.
Changelog Entry
CHANGELOG.md
Documents new output token limit feature in unreleased additions.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

frontend, documentation

Suggested reviewers

  • qbey
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 79.17% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: implementing an output token limit per message, which is the core objective of this pull request affecting both backend (settings, AI agent service) and frontend (truncation UI component).
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch maxenceh/limit-output-tokens

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@maxenceh maxenceh force-pushed the maxenceh/limit-output-tokens branch 2 times, most recently from d8924a2 to fbabc9b Compare May 13, 2026 12:22
@maxenceh maxenceh marked this pull request as ready for review May 13, 2026 12:25
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/backend/chat/tests/clients/pydantic_ai/test_output_token_limit.py`:
- Around line 20-29: The fixture base_settings is forcing
LLM_MAX_OUTPUT_TOKENS_PER_MESSAGE to 8192 for every test, making the "default is
8192" assertions tautological; remove or stop setting
settings.LLM_MAX_OUTPUT_TOKENS_PER_MESSAGE in the base_settings fixture and
instead set that value only in tests that need it (or create a separate fixture
for tests that require a non-default), and update the tests that assert the
default (the "default is 8192" test and the other assertions flagged) to rely on
the real unset default value so they can detect regressions; locate references
to settings.LLM_MAX_OUTPUT_TOKENS_PER_MESSAGE and the pytest fixture
base_settings to implement this change.

In `@src/backend/conversations/settings.py`:
- Around line 681-685: ConversationsSettings defines
LLM_MAX_OUTPUT_TOKENS_PER_MESSAGE but doesn't validate it; add a fail-fast check
in the class's post_setup method that reads
self.LLM_MAX_OUTPUT_TOKENS_PER_MESSAGE and raises a ValueError (or logger +
sys.exit) if the value is <= 0, with a clear message identifying
LLM_MAX_OUTPUT_TOKENS_PER_MESSAGE so startup fails early and prevents invalid
token limits from propagating into generation settings.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: a3b26a1d-96d2-470e-9953-ddf14d477e3f

📥 Commits

Reviewing files that changed from the base of the PR and between 5ca4ef9 and fbabc9b.

📒 Files selected for processing (9)
  • CHANGELOG.md
  • docs/env.md
  • src/backend/chat/clients/pydantic_ai.py
  • src/backend/chat/tests/clients/pydantic_ai/test_output_token_limit.py
  • src/backend/chat/tests/test_ai_agent_service_co2.py
  • src/backend/conversations/settings.py
  • src/frontend/apps/conversations/src/features/chat/components/MessageItem.tsx
  • src/frontend/apps/conversations/src/features/chat/components/TruncatedResponseMessage.tsx
  • src/frontend/apps/conversations/src/features/chat/components/__tests__/MessageItem.test.tsx

Comment on lines +681 to +685
LLM_MAX_OUTPUT_TOKENS_PER_MESSAGE = values.IntegerValue(
8192,
environ_name="LLM_MAX_OUTPUT_TOKENS_PER_MESSAGE",
environ_prefix=None,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate token limit as strictly positive at startup.

A non-positive value here can flow into generation settings and break response generation at runtime. Add a fail-fast validation in post_setup.

Suggested fix
 class Base(BraveSettings, Configuration):
@@
     `@classmethod`
     def post_setup(cls):
@@
+        if cls.LLM_MAX_OUTPUT_TOKENS_PER_MESSAGE <= 0:
+            raise ValueError(
+                "LLM_MAX_OUTPUT_TOKENS_PER_MESSAGE must be > 0, "
+                f"got {cls.LLM_MAX_OUTPUT_TOKENS_PER_MESSAGE}."
+            )
+
         # Document context budget ratio must be a fraction (0 disables full inlining,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/backend/conversations/settings.py` around lines 681 - 685,
ConversationsSettings defines LLM_MAX_OUTPUT_TOKENS_PER_MESSAGE but doesn't
validate it; add a fail-fast check in the class's post_setup method that reads
self.LLM_MAX_OUTPUT_TOKENS_PER_MESSAGE and raises a ValueError (or logger +
sys.exit) if the value is <= 0, with a clear message identifying
LLM_MAX_OUTPUT_TOKENS_PER_MESSAGE so startup fails early and prevents invalid
token limits from propagating into generation settings.

If limit is reached display a message to the user.
It's meant to limit cost of very long messages, so the
limit must be high.
@maxenceh maxenceh force-pushed the maxenceh/limit-output-tokens branch from fbabc9b to b16bf5c Compare May 13, 2026 13:25
@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant