Skip to content

✨(tools) add basic document translate tool#302

Open
natoromano wants to merge 4 commits into
mainfrom
nr/trad
Open

✨(tools) add basic document translate tool#302
natoromano wants to merge 4 commits into
mainfrom
nr/trad

Conversation

@natoromano
Copy link
Copy Markdown
Collaborator

@natoromano natoromano commented Feb 20, 2026

Purpose

First fix to #301 (for short documents at least)
This tool adds a simple document translate tool, because right now document translation is done through summary.

Proposal

  • If the doc is short (env variable to be set, but let's try to have 50% of context size), just translate it
  • Otherwise inform the user

Summary by CodeRabbit

  • New Features

    • Document translation: translate document content into a chosen target language while preserving formatting.
    • Real-time progress indicators in chat showing when translation operations are active.
    • Multi-language UI: localized text for translation workflows.
  • Tests

    • Added comprehensive tests covering translation flows, errors, and edge cases.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Feb 20, 2026

Walkthrough

Adds a translation feature: a TranslationAgent, a document_translate tool, pydantic AI client wiring, tests and fixtures, a TRANSLATION_MAX_CHARS setting, utility to read documents, and frontend UI/i18n updates for translation progress.

Changes

Cohort / File(s) Summary
Backend Agent
src/backend/chat/agents/translate.py
New TranslationAgent dataclass (init disabled) that initializes BaseAgent with settings.LLM_DEFAULT_MODEL_HRID and output_type=str.
Document Translate Tool
src/backend/chat/tools/document_translate.py, src/backend/chat/tools/utils.py
New async document_translate tool (validation, reads latest text attachment via read_document_content, enforces TRANSLATION_MAX_CHARS, constructs prompt, calls TranslationAgent, returns ToolReturn, maps failures to ModelRetry/ModelCannotRetry). read_document_content moved/added to utils.py.
AI Client Integration
src/backend/chat/clients/pydantic_ai.py
Registers translate tool and a translation_system_prompt; adds translate tool wrapper and AIAgentService.translate wiring to integrate translation into tool pipeline.
Settings
src/backend/conversations/settings.py
Adds TRANSLATION_MAX_CHARS setting (PositiveIntegerValue, default 100_000).
Tests & Fixtures
src/backend/chat/tests/conftest.py, src/backend/chat/tests/tools/test_document_translate.py
Adds mock_translation_agent fixture and extensive async tests for document_translate covering many success/error cases and edge conditions.
Frontend UI & i18n
src/frontend/apps/conversations/src/features/chat/components/MessageItem.tsx, src/frontend/apps/conversations/src/features/chat/components/ToolInvocationItem.tsx, src/frontend/apps/conversations/src/i18n/translations.json
Displays localized "Translation in progress..." via memoized active tool label; singular/plural extraction label handling; adds translation keys in multiple locales.
Changelog
CHANGELOG.md
Adds Unreleased entry: “✨(tools) add basic translate tool”.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Frontend/User
    participant Service as AIAgentService
    participant Tool as document_translate
    participant Agent as TranslationAgent
    participant Storage as Document Storage

    Client->>Service: Request translation (via tool invocation)
    Service->>Tool: invoke translate(target_language, instructions)
    Tool->>Storage: Get latest text attachment metadata
    Storage-->>Tool: Attachment metadata
    Tool->>Storage: Read document content
    Storage-->>Tool: Document text
    Tool->>Agent: Execute translation(prompt with content)
    Agent-->>Tool: Translated text
    Tool->>Service: Return ToolReturn(translated_text, metadata)
    Service-->>Client: Display translation result
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • PR #233: Touches same frontend files (ToolInvocationItem.tsx, i18n/translations.json) and relates to document extraction UI/i18n changes.
  • PR #151: Related changes in pydantic_ai.py around system prompts and tool-wrapper patterns used for translation wiring.
  • PR #78: Related to document reading logic and the move/introduction of read_document_content used by document_translate.

Suggested labels

enhancement, backend

Suggested reviewers

  • qbey
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and accurately describes the primary change: adding a basic document translate tool. It is specific, concise, and directly reflects the main objective of the pull request.
Docstring Coverage ✅ Passed Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch nr/trad

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds a dedicated document translation tool to fix issue #301, which previously caused translation requests to be handled by the summary tool, resulting in translated summaries rather than full translations.

Changes:

  • Implements a new translate tool that translates the last uploaded text document into a specified target language, with a configurable size limit (100k characters by default)
  • Adds frontend support for displaying translation progress with appropriate loading messages
  • Extends test coverage with comprehensive test cases for the new translation functionality

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/backend/chat/tools/document_translate.py New tool implementation that translates the last uploaded text document with size validation
src/backend/chat/agents/translate.py New TranslationAgent class following the BaseAgent pattern used by SummarizationAgent
src/backend/conversations/settings.py Adds TRANSLATION_MAX_CHARS configuration setting (default 100k characters)
src/backend/chat/clients/pydantic_ai.py Registers the translate tool and adds system prompt for translation behavior
src/backend/chat/tests/tools/test_document_translate.py Comprehensive test suite covering success cases, error handling, and edge cases
src/backend/chat/tests/conftest.py Adds mock_translation_agent fixture following the pattern of mock_summarization_agent
src/frontend/apps/conversations/src/i18n/translations.json Adds translation strings for "Extracting document" (singular) and "Translation in progress" in all supported languages
src/frontend/apps/conversations/src/features/chat/components/ToolInvocationItem.tsx Updates to show singular/plural document extraction messages
src/frontend/apps/conversations/src/features/chat/components/MessageItem.tsx Adds display logic for translation progress indicator
CHANGELOG.md Documents the new feature in the Unreleased section

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/backend/chat/tools/document_translate.py
Comment thread src/backend/chat/tools/document_translate.py
Comment thread src/backend/chat/tools/document_translate.py
Comment thread src/backend/chat/tools/document_translate.py Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (5)
src/backend/chat/agents/translate.py (2)

19-20: TranslationAgent silently reuses LLM_SUMMARIZATION_MODEL_HRID.

The translation agent borrows the summarization model HRID with no comment explaining the intentional coupling. This prevents independent per-role model configuration and makes the setting name misleading when operators try to tune translation separately.

Consider introducing a dedicated LLM_TRANSLATION_MODEL_HRID setting (falling back to the summarization HRID as a default), or at least add an inline comment documenting the deliberate reuse.

♻️ Suggested approach (new setting in settings.py + usage here)

In src/backend/conversations/settings.py:

LLM_TRANSLATION_MODEL_HRID = values.Value(
    None,  # falls back to LLM_SUMMARIZATION_MODEL_HRID when unset
    environ_name="LLM_TRANSLATION_MODEL_HRID",
    environ_prefix=None,
)

In translate.py:

-            model_hrid=settings.LLM_SUMMARIZATION_MODEL_HRID,
+            model_hrid=settings.LLM_TRANSLATION_MODEL_HRID or settings.LLM_SUMMARIZATION_MODEL_HRID,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/backend/chat/agents/translate.py` around lines 19 - 20, TranslationAgent
is currently hard-coded to reuse settings.LLM_SUMMARIZATION_MODEL_HRID which
prevents separate configuration; update TranslationAgent to read a dedicated
setting (settings.LLM_TRANSLATION_MODEL_HRID) falling back to
settings.LLM_SUMMARIZATION_MODEL_HRID when unset, or at minimum add an inline
comment above the super().__init__ call explaining the intentional reuse;
reference the TranslationAgent class and the model_hrid parameter in the
super().__init__ call to implement the fallback or add the explanatory comment.

4-10: Unused logger instance.

logging is imported and a logger instance is created at line 10, but no log statements exist anywhere in the file. Remove both or add at minimum a debug log in __init__.

🧹 Proposed fix
-import logging
-
 from django.conf import settings

 from .base import BaseAgent
-
-logger = logging.getLogger(__name__)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/backend/chat/agents/translate.py` around lines 4 - 10, The module defines
an unused logger (import logging and logger = logging.getLogger(__name__)) in
translate.py; either remove the logging import and the module-level logger
variable, or keep them and add a minimal log statement (e.g., logger.debug)
inside the agent initializer (the Translate agent class that subclasses
BaseAgent) to justify the logger's presence; update imports and the class
__init__ accordingly so there are no unused symbols.
src/backend/chat/clients/pydantic_ai.py (1)

658-679: translation_system_prompt and summarization_system_prompt share near-identical wording.

Both instruct the model to return results verbatim without modification. The duplication is harmless but could be consolidated into a single reusable instruction if more tool-specific verbatim prompts are added in the future.

♻️ Optional consolidation
-        `@self.conversation_agent.instructions`
-        def summarization_system_prompt() -> str:
-            return (
-                "When you receive a result from the summarization tool, you MUST return it "
-                "directly to the user without any modification, paraphrasing, or additional "
-                "summarization."
-                "The tool already produces optimized summaries that should be presented "
-                "verbatim."
-                "You may translate the summary if required, but you MUST preserve all the "
-                "information from the original summary."
-                "You may add a follow-up question after the summary if needed."
-            )
-
-        `@self.conversation_agent.instructions`
-        def translation_system_prompt() -> str:
-            return (
-                "When you receive a result from the translation tool, you MUST return it "
-                "directly to the user without any modification, paraphrasing, or additional "
-                "summarization. "
-                "The tool already produces complete translations that should be presented "
-                "verbatim."
-            )
+        `@self.conversation_agent.instructions`
+        def verbatim_tool_results_prompt() -> str:
+            return (
+                "When you receive a result from the summarization or translation tool, "
+                "you MUST return it directly to the user without any modification, "
+                "paraphrasing, or additional summarization. "
+                "The tool already produces optimized output that should be presented verbatim. "
+                "For summaries, you may translate if required but MUST preserve all information. "
+                "You may add a follow-up question after the result if needed."
+            )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/backend/chat/clients/pydantic_ai.py` around lines 658 - 679, Both
summarization_system_prompt and translation_system_prompt register
near-identical instructions via self.conversation_agent.instructions;
consolidate into a single reusable function (e.g., verbatim_tool_prompt or
create_tool_prompt) that returns the shared verbatim-return policy and call that
from both translation_system_prompt and summarization_system_prompt (or replace
both with direct references to the unified instruction) so future tool-specific
tweaks are easier to manage while preserving current behavior.
src/backend/chat/tests/tools/test_document_translate.py (1)

51-59: Consider adding content_type filter verification to the mock queryset.

_mock_attachments_queryset stubs .filter().order_by().afirst() but doesn't verify that .filter() was called with content_type__startswith="text/". Adding an assertion (like mock_conversation.attachments.filter.assert_called_once_with(content_type__startswith="text/")) in at least one test would guard against filter regressions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/backend/chat/tests/tools/test_document_translate.py` around lines 51 -
59, Update the tests that use _mock_attachments_queryset to assert the .filter
call includes the content_type prefix: when you create the mock via
_mock_attachments_queryset(attachment), ensure the test asserts
mock_conversation.attachments.filter.assert_called_once_with(content_type__startswith="text/")
(or equivalent) after the code under test runs; this will validate that the
mocked .filter was invoked with content_type__startswith="text/" and prevent
regressions in how attachments are filtered (reference
_mock_attachments_queryset, mock_attachments.filter, and
mock_conversation.attachments.filter in the assertion).
src/backend/chat/tools/document_translate.py (1)

104-111: Unnecessary f-string prefixes on lines without interpolation.

Lines 105, 107, and 110 use f"..." but contain no {…} placeholders. This is a minor readability nit — plain strings suffice.

Proposed fix
         translate_prompt = (
-            f"You are an agent specializing in text translation. "
+            "You are an agent specializing in text translation. "
             f"Translate the following document to {target_language}. "
-            f"Preserve all markdown formatting exactly as-is. "
+            "Preserve all markdown formatting exactly as-is. "
             f"{instructions_hint}\n\n"
             f"'''\n{content}\n'''\n\n"
-            f"Respond directly with the translated text only, no commentary."
+            "Respond directly with the translated text only, no commentary."
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/backend/chat/tools/document_translate.py` around lines 104 - 111, The
translate_prompt construction uses unnecessary f-strings on literal lines;
update the translate_prompt assignment in document_translate.py (the
translate_prompt variable creation inside the document_translate function) by
removing the f-prefix from string fragments that have no interpolation (the
lines that currently read f"Translate the following document to
{target_language}." may keep the f when needed, but the literal fragments like
f"Preserve all markdown formatting exactly as-is. ", f"{instructions_hint}\n\n"
if instructions_hint is already interpolated keep its f, and
f"'''\n{content}\n'''\n\n" should remain f only where {content} is used) — in
short, change any f"..." fragments with no {…} placeholders to plain "..." to
improve readability while keeping f-strings where placeholders like
{target_language}, {instructions_hint}, or {content} are used.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/backend/chat/tests/tools/test_document_translate.py`:
- Around line 67-96: The test fails because document_translate awaits the
synchronous helper read_document_content; update document_translate to call
read_document_content without await (or alternatively make read_document_content
async and update its callers), specifically fix the await usage around
read_document_content in the document_translate flow so the call is synchronous
(remove the await and handle its return directly) or convert
read_document_content to an async def and await it consistently across
document_translate and any callers.

In `@src/backend/chat/tools/document_translate.py`:
- Line 9: Update the ToolReturn import to use the top-level package: replace the
import statement that references ToolReturn from pydantic_ai.messages with an
import from pydantic_ai (i.e., change "from pydantic_ai.messages import
ToolReturn" to "from pydantic_ai import ToolReturn"). Apply the same change in
the other affected modules—document_search_rag.py, document_summarize.py,
document_generic_search_rag.py, and web_search_brave.py—so any references to
ToolReturn come from pydantic_ai instead of pydantic_ai.messages.

In
`@src/frontend/apps/conversations/src/features/chat/components/ToolInvocationItem.tsx`:
- Around line 51-57: The current render in ToolInvocationItem.tsx uses
documentIdentifiers.length === 1 to pick singular vs plural but doesn't guard
length === 0, producing "Extracting documents: ..." for an empty array; update
the JSX to first check if documentIdentifiers.length === 0 and render a clear
fallback (e.g., t('No documents to extract') or another appropriate translation
key), otherwise keep the existing singular (documentIdentifiers[0]) and plural
(documentIdentifiers.join(', ')) branches; ensure you reference the
documentIdentifiers variable and update the localized strings accordingly.

---

Nitpick comments:
In `@src/backend/chat/agents/translate.py`:
- Around line 19-20: TranslationAgent is currently hard-coded to reuse
settings.LLM_SUMMARIZATION_MODEL_HRID which prevents separate configuration;
update TranslationAgent to read a dedicated setting
(settings.LLM_TRANSLATION_MODEL_HRID) falling back to
settings.LLM_SUMMARIZATION_MODEL_HRID when unset, or at minimum add an inline
comment above the super().__init__ call explaining the intentional reuse;
reference the TranslationAgent class and the model_hrid parameter in the
super().__init__ call to implement the fallback or add the explanatory comment.
- Around line 4-10: The module defines an unused logger (import logging and
logger = logging.getLogger(__name__)) in translate.py; either remove the logging
import and the module-level logger variable, or keep them and add a minimal log
statement (e.g., logger.debug) inside the agent initializer (the Translate agent
class that subclasses BaseAgent) to justify the logger's presence; update
imports and the class __init__ accordingly so there are no unused symbols.

In `@src/backend/chat/clients/pydantic_ai.py`:
- Around line 658-679: Both summarization_system_prompt and
translation_system_prompt register near-identical instructions via
self.conversation_agent.instructions; consolidate into a single reusable
function (e.g., verbatim_tool_prompt or create_tool_prompt) that returns the
shared verbatim-return policy and call that from both translation_system_prompt
and summarization_system_prompt (or replace both with direct references to the
unified instruction) so future tool-specific tweaks are easier to manage while
preserving current behavior.

In `@src/backend/chat/tests/tools/test_document_translate.py`:
- Around line 51-59: Update the tests that use _mock_attachments_queryset to
assert the .filter call includes the content_type prefix: when you create the
mock via _mock_attachments_queryset(attachment), ensure the test asserts
mock_conversation.attachments.filter.assert_called_once_with(content_type__startswith="text/")
(or equivalent) after the code under test runs; this will validate that the
mocked .filter was invoked with content_type__startswith="text/" and prevent
regressions in how attachments are filtered (reference
_mock_attachments_queryset, mock_attachments.filter, and
mock_conversation.attachments.filter in the assertion).

In `@src/backend/chat/tools/document_translate.py`:
- Around line 104-111: The translate_prompt construction uses unnecessary
f-strings on literal lines; update the translate_prompt assignment in
document_translate.py (the translate_prompt variable creation inside the
document_translate function) by removing the f-prefix from string fragments that
have no interpolation (the lines that currently read f"Translate the following
document to {target_language}." may keep the f when needed, but the literal
fragments like f"Preserve all markdown formatting exactly as-is. ",
f"{instructions_hint}\n\n" if instructions_hint is already interpolated keep its
f, and f"'''\n{content}\n'''\n\n" should remain f only where {content} is used)
— in short, change any f"..." fragments with no {…} placeholders to plain "..."
to improve readability while keeping f-strings where placeholders like
{target_language}, {instructions_hint}, or {content} are used.

Comment thread src/backend/chat/tests/tools/test_document_translate.py
Comment thread src/backend/chat/tools/document_translate.py
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
src/backend/chat/tools/document_translate.py (1)

9-9: ⚠️ Potential issue | 🔴 Critical

ToolReturn must be imported from pydantic_ai, not pydantic_ai.messages.

from pydantic_ai import ToolReturn is the correct import path per the pydantic-ai documentation.

🐛 Proposed fix
-from pydantic_ai.messages import ToolReturn
+from pydantic_ai import ToolReturn
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/backend/chat/tools/document_translate.py` at line 9, The file imports
ToolReturn from the wrong module; change the import to use the top-level package
by replacing "from pydantic_ai.messages import ToolReturn" with "from
pydantic_ai import ToolReturn" so the symbol ToolReturn is resolved correctly
(look for the import line near the top of
src/backend/chat/tools/document_translate.py and update the import statement
that currently references pydantic_ai.messages).
🧹 Nitpick comments (1)
src/backend/chat/tools/document_translate.py (1)

12-12: read_document_content imported from a peer tool module — move it to a shared utility.

read_document_content is a generic document-reading helper, not summarization logic. Importing it from chat.tools.document_summarize creates a coupling between two independent tool modules. Moving it to chat.tools.utils (or a dedicated chat.tools.document_utils) would give both tools a clean, stable import.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/backend/chat/tools/document_translate.py` at line 12, The import of
read_document_content from chat.tools.document_summarize creates an unnecessary
coupling; extract the read_document_content function into a shared utility
module (e.g., chat.tools.utils or chat.tools.document_utils), update the
implementation there, then change imports in chat.tools.document_translate (and
keep chat.tools.document_summarize) to import read_document_content from the new
module; ensure any internal references to read_document_content are updated and
run tests/linters to verify no remaining imports from
chat.tools.document_summarize.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/backend/chat/tests/tools/test_document_translate.py`:
- Around line 18-38: The fixture fixture_translation_agent_config sets
settings.LLM_CONFIGURATIONS keyed to settings.LLM_DEFAULT_MODEL_HRID but
TranslationAgent.__init__ passes
model_hrid=settings.LLM_SUMMARIZATION_MODEL_HRID and BaseAgent.__init__
immediately looks up settings.LLM_CONFIGURATIONS[model_hrid], causing
ImproperlyConfigured; to fix, change the fixture to register the LLModel under
settings.LLM_SUMMARIZATION_MODEL_HRID (or add an entry for that HRID) so that
settings.LLM_CONFIGURATIONS contains a mapping for the key TranslationAgent
expects before instantiation.

In `@src/backend/chat/tools/document_translate.py`:
- Around line 79-91: The early byte-size check using last_attachment.size
against settings.TRANSLATION_MAX_CHARS is producing false positives for
multi-byte UTF-8 content; update the logic in document_translate.py by either
removing this early check entirely (rely on the existing content-length check
later) or change the threshold to a conservative bytes threshold of
settings.TRANSLATION_MAX_CHARS * 4 before raising ModelCannotRetry so you only
reject files that cannot possibly be within the char limit; ensure the
ModelCannotRetry message remains unchanged.

In `@src/backend/conversations/configuration/llm/default.json`:
- Line 40: The change of default provider "kind" from "openai" to "mistral" in
src/backend/conversations/configuration/llm/default.json breaks deployments;
either revert "kind" back to "openai" so default-provider, default-model and
default-summarization-model continue to work with OpenAI-compatible backends, or
if the switch to "mistral" is intentional update the schema default (schema
Literal default for kind), add a CHANGELOG entry and migration guidance
explaining required env changes (e.g., setting AI_BASE_URL to a Mistral
endpoint) and update any docs referencing
default-provider/default-model/default-summarization-model so users are aware
and deployments won’t fail at runtime.

---

Duplicate comments:
In `@src/backend/chat/tools/document_translate.py`:
- Line 9: The file imports ToolReturn from the wrong module; change the import
to use the top-level package by replacing "from pydantic_ai.messages import
ToolReturn" with "from pydantic_ai import ToolReturn" so the symbol ToolReturn
is resolved correctly (look for the import line near the top of
src/backend/chat/tools/document_translate.py and update the import statement
that currently references pydantic_ai.messages).

---

Nitpick comments:
In `@src/backend/chat/tools/document_translate.py`:
- Line 12: The import of read_document_content from
chat.tools.document_summarize creates an unnecessary coupling; extract the
read_document_content function into a shared utility module (e.g.,
chat.tools.utils or chat.tools.document_utils), update the implementation there,
then change imports in chat.tools.document_translate (and keep
chat.tools.document_summarize) to import read_document_content from the new
module; ensure any internal references to read_document_content are updated and
run tests/linters to verify no remaining imports from
chat.tools.document_summarize.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bd4b312 and 34a619a.

📒 Files selected for processing (4)
  • src/backend/chat/tests/tools/test_document_translate.py
  • src/backend/chat/tools/document_translate.py
  • src/backend/conversations/configuration/llm/default.json
  • src/frontend/apps/conversations/src/features/chat/components/ToolInvocationItem.tsx
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/frontend/apps/conversations/src/features/chat/components/ToolInvocationItem.tsx

Comment thread src/backend/chat/tests/tools/test_document_translate.py
Comment thread src/backend/chat/tools/document_translate.py Outdated
Comment thread src/backend/conversations/configuration/llm/default.json Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
src/backend/chat/tools/document_translate.py (1)

9-9: ⚠️ Potential issue | 🔴 Critical

ToolReturn must be imported from pydantic_ai, not pydantic_ai.messages.

ToolReturn is a top-level export — the pydantic-ai docs consistently show from pydantic_ai import ToolReturn. pydantic_ai.messages does not export ToolReturn; this will raise an ImportError at startup.

🐛 Proposed fix
-from pydantic_ai.messages import ToolReturn
+from pydantic_ai import ToolReturn
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/backend/chat/tools/document_translate.py` at line 9, The import for
ToolReturn in document_translate.py is wrong; replace the line importing
ToolReturn from pydantic_ai.messages with an import from the top-level
pydantic_ai package (i.e., import ToolReturn from pydantic_ai) so that the
symbol ToolReturn is resolved at startup; update the import statement that
references ToolReturn to use pydantic_ai instead of pydantic_ai.messages.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/backend/chat/tools/document_translate.py`:
- Around line 59-61: The code builds instructions_hint using
instructions.strip() but tests truthiness on the original instructions, which
makes whitespace-only strings produce a dangling "Follow these instructions: "
hint; change the guard to use a stripped value (e.g., stripped =
instructions.strip() and then instructions_hint = f"Follow these instructions:
{stripped}" if stripped else "") so only non-empty, non-whitespace instructions
populate instructions_hint; update uses of instructions_hint accordingly.
- Around line 102-109: The prompt in document_translate.py uses a fixed
triple-quote delimiter (translate_prompt) which is vulnerable if the uploaded
document contains the same sequence; replace the fixed delimiter with a strong,
unpredictable boundary (e.g., generate a UUID or random token in the translation
function), insert that token into the prompt template and wrap the content with
the same start/end boundary strings so the LLM can never confuse document
content with instructions; update references to translate_prompt to include the
generated boundary variable and ensure you validate/escape nothing else (or
assert the token does not appear in content), and include the boundary value in
the prompt metadata (e.g., "BOUNDARY: <token>") so both sides use the same
unique delimiter.
- Line 79: The call site is awaiting read_document_content but that function is
synchronous (read_document_content), causing a TypeError; fix by making
read_document_content an async function or by removing await and wrapping the
synchronous call with sync_to_async at the call sites (document_translate.py and
document_summarize.py). Specifically, either convert the definition of
read_document_content to async def (and use an async file reader like aiofiles
or run the blocking read in an executor) so await read_document_content(...) is
valid, or keep it synchronous and replace await read_document_content(...) with
await sync_to_async(read_document_content)(...) or simply call
read_document_content(...) without await and adjust callers accordingly (update
both the call in document_translate.py and the other call in
document_summarize.py).

---

Duplicate comments:
In `@src/backend/chat/tools/document_translate.py`:
- Line 9: The import for ToolReturn in document_translate.py is wrong; replace
the line importing ToolReturn from pydantic_ai.messages with an import from the
top-level pydantic_ai package (i.e., import ToolReturn from pydantic_ai) so that
the symbol ToolReturn is resolved at startup; update the import statement that
references ToolReturn to use pydantic_ai instead of pydantic_ai.messages.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 34a619a and 00f28f4.

📒 Files selected for processing (2)
  • src/backend/chat/agents/translate.py
  • src/backend/chat/tools/document_translate.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/backend/chat/agents/translate.py

Comment thread src/backend/chat/tools/document_translate.py
"You must explain this to the user and ask them to provide documents."
)

doc_name, content = await read_document_content(last_attachment)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find all definitions of read_document_content
echo "=== Definitions of read_document_content ==="
rg -n "def read_document_content|async def read_document_content" --type=py -A 3

echo -e "\n=== All usages with await ==="
rg -n "await read_document_content" --type=py

echo -e "\n=== File location check ==="
fd read_document_content

Repository: suitenumerique/conversations

Length of output: 799


await-ing a synchronous function will raise TypeError at runtime.

The function read_document_content (line 23 in src/backend/chat/tools/document_summarize.py) is defined as a plain def, not async def, yet its docstring claims it reads content "asynchronously." The function returns a tuple directly: (doc.file_name, f.read().decode("utf-8")).

Awaiting a non-coroutine raises TypeError: object tuple can't be used in 'await' expression. This occurs at line 79 here and also at line 94 in document_summarize.py.

Convert read_document_content to async def, or if it must remain synchronous, wrap the calls with sync_to_async() and remove the await keyword.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/backend/chat/tools/document_translate.py` at line 79, The call site is
awaiting read_document_content but that function is synchronous
(read_document_content), causing a TypeError; fix by making
read_document_content an async function or by removing await and wrapping the
synchronous call with sync_to_async at the call sites (document_translate.py and
document_summarize.py). Specifically, either convert the definition of
read_document_content to async def (and use an async file reader like aiofiles
or run the blocking read in an executor) so await read_document_content(...) is
valid, or keep it synchronous and replace await read_document_content(...) with
await sync_to_async(read_document_content)(...) or simply call
read_document_content(...) without await and adjust callers accordingly (update
both the call in document_translate.py and the other call in
document_summarize.py).

Comment thread src/backend/chat/tools/document_translate.py
@sonarqubecloud
Copy link
Copy Markdown

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (4)
src/backend/chat/tools/document_translate.py (4)

9-9: ToolReturn is imported from the wrong module.

from pydantic_ai.messages import ToolReturn should be from pydantic_ai import ToolReturn.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/backend/chat/tools/document_translate.py` at line 9, The import for
ToolReturn is from the wrong module; replace the current import statement "from
pydantic_ai.messages import ToolReturn" with "from pydantic_ai import
ToolReturn" so the ToolReturn symbol is imported from the top-level pydantic_ai
package rather than pydantic_ai.messages; update the import in
document_translate.py where ToolReturn is referenced.

58-60: Whitespace-only instructions injects a dangling hint into the prompt.

if instructions is truthy for " ", but instructions.strip() is "", producing "Follow these instructions: " with no trailing content.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/backend/chat/tools/document_translate.py` around lines 58 - 60, The
prompt builds instructions_hint even when instructions contains only whitespace;
update the logic around the instructions and instructions_hint variables so you
check the stripped value instead of the raw string (e.g., compute
stripped_instructions = instructions.strip() and only set instructions_hint when
stripped_instructions is non-empty), and use stripped_instructions in the
formatted hint to avoid producing a dangling "Follow these instructions: "
fragment.

101-108: Triple-quote delimiter is breakable by document content (prompt injection).

If the uploaded document contains the literal sequence \n'''\n, the LLM cannot reliably distinguish content from post-document instructions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/backend/chat/tools/document_translate.py` around lines 101 - 108, The
current translate_prompt in document_translate.py uses the fixed triple-quote
delimiter (''' ) which can be injected if the uploaded content contains that
sequence; change the boundary strategy so the LLM can never confuse content with
prompt: generate a unique, hard-to-guess delimiter (e.g., include a UUID or
random nonce) and use that delimiter both around the {content} and in the
surrounding instructions, or alternatively prefix the document with its exact
byte/character length and instruct the model to read exactly that many
characters; update the translate_prompt construction (referencing
translate_prompt, content, and instructions_hint) to use the chosen safe
delimiter or length-prefix method consistently.

82-90: ⚠️ Potential issue | 🟡 Minor

Contradictory: error message passes exact counts to the LLM then tells it not to share them.

Lines 83-84 embed {len(content):,} and {max_chars:,} in the ModelCannotRetry message, immediately followed by "without providing numerical details". This is self-contradictory; the LLM may still echo the numbers to the user, undermining the intent.

If the goal is to hide implementation details, simply omit the numbers from the message:

🛡️ Proposed fix
-            raise ModelCannotRetry(
-                f"The document is too large to translate ({len(content):,} characters, "
-                f"limit is {max_chars:,}). "
-                "You must explain this to the user, without providing numerical details. "
-                "Suggest them to reduce the document size by summarizing it or "
-                "by splitting it into smaller parts. "
-                "Also offer them to summarize the document in the target language instead, "
-                "which can be a good alternative to translation for large documents."
-            )
+            raise ModelCannotRetry(
+                "The document is too large to translate. "
+                "You must explain this to the user. "
+                "Suggest them to reduce the document size by summarizing it or "
+                "by splitting it into smaller parts. "
+                "Also offer them to summarize the document in the target language instead, "
+                "which can be a good alternative to translation for large documents."
+            )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/backend/chat/tools/document_translate.py` around lines 82 - 90, The
ModelCannotRetry exception message currently embeds exact character counts using
{len(content):,} and {max_chars:,} but then instructs the LLM not to share
numerical details; remove the interpolated numbers from the raised message in
document_translate.py (the raise of ModelCannotRetry referencing content and
max_chars) and replace the text with a version that omits numeric values and
simply tells the user the document is too large, advises summarizing or
splitting it, and offers summarizing in the target language as an alternative.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/backend/chat/tools/utils.py`:
- Around line 56-60: The read_document_content function currently decodes bytes
with decode("utf-8") which will raise UnicodeDecodeError for non-UTF-8 text
files; update read_document_content to catch UnicodeDecodeError from
default_storage.open(...).read(), and instead of letting it bubble up, raise
ModelCannotRetry with a clear user-facing message like "Document encoding is not
supported — please re-save as UTF-8"; also change the `@sync_to_async` decorator
on read_document_content to `@sync_to_async`(thread_sensitive=False) since file IO
here does not require thread-sensitive execution to improve throughput.

---

Duplicate comments:
In `@src/backend/chat/tools/document_translate.py`:
- Line 9: The import for ToolReturn is from the wrong module; replace the
current import statement "from pydantic_ai.messages import ToolReturn" with
"from pydantic_ai import ToolReturn" so the ToolReturn symbol is imported from
the top-level pydantic_ai package rather than pydantic_ai.messages; update the
import in document_translate.py where ToolReturn is referenced.
- Around line 58-60: The prompt builds instructions_hint even when instructions
contains only whitespace; update the logic around the instructions and
instructions_hint variables so you check the stripped value instead of the raw
string (e.g., compute stripped_instructions = instructions.strip() and only set
instructions_hint when stripped_instructions is non-empty), and use
stripped_instructions in the formatted hint to avoid producing a dangling
"Follow these instructions: " fragment.
- Around line 101-108: The current translate_prompt in document_translate.py
uses the fixed triple-quote delimiter (''' ) which can be injected if the
uploaded content contains that sequence; change the boundary strategy so the LLM
can never confuse content with prompt: generate a unique, hard-to-guess
delimiter (e.g., include a UUID or random nonce) and use that delimiter both
around the {content} and in the surrounding instructions, or alternatively
prefix the document with its exact byte/character length and instruct the model
to read exactly that many characters; update the translate_prompt construction
(referencing translate_prompt, content, and instructions_hint) to use the chosen
safe delimiter or length-prefix method consistently.
- Around line 82-90: The ModelCannotRetry exception message currently embeds
exact character counts using {len(content):,} and {max_chars:,} but then
instructs the LLM not to share numerical details; remove the interpolated
numbers from the raised message in document_translate.py (the raise of
ModelCannotRetry referencing content and max_chars) and replace the text with a
version that omits numeric values and simply tells the user the document is too
large, advises summarizing or splitting it, and offers summarizing in the target
language as an alternative.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 00f28f4 and 88be0c9.

📒 Files selected for processing (4)
  • src/backend/chat/agents/translate.py
  • src/backend/chat/tools/document_summarize.py
  • src/backend/chat/tools/document_translate.py
  • src/backend/chat/tools/utils.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/backend/chat/agents/translate.py

Comment on lines +56 to +60
@sync_to_async
def read_document_content(doc):
"""Read document content asynchronously."""
with default_storage.open(doc.key) as f:
return doc.file_name, f.read().decode("utf-8")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

asgiref sync_to_async thread_sensitive parameter default documentation

💡 Result:

asgiref.sync.sync_to_async() has the signature sync_to_async(func, thread_sensitive=True), so the default for thread_sensitive is True. Django’s async docs also note that asgiref 3.3.0 changed the default to True (it used to be False). [1]

Primary references:

  • Django async docs (“sync_to_async(sync_function, thread_sensitive=True)”, and the 3.3.0 default-change warning) [1]
  • asgiref source showing thread_sensitive: bool = True in SyncToAsync.__init__ [2]
  • asgiref README noting the default behavior (“By default, sync_to_async will run all synchronous code in the program in the same thread…”) [3]

Sources: [1] (docs.djangoproject.com) [2] (docs.djangoproject.com) [3] (github.com)

Citations:


🏁 Script executed:

# Search for content_type filter and how read_document_content is used
rg "content_type.*startswith.*text" --type py -B 2 -A 2

Repository: suitenumerique/conversations

Length of output: 1317


🏁 Script executed:

# Find where read_document_content is called and how errors are handled
rg "read_document_content" --type py -B 2 -A 2

Repository: suitenumerique/conversations

Length of output: 2013


🏁 Script executed:

# Check the full context of the utils.py file to understand error handling
cat -n src/backend/chat/tools/utils.py | head -80

Repository: suitenumerique/conversations

Length of output: 2540


🏁 Script executed:

# Check error handling in document_translate.py around read_document_content call
cat -n src/backend/chat/tools/document_translate.py | head -150

Repository: suitenumerique/conversations

Length of output: 6615


🏁 Script executed:

# Check error handling in document_summarize.py around read_document_content call
cat -n src/backend/chat/tools/document_summarize.py | head -150

Repository: suitenumerique/conversations

Length of output: 7002


🏁 Script executed:

# Search for exception handlers that might catch these errors
rg "except.*Exception" src/backend/chat/tools/document_translate.py src/backend/chat/tools/document_summarize.py -B 3 -A 3

Repository: suitenumerique/conversations

Length of output: 3853


decode("utf-8") will raise UnicodeDecodeError for non-UTF-8 text attachments.

Since the tool only filters by content_type__startswith="text/", a user could upload a text/plain file encoded in Windows-1252 or Latin-1. The error propagates through the caller's outer exception handler and surfaces as "An unexpected error occurred during document translation: UnicodeDecodeError", which is opaque to the user.

🛡️ Proposed fix
-        return doc.file_name, f.read().decode("utf-8")
+        return doc.file_name, f.read().decode("utf-8", errors="replace")

Alternatively, catch UnicodeDecodeError explicitly and raise a ModelCannotRetry with a clear message (e.g., "Document encoding is not supported — please re-save as UTF-8").

Also, @sync_to_async defaults to thread_sensitive=True, which routes all invocations through the main async thread. Since default_storage.open / file reads have no thread-local state requirements, thread_sensitive=False is better for throughput.

⚡ Optional optimisation
-@sync_to_async
+@sync_to_async(thread_sensitive=False)
 def read_document_content(doc):
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@sync_to_async
def read_document_content(doc):
"""Read document content asynchronously."""
with default_storage.open(doc.key) as f:
return doc.file_name, f.read().decode("utf-8")
`@sync_to_async`(thread_sensitive=False)
def read_document_content(doc):
"""Read document content asynchronously."""
with default_storage.open(doc.key) as f:
return doc.file_name, f.read().decode("utf-8", errors="replace")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/backend/chat/tools/utils.py` around lines 56 - 60, The
read_document_content function currently decodes bytes with decode("utf-8")
which will raise UnicodeDecodeError for non-UTF-8 text files; update
read_document_content to catch UnicodeDecodeError from
default_storage.open(...).read(), and instead of letting it bubble up, raise
ModelCannotRetry with a clear user-facing message like "Document encoding is not
supported — please re-save as UTF-8"; also change the `@sync_to_async` decorator
on read_document_content to `@sync_to_async`(thread_sensitive=False) since file IO
here does not require thread-sensitive execution to improve throughput.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants