Skip to content

fix(gateway): strip leaked special tokens and clean up JSON response#1063

Open
ConnorLi96 wants to merge 6 commits intomainfrom
connorli/fix-leaked-token-cleanup
Open

fix(gateway): strip leaked special tokens and clean up JSON response#1063
ConnorLi96 wants to merge 6 commits intomainfrom
connorli/fix-leaked-token-cleanup

Conversation

@ConnorLi96
Copy link
Copy Markdown
Collaborator

@ConnorLi96 ConnorLi96 commented Apr 8, 2026

Problem

Two response quality issues when serving models with chat-template special tokens:

  1. Models with configured parsers (e.g. Kimi K2) sometimes leak chatml tokens (<|im_end|>, <|im_start|>, etc.) into API responses, appearing as garbage text to end users.
  2. For json_object/json_schema response formats, models sometimes wrap output in markdown fences or append trailing text after the JSON object, breaking downstream JSON parsing.

Solution

  • Conditionally strip special tokens only when a model-specific tool/reasoning parser is configured, using the tokenizer's own special token list instead of a hardcoded set.
  • For JSON response formats: strip markdown fences, then truncate at the first valid JSON object boundary using a brace-depth-aware state machine.
  • Extracted into a testable response_cleanup module with unit tests.

Changes

  • model_gateway/src/routers/grpc/utils/response_cleanup.rs (new): strip_leaked_special_tokens, strip_leaked_special_tokens_from_delta, clean_json_response + unit tests
  • model_gateway/src/routers/grpc/utils/mod.rs: module re-export
  • model_gateway/src/routers/grpc/regular/processor.rs: integrate cleanup in Chat Completions + Messages API paths
  • model_gateway/src/routers/grpc/regular/streaming.rs: integrate streaming delta cleanup

Test Plan

  • Verified responses no longer contain leaked <|im_end|> / <|im_start|> tokens for Kimi K2
  • Verified models without configured parsers are unaffected (no token stripping applied)
  • Verified json_object responses with markdown fences are correctly unwrapped
  • Verified JSON responses with trailing text are truncated at the correct boundary
  • Unit tests cover fence stripping, JSON truncation, escape handling, and streaming cleanup

Checklist:

  • cargo +nightly fmt passes
  • cargo clippy --all-targets --all-features -- -D warnings passes

Closes #1111

Summary by CodeRabbit

  • Bug Fixes
    • Improved response quality by removing leaked special tokens from model outputs in both streaming and non-streaming responses.
    • Enhanced JSON response handling to strip unnecessary Markdown formatting and ensure valid JSON structure in model-generated responses.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 8, 2026

📝 Walkthrough

Walkthrough

Adds a response-cleanup module and integrates conditional post-processing into gRPC regular and streaming processors to strip leaked special tokens and to normalize/truncate Markdown-wrapped JSON outputs for JSON response formats (applied after tool/reasoning parsing).

Changes

Cohort / File(s) Summary
Processor (non-streaming)
model_gateway/src/routers/grpc/regular/processor.rs
After tool/reasoning parsing, conditionally strip leaked special tokens (when a tool/reasoning parser is configured) and, for JsonObject/JsonSchema responses, run clean_json_response. Invalidate choice logprobs if post-processing changed text.
Streaming processor
model_gateway/src/routers/grpc/regular/streaming.rs
Thread Tokenizer into streaming paths; strip leaked special tokens from deltas, reasoning chunks, and flushed text when parsers are configured. If stripping shortens a delta, suppress choice logprobs.
Response cleanup utilities
model_gateway/src/routers/grpc/utils/mod.rs, model_gateway/src/routers/grpc/utils/response_cleanup.rs
Add response_cleanup module with strip_leaked_special_tokens, strip_leaked_special_tokens_from_delta, and clean_json_response (Markdown fence stripping + truncation to first balanced top-level JSON value). Re-exported via utils; includes unit tests.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant Gateway as ModelGateway::Processor
  participant Model
  participant Tokenizer
  participant Utils as ResponseCleanup

  Client->>Gateway: request (messages, response_format, parsers configured)
  Gateway->>Model: forward request / start stream
  Model-->>Gateway: stream/generate raw text deltas
  Gateway->>Tokenizer: provide delta/tokenizer for stripping (if parser configured)
  Tokenizer-->>Gateway: cleaned delta
  Gateway->>Utils: clean_json_response (if JSON format)
  Utils-->>Gateway: cleaned/trimmed text
  Gateway-->>Client: emit cleaned streaming chunks / final response
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • CatherineSue
  • key4ng
  • slin1237

Poem

🐰 I hopped through streams and fenced-off code,
Snipped stray tokens from the road,
JSON tidy, deltas neat,
Responses now are clean and sweet,
🥕✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and concisely summarizes the two main fixes: stripping leaked special tokens and cleaning up JSON responses, matching the core changes.
Linked Issues check ✅ Passed The PR fully addresses issue #1111 by implementing token stripping across streaming and non-streaming paths, JSON cleanup for markdown-fenced responses, and conditional logic based on configured parsers.
Out of Scope Changes check ✅ Passed All changes are tightly scoped to response cleanup utilities and their integration into the two primary response-processing paths, with no unrelated modifications detected.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch connorli/fix-leaked-token-cleanup

Warning

Review ran into problems

🔥 Problems

Timed out fetching pipeline failures after 30000ms


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ConnorLi96 ConnorLi96 changed the title fix(gateway): strip leaked special tokens and clean up JSON response … fix(gateway): strip leaked special tokens and clean up JSON response Apr 8, 2026
@github-actions github-actions bot added grpc gRPC client and router changes model-gateway Model gateway crate changes labels Apr 8, 2026
Comment thread model_gateway/src/routers/grpc/regular/streaming.rs Outdated
Comment thread model_gateway/src/routers/grpc/regular/processor.rs Outdated
Comment thread model_gateway/src/routers/grpc/regular/streaming.rs Outdated
Comment thread model_gateway/src/routers/grpc/regular/processor.rs Outdated
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good — sensible defense-in-depth cleanup for leaked special tokens and malformed JSON responses. Posted 4 nits (all 🟡), no blocking issues.

Summary: 🟡 Nit × 4, 🔴 Important × 0

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces logic to strip leaked ChatML and reasoning tokens from model responses and adds a mechanism to clean and extract JSON objects from formatted responses. However, the review identifies several issues: the JSON cleaning logic is fragile regarding whitespace and markdown formatting, and the token stripping in streaming mode is unreliable because tokens can be split across chunks. Additionally, there are inconsistencies and code duplication across different paths, such as the missing tag in non-streaming and incremental parsing logic, which should be addressed by unifying the implementation.

Comment thread model_gateway/src/routers/grpc/regular/processor.rs
Comment thread model_gateway/src/routers/grpc/regular/streaming.rs Outdated
Comment thread model_gateway/src/routers/grpc/regular/processor.rs
Comment thread model_gateway/src/routers/grpc/regular/processor.rs
Comment thread model_gateway/src/routers/grpc/regular/streaming.rs Outdated
@ConnorLi96 ConnorLi96 changed the title fix(gateway): strip leaked special tokens and clean up JSON response WIP fix(gateway): strip leaked special tokens and clean up JSON response Apr 8, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b112baecf7

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread model_gateway/src/routers/grpc/regular/processor.rs Outdated
Comment thread model_gateway/src/routers/grpc/regular/streaming.rs Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
model_gateway/src/routers/grpc/regular/streaming.rs (1)

423-445: ⚠️ Potential issue | 🟠 Major

Centralize streamed-text sanitization before emission.

The new cleanup only covers the main content delta and parse_incremental()'s normal_text. process_specific_function_stream() still forwards raw delta into function.arguments, the decoder.flush() path below still emits raw held suffixes, and this second copy also omits </think>, so the same leaked markers can still reach clients through those branches. Line 321 also computes choice_logprobs before this rewrite, so sanitized chunks can carry token logprobs for text the client never sees.

Also applies to: 1257-1267

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/grpc/regular/streaming.rs` around lines 423 - 445,
Sanitize streamed text in one place before any emission paths: move the
token-stripping logic (respecting configured_tool_parser and
configured_reasoning_parser) to a centralized helper and call it on all outputs
including the main delta, function.arguments in
process_specific_function_stream, any decoder.flush() held suffixes, and
parse_incremental()'s normal_text; ensure the strip set includes "</think>" and
all chatml markers and compute choice_logprobs after sanitization so logprobs
correspond to emitted text (update usages around configured_tool_parser,
configured_reasoning_parser, process_specific_function_stream, decoder.flush(),
parse_incremental(), and choice_logprobs).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@model_gateway/src/routers/grpc/regular/processor.rs`:
- Around line 201-214: The current unwrapping only matches processed_text that
starts exactly with "```json" or "```JSON"; update the logic in the
is_json_response branch that manipulates processed_text to first trim leading
whitespace/newlines (use processed_text.trim_start()), then detect a leading
triple-backtick fence in a case-insensitive and tolerant way (accept "```",
"```json", "```jsonld", etc., ignoring case or extra spaces) rather than exact
matches; when a fence is found, locate the first newline after the opening fence
and then rfind the closing "```" to extract the inner JSON content (assign back
to processed_text) so responses with leading newlines or plain fences are
correctly unwrapped.
- Around line 186-199: The ChatML cleanup currently removes leaked markers then
unconditionally calls processed_text = processed_text.trim().to_string(), which
strips legitimate leading/trailing whitespace and makes non-streaming behavior
diverge from streaming; remove that unconditional trim so the code only replaces
the marker tokens (leave leading/trailing whitespace intact) in the block that
checks self.configured_tool_parser / self.configured_reasoning_parser and do the
same change for the duplicate cleanup at the other location (lines around the
701-714 snippet) referencing processed_text and the token-replacement loop.

---

Outside diff comments:
In `@model_gateway/src/routers/grpc/regular/streaming.rs`:
- Around line 423-445: Sanitize streamed text in one place before any emission
paths: move the token-stripping logic (respecting configured_tool_parser and
configured_reasoning_parser) to a centralized helper and call it on all outputs
including the main delta, function.arguments in
process_specific_function_stream, any decoder.flush() held suffixes, and
parse_incremental()'s normal_text; ensure the strip set includes "</think>" and
all chatml markers and compute choice_logprobs after sanitization so logprobs
correspond to emitted text (update usages around configured_tool_parser,
configured_reasoning_parser, process_specific_function_stream, decoder.flush(),
parse_incremental(), and choice_logprobs).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: cba22969-fc49-46fb-a354-7133191e099e

📥 Commits

Reviewing files that changed from the base of the PR and between d3fc32a and b112bae.

📒 Files selected for processing (2)
  • model_gateway/src/routers/grpc/regular/processor.rs
  • model_gateway/src/routers/grpc/regular/streaming.rs

Comment thread model_gateway/src/routers/grpc/regular/processor.rs Outdated
Comment thread model_gateway/src/routers/grpc/regular/processor.rs Outdated
Comment thread model_gateway/src/routers/grpc/regular/processor.rs Outdated
Comment thread model_gateway/src/routers/grpc/utils/response_cleanup.rs
Comment thread model_gateway/src/routers/grpc/utils/response_cleanup.rs
Comment thread model_gateway/src/routers/grpc/regular/processor.rs
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c8c101354a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread model_gateway/src/routers/grpc/regular/processor.rs
Comment thread model_gateway/src/routers/grpc/utils/response_cleanup.rs Outdated
Comment thread model_gateway/src/routers/grpc/utils/response_cleanup.rs Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
model_gateway/src/routers/grpc/utils/response_cleanup.rs (2)

16-35: ⚠️ Potential issue | 🟠 Major

Don’t trim user content after removing leaked tokens.

Once any special token is removed, the trim() block also deletes legitimate leading/trailing whitespace from otherwise valid responses. That changes user-visible content and makes non-streaming behavior diverge from strip_leaked_special_tokens_from_delta(), which only removes the leaked marker.

Suggested fix
 pub(crate) fn strip_leaked_special_tokens(text: &mut String, tokenizer: &dyn Tokenizer) {
     let special = tokenizer.get_special_tokens();
     if special.additional_special_tokens.is_empty() {
         return;
     }
 
-    let mut changed = false;
     for token in &special.additional_special_tokens {
         if text.contains(token.as_str()) {
             *text = text.replace(token.as_str(), "");
-            changed = true;
         }
     }
-    if changed {
-        let trimmed = text.trim();
-        if trimmed.len() != text.len() {
-            *text = trimmed.to_string();
-        }
-    }
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/grpc/utils/response_cleanup.rs` around lines 16 -
35, The function strip_leaked_special_tokens currently trims leading/trailing
whitespace after removing additional_special_tokens, which alters user-visible
content; update strip_leaked_special_tokens (and keep behavior consistent with
strip_leaked_special_tokens_from_delta) to only remove occurrences of tokens
from text using tokenizer.get_special_tokens().additional_special_tokens and do
not call trim() or change surrounding whitespace—i.e., eliminate the
trimmed/changed block so the function only replaces token.as_str() with "" and
leaves whitespace intact.

63-79: ⚠️ Potential issue | 🟠 Major

Trim leading whitespace and accept generic code fences before JSON cleanup.

clean_json_response() currently only unwraps text that starts exactly with ```json / ```JSON, and truncate_to_json_boundary() only runs when the first character is { or [. Responses like "\n```json\n{...}\n```", plain triple-backtick fences, or "\n{...} trailing" skip both helpers and still break downstream JSON parsing.

Also applies to: 86-89

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/grpc/utils/response_cleanup.rs` around lines 63 -
79, The cleanup helpers miss cases with leading whitespace and generic code
fences; update clean_json_response to trim leading whitespace (e.g.,
text.trim_start()) before calling strip_markdown_json_fence and
truncate_to_json_boundary, modify strip_markdown_json_fence to accept plain
triple-backtick fences and to detect fences case-insensitively (treat "```",
"```json", "```JSON", etc.), and ensure truncate_to_json_boundary is invoked on
the trimmed/stripped text regardless of the original first character so JSON
like "\n```json\n{...}\n```" or "\n{...} trailing" are handled correctly;
reference functions: clean_json_response, strip_markdown_json_fence,
truncate_to_json_boundary.
model_gateway/src/routers/grpc/regular/processor.rs (1)

181-196: ⚠️ Potential issue | 🟠 Major

Drop logprobs whenever this cleanup rewrites processed_text.

logprobs are derived from the raw proto output before the new special-token/JSON cleanup runs. If this block strips leaked markers, unwraps fences, or truncates trailing text, message.content no longer matches the returned token logprobs. If recomputation is not possible, None is the safer fallback whenever processed_text changes here.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/grpc/regular/processor.rs` around lines 181 - 196,
The computed `logprobs` come from the raw proto before special-token/JSON
cleanup, so if cleanup mutates `processed_text` we must drop `logprobs`; capture
the original `processed_text` (e.g., let original_processed =
processed_text.clone()) before calling utils::strip_leaked_special_tokens and
utils::clean_json_response, run the cleanups (those calls in the block checking
self.configured_tool_parser / is_json_response), and if processed_text !=
original_processed set logprobs = None (or replace the Option with None) so the
returned token logprobs are not inconsistent with the modified message content.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@model_gateway/src/routers/grpc/regular/streaming.rs`:
- Around line 424-430: process_messages_streaming_chunks() currently emits
parser fields `text` / `normal_text` without sanitizing leaked special tokens;
update it to call utils::strip_leaked_special_tokens_from_delta(...) (passing
the same `tokenizer.as_ref()`) wherever `text` or `normal_text` are produced —
both on the regular-text path and the incremental tool-parser path — using the
same condition used elsewhere (i.e., when `self.configured_tool_parser.is_some()
|| self.configured_reasoning_parser.is_some()`) so streamed Messages get the
same `<|im_*|>` / `</think>` stripping as non-streaming code; apply the same
change to the other occurrence of the same pattern (the second block analogous
to 1253-1262).

---

Duplicate comments:
In `@model_gateway/src/routers/grpc/regular/processor.rs`:
- Around line 181-196: The computed `logprobs` come from the raw proto before
special-token/JSON cleanup, so if cleanup mutates `processed_text` we must drop
`logprobs`; capture the original `processed_text` (e.g., let original_processed
= processed_text.clone()) before calling utils::strip_leaked_special_tokens and
utils::clean_json_response, run the cleanups (those calls in the block checking
self.configured_tool_parser / is_json_response), and if processed_text !=
original_processed set logprobs = None (or replace the Option with None) so the
returned token logprobs are not inconsistent with the modified message content.

In `@model_gateway/src/routers/grpc/utils/response_cleanup.rs`:
- Around line 16-35: The function strip_leaked_special_tokens currently trims
leading/trailing whitespace after removing additional_special_tokens, which
alters user-visible content; update strip_leaked_special_tokens (and keep
behavior consistent with strip_leaked_special_tokens_from_delta) to only remove
occurrences of tokens from text using
tokenizer.get_special_tokens().additional_special_tokens and do not call trim()
or change surrounding whitespace—i.e., eliminate the trimmed/changed block so
the function only replaces token.as_str() with "" and leaves whitespace intact.
- Around line 63-79: The cleanup helpers miss cases with leading whitespace and
generic code fences; update clean_json_response to trim leading whitespace
(e.g., text.trim_start()) before calling strip_markdown_json_fence and
truncate_to_json_boundary, modify strip_markdown_json_fence to accept plain
triple-backtick fences and to detect fences case-insensitively (treat "```",
"```json", "```JSON", etc.), and ensure truncate_to_json_boundary is invoked on
the trimmed/stripped text regardless of the original first character so JSON
like "\n```json\n{...}\n```" or "\n{...} trailing" are handled correctly;
reference functions: clean_json_response, strip_markdown_json_fence,
truncate_to_json_boundary.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6f76d5d1-1de3-4a2c-9375-cd7943422d2a

📥 Commits

Reviewing files that changed from the base of the PR and between b112bae and c8c1013.

📒 Files selected for processing (4)
  • model_gateway/src/routers/grpc/regular/processor.rs
  • model_gateway/src/routers/grpc/regular/streaming.rs
  • model_gateway/src/routers/grpc/utils/mod.rs
  • model_gateway/src/routers/grpc/utils/response_cleanup.rs

Comment thread model_gateway/src/routers/grpc/regular/streaming.rs
@ConnorLi96 ConnorLi96 force-pushed the connorli/fix-leaked-token-cleanup branch from c8c1013 to 6a6b248 Compare April 9, 2026 12:04
@ConnorLi96 ConnorLi96 changed the title WIP fix(gateway): strip leaked special tokens and clean up JSON response fix(gateway): strip leaked special tokens and clean up JSON response Apr 9, 2026
ConnorLi96 and others added 2 commits April 9, 2026 05:05
…format

- Conditionally strip chatml tokens (<|im_end|>, etc.) only when a model-specific parser is configured
- Strip markdown JSON fences for json_object/json_schema response formats
- Truncate trailing content after first valid JSON object boundary

Signed-off-by: ConnorLi96 <ConnorLi96@users.noreply.github.com>
Made-with: Cursor
…coded chatml tokens

Replace 4 duplicated hardcoded CHATML_LEAK_TOKENS lists with dynamic
stripping via tokenizer.get_special_tokens().additional_special_tokens.
This adapts automatically to any model family (ChatML, Llama, Phi, etc.)
without maintaining a static token list.

Extract JSON response cleanup (markdown fence removal + brace-depth
truncation) into a shared utility with unit tests. Also adds support
for JSON array top-level values.

Signed-off-by: ConnorLi96 <ConnorLi96@users.noreply.github.com>
Signed-off-by: ConnorLi96 <connorli@together.ai>
Made-with: Cursor
Comment thread model_gateway/src/routers/grpc/regular/streaming.rs
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6a6b248915

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread model_gateway/src/routers/grpc/regular/streaming.rs
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0555c02197

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread model_gateway/src/routers/grpc/utils/response_cleanup.rs Outdated
Comment thread model_gateway/src/routers/grpc/utils/response_cleanup.rs Outdated
- Add CHATML_FALLBACK_TOKENS fallback for tiktoken tokenizers where
  additional_special_tokens is empty (fixes Kimi K2 regression)
- Extend token stripping to Messages API streaming (all 3 emission
  points + flush path) — previously only Chat Completions was covered
- Strip leaked tokens from Chat streaming flush (decoder.flush) path
- Drop logprobs when text was mutated by stripping or JSON cleanup
  (non-streaming: compare pre/post length; streaming: compare delta len)
- Remove trim() after stripping to preserve legitimate whitespace
- Accept any triple-backtick fence (```, ```json, ```JSON, etc.) in
  JSON cleanup, and trim leading whitespace before fence detection

Signed-off-by: ConnorLi96 <ConnorLi96@users.noreply.github.com>
Made-with: Cursor
@ConnorLi96 ConnorLi96 force-pushed the connorli/fix-leaked-token-cleanup branch from 0555c02 to 21edef5 Compare April 14, 2026 18:25
Reasoning text emitted via `process_reasoning_stream` (Chat) and
`process_messages_reasoning` (Messages) was not sanitized. This caused
ChatML tokens like `<|im_end|>` to leak into `reasoning_content` SSE
deltas even though `content` was clean.

- Pass tokenizer into `process_reasoning_stream` and strip reasoning
  text via `strip_leaked_special_tokens_from_delta` before building
  the SSE chunk
- Strip `reasoning_chunk_text` in the Messages streaming path alongside
  `normal_text` in a single conditional block

Signed-off-by: ConnorLi96 <ConnorLi96@users.noreply.github.com>
Made-with: Cursor
Comment thread model_gateway/src/routers/grpc/regular/processor.rs
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d968642b28

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +62 to +66
if special.additional_special_tokens.is_empty() {
for token in CHATML_FALLBACK_TOKENS {
if result.contains(token) {
result = result.replace(token, "");
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve token stripping state across streamed deltas

strip_leaked_special_tokens_from_delta only performs contains/replace on the current chunk and keeps no carry-over state, so leaked markers that arrive split across chunks are not removed. This is especially likely in the fallback path (CHATML_FALLBACK_TOKENS) where strings like <|im_end|> may be emitted over multiple deltas; e.g. <|im_ then end|> passes through unchanged and is reassembled by clients as leaked special-token text. The streaming cleaner needs a per-stream tail buffer (or equivalent boundary-aware logic) to catch cross-chunk matches.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@model_gateway/src/routers/grpc/regular/processor.rs`:
- Around line 186-205: The cleanup only operates on processed_text; extend the
same sanitization to reasoning_text and each tool_calls[*].function.arguments:
call utils::strip_leaked_special_tokens(&mut reasoning_text, tokenizer.as_ref())
and for every tool_call do utils::strip_leaked_special_tokens(&mut
tool_call.function.arguments, tokenizer.as_ref()); also apply
utils::clean_json_response when is_json_response is true to both reasoning_text
and tool_call.function.arguments so leaked markers/JSON-breaking tokens are
removed, and if any of those lengths change invalidate logprobs (same check as
processed_text). Apply the same changes at the analogous location around the
code handling non-streaming reasoning (the other referenced block).
- Around line 182-205: Record the processed_text length before any
tool/reasoning parsing runs (e.g. introduce pre_parsing_len =
processed_text.len() before the code paths that run configured_tool_parser or
configured_reasoning_parser), then after parsing and the existing cleanup steps
check against that saved length and set logprobs = None if processed_text.len()
!= pre_parsing_len; keep the existing post-cleanup length check as well so any
cleanup edits still invalidate logprobs, and reference the existing variables
logprobs, processed_text, configured_tool_parser, and
configured_reasoning_parser when making the change.

In `@model_gateway/src/routers/grpc/regular/streaming.rs`:
- Around line 1286-1295: The streaming path currently strips leaked special
tokens only from normal_text via utils::strip_leaked_special_tokens_from_delta
when either self.configured_tool_parser or self.configured_reasoning_parser is
set, but it leaves parser-emitted tool_call_item.parameters untouched; update
the code to call utils::strip_leaked_special_tokens_from_delta on
tool_call_item.parameters (and any other incremental JSON fragments used for
Chat tool-call deltas / Messages InputJsonDelta) under the same conditional so
parameters are sanitized too, and apply the same change in the other affected
block around the 1763–1779 region; reference the symbols configured_tool_parser,
configured_reasoning_parser, utils::strip_leaked_special_tokens_from_delta,
normal_text, and tool_call_item.parameters when locating and changing the code.
- Around line 425-437: The current logic only nulls choice_logprobs when
special-token stripping shortens delta, but if process_reasoning_stream()
already split off part of the chunk for a reasoning event the assistant delta no
longer aligns with chunk.output_logprobs() and we must also invalidate logprobs;
after calling process_reasoning_stream() (and after applying
utils::strip_leaked_special_tokens_from_delta), compare the length/structure of
the resulting delta against the original chunk logprobs (e.g.
chunk.output_logprobs() or its length) and if they don't match treat
choice_logprobs as None; update the block handling delta, choice_logprobs, and
any logic around configured_tool_parser/configured_reasoning_parser to set
choice_logprobs = None when a reasoning split was performed or delta no longer
corresponds to chunk.output_logprobs().
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 660709fd-7a95-45cf-8b46-f2b5c1288472

📥 Commits

Reviewing files that changed from the base of the PR and between 6a6b248 and 21edef5.

📒 Files selected for processing (3)
  • model_gateway/src/routers/grpc/regular/processor.rs
  • model_gateway/src/routers/grpc/regular/streaming.rs
  • model_gateway/src/routers/grpc/utils/response_cleanup.rs

Comment on lines +182 to +205
let mut logprobs = complete.output_logprobs().map(|ref proto_logprobs| {
utils::convert_proto_to_openai_logprobs(proto_logprobs, tokenizer)
});

// Step 5: Build ChatCompletionMessage (proper response message type)
// Step 5: Post-processing cleanup.
// Track whether text is mutated so we can invalidate logprobs.
let pre_cleanup_len = processed_text.len();

if self.configured_tool_parser.is_some() || self.configured_reasoning_parser.is_some() {
utils::strip_leaked_special_tokens(&mut processed_text, tokenizer.as_ref());
}

let is_json_response = matches!(
&original_request.response_format,
Some(openai_protocol::common::ResponseFormat::JsonObject)
| Some(openai_protocol::common::ResponseFormat::JsonSchema { .. })
);
if is_json_response {
utils::clean_json_response(&mut processed_text);
}

if processed_text.len() != pre_cleanup_len {
logprobs = None;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Drop logprobs when parsing already changed the emitted text.

logprobs come from the raw proto output, but processed_text may already have been rewritten by reasoning/tool parsing before this cleanup block runs. Line 203 only catches cleanup-length changes, so tool-call or separate_reasoning responses can still return logprobs for text that never appears in message.content.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/grpc/regular/processor.rs` around lines 182 - 205,
Record the processed_text length before any tool/reasoning parsing runs (e.g.
introduce pre_parsing_len = processed_text.len() before the code paths that run
configured_tool_parser or configured_reasoning_parser), then after parsing and
the existing cleanup steps check against that saved length and set logprobs =
None if processed_text.len() != pre_parsing_len; keep the existing post-cleanup
length check as well so any cleanup edits still invalidate logprobs, and
reference the existing variables logprobs, processed_text,
configured_tool_parser, and configured_reasoning_parser when making the change.

Comment on lines +186 to +205
// Step 5: Post-processing cleanup.
// Track whether text is mutated so we can invalidate logprobs.
let pre_cleanup_len = processed_text.len();

if self.configured_tool_parser.is_some() || self.configured_reasoning_parser.is_some() {
utils::strip_leaked_special_tokens(&mut processed_text, tokenizer.as_ref());
}

let is_json_response = matches!(
&original_request.response_format,
Some(openai_protocol::common::ResponseFormat::JsonObject)
| Some(openai_protocol::common::ResponseFormat::JsonSchema { .. })
);
if is_json_response {
utils::clean_json_response(&mut processed_text);
}

if processed_text.len() != pre_cleanup_len {
logprobs = None;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Sanitize parser-derived reasoning and tool outputs too.

This only scrubs processed_text. By the time this runs, reasoning_text and tool_calls[*].function.arguments are already detached, so non-streaming reasoning_content / Messages Thinking can still leak markers, and leaked argument text can trip the Messages serde_json::from_str(args) fallback to {}.

Also applies to: 671-673

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/grpc/regular/processor.rs` around lines 186 - 205,
The cleanup only operates on processed_text; extend the same sanitization to
reasoning_text and each tool_calls[*].function.arguments: call
utils::strip_leaked_special_tokens(&mut reasoning_text, tokenizer.as_ref()) and
for every tool_call do utils::strip_leaked_special_tokens(&mut
tool_call.function.arguments, tokenizer.as_ref()); also apply
utils::clean_json_response when is_json_response is true to both reasoning_text
and tool_call.function.arguments so leaked markers/JSON-breaking tokens are
removed, and if any of those lengths change invalidate logprobs (same check as
processed_text). Apply the same changes at the analogous location around the
code handling non-streaming reasoning (the other referenced block).

Comment on lines +425 to +437
let original_len = delta.len();
let delta = if self.configured_tool_parser.is_some()
|| self.configured_reasoning_parser.is_some()
{
utils::strip_leaked_special_tokens_from_delta(delta, tokenizer.as_ref())
} else {
delta
};
let choice_logprobs = if delta.len() < original_len {
None
} else {
choice_logprobs
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reasoning-split deltas still carry stale streaming logprobs.

This only invalidates logprobs when special-token stripping shortens delta. If process_reasoning_stream() already moved part of the chunk into a reasoning event, the assistant content delta no longer matches chunk.output_logprobs(), but choice_logprobs still survives.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/grpc/regular/streaming.rs` around lines 425 - 437,
The current logic only nulls choice_logprobs when special-token stripping
shortens delta, but if process_reasoning_stream() already split off part of the
chunk for a reasoning event the assistant delta no longer aligns with
chunk.output_logprobs() and we must also invalidate logprobs; after calling
process_reasoning_stream() (and after applying
utils::strip_leaked_special_tokens_from_delta), compare the length/structure of
the resulting delta against the original chunk logprobs (e.g.
chunk.output_logprobs() or its length) and if they don't match treat
choice_logprobs as None; update the block handling delta, choice_logprobs, and
any logic around configured_tool_parser/configured_reasoning_parser to set
choice_logprobs = None when a reasoning split was performed or delta no longer
corresponds to chunk.output_logprobs().

Comment on lines +1286 to +1295
let normal_text = if self.configured_tool_parser.is_some()
|| self.configured_reasoning_parser.is_some()
{
utils::strip_leaked_special_tokens_from_delta(
normal_text,
tokenizer.as_ref(),
)
} else {
normal_text
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Sanitize streamed tool arguments, not just normal_text.

These paths clean the surrounding text, but the parser-emitted tool_call_item.parameters fragments are still forwarded unchanged later in Chat tool-call deltas and Messages InputJsonDelta. Leaked <|im_*|> / </think> markers in arguments will still corrupt the incremental JSON clients consume.

Also applies to: 1763-1779

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/grpc/regular/streaming.rs` around lines 1286 -
1295, The streaming path currently strips leaked special tokens only from
normal_text via utils::strip_leaked_special_tokens_from_delta when either
self.configured_tool_parser or self.configured_reasoning_parser is set, but it
leaves parser-emitted tool_call_item.parameters untouched; update the code to
call utils::strip_leaked_special_tokens_from_delta on tool_call_item.parameters
(and any other incremental JSON fragments used for Chat tool-call deltas /
Messages InputJsonDelta) under the same conditional so parameters are sanitized
too, and apply the same change in the other affected block around the 1763–1779
region; reference the symbols configured_tool_parser,
configured_reasoning_parser, utils::strip_leaked_special_tokens_from_delta,
normal_text, and tool_call_item.parameters when locating and changing the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

grpc gRPC client and router changes model-gateway Model gateway crate changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Internal special tokens leak into final API responses for Kimi K2 outputs

1 participant