fix(gateway): strip leaked special tokens and clean up JSON response by ConnorLi96 · Pull Request #1063 · lightseekorg/smg

ConnorLi96 · 2026-04-08T06:49:56Z

Problem

Two response quality issues when serving models with chat-template special tokens:

Models with configured parsers (e.g. Kimi K2) sometimes leak chatml tokens (<|im_end|>, <|im_start|>, etc.) into API responses, appearing as garbage text to end users.
For json_object/json_schema response formats, models sometimes wrap output in markdown fences or append trailing text after the JSON object, breaking downstream JSON parsing.

Solution

Conditionally strip special tokens only when a model-specific tool/reasoning parser is configured, using the tokenizer's own special token list instead of a hardcoded set.
For JSON response formats: strip markdown fences, then truncate at the first valid JSON object boundary using a brace-depth-aware state machine.
Extracted into a testable response_cleanup module with unit tests.

Changes

model_gateway/src/routers/grpc/utils/response_cleanup.rs (new): strip_leaked_special_tokens, strip_leaked_special_tokens_from_delta, clean_json_response + unit tests
model_gateway/src/routers/grpc/utils/mod.rs: module re-export
model_gateway/src/routers/grpc/regular/processor.rs: integrate cleanup in Chat Completions + Messages API paths
model_gateway/src/routers/grpc/regular/streaming.rs: integrate streaming delta cleanup

Test Plan

Verified responses no longer contain leaked <|im_end|> / <|im_start|> tokens for Kimi K2
Verified models without configured parsers are unaffected (no token stripping applied)
Verified json_object responses with markdown fences are correctly unwrapped
Verified JSON responses with trailing text are truncated at the correct boundary
Unit tests cover fence stripping, JSON truncation, escape handling, and streaming cleanup

Checklist:

cargo +nightly fmt passes
cargo clippy --all-targets --all-features -- -D warnings passes

Closes #1111

Summary by CodeRabbit

Bug Fixes
- Improved response quality by removing leaked special tokens from model outputs in both streaming and non-streaming responses.
- Enhanced JSON response handling to strip unnecessary Markdown formatting and ensure valid JSON structure in model-generated responses.

coderabbitai · 2026-04-08T06:50:10Z

📝 Walkthrough

Walkthrough

Adds a response-cleanup module and integrates conditional post-processing into gRPC regular and streaming processors to strip leaked special tokens and to normalize/truncate Markdown-wrapped JSON outputs for JSON response formats (applied after tool/reasoning parsing).

Changes

Cohort / File(s)	Summary
Processor (non-streaming) `model_gateway/src/routers/grpc/regular/processor.rs`	After tool/reasoning parsing, conditionally strip leaked special tokens (when a tool/reasoning parser is configured) and, for `JsonObject`/`JsonSchema` responses, run `clean_json_response`. Invalidate choice logprobs if post-processing changed text.
Streaming processor `model_gateway/src/routers/grpc/regular/streaming.rs`	Thread `Tokenizer` into streaming paths; strip leaked special tokens from deltas, reasoning chunks, and flushed text when parsers are configured. If stripping shortens a delta, suppress choice logprobs.
Response cleanup utilities `model_gateway/src/routers/grpc/utils/mod.rs`, `model_gateway/src/routers/grpc/utils/response_cleanup.rs`	Add `response_cleanup` module with `strip_leaked_special_tokens`, `strip_leaked_special_tokens_from_delta`, and `clean_json_response` (Markdown fence stripping + truncation to first balanced top-level JSON value). Re-exported via utils; includes unit tests.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant Gateway as ModelGateway::Processor
  participant Model
  participant Tokenizer
  participant Utils as ResponseCleanup

  Client->>Gateway: request (messages, response_format, parsers configured)
  Gateway->>Model: forward request / start stream
  Model-->>Gateway: stream/generate raw text deltas
  Gateway->>Tokenizer: provide delta/tokenizer for stripping (if parser configured)
  Tokenizer-->>Gateway: cleaned delta
  Gateway->>Utils: clean_json_response (if JSON format)
  Utils-->>Gateway: cleaned/trimmed text
  Gateway-->>Client: emit cleaned streaming chunks / final response

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

feat(gateway): add MessageResponseProcessingStage for Messages API (non-streaming) #747: Modifies MessageResponseProcessingStage and non-streaming message processing paths that this PR also updates.
feat(gateway): wire Messages API pipeline into gRPC routers #753: Changes reasoning parsing placement that interacts with the post-parsing cleanup added here.
feat(gateway): add Messages API streaming support to gRPC router #758: Introduces streaming entrypoints and threading of tokenizer similar to the streaming changes in this PR.

Suggested reviewers

CatherineSue
key4ng
slin1237

Poem

🐰 I hopped through streams and fenced-off code,
Snipped stray tokens from the road,
JSON tidy, deltas neat,
Responses now are clean and sweet,
🥕✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly and concisely summarizes the two main fixes: stripping leaked special tokens and cleaning up JSON responses, matching the core changes.
Linked Issues check	✅ Passed	The PR fully addresses issue `#1111` by implementing token stripping across streaming and non-streaming paths, JSON cleanup for markdown-fenced responses, and conditional logic based on configured parsers.
Out of Scope Changes check	✅ Passed	All changes are tightly scoped to response cleanup utilities and their integration into the two primary response-processing paths, with no unrelated modifications detected.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch connorli/fix-leaked-token-cleanup

Warning

Review ran into problems

🔥 Problems

Timed out fetching pipeline failures after 30000ms

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

claude

Looks good — sensible defense-in-depth cleanup for leaked special tokens and malformed JSON responses. Posted 4 nits (all 🟡), no blocking issues.

Summary: 🟡 Nit × 4, 🔴 Important × 0

gemini-code-assist

Code Review

This pull request introduces logic to strip leaked ChatML and reasoning tokens from model responses and adds a mechanism to clean and extract JSON objects from formatted responses. However, the review identifies several issues: the JSON cleaning logic is fragile regarding whitespace and markdown formatting, and the token stripping in streaming mode is unreliable because tokens can be split across chunks. Additionally, there are inconsistencies and code duplication across different paths, such as the missing tag in non-streaming and incremental parsing logic, which should be addressed by unifying the implementation.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b112baecf7

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

model_gateway/src/routers/grpc/regular/streaming.rs (1)
423-445: ⚠️ Potential issue | 🟠 Major

Centralize streamed-text sanitization before emission.

The new cleanup only covers the main content delta and parse_incremental()'s normal_text. process_specific_function_stream() still forwards raw delta into function.arguments, the decoder.flush() path below still emits raw held suffixes, and this second copy also omits </think>, so the same leaked markers can still reach clients through those branches. Line 321 also computes choice_logprobs before this rewrite, so sanitized chunks can carry token logprobs for text the client never sees.

Also applies to: 1257-1267
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/grpc/regular/streaming.rs` around lines 423 - 445,
Sanitize streamed text in one place before any emission paths: move the
token-stripping logic (respecting configured_tool_parser and
configured_reasoning_parser) to a centralized helper and call it on all outputs
including the main delta, function.arguments in
process_specific_function_stream, any decoder.flush() held suffixes, and
parse_incremental()'s normal_text; ensure the strip set includes "</think>" and
all chatml markers and compute choice_logprobs after sanitization so logprobs
correspond to emitted text (update usages around configured_tool_parser,
configured_reasoning_parser, process_specific_function_stream, decoder.flush(),
parse_incremental(), and choice_logprobs).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@model_gateway/src/routers/grpc/regular/processor.rs`:
- Around line 201-214: The current unwrapping only matches processed_text that
starts exactly with "```json" or "```JSON"; update the logic in the
is_json_response branch that manipulates processed_text to first trim leading
whitespace/newlines (use processed_text.trim_start()), then detect a leading
triple-backtick fence in a case-insensitive and tolerant way (accept "```",
"```json", "```jsonld", etc., ignoring case or extra spaces) rather than exact
matches; when a fence is found, locate the first newline after the opening fence
and then rfind the closing "```" to extract the inner JSON content (assign back
to processed_text) so responses with leading newlines or plain fences are
correctly unwrapped.
- Around line 186-199: The ChatML cleanup currently removes leaked markers then
unconditionally calls processed_text = processed_text.trim().to_string(), which
strips legitimate leading/trailing whitespace and makes non-streaming behavior
diverge from streaming; remove that unconditional trim so the code only replaces
the marker tokens (leave leading/trailing whitespace intact) in the block that
checks self.configured_tool_parser / self.configured_reasoning_parser and do the
same change for the duplicate cleanup at the other location (lines around the
701-714 snippet) referencing processed_text and the token-replacement loop.

---

Outside diff comments:
In `@model_gateway/src/routers/grpc/regular/streaming.rs`:
- Around line 423-445: Sanitize streamed text in one place before any emission
paths: move the token-stripping logic (respecting configured_tool_parser and
configured_reasoning_parser) to a centralized helper and call it on all outputs
including the main delta, function.arguments in
process_specific_function_stream, any decoder.flush() held suffixes, and
parse_incremental()'s normal_text; ensure the strip set includes "</think>" and
all chatml markers and compute choice_logprobs after sanitization so logprobs
correspond to emitted text (update usages around configured_tool_parser,
configured_reasoning_parser, process_specific_function_stream, decoder.flush(),
parse_incremental(), and choice_logprobs).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: cba22969-fc49-46fb-a354-7133191e099e

📥 Commits

Reviewing files that changed from the base of the PR and between d3fc32a and b112bae.

📒 Files selected for processing (2)

model_gateway/src/routers/grpc/regular/processor.rs
model_gateway/src/routers/grpc/regular/streaming.rs

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c8c101354a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (3)

model_gateway/src/routers/grpc/utils/response_cleanup.rs (2)
16-35: ⚠️ Potential issue | 🟠 Major

Don’t trim user content after removing leaked tokens.

Once any special token is removed, the trim() block also deletes legitimate leading/trailing whitespace from otherwise valid responses. That changes user-visible content and makes non-streaming behavior diverge from strip_leaked_special_tokens_from_delta(), which only removes the leaked marker.
Suggested fix
 pub(crate) fn strip_leaked_special_tokens(text: &mut String, tokenizer: &dyn Tokenizer) {
     let special = tokenizer.get_special_tokens();
     if special.additional_special_tokens.is_empty() {
         return;
     }
 
-    let mut changed = false;
     for token in &special.additional_special_tokens {
         if text.contains(token.as_str()) {
             *text = text.replace(token.as_str(), "");
-            changed = true;
         }
     }
-    if changed {
-        let trimmed = text.trim();
-        if trimmed.len() != text.len() {
-            *text = trimmed.to_string();
-        }
-    }
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/grpc/utils/response_cleanup.rs` around lines 16 -
35, The function strip_leaked_special_tokens currently trims leading/trailing
whitespace after removing additional_special_tokens, which alters user-visible
content; update strip_leaked_special_tokens (and keep behavior consistent with
strip_leaked_special_tokens_from_delta) to only remove occurrences of tokens
from text using tokenizer.get_special_tokens().additional_special_tokens and do
not call trim() or change surrounding whitespace—i.e., eliminate the
trimmed/changed block so the function only replaces token.as_str() with "" and
leaves whitespace intact.
63-79: ⚠️ Potential issue | 🟠 Major

Trim leading whitespace and accept generic code fences before JSON cleanup.

clean_json_response() currently only unwraps text that starts exactly with ```json / ```JSON, and truncate_to_json_boundary() only runs when the first character is { or [. Responses like "\n```json\n{...}\n```", plain triple-backtick fences, or "\n{...} trailing" skip both helpers and still break downstream JSON parsing.

Also applies to: 86-89
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/grpc/utils/response_cleanup.rs` around lines 63 -
79, The cleanup helpers miss cases with leading whitespace and generic code
fences; update clean_json_response to trim leading whitespace (e.g.,
text.trim_start()) before calling strip_markdown_json_fence and
truncate_to_json_boundary, modify strip_markdown_json_fence to accept plain
triple-backtick fences and to detect fences case-insensitively (treat "```",
"```json", "```JSON", etc.), and ensure truncate_to_json_boundary is invoked on
the trimmed/stripped text regardless of the original first character so JSON
like "\n```json\n{...}\n```" or "\n{...} trailing" are handled correctly;
reference functions: clean_json_response, strip_markdown_json_fence,
truncate_to_json_boundary.
model_gateway/src/routers/grpc/regular/processor.rs (1)
181-196: ⚠️ Potential issue | 🟠 Major

Drop logprobs whenever this cleanup rewrites processed_text.

logprobs are derived from the raw proto output before the new special-token/JSON cleanup runs. If this block strips leaked markers, unwraps fences, or truncates trailing text, message.content no longer matches the returned token logprobs. If recomputation is not possible, None is the safer fallback whenever processed_text changes here.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/grpc/regular/processor.rs` around lines 181 - 196,
The computed `logprobs` come from the raw proto before special-token/JSON
cleanup, so if cleanup mutates `processed_text` we must drop `logprobs`; capture
the original `processed_text` (e.g., let original_processed =
processed_text.clone()) before calling utils::strip_leaked_special_tokens and
utils::clean_json_response, run the cleanups (those calls in the block checking
self.configured_tool_parser / is_json_response), and if processed_text !=
original_processed set logprobs = None (or replace the Option with None) so the
returned token logprobs are not inconsistent with the modified message content.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@model_gateway/src/routers/grpc/regular/streaming.rs`:
- Around line 424-430: process_messages_streaming_chunks() currently emits
parser fields `text` / `normal_text` without sanitizing leaked special tokens;
update it to call utils::strip_leaked_special_tokens_from_delta(...) (passing
the same `tokenizer.as_ref()`) wherever `text` or `normal_text` are produced —
both on the regular-text path and the incremental tool-parser path — using the
same condition used elsewhere (i.e., when `self.configured_tool_parser.is_some()
|| self.configured_reasoning_parser.is_some()`) so streamed Messages get the
same `<|im_*|>` / `</think>` stripping as non-streaming code; apply the same
change to the other occurrence of the same pattern (the second block analogous
to 1253-1262).

---

Duplicate comments:
In `@model_gateway/src/routers/grpc/regular/processor.rs`:
- Around line 181-196: The computed `logprobs` come from the raw proto before
special-token/JSON cleanup, so if cleanup mutates `processed_text` we must drop
`logprobs`; capture the original `processed_text` (e.g., let original_processed
= processed_text.clone()) before calling utils::strip_leaked_special_tokens and
utils::clean_json_response, run the cleanups (those calls in the block checking
self.configured_tool_parser / is_json_response), and if processed_text !=
original_processed set logprobs = None (or replace the Option with None) so the
returned token logprobs are not inconsistent with the modified message content.

In `@model_gateway/src/routers/grpc/utils/response_cleanup.rs`:
- Around line 16-35: The function strip_leaked_special_tokens currently trims
leading/trailing whitespace after removing additional_special_tokens, which
alters user-visible content; update strip_leaked_special_tokens (and keep
behavior consistent with strip_leaked_special_tokens_from_delta) to only remove
occurrences of tokens from text using
tokenizer.get_special_tokens().additional_special_tokens and do not call trim()
or change surrounding whitespace—i.e., eliminate the trimmed/changed block so
the function only replaces token.as_str() with "" and leaves whitespace intact.
- Around line 63-79: The cleanup helpers miss cases with leading whitespace and
generic code fences; update clean_json_response to trim leading whitespace
(e.g., text.trim_start()) before calling strip_markdown_json_fence and
truncate_to_json_boundary, modify strip_markdown_json_fence to accept plain
triple-backtick fences and to detect fences case-insensitively (treat "```",
"```json", "```JSON", etc.), and ensure truncate_to_json_boundary is invoked on
the trimmed/stripped text regardless of the original first character so JSON
like "\n```json\n{...}\n```" or "\n{...} trailing" are handled correctly;
reference functions: clean_json_response, strip_markdown_json_fence,
truncate_to_json_boundary.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6f76d5d1-1de3-4a2c-9375-cd7943422d2a

📥 Commits

Reviewing files that changed from the base of the PR and between b112bae and c8c1013.

📒 Files selected for processing (4)

model_gateway/src/routers/grpc/regular/processor.rs
model_gateway/src/routers/grpc/regular/streaming.rs
model_gateway/src/routers/grpc/utils/mod.rs
model_gateway/src/routers/grpc/utils/response_cleanup.rs

…format - Conditionally strip chatml tokens (<|im_end|>, etc.) only when a model-specific parser is configured - Strip markdown JSON fences for json_object/json_schema response formats - Truncate trailing content after first valid JSON object boundary Signed-off-by: ConnorLi96 <ConnorLi96@users.noreply.github.com> Made-with: Cursor

…coded chatml tokens Replace 4 duplicated hardcoded CHATML_LEAK_TOKENS lists with dynamic stripping via tokenizer.get_special_tokens().additional_special_tokens. This adapts automatically to any model family (ChatML, Llama, Phi, etc.) without maintaining a static token list. Extract JSON response cleanup (markdown fence removal + brace-depth truncation) into a shared utility with unit tests. Also adds support for JSON array top-level values. Signed-off-by: ConnorLi96 <ConnorLi96@users.noreply.github.com> Signed-off-by: ConnorLi96 <connorli@together.ai> Made-with: Cursor

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6a6b248915

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0555c02197

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

…oken-cleanup

- Add CHATML_FALLBACK_TOKENS fallback for tiktoken tokenizers where additional_special_tokens is empty (fixes Kimi K2 regression) - Extend token stripping to Messages API streaming (all 3 emission points + flush path) — previously only Chat Completions was covered - Strip leaked tokens from Chat streaming flush (decoder.flush) path - Drop logprobs when text was mutated by stripping or JSON cleanup (non-streaming: compare pre/post length; streaming: compare delta len) - Remove trim() after stripping to preserve legitimate whitespace - Accept any triple-backtick fence (```, ```json, ```JSON, etc.) in JSON cleanup, and trim leading whitespace before fence detection Signed-off-by: ConnorLi96 <ConnorLi96@users.noreply.github.com> Made-with: Cursor

Reasoning text emitted via `process_reasoning_stream` (Chat) and `process_messages_reasoning` (Messages) was not sanitized. This caused ChatML tokens like `<|im_end|>` to leak into `reasoning_content` SSE deltas even though `content` was clean. - Pass tokenizer into `process_reasoning_stream` and strip reasoning text via `strip_leaked_special_tokens_from_delta` before building the SSE chunk - Strip `reasoning_chunk_text` in the Messages streaming path alongside `normal_text` in a single conditional block Signed-off-by: ConnorLi96 <ConnorLi96@users.noreply.github.com> Made-with: Cursor

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d968642b28

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-14T18:37:02Z

+    if special.additional_special_tokens.is_empty() {
+        for token in CHATML_FALLBACK_TOKENS {
+            if result.contains(token) {
+                result = result.replace(token, "");
+            }


Preserve token stripping state across streamed deltas

strip_leaked_special_tokens_from_delta only performs contains/replace on the current chunk and keeps no carry-over state, so leaked markers that arrive split across chunks are not removed. This is especially likely in the fallback path (CHATML_FALLBACK_TOKENS) where strings like <|im_end|> may be emitted over multiple deltas; e.g. <|im_ then end|> passes through unchanged and is reassembled by clients as leaked special-token text. The streaming cleaner needs a per-stream tail buffer (or equivalent boundary-aware logic) to catch cross-chunk matches.

Useful? React with 👍 / 👎.

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@model_gateway/src/routers/grpc/regular/processor.rs`:
- Around line 186-205: The cleanup only operates on processed_text; extend the
same sanitization to reasoning_text and each tool_calls[*].function.arguments:
call utils::strip_leaked_special_tokens(&mut reasoning_text, tokenizer.as_ref())
and for every tool_call do utils::strip_leaked_special_tokens(&mut
tool_call.function.arguments, tokenizer.as_ref()); also apply
utils::clean_json_response when is_json_response is true to both reasoning_text
and tool_call.function.arguments so leaked markers/JSON-breaking tokens are
removed, and if any of those lengths change invalidate logprobs (same check as
processed_text). Apply the same changes at the analogous location around the
code handling non-streaming reasoning (the other referenced block).
- Around line 182-205: Record the processed_text length before any
tool/reasoning parsing runs (e.g. introduce pre_parsing_len =
processed_text.len() before the code paths that run configured_tool_parser or
configured_reasoning_parser), then after parsing and the existing cleanup steps
check against that saved length and set logprobs = None if processed_text.len()
!= pre_parsing_len; keep the existing post-cleanup length check as well so any
cleanup edits still invalidate logprobs, and reference the existing variables
logprobs, processed_text, configured_tool_parser, and
configured_reasoning_parser when making the change.

In `@model_gateway/src/routers/grpc/regular/streaming.rs`:
- Around line 1286-1295: The streaming path currently strips leaked special
tokens only from normal_text via utils::strip_leaked_special_tokens_from_delta
when either self.configured_tool_parser or self.configured_reasoning_parser is
set, but it leaves parser-emitted tool_call_item.parameters untouched; update
the code to call utils::strip_leaked_special_tokens_from_delta on
tool_call_item.parameters (and any other incremental JSON fragments used for
Chat tool-call deltas / Messages InputJsonDelta) under the same conditional so
parameters are sanitized too, and apply the same change in the other affected
block around the 1763–1779 region; reference the symbols configured_tool_parser,
configured_reasoning_parser, utils::strip_leaked_special_tokens_from_delta,
normal_text, and tool_call_item.parameters when locating and changing the code.
- Around line 425-437: The current logic only nulls choice_logprobs when
special-token stripping shortens delta, but if process_reasoning_stream()
already split off part of the chunk for a reasoning event the assistant delta no
longer aligns with chunk.output_logprobs() and we must also invalidate logprobs;
after calling process_reasoning_stream() (and after applying
utils::strip_leaked_special_tokens_from_delta), compare the length/structure of
the resulting delta against the original chunk logprobs (e.g.
chunk.output_logprobs() or its length) and if they don't match treat
choice_logprobs as None; update the block handling delta, choice_logprobs, and
any logic around configured_tool_parser/configured_reasoning_parser to set
choice_logprobs = None when a reasoning split was performed or delta no longer
corresponds to chunk.output_logprobs().

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 660709fd-7a95-45cf-8b46-f2b5c1288472

📥 Commits

Reviewing files that changed from the base of the PR and between 6a6b248 and 21edef5.

📒 Files selected for processing (3)

model_gateway/src/routers/grpc/regular/processor.rs
model_gateway/src/routers/grpc/regular/streaming.rs
model_gateway/src/routers/grpc/utils/response_cleanup.rs

coderabbitai · 2026-04-14T18:39:00Z

+        let mut logprobs = complete.output_logprobs().map(|ref proto_logprobs| {
            utils::convert_proto_to_openai_logprobs(proto_logprobs, tokenizer)
        });

-        // Step 5: Build ChatCompletionMessage (proper response message type)
+        // Step 5: Post-processing cleanup.
+        // Track whether text is mutated so we can invalidate logprobs.
+        let pre_cleanup_len = processed_text.len();
+
+        if self.configured_tool_parser.is_some() || self.configured_reasoning_parser.is_some() {
+            utils::strip_leaked_special_tokens(&mut processed_text, tokenizer.as_ref());
+        }
+
+        let is_json_response = matches!(
+            &original_request.response_format,
+            Some(openai_protocol::common::ResponseFormat::JsonObject)
+                | Some(openai_protocol::common::ResponseFormat::JsonSchema { .. })
+        );
+        if is_json_response {
+            utils::clean_json_response(&mut processed_text);
+        }
+
+        if processed_text.len() != pre_cleanup_len {
+            logprobs = None;
+        }


⚠️ Potential issue | 🟠 Major

Drop logprobs when parsing already changed the emitted text.

logprobs come from the raw proto output, but processed_text may already have been rewritten by reasoning/tool parsing before this cleanup block runs. Line 203 only catches cleanup-length changes, so tool-call or separate_reasoning responses can still return logprobs for text that never appears in message.content.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/routers/grpc/regular/processor.rs` around lines 182 - 205, Record the processed_text length before any tool/reasoning parsing runs (e.g. introduce pre_parsing_len = processed_text.len() before the code paths that run configured_tool_parser or configured_reasoning_parser), then after parsing and the existing cleanup steps check against that saved length and set logprobs = None if processed_text.len() != pre_parsing_len; keep the existing post-cleanup length check as well so any cleanup edits still invalidate logprobs, and reference the existing variables logprobs, processed_text, configured_tool_parser, and configured_reasoning_parser when making the change.

coderabbitai · 2026-04-14T18:39:00Z

+        // Step 5: Post-processing cleanup.
+        // Track whether text is mutated so we can invalidate logprobs.
+        let pre_cleanup_len = processed_text.len();
+
+        if self.configured_tool_parser.is_some() || self.configured_reasoning_parser.is_some() {
+            utils::strip_leaked_special_tokens(&mut processed_text, tokenizer.as_ref());
+        }
+
+        let is_json_response = matches!(
+            &original_request.response_format,
+            Some(openai_protocol::common::ResponseFormat::JsonObject)
+                | Some(openai_protocol::common::ResponseFormat::JsonSchema { .. })
+        );
+        if is_json_response {
+            utils::clean_json_response(&mut processed_text);
+        }
+
+        if processed_text.len() != pre_cleanup_len {
+            logprobs = None;
+        }


⚠️ Potential issue | 🟠 Major

Sanitize parser-derived reasoning and tool outputs too.

This only scrubs processed_text. By the time this runs, reasoning_text and tool_calls[*].function.arguments are already detached, so non-streaming reasoning_content / Messages Thinking can still leak markers, and leaked argument text can trip the Messages serde_json::from_str(args) fallback to {}.

Also applies to: 671-673

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/routers/grpc/regular/processor.rs` around lines 186 - 205, The cleanup only operates on processed_text; extend the same sanitization to reasoning_text and each tool_calls[*].function.arguments: call utils::strip_leaked_special_tokens(&mut reasoning_text, tokenizer.as_ref()) and for every tool_call do utils::strip_leaked_special_tokens(&mut tool_call.function.arguments, tokenizer.as_ref()); also apply utils::clean_json_response when is_json_response is true to both reasoning_text and tool_call.function.arguments so leaked markers/JSON-breaking tokens are removed, and if any of those lengths change invalidate logprobs (same check as processed_text). Apply the same changes at the analogous location around the code handling non-streaming reasoning (the other referenced block).

coderabbitai · 2026-04-14T18:39:00Z

+                    let original_len = delta.len();
+                    let delta = if self.configured_tool_parser.is_some()
+                        || self.configured_reasoning_parser.is_some()
+                    {
+                        utils::strip_leaked_special_tokens_from_delta(delta, tokenizer.as_ref())
+                    } else {
+                        delta
+                    };
+                    let choice_logprobs = if delta.len() < original_len {
+                        None
+                    } else {
+                        choice_logprobs
+                    };


⚠️ Potential issue | 🟠 Major

Reasoning-split deltas still carry stale streaming logprobs.

This only invalidates logprobs when special-token stripping shortens delta. If process_reasoning_stream() already moved part of the chunk into a reasoning event, the assistant content delta no longer matches chunk.output_logprobs(), but choice_logprobs still survives.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/routers/grpc/regular/streaming.rs` around lines 425 - 437, The current logic only nulls choice_logprobs when special-token stripping shortens delta, but if process_reasoning_stream() already split off part of the chunk for a reasoning event the assistant delta no longer aligns with chunk.output_logprobs() and we must also invalidate logprobs; after calling process_reasoning_stream() (and after applying utils::strip_leaked_special_tokens_from_delta), compare the length/structure of the resulting delta against the original chunk logprobs (e.g. chunk.output_logprobs() or its length) and if they don't match treat choice_logprobs as None; update the block handling delta, choice_logprobs, and any logic around configured_tool_parser/configured_reasoning_parser to set choice_logprobs = None when a reasoning split was performed or delta no longer corresponds to chunk.output_logprobs().

coderabbitai · 2026-04-14T18:39:00Z

+                    let normal_text = if self.configured_tool_parser.is_some()
+                        || self.configured_reasoning_parser.is_some()
+                    {
+                        utils::strip_leaked_special_tokens_from_delta(
+                            normal_text,
+                            tokenizer.as_ref(),
+                        )
+                    } else {
+                        normal_text
+                    };


⚠️ Potential issue | 🟠 Major

Sanitize streamed tool arguments, not just normal_text.

These paths clean the surrounding text, but the parser-emitted tool_call_item.parameters fragments are still forwarded unchanged later in Chat tool-call deltas and Messages InputJsonDelta. Leaked <|im_*|> / </think> markers in arguments will still corrupt the incremental JSON clients consume.

Also applies to: 1763-1779

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/routers/grpc/regular/streaming.rs` around lines 1286 - 1295, The streaming path currently strips leaked special tokens only from normal_text via utils::strip_leaked_special_tokens_from_delta when either self.configured_tool_parser or self.configured_reasoning_parser is set, but it leaves parser-emitted tool_call_item.parameters untouched; update the code to call utils::strip_leaked_special_tokens_from_delta on tool_call_item.parameters (and any other incremental JSON fragments used for Chat tool-call deltas / Messages InputJsonDelta) under the same conditional so parameters are sanitized too, and apply the same change in the other affected block around the 1763–1779 region; reference the symbols configured_tool_parser, configured_reasoning_parser, utils::strip_leaked_special_tokens_from_delta, normal_text, and tool_call_item.parameters when locating and changing the code.

ConnorLi96 requested review from CatherineSue, key4ng and slin1237 as code owners April 8, 2026 06:49

ConnorLi96 changed the title ~~fix(gateway): strip leaked special tokens and clean up JSON response …~~ fix(gateway): strip leaked special tokens and clean up JSON response Apr 8, 2026

github-actions bot added grpc gRPC client and router changes model-gateway Model gateway crate changes labels Apr 8, 2026

claude bot reviewed Apr 8, 2026

View reviewed changes

Comment thread model_gateway/src/routers/grpc/regular/streaming.rs Outdated

claude bot reviewed Apr 8, 2026

View reviewed changes

Comment thread model_gateway/src/routers/grpc/regular/processor.rs Outdated

claude bot reviewed Apr 8, 2026

View reviewed changes

Comment thread model_gateway/src/routers/grpc/regular/streaming.rs Outdated

claude bot reviewed Apr 8, 2026

View reviewed changes

Comment thread model_gateway/src/routers/grpc/regular/processor.rs Outdated

claude bot approved these changes Apr 8, 2026

View reviewed changes

gemini-code-assist bot reviewed Apr 8, 2026

View reviewed changes

ConnorLi96 changed the title ~~fix(gateway): strip leaked special tokens and clean up JSON response~~ WIP fix(gateway): strip leaked special tokens and clean up JSON response Apr 8, 2026

chatgpt-codex-connector bot reviewed Apr 8, 2026

View reviewed changes

Comment thread model_gateway/src/routers/grpc/regular/processor.rs Outdated

Comment thread model_gateway/src/routers/grpc/regular/streaming.rs Outdated

coderabbitai bot requested changes Apr 8, 2026

View reviewed changes

Comment thread model_gateway/src/routers/grpc/regular/processor.rs Outdated

Comment thread model_gateway/src/routers/grpc/regular/processor.rs Outdated

Comment thread model_gateway/src/routers/grpc/regular/processor.rs Outdated

claude bot reviewed Apr 8, 2026

View reviewed changes

Comment thread model_gateway/src/routers/grpc/utils/response_cleanup.rs

claude bot reviewed Apr 8, 2026

View reviewed changes

Comment thread model_gateway/src/routers/grpc/utils/response_cleanup.rs

claude bot reviewed Apr 8, 2026

View reviewed changes

Comment thread model_gateway/src/routers/grpc/regular/processor.rs

chatgpt-codex-connector bot reviewed Apr 8, 2026

View reviewed changes

Comment thread model_gateway/src/routers/grpc/regular/processor.rs

Comment thread model_gateway/src/routers/grpc/utils/response_cleanup.rs Outdated

claude bot reviewed Apr 8, 2026

View reviewed changes

Comment thread model_gateway/src/routers/grpc/utils/response_cleanup.rs Outdated

coderabbitai bot requested changes Apr 8, 2026

View reviewed changes

Comment thread model_gateway/src/routers/grpc/regular/streaming.rs

ConnorLi96 force-pushed the connorli/fix-leaked-token-cleanup branch from c8c1013 to 6a6b248 Compare April 9, 2026 12:04

ConnorLi96 changed the title ~~WIP fix(gateway): strip leaked special tokens and clean up JSON response~~ fix(gateway): strip leaked special tokens and clean up JSON response Apr 9, 2026

ConnorLi96 and others added 2 commits April 9, 2026 05:05

claude bot reviewed Apr 9, 2026

View reviewed changes

Comment thread model_gateway/src/routers/grpc/regular/streaming.rs

chatgpt-codex-connector bot reviewed Apr 9, 2026

View reviewed changes

Comment thread model_gateway/src/routers/grpc/regular/streaming.rs

chatgpt-codex-connector bot reviewed Apr 13, 2026

View reviewed changes

Comment thread model_gateway/src/routers/grpc/utils/response_cleanup.rs Outdated

Comment thread model_gateway/src/routers/grpc/utils/response_cleanup.rs Outdated

ConnorLi96 mentioned this pull request Apr 13, 2026

[Bug]: Internal special tokens leak into final API responses for Kimi K2 outputs #1111

Open

3 tasks

ConnorLi96 added 2 commits April 14, 2026 10:08

Merge remote-tracking branch 'origin/main' into connorli/fix-leaked-t…

1b0fec4

…oken-cleanup

ConnorLi96 force-pushed the connorli/fix-leaked-token-cleanup branch from 0555c02 to 21edef5 Compare April 14, 2026 18:25

claude bot reviewed Apr 14, 2026

View reviewed changes

Comment thread model_gateway/src/routers/grpc/regular/processor.rs

Merge branch 'main' into connorli/fix-leaked-token-cleanup

d968642

chatgpt-codex-connector bot reviewed Apr 14, 2026

View reviewed changes

coderabbitai bot requested changes Apr 14, 2026

View reviewed changes

Conversation

ConnorLi96 commented Apr 8, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Changes

Test Plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Review ran into problems

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 14, 2026

ConnorLi96 commented Apr 8, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 8, 2026 •

edited

Loading