fix(gateway): strip leaked special tokens and clean up JSON response#1063
fix(gateway): strip leaked special tokens and clean up JSON response#1063ConnorLi96 wants to merge 6 commits intomainfrom
Conversation
📝 WalkthroughWalkthroughAdds a response-cleanup module and integrates conditional post-processing into gRPC regular and streaming processors to strip leaked special tokens and to normalize/truncate Markdown-wrapped JSON outputs for JSON response formats (applied after tool/reasoning parsing). Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Gateway as ModelGateway::Processor
participant Model
participant Tokenizer
participant Utils as ResponseCleanup
Client->>Gateway: request (messages, response_format, parsers configured)
Gateway->>Model: forward request / start stream
Model-->>Gateway: stream/generate raw text deltas
Gateway->>Tokenizer: provide delta/tokenizer for stripping (if parser configured)
Tokenizer-->>Gateway: cleaned delta
Gateway->>Utils: clean_json_response (if JSON format)
Utils-->>Gateway: cleaned/trimmed text
Gateway-->>Client: emit cleaned streaming chunks / final response
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsTimed out fetching pipeline failures after 30000ms Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces logic to strip leaked ChatML and reasoning tokens from model responses and adds a mechanism to clean and extract JSON objects from formatted responses. However, the review identifies several issues: the JSON cleaning logic is fragile regarding whitespace and markdown formatting, and the token stripping in streaming mode is unreliable because tokens can be split across chunks. Additionally, there are inconsistencies and code duplication across different paths, such as the missing tag in non-streaming and incremental parsing logic, which should be addressed by unifying the implementation.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b112baecf7
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
model_gateway/src/routers/grpc/regular/streaming.rs (1)
423-445:⚠️ Potential issue | 🟠 MajorCentralize streamed-text sanitization before emission.
The new cleanup only covers the main content delta and
parse_incremental()'snormal_text.process_specific_function_stream()still forwards rawdeltaintofunction.arguments, thedecoder.flush()path below still emits raw held suffixes, and this second copy also omits</think>, so the same leaked markers can still reach clients through those branches. Line 321 also computeschoice_logprobsbefore this rewrite, so sanitized chunks can carry token logprobs for text the client never sees.Also applies to: 1257-1267
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/routers/grpc/regular/streaming.rs` around lines 423 - 445, Sanitize streamed text in one place before any emission paths: move the token-stripping logic (respecting configured_tool_parser and configured_reasoning_parser) to a centralized helper and call it on all outputs including the main delta, function.arguments in process_specific_function_stream, any decoder.flush() held suffixes, and parse_incremental()'s normal_text; ensure the strip set includes "</think>" and all chatml markers and compute choice_logprobs after sanitization so logprobs correspond to emitted text (update usages around configured_tool_parser, configured_reasoning_parser, process_specific_function_stream, decoder.flush(), parse_incremental(), and choice_logprobs).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@model_gateway/src/routers/grpc/regular/processor.rs`:
- Around line 201-214: The current unwrapping only matches processed_text that
starts exactly with "```json" or "```JSON"; update the logic in the
is_json_response branch that manipulates processed_text to first trim leading
whitespace/newlines (use processed_text.trim_start()), then detect a leading
triple-backtick fence in a case-insensitive and tolerant way (accept "```",
"```json", "```jsonld", etc., ignoring case or extra spaces) rather than exact
matches; when a fence is found, locate the first newline after the opening fence
and then rfind the closing "```" to extract the inner JSON content (assign back
to processed_text) so responses with leading newlines or plain fences are
correctly unwrapped.
- Around line 186-199: The ChatML cleanup currently removes leaked markers then
unconditionally calls processed_text = processed_text.trim().to_string(), which
strips legitimate leading/trailing whitespace and makes non-streaming behavior
diverge from streaming; remove that unconditional trim so the code only replaces
the marker tokens (leave leading/trailing whitespace intact) in the block that
checks self.configured_tool_parser / self.configured_reasoning_parser and do the
same change for the duplicate cleanup at the other location (lines around the
701-714 snippet) referencing processed_text and the token-replacement loop.
---
Outside diff comments:
In `@model_gateway/src/routers/grpc/regular/streaming.rs`:
- Around line 423-445: Sanitize streamed text in one place before any emission
paths: move the token-stripping logic (respecting configured_tool_parser and
configured_reasoning_parser) to a centralized helper and call it on all outputs
including the main delta, function.arguments in
process_specific_function_stream, any decoder.flush() held suffixes, and
parse_incremental()'s normal_text; ensure the strip set includes "</think>" and
all chatml markers and compute choice_logprobs after sanitization so logprobs
correspond to emitted text (update usages around configured_tool_parser,
configured_reasoning_parser, process_specific_function_stream, decoder.flush(),
parse_incremental(), and choice_logprobs).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: cba22969-fc49-46fb-a354-7133191e099e
📒 Files selected for processing (2)
model_gateway/src/routers/grpc/regular/processor.rsmodel_gateway/src/routers/grpc/regular/streaming.rs
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c8c101354a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (3)
model_gateway/src/routers/grpc/utils/response_cleanup.rs (2)
16-35:⚠️ Potential issue | 🟠 MajorDon’t trim user content after removing leaked tokens.
Once any special token is removed, the
trim()block also deletes legitimate leading/trailing whitespace from otherwise valid responses. That changes user-visible content and makes non-streaming behavior diverge fromstrip_leaked_special_tokens_from_delta(), which only removes the leaked marker.Suggested fix
pub(crate) fn strip_leaked_special_tokens(text: &mut String, tokenizer: &dyn Tokenizer) { let special = tokenizer.get_special_tokens(); if special.additional_special_tokens.is_empty() { return; } - let mut changed = false; for token in &special.additional_special_tokens { if text.contains(token.as_str()) { *text = text.replace(token.as_str(), ""); - changed = true; } } - if changed { - let trimmed = text.trim(); - if trimmed.len() != text.len() { - *text = trimmed.to_string(); - } - } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/routers/grpc/utils/response_cleanup.rs` around lines 16 - 35, The function strip_leaked_special_tokens currently trims leading/trailing whitespace after removing additional_special_tokens, which alters user-visible content; update strip_leaked_special_tokens (and keep behavior consistent with strip_leaked_special_tokens_from_delta) to only remove occurrences of tokens from text using tokenizer.get_special_tokens().additional_special_tokens and do not call trim() or change surrounding whitespace—i.e., eliminate the trimmed/changed block so the function only replaces token.as_str() with "" and leaves whitespace intact.
63-79:⚠️ Potential issue | 🟠 MajorTrim leading whitespace and accept generic code fences before JSON cleanup.
clean_json_response()currently only unwraps text that starts exactly with```json/```JSON, andtruncate_to_json_boundary()only runs when the first character is{or[. Responses like"\n```json\n{...}\n```", plain triple-backtick fences, or"\n{...} trailing"skip both helpers and still break downstream JSON parsing.Also applies to: 86-89
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/routers/grpc/utils/response_cleanup.rs` around lines 63 - 79, The cleanup helpers miss cases with leading whitespace and generic code fences; update clean_json_response to trim leading whitespace (e.g., text.trim_start()) before calling strip_markdown_json_fence and truncate_to_json_boundary, modify strip_markdown_json_fence to accept plain triple-backtick fences and to detect fences case-insensitively (treat "```", "```json", "```JSON", etc.), and ensure truncate_to_json_boundary is invoked on the trimmed/stripped text regardless of the original first character so JSON like "\n```json\n{...}\n```" or "\n{...} trailing" are handled correctly; reference functions: clean_json_response, strip_markdown_json_fence, truncate_to_json_boundary.model_gateway/src/routers/grpc/regular/processor.rs (1)
181-196:⚠️ Potential issue | 🟠 MajorDrop
logprobswhenever this cleanup rewritesprocessed_text.
logprobsare derived from the raw proto output before the new special-token/JSON cleanup runs. If this block strips leaked markers, unwraps fences, or truncates trailing text,message.contentno longer matches the returned token logprobs. If recomputation is not possible,Noneis the safer fallback wheneverprocessed_textchanges here.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/routers/grpc/regular/processor.rs` around lines 181 - 196, The computed `logprobs` come from the raw proto before special-token/JSON cleanup, so if cleanup mutates `processed_text` we must drop `logprobs`; capture the original `processed_text` (e.g., let original_processed = processed_text.clone()) before calling utils::strip_leaked_special_tokens and utils::clean_json_response, run the cleanups (those calls in the block checking self.configured_tool_parser / is_json_response), and if processed_text != original_processed set logprobs = None (or replace the Option with None) so the returned token logprobs are not inconsistent with the modified message content.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@model_gateway/src/routers/grpc/regular/streaming.rs`:
- Around line 424-430: process_messages_streaming_chunks() currently emits
parser fields `text` / `normal_text` without sanitizing leaked special tokens;
update it to call utils::strip_leaked_special_tokens_from_delta(...) (passing
the same `tokenizer.as_ref()`) wherever `text` or `normal_text` are produced —
both on the regular-text path and the incremental tool-parser path — using the
same condition used elsewhere (i.e., when `self.configured_tool_parser.is_some()
|| self.configured_reasoning_parser.is_some()`) so streamed Messages get the
same `<|im_*|>` / `</think>` stripping as non-streaming code; apply the same
change to the other occurrence of the same pattern (the second block analogous
to 1253-1262).
---
Duplicate comments:
In `@model_gateway/src/routers/grpc/regular/processor.rs`:
- Around line 181-196: The computed `logprobs` come from the raw proto before
special-token/JSON cleanup, so if cleanup mutates `processed_text` we must drop
`logprobs`; capture the original `processed_text` (e.g., let original_processed
= processed_text.clone()) before calling utils::strip_leaked_special_tokens and
utils::clean_json_response, run the cleanups (those calls in the block checking
self.configured_tool_parser / is_json_response), and if processed_text !=
original_processed set logprobs = None (or replace the Option with None) so the
returned token logprobs are not inconsistent with the modified message content.
In `@model_gateway/src/routers/grpc/utils/response_cleanup.rs`:
- Around line 16-35: The function strip_leaked_special_tokens currently trims
leading/trailing whitespace after removing additional_special_tokens, which
alters user-visible content; update strip_leaked_special_tokens (and keep
behavior consistent with strip_leaked_special_tokens_from_delta) to only remove
occurrences of tokens from text using
tokenizer.get_special_tokens().additional_special_tokens and do not call trim()
or change surrounding whitespace—i.e., eliminate the trimmed/changed block so
the function only replaces token.as_str() with "" and leaves whitespace intact.
- Around line 63-79: The cleanup helpers miss cases with leading whitespace and
generic code fences; update clean_json_response to trim leading whitespace
(e.g., text.trim_start()) before calling strip_markdown_json_fence and
truncate_to_json_boundary, modify strip_markdown_json_fence to accept plain
triple-backtick fences and to detect fences case-insensitively (treat "```",
"```json", "```JSON", etc.), and ensure truncate_to_json_boundary is invoked on
the trimmed/stripped text regardless of the original first character so JSON
like "\n```json\n{...}\n```" or "\n{...} trailing" are handled correctly;
reference functions: clean_json_response, strip_markdown_json_fence,
truncate_to_json_boundary.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 6f76d5d1-1de3-4a2c-9375-cd7943422d2a
📒 Files selected for processing (4)
model_gateway/src/routers/grpc/regular/processor.rsmodel_gateway/src/routers/grpc/regular/streaming.rsmodel_gateway/src/routers/grpc/utils/mod.rsmodel_gateway/src/routers/grpc/utils/response_cleanup.rs
c8c1013 to
6a6b248
Compare
…format - Conditionally strip chatml tokens (<|im_end|>, etc.) only when a model-specific parser is configured - Strip markdown JSON fences for json_object/json_schema response formats - Truncate trailing content after first valid JSON object boundary Signed-off-by: ConnorLi96 <ConnorLi96@users.noreply.github.com> Made-with: Cursor
…coded chatml tokens Replace 4 duplicated hardcoded CHATML_LEAK_TOKENS lists with dynamic stripping via tokenizer.get_special_tokens().additional_special_tokens. This adapts automatically to any model family (ChatML, Llama, Phi, etc.) without maintaining a static token list. Extract JSON response cleanup (markdown fence removal + brace-depth truncation) into a shared utility with unit tests. Also adds support for JSON array top-level values. Signed-off-by: ConnorLi96 <ConnorLi96@users.noreply.github.com> Signed-off-by: ConnorLi96 <connorli@together.ai> Made-with: Cursor
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6a6b248915
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0555c02197
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
- Add CHATML_FALLBACK_TOKENS fallback for tiktoken tokenizers where additional_special_tokens is empty (fixes Kimi K2 regression) - Extend token stripping to Messages API streaming (all 3 emission points + flush path) — previously only Chat Completions was covered - Strip leaked tokens from Chat streaming flush (decoder.flush) path - Drop logprobs when text was mutated by stripping or JSON cleanup (non-streaming: compare pre/post length; streaming: compare delta len) - Remove trim() after stripping to preserve legitimate whitespace - Accept any triple-backtick fence (```, ```json, ```JSON, etc.) in JSON cleanup, and trim leading whitespace before fence detection Signed-off-by: ConnorLi96 <ConnorLi96@users.noreply.github.com> Made-with: Cursor
0555c02 to
21edef5
Compare
Reasoning text emitted via `process_reasoning_stream` (Chat) and `process_messages_reasoning` (Messages) was not sanitized. This caused ChatML tokens like `<|im_end|>` to leak into `reasoning_content` SSE deltas even though `content` was clean. - Pass tokenizer into `process_reasoning_stream` and strip reasoning text via `strip_leaked_special_tokens_from_delta` before building the SSE chunk - Strip `reasoning_chunk_text` in the Messages streaming path alongside `normal_text` in a single conditional block Signed-off-by: ConnorLi96 <ConnorLi96@users.noreply.github.com> Made-with: Cursor
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d968642b28
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if special.additional_special_tokens.is_empty() { | ||
| for token in CHATML_FALLBACK_TOKENS { | ||
| if result.contains(token) { | ||
| result = result.replace(token, ""); | ||
| } |
There was a problem hiding this comment.
Preserve token stripping state across streamed deltas
strip_leaked_special_tokens_from_delta only performs contains/replace on the current chunk and keeps no carry-over state, so leaked markers that arrive split across chunks are not removed. This is especially likely in the fallback path (CHATML_FALLBACK_TOKENS) where strings like <|im_end|> may be emitted over multiple deltas; e.g. <|im_ then end|> passes through unchanged and is reassembled by clients as leaked special-token text. The streaming cleaner needs a per-stream tail buffer (or equivalent boundary-aware logic) to catch cross-chunk matches.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@model_gateway/src/routers/grpc/regular/processor.rs`:
- Around line 186-205: The cleanup only operates on processed_text; extend the
same sanitization to reasoning_text and each tool_calls[*].function.arguments:
call utils::strip_leaked_special_tokens(&mut reasoning_text, tokenizer.as_ref())
and for every tool_call do utils::strip_leaked_special_tokens(&mut
tool_call.function.arguments, tokenizer.as_ref()); also apply
utils::clean_json_response when is_json_response is true to both reasoning_text
and tool_call.function.arguments so leaked markers/JSON-breaking tokens are
removed, and if any of those lengths change invalidate logprobs (same check as
processed_text). Apply the same changes at the analogous location around the
code handling non-streaming reasoning (the other referenced block).
- Around line 182-205: Record the processed_text length before any
tool/reasoning parsing runs (e.g. introduce pre_parsing_len =
processed_text.len() before the code paths that run configured_tool_parser or
configured_reasoning_parser), then after parsing and the existing cleanup steps
check against that saved length and set logprobs = None if processed_text.len()
!= pre_parsing_len; keep the existing post-cleanup length check as well so any
cleanup edits still invalidate logprobs, and reference the existing variables
logprobs, processed_text, configured_tool_parser, and
configured_reasoning_parser when making the change.
In `@model_gateway/src/routers/grpc/regular/streaming.rs`:
- Around line 1286-1295: The streaming path currently strips leaked special
tokens only from normal_text via utils::strip_leaked_special_tokens_from_delta
when either self.configured_tool_parser or self.configured_reasoning_parser is
set, but it leaves parser-emitted tool_call_item.parameters untouched; update
the code to call utils::strip_leaked_special_tokens_from_delta on
tool_call_item.parameters (and any other incremental JSON fragments used for
Chat tool-call deltas / Messages InputJsonDelta) under the same conditional so
parameters are sanitized too, and apply the same change in the other affected
block around the 1763–1779 region; reference the symbols configured_tool_parser,
configured_reasoning_parser, utils::strip_leaked_special_tokens_from_delta,
normal_text, and tool_call_item.parameters when locating and changing the code.
- Around line 425-437: The current logic only nulls choice_logprobs when
special-token stripping shortens delta, but if process_reasoning_stream()
already split off part of the chunk for a reasoning event the assistant delta no
longer aligns with chunk.output_logprobs() and we must also invalidate logprobs;
after calling process_reasoning_stream() (and after applying
utils::strip_leaked_special_tokens_from_delta), compare the length/structure of
the resulting delta against the original chunk logprobs (e.g.
chunk.output_logprobs() or its length) and if they don't match treat
choice_logprobs as None; update the block handling delta, choice_logprobs, and
any logic around configured_tool_parser/configured_reasoning_parser to set
choice_logprobs = None when a reasoning split was performed or delta no longer
corresponds to chunk.output_logprobs().
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 660709fd-7a95-45cf-8b46-f2b5c1288472
📒 Files selected for processing (3)
model_gateway/src/routers/grpc/regular/processor.rsmodel_gateway/src/routers/grpc/regular/streaming.rsmodel_gateway/src/routers/grpc/utils/response_cleanup.rs
| let mut logprobs = complete.output_logprobs().map(|ref proto_logprobs| { | ||
| utils::convert_proto_to_openai_logprobs(proto_logprobs, tokenizer) | ||
| }); | ||
|
|
||
| // Step 5: Build ChatCompletionMessage (proper response message type) | ||
| // Step 5: Post-processing cleanup. | ||
| // Track whether text is mutated so we can invalidate logprobs. | ||
| let pre_cleanup_len = processed_text.len(); | ||
|
|
||
| if self.configured_tool_parser.is_some() || self.configured_reasoning_parser.is_some() { | ||
| utils::strip_leaked_special_tokens(&mut processed_text, tokenizer.as_ref()); | ||
| } | ||
|
|
||
| let is_json_response = matches!( | ||
| &original_request.response_format, | ||
| Some(openai_protocol::common::ResponseFormat::JsonObject) | ||
| | Some(openai_protocol::common::ResponseFormat::JsonSchema { .. }) | ||
| ); | ||
| if is_json_response { | ||
| utils::clean_json_response(&mut processed_text); | ||
| } | ||
|
|
||
| if processed_text.len() != pre_cleanup_len { | ||
| logprobs = None; | ||
| } |
There was a problem hiding this comment.
Drop logprobs when parsing already changed the emitted text.
logprobs come from the raw proto output, but processed_text may already have been rewritten by reasoning/tool parsing before this cleanup block runs. Line 203 only catches cleanup-length changes, so tool-call or separate_reasoning responses can still return logprobs for text that never appears in message.content.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@model_gateway/src/routers/grpc/regular/processor.rs` around lines 182 - 205,
Record the processed_text length before any tool/reasoning parsing runs (e.g.
introduce pre_parsing_len = processed_text.len() before the code paths that run
configured_tool_parser or configured_reasoning_parser), then after parsing and
the existing cleanup steps check against that saved length and set logprobs =
None if processed_text.len() != pre_parsing_len; keep the existing post-cleanup
length check as well so any cleanup edits still invalidate logprobs, and
reference the existing variables logprobs, processed_text,
configured_tool_parser, and configured_reasoning_parser when making the change.
| // Step 5: Post-processing cleanup. | ||
| // Track whether text is mutated so we can invalidate logprobs. | ||
| let pre_cleanup_len = processed_text.len(); | ||
|
|
||
| if self.configured_tool_parser.is_some() || self.configured_reasoning_parser.is_some() { | ||
| utils::strip_leaked_special_tokens(&mut processed_text, tokenizer.as_ref()); | ||
| } | ||
|
|
||
| let is_json_response = matches!( | ||
| &original_request.response_format, | ||
| Some(openai_protocol::common::ResponseFormat::JsonObject) | ||
| | Some(openai_protocol::common::ResponseFormat::JsonSchema { .. }) | ||
| ); | ||
| if is_json_response { | ||
| utils::clean_json_response(&mut processed_text); | ||
| } | ||
|
|
||
| if processed_text.len() != pre_cleanup_len { | ||
| logprobs = None; | ||
| } |
There was a problem hiding this comment.
Sanitize parser-derived reasoning and tool outputs too.
This only scrubs processed_text. By the time this runs, reasoning_text and tool_calls[*].function.arguments are already detached, so non-streaming reasoning_content / Messages Thinking can still leak markers, and leaked argument text can trip the Messages serde_json::from_str(args) fallback to {}.
Also applies to: 671-673
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@model_gateway/src/routers/grpc/regular/processor.rs` around lines 186 - 205,
The cleanup only operates on processed_text; extend the same sanitization to
reasoning_text and each tool_calls[*].function.arguments: call
utils::strip_leaked_special_tokens(&mut reasoning_text, tokenizer.as_ref()) and
for every tool_call do utils::strip_leaked_special_tokens(&mut
tool_call.function.arguments, tokenizer.as_ref()); also apply
utils::clean_json_response when is_json_response is true to both reasoning_text
and tool_call.function.arguments so leaked markers/JSON-breaking tokens are
removed, and if any of those lengths change invalidate logprobs (same check as
processed_text). Apply the same changes at the analogous location around the
code handling non-streaming reasoning (the other referenced block).
| let original_len = delta.len(); | ||
| let delta = if self.configured_tool_parser.is_some() | ||
| || self.configured_reasoning_parser.is_some() | ||
| { | ||
| utils::strip_leaked_special_tokens_from_delta(delta, tokenizer.as_ref()) | ||
| } else { | ||
| delta | ||
| }; | ||
| let choice_logprobs = if delta.len() < original_len { | ||
| None | ||
| } else { | ||
| choice_logprobs | ||
| }; |
There was a problem hiding this comment.
Reasoning-split deltas still carry stale streaming logprobs.
This only invalidates logprobs when special-token stripping shortens delta. If process_reasoning_stream() already moved part of the chunk into a reasoning event, the assistant content delta no longer matches chunk.output_logprobs(), but choice_logprobs still survives.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@model_gateway/src/routers/grpc/regular/streaming.rs` around lines 425 - 437,
The current logic only nulls choice_logprobs when special-token stripping
shortens delta, but if process_reasoning_stream() already split off part of the
chunk for a reasoning event the assistant delta no longer aligns with
chunk.output_logprobs() and we must also invalidate logprobs; after calling
process_reasoning_stream() (and after applying
utils::strip_leaked_special_tokens_from_delta), compare the length/structure of
the resulting delta against the original chunk logprobs (e.g.
chunk.output_logprobs() or its length) and if they don't match treat
choice_logprobs as None; update the block handling delta, choice_logprobs, and
any logic around configured_tool_parser/configured_reasoning_parser to set
choice_logprobs = None when a reasoning split was performed or delta no longer
corresponds to chunk.output_logprobs().
| let normal_text = if self.configured_tool_parser.is_some() | ||
| || self.configured_reasoning_parser.is_some() | ||
| { | ||
| utils::strip_leaked_special_tokens_from_delta( | ||
| normal_text, | ||
| tokenizer.as_ref(), | ||
| ) | ||
| } else { | ||
| normal_text | ||
| }; |
There was a problem hiding this comment.
Sanitize streamed tool arguments, not just normal_text.
These paths clean the surrounding text, but the parser-emitted tool_call_item.parameters fragments are still forwarded unchanged later in Chat tool-call deltas and Messages InputJsonDelta. Leaked <|im_*|> / </think> markers in arguments will still corrupt the incremental JSON clients consume.
Also applies to: 1763-1779
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@model_gateway/src/routers/grpc/regular/streaming.rs` around lines 1286 -
1295, The streaming path currently strips leaked special tokens only from
normal_text via utils::strip_leaked_special_tokens_from_delta when either
self.configured_tool_parser or self.configured_reasoning_parser is set, but it
leaves parser-emitted tool_call_item.parameters untouched; update the code to
call utils::strip_leaked_special_tokens_from_delta on tool_call_item.parameters
(and any other incremental JSON fragments used for Chat tool-call deltas /
Messages InputJsonDelta) under the same conditional so parameters are sanitized
too, and apply the same change in the other affected block around the 1763–1779
region; reference the symbols configured_tool_parser,
configured_reasoning_parser, utils::strip_leaked_special_tokens_from_delta,
normal_text, and tool_call_item.parameters when locating and changing the code.
Problem
Two response quality issues when serving models with chat-template special tokens:
<|im_end|>,<|im_start|>, etc.) into API responses, appearing as garbage text to end users.json_object/json_schemaresponse formats, models sometimes wrap output in markdown fences or append trailing text after the JSON object, breaking downstream JSON parsing.Solution
response_cleanupmodule with unit tests.Changes
model_gateway/src/routers/grpc/utils/response_cleanup.rs(new):strip_leaked_special_tokens,strip_leaked_special_tokens_from_delta,clean_json_response+ unit testsmodel_gateway/src/routers/grpc/utils/mod.rs: module re-exportmodel_gateway/src/routers/grpc/regular/processor.rs: integrate cleanup in Chat Completions + Messages API pathsmodel_gateway/src/routers/grpc/regular/streaming.rs: integrate streaming delta cleanupTest Plan
<|im_end|>/<|im_start|>tokens for Kimi K2json_objectresponses with markdown fences are correctly unwrappedChecklist:
cargo +nightly fmtpassescargo clippy --all-targets --all-features -- -D warningspassesCloses #1111
Summary by CodeRabbit