Skip to content

[WIP] Connorli/fix func call for KimiK2.5#1070

Open
ConnorLi96 wants to merge 7 commits intomainfrom
connorli/fix-func-call-all
Open

[WIP] Connorli/fix func call for KimiK2.5#1070
ConnorLi96 wants to merge 7 commits intomainfrom
connorli/fix-func-call-all

Conversation

@ConnorLi96
Copy link
Copy Markdown
Collaborator

@ConnorLi96 ConnorLi96 commented Apr 8, 2026

Description

Note: This PR is for record-keeping purposes. The changes will be systematically split into separate PRs for review and merging.

Problem

Function calling with Kimi-K2.5 (NVFP4) is severely broken on SMG — fc-dash scores 44/60 (SMG+SGLang) and 18/60 (SMG+TRT-LLM) vs 58/60 on pure SGLang direct. Multiple stages in the SMG inference pipeline have issues that compound to produce wrong or missing tool calls, empty structured outputs, and native token leaks.

Solution

Fix function calling across the full SMG inference pipeline, organized by stage:

1. Input Tokenization — Fix Rust tiktoken encoder to use encode_with_special_tokens for chat template output, ensuring special tokens like <|im_system|>, <|im_end|> are encoded as single token IDs instead of being split into character-level BPE tokens (tiktoken.rs, preparation.rs)

2. Chat Template Rendering — Add name: Option<String> to ChatMessage::Tool so the Jinja2 template resolves message.get('name') correctly (e.g. tool role renders as function name instead of literal "tool"). Keep tool call arguments as strings instead of parsing to serde_json::Value to preserve original formatting like {"query": "weather"} vs {"query":"weather"} (chat.rs, chat_utils.rs)

3. Reasoning Parser — Skip reasoning parsing when output is constrained (specific tool_choice or response_format: json_object/json_schema), preventing the parser from swallowing constrained JSON output as reasoning content. Add <|tool_calls_section_begin|> detection as a split point when </think> is absent (processor.rs, streaming.rs, base.rs)

4. Tool Call Parser — Support hyphens in KimiK2 tool call IDs (e.g. functions.execute-tool:0) and alternative delimiters <|func_start|>/<|func_end|>. Add fallback from configured parser to JSON schema parser for tool_choice=specific_function when native tokens are absent (kimik2.rs, processor.rs)

5. Stop Token / EOS Handling — Load merged EOS token IDs from both config.json ([EOS]=163585) and generation_config.json (<|im_end|>=163586), matching SGLang's _get_hf_eos_token_id() merge logic. Expose eos_token_ids() on the Tokenizer trait and pass them as stop_token_ids in TRT-LLM gRPC requests. This fixes TRT-LLM generating past turn boundaries (18/60 → 55/60) (tiktoken.rs, traits.rs, trtllm_service.rs)

6. Output Cleanup — Strip leaked chatml tokens and [EOS]/[BOS] text from both streaming and non-streaming output paths, conditioned on parser configuration (processor.rs, streaming.rs, chat_utils.rs)

Other (separate concern): Graceful metrics server bind failure — log error instead of panicking when port is in use (metrics_server.rs). This should be a separate PR.

Summary by CodeRabbit

  • New Features

    • Enhanced EOS token handling to improve generation stopping across backends
    • Support for additional tool-call delimiters and richer tool-message metadata
  • Bug Fixes

    • Metrics server no longer crashes on bind failures
    • Reasoning parser no longer interferes with constrained JSON outputs
    • Tokenization now recognizes special tokens correctly
  • Improvements

    • Better gRPC metrics collection and safer metrics aggregation
    • Cleaner JSON and token cleanup in streaming responses and parsing

scottjlee and others added 6 commits April 3, 2026 22:17
Signed-off-by: Scott Lee <scott@together.ai>
Signed-off-by: Scott Lee <scott@together.ai>
…ol-call tokens

- Prioritize explicitly configured tool parser over JSON schema parsing
- Support alternative delimiters (<|func_start|>/<|func_end|>) in KimiK2 parser
- Prevent reasoning parser from consuming tool call markers
- Strip leaked chatml tokens only when a parser is explicitly configured
- Truncate trailing content after first valid JSON object for JSON response formats

Signed-off-by: ConnorLi96 <ConnorLi96@users.noreply.github.com>
Made-with: Cursor
- Fix tiktoken encoder to use encode_with_special_tokens for chat template output
- Prioritize configured tool parser over JSON schema with fallthrough
- Skip reasoning parsing when output is constrained (specific tool_choice / JSON response_format)
- Support alternative delimiters and hyphens in KimiK2 tool call parser
- Prevent reasoning parser from consuming tool call markers
- Keep tool call arguments as strings to preserve formatting
- Generate TypeScript-style tool declarations for Kimi K2.5
- Strip leaked chatml tokens conditionally and clean up JSON response format
- Add name field to Tool ChatMessage variant
- Graceful metrics server bind failure

Signed-off-by: ConnorLi96 <ConnorLi96@users.noreply.github.com>
Made-with: Cursor
TRT-LLM's gRPC path does not reliably apply generation_config.eos_token_id
via SamplingParams._setup(), causing chatml models (e.g. Kimi-K2.5) to
generate past turn boundaries (18/60 -> 52/60 on fc-dash).

This is a workaround: the proper fix is for TRT-LLM to honor
generation_config.eos_token_id in its gRPC executor path.

Signed-off-by: ConnorLi96 <ConnorLi96@users.noreply.github.com>
Made-with: Cursor
…on_config.json

Models like Kimi-K2.5 define different EOS tokens in config.json ([EOS]=163585)
and generation_config.json (<|im_end|>=163586). The engine must stop at both.

SGLang handles this internally by merging both sources, but TRT-LLM's gRPC path
does not. This change:

- Adds load_eos_token_ids() to read and merge EOS IDs from both config files
- Exposes eos_token_ids() on the Tokenizer trait
- Passes merged IDs as stop_token_ids in TRT-LLM gRPC requests
- Removes the hardcoded <|im_end|> stop string workaround (bfab5ad)
- Strips [EOS]/[BOS] text from constrained decoding output

Result: SMG+TRT-LLM fc-dash 55/60 (was 52/60 with hack, 18/60 without)

Signed-off-by: ConnorLi96 <ConnorLi96@users.noreply.github.com>
Made-with: Cursor
@github-actions github-actions bot added tokenizer Tokenizer related changes python-bindings Python bindings changes grpc gRPC client and router changes tests Test changes tool-parser Tool/function call parser changes reasoning-parser Reasoning parser changes protocols Protocols crate changes model-gateway Model gateway crate changes labels Apr 8, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 8, 2026

📝 Walkthrough

Walkthrough

Adds merged EOS token ID support across tokenizers and threads EOS IDs through gRPC request construction for TRT-LLM; enables special-token-aware tokenization; extends tool-message schema with an optional name; tightens reasoning/tool parsing around constrained JSON outputs; adds Prometheus multiprocess metrics collection and safer metrics server startup.

Changes

Cohort / File(s) Summary
Tokenizer trait & implementations
crates/tokenizer/src/traits.rs, crates/tokenizer/src/tiktoken.rs, crates/tokenizer/src/huggingface.rs, crates/tokenizer/src/mock.rs
Added eos_token_ids field/method to SpecialTokens/Tokenizer trait; implemented EOS loading/merge from config.json + generation_config.json; tokenizers expose merged EOS IDs.
gRPC request wiring (EOS)
model_gateway/src/routers/grpc/regular/stages/chat/request_building.rs, model_gateway/src/routers/grpc/harmony/stages/request_building.rs, model_gateway/src/routers/grpc/client.rs, crates/grpc_client/src/trtllm_service.rs
Tokenizers' eos_token_ids passed into chat request builders; TrtllmServiceClient::build_generate_request_from_chat now accepts eos_token_ids and populates proto stop_token_ids.
Tokenization special-token handling
model_gateway/src/routers/grpc/regular/stages/chat/preparation.rs, model_gateway/src/routers/grpc/regular/stages/messages/preparation.rs
Switched tokenizer calls to encode(..., true) to enable special-token recognition so chat-template markers and <think> are treated as single tokens.
Reasoning & tool parsing behavior
crates/reasoning_parser/src/parsers/base.rs, model_gateway/src/routers/grpc/regular/processor.rs, model_gateway/src/routers/grpc/regular/streaming.rs
Detects output constraints (tool_choice + JSON formats) to skip reasoning/tool parsing when constrained; splits reasoning at tool-call section marker; adds streaming sanitization to avoid consuming constrained JSON.
Tool parser improvements
crates/tool_parser/src/parsers/kimik2.rs
Added alternate delimiters `<
ChatML/token scrubbing & utils
model_gateway/src/routers/grpc/utils/chat_utils.rs, model_gateway/src/routers/grpc/regular/responses/conversions.rs
Added ChatML/special-token stripping for parsing; added tools_to_typescript generator; ensure ChatMessage::Tool conversions include name: None.
ChatMessage schema & tests
crates/protocols/src/chat.rs, model_gateway/src/routers/grpc/harmony/builder.rs, model_gateway/tests/spec/chat_message.rs
ChatMessage::Tool gains name: Option<String>; pattern matches updated to ignore new field; tests updated accordingly.
Prometheus multiprocess metrics & metrics server
bindings/python/src/smg/serve.py, model_gateway/src/core/worker_manager.rs, model_gateway/src/observability/metrics_server.rs
For gRPC workers create/manage a temp PROMETHEUS_MULTIPROC_DIR and clean it on shutdown; added collect_prometheus_multiproc_metrics() calling a Python helper; metrics server startup no longer panics on bind failure.
Streaming/message preparation adjustments
model_gateway/src/routers/grpc/regular/stages/messages/preparation.rs, model_gateway/src/routers/grpc/regular/streaming.rs
Tokenizer special-token handling enabled for Messages; streaming deltas sanitized of ChatML/think markers when configured tool/reasoning parsers are used.

Sequence Diagram(s)

sequenceDiagram
    actor Client
    participant Tokenizer
    participant RequestBuilder
    participant GrpcClient
    participant TRT_LLM

    Client->>Tokenizer: load tokenizer (config + generation_config)
    Tokenizer->>Tokenizer: merge eos_token_ids
    Tokenizer-->>RequestBuilder: eos_token_ids &[u32]

    Client->>RequestBuilder: prepare chat request
    RequestBuilder->>GrpcClient: build_chat_request(..., eos_token_ids)
    GrpcClient->>GrpcClient: build_generate_request_from_chat(eos_token_ids)
    GrpcClient->>TRT_LLM: GenerateRequest { stop_token_ids: eos_token_ids }
    TRT_LLM-->>GrpcClient: Response
Loading
sequenceDiagram
    actor Client
    participant Processor
    participant ConstraintDetector
    participant ReasoningParser
    participant ToolParser
    participant Formatter

    Client->>Processor: chat response (with response_format/tool_choice)
    Processor->>ConstraintDetector: detect output_is_constrained
    ConstraintDetector-->>Processor: true/false

    alt constrained
        Processor->>Formatter: skip ReasoningParser & ToolParser
        Processor->>Formatter: extract/truncate JSON from fenced blocks
    else unconstrained
        Processor->>ReasoningParser: parse reasoning blocks
        ReasoningParser-->>Processor: reasoning_text
        Processor->>ToolParser: parse tool calls
        ToolParser-->>Processor: tool_calls
    end

    Processor->>Formatter: final formatting
    Formatter-->>Client: response
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related issues

Possibly related PRs

Suggested reviewers

  • CatherineSue
  • key4ng
  • slin1237

🐰 I hopped through token trees to find the ends,
Merged EOS seeds and threaded them to friends,
I scrubbed the chatter, kept constrained replies neat,
Prometheus hums while metrics tidy their seat,
A little rabbit clap — this patch is quite complete! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title is vague and does not clearly describe the main changes; '[WIP]' indicates work-in-progress status, and 'fix func call for KimiK2.5' is overly broad given the extensive multi-stage pipeline modifications. Revise to a more descriptive title that captures the primary change(s), such as 'Fix function calling for KimiK2.5 via tokenizer and tool-call parser improvements' or split into separate, more specific PR titles as mentioned in the PR description.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch connorli/fix-func-call-all

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ConnorLi96 ConnorLi96 changed the title [WIP] Connorli/fix func call all [WIP] Connorli/fix func call for KimiK2.5 Apr 8, 2026
Comment on lines +222 to +234
if self.configured_tool_parser.is_some() || self.configured_reasoning_parser.is_some() {
for token in [
"<|im_end|>",
"<|im_start|>",
"<|im_user|>",
"<|im_assistant|>",
"<|im_system|>",
"<|im_middle|>",
] {
processed_text = processed_text.replace(token, "");
}
processed_text = processed_text.trim().to_string();
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: The chatml token stripping list here (non-streaming chat path) is missing </think>, [EOS], and [BOS] that the streaming path at streaming.rs:443-446 does strip. Same discrepancy exists in the Messages non-streaming path at line 747-758 and in process_tool_calls_stream at streaming.rs:1288-1291.

This means non-streaming responses could leak [EOS]/[BOS] tokens that streaming responses correctly filter out. Consider extracting a shared constant (e.g. const CHATML_LEAK_TOKENS: &[&str] = &[...]) to keep all 6 stripping sites consistent.

Comment on lines +710 to +719
if self.configured_tool_parser.is_some() && tool_parser_available {
(tool_calls, processed_text) = self
.parse_tool_calls(
&processed_text,
&messages_request.model,
utils::message_utils::get_history_tool_calls_count_messages(
&messages_request,
),
)
.await;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: The chat non-streaming path (line ~167) gates the configured tool parser with && !output_is_constrained, but this Messages non-streaming path does not. If a Messages API request uses constrained decoding (e.g. specific tool choice) with a configured tool parser, the parser will run on raw JSON and may misparse it. Consider adding the same output_is_constrained guard here for consistency, or document why the Messages path intentionally differs.

out
}

fn schema_to_typescript(schema: &Value, indent: &str, required: &[&str]) -> String {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: The required parameter is unused — object_to_typescript reads the required list from the schema value directly. This is dead code; consider removing the parameter.

Suggested change
fn schema_to_typescript(schema: &Value, indent: &str, required: &[&str]) -> String {
fn schema_to_typescript(schema: &Value, indent: &str) -> String {

Comment on lines +251 to +269
if processed_text.starts_with('{') {
let mut depth = 0i32;
let mut in_string = false;
let mut escape = false;
let mut json_end = None;
for (i, ch) in processed_text.char_indices() {
if escape { escape = false; continue; }
if ch == '\\' && in_string { escape = true; continue; }
if ch == '"' { in_string = !in_string; continue; }
if in_string { continue; }
if ch == '{' { depth += 1; } else if ch == '}' {
depth -= 1;
if depth == 0 { json_end = Some(i + 1); break; }
}
}
if let Some(end) = json_end {
processed_text = processed_text[..end].to_string();
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: This hand-rolled JSON brace-matching truncator only handles {…} top-level objects. If response_format is json_schema and the schema's root type is an array ([…]), the JSON response will start with [ and this code won't trim trailing tokens from it. Consider also handling the array case, or using serde_json::from_str to find the valid JSON prefix (attempt parse, on failure progressively trim trailing characters).

Comment on lines +87 to +115

/// Collect gRPC worker metrics by aggregating `PROMETHEUS_MULTIPROC_DIR` via a python3 subprocess.
async fn collect_prometheus_multiproc_metrics() -> Result<String, String> {
let dir = std::env::var("PROMETHEUS_MULTIPROC_DIR").map_err(|_| {
"PROMETHEUS_MULTIPROC_DIR not set; cannot collect metrics from gRPC workers".to_string()
})?;

let output = tokio::process::Command::new("python3")
.args([
"-c",
"import sys\n\
from prometheus_client import CollectorRegistry, generate_latest\n\
from prometheus_client.multiprocess import MultiProcessCollector\n\
registry = CollectorRegistry()\n\
MultiProcessCollector(registry)\n\
sys.stdout.buffer.write(generate_latest(registry))\n",
])
.env("PROMETHEUS_MULTIPROC_DIR", &dir)
.output()
.await
.map_err(|e| format!("failed to run python3 prometheus collector: {e}"))?;

if !output.status.success() {
let stderr = String::from_utf8_lossy(&output.stderr);
return Err(format!("python3 prometheus collector failed: {stderr}"));
}

String::from_utf8(output.stdout)
.map_err(|e| format!("prometheus collector output is not valid UTF-8: {e}"))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: collect_prometheus_multiproc_metrics spawns a new python3 process on every /metrics scrape. If Prometheus scrapes at 15s intervals, that's ~5760 Python startups/day. This works but could add latency to the metrics endpoint under load. Consider documenting this trade-off or, as a follow-up, caching the Python process / using a longer-lived collector.

Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comprehensive set of fixes for func call handling, EOS token merging, chatml token stripping, and gRPC metrics. The changes are well-structured and the logic is sound.

Summary: 0 🔴 Important · 5 🟡 Nit · 0 🟣 Pre-existing

Nits flagged:

  1. Inconsistent chatml token strip lists — 6 stripping sites use different token sets; non-streaming paths miss </think>, [EOS], [BOS]. Extract a shared constant.
  2. Messages non-streaming path missing output_is_constrained guard — Chat path gates configured tool parser on constrained output, Messages path doesn't.
  3. Unused required parameter in schema_to_typescript — dead code.
  4. JSON truncator only handles {…} — Array-rooted JSON schemas ([…]) won't be trimmed.
  5. Python subprocess per metrics scrape — works but spawns a process on every /metrics call; consider documenting the trade-off.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 46b284fe0e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +710 to +720
if self.configured_tool_parser.is_some() && tool_parser_available {
(tool_calls, processed_text) = self
.parse_tool_calls(
&processed_text,
&messages_request.model,
utils::message_utils::get_history_tool_calls_count_messages(
&messages_request,
),
)
.await;
} else if used_json_schema {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restore JSON-schema fallback after configured tool parsing

When messages_request.tool_choice is Tool/Any (JSON-schema constrained output) and a configured tool parser is present, this branch always runs parse_tool_calls first and skips the JSON-schema parser due to the else if chain. In constrained mode the model often returns plain JSON without native tool-call markers, so parse_tool_calls returns no calls and the response loses tool invocations instead of building ToolUse blocks. This is a regression from the previous logic that prioritized parse_json_schema_response for constrained tool choices.

Useful? React with 👍 / 👎.

// Use add_special_tokens=true so the tokenizer recognizes special tokens
// (e.g. <|im_system|>, <|im_end|>, <think>) embedded in the chat template
// output as single token IDs rather than splitting them into characters.
let encoding = match tokenizer.encode(&processed_messages.text, true) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid enabling generic special-token injection for chat prompts

This now tokenizes already-rendered chat-template text with add_special_tokens=true. For HuggingFaceTokenizer, encode(..., true) is forwarded directly to the HF tokenizer post-processor, which can prepend/append BOS/EOS on top of template-provided markers. That changes prompt token IDs and can break stop boundaries or generation behavior for non-tiktoken models. The special-token recognition fix is tiktoken-specific, but this call affects all tokenizer backends (and is duplicated in the Messages preparation path).

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request improves support for Kimi K2.5 models by merging EOS token IDs from configuration files, implementing TypeScript-style tool declarations, and refining reasoning and tool call parsers. It also adds gRPC worker metrics collection and logic to strip leaked ChatML tokens. Review feedback identifies opportunities to reduce code duplication in token parsing and stripping, and suggests using more robust JSON deserialization to improve performance and reliability.

Comment on lines +51 to +97
pub fn load_eos_token_ids(dir: &Path) -> Vec<TokenIdType> {
let mut ids = std::collections::BTreeSet::new();

// config.json — eos_token_id can be int or list
if let Ok(content) = std::fs::read_to_string(dir.join("config.json")) {
if let Ok(cfg) = serde_json::from_str::<serde_json::Value>(&content) {
match cfg.get("eos_token_id") {
Some(serde_json::Value::Number(n)) => {
if let Some(id) = n.as_u64() {
ids.insert(id as TokenIdType);
}
}
Some(serde_json::Value::Array(arr)) => {
for v in arr {
if let Some(id) = v.as_u64() {
ids.insert(id as TokenIdType);
}
}
}
_ => {}
}
}
}

// generation_config.json — eos_token_id can be int or list
if let Ok(content) = std::fs::read_to_string(dir.join("generation_config.json")) {
if let Ok(cfg) = serde_json::from_str::<serde_json::Value>(&content) {
match cfg.get("eos_token_id") {
Some(serde_json::Value::Number(n)) => {
if let Some(id) = n.as_u64() {
ids.insert(id as TokenIdType);
}
}
Some(serde_json::Value::Array(arr)) => {
for v in arr {
if let Some(id) = v.as_u64() {
ids.insert(id as TokenIdType);
}
}
}
_ => {}
}
}
}

ids.into_iter().collect()
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for parsing eos_token_id from config.json and generation_config.json is duplicated. To improve maintainability and reduce code duplication, extract this logic into a helper closure. Additionally, refactor the match statement to return the differing values and perform the insertion logic once to avoid redundancy.

pub fn load_eos_token_ids(dir: &Path) -> Vec<TokenIdType> {
    let mut ids = std::collections::BTreeSet::new();

    let mut extract_eos_ids = |content: &str| {
        if let Ok(cfg) = serde_json::from_str::<serde_json::Value>(content) {
            let extracted: Vec<u64> = match cfg.get("eos_token_id") {
                Some(serde_json::Value::Number(n)) => n.as_u64().into_iter().collect(),
                Some(serde_json::Value::Array(arr)) => arr.iter().filter_map(|v| v.as_u64()).collect(),
                _ => vec![],
            };
            for id in extracted {
                ids.insert(id as TokenIdType);
            }
        }
    };

    if let Ok(content) = std::fs::read_to_string(dir.join("config.json")) {
        extract_eos_ids(&content);
    }

    if let Ok(content) = std::fs::read_to_string(dir.join("generation_config.json")) {
        extract_eos_ids(&content);
    }

    ids.into_iter().collect()
}
References
  1. Extract duplicated logic into a shared helper function to improve maintainability and reduce redundancy.
  2. Refactor match statements to avoid duplication. When arms have common logic, use the match to return the differing value and perform the common logic once.

Comment on lines +223 to +230
for token in [
"<|im_end|>",
"<|im_start|>",
"<|im_user|>",
"<|im_assistant|>",
"<|im_system|>",
"<|im_middle|>",
] {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This list of tokens to strip is duplicated in other parts of the codebase. To avoid duplication and improve maintainability, consider defining this array as a constant at the module level. Furthermore, when stripping these markers from data that might contain structured formats like JSON, ensure you use whole-string matching instead of character-set matching to prevent data corruption.

References
  1. Extract duplicated logic into a shared helper function or constant to improve maintainability.
  2. When stripping specific marker strings from data that might contain structured formats like JSON, use whole-string matching instead of character-set matching to avoid data corruption.

Comment on lines +251 to +269
if processed_text.starts_with('{') {
let mut depth = 0i32;
let mut in_string = false;
let mut escape = false;
let mut json_end = None;
for (i, ch) in processed_text.char_indices() {
if escape { escape = false; continue; }
if ch == '\\' && in_string { escape = true; continue; }
if ch == '"' { in_string = !in_string; continue; }
if in_string { continue; }
if ch == '{' { depth += 1; } else if ch == '}' {
depth -= 1;
if depth == 0 { json_end = Some(i + 1); break; }
}
}
if let Some(end) = json_end {
processed_text = processed_text[..end].to_string();
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This manual JSON parsing logic to find the end of the first object can be brittle. For a more robust solution, consider using serde_json::Deserializer to stream-parse the first valid JSON object. Additionally, avoid reallocating memory with .to_string() within processing loops; use .truncate() to modify the buffer in-place.

            if processed_text.starts_with('{') {
                let mut stream = serde_json::Deserializer::from_str(&processed_text).into_iter::<serde_json::Value>();
                if stream.next().is_some() {
                    let end = stream.byte_offset();
                    processed_text.truncate(end);
                }
            }
References
  1. When processing streaming data, avoid inefficient string manipulations like buffer = buffer[..].to_string() as it reallocates memory; use performant methods like .drain() or .truncate() instead.

Comment on lines +443 to +449
for token in [
"<|im_end|>", "<|im_start|>", "<|im_user|>",
"<|im_assistant|>", "<|im_system|>", "<|im_middle|>",
"</think>", "[EOS]", "[BOS]",
] {
delta = delta.replace(token, "");
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This list of tokens to strip is duplicated across multiple stream processing functions. To improve maintainability and consistency, define this array as a constant at the module level. Ensure that any stripping logic uses whole-string matching rather than character-set matching to avoid corrupting JSON data.

References
  1. Extract duplicated logic into a shared helper function or constant to improve maintainability.
  2. When stripping specific marker strings from data that might contain structured formats like JSON, use whole-string matching instead of character-set matching to avoid data corruption.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (6)
bindings/python/src/smg/serve.py (1)

632-671: ⚠️ Potential issue | 🟡 Minor

Always run Prometheus cleanup, even when no worker was appended.

_launch_workers() creates self._prometheus_dir before the first self.workers.append(...), but Line 634 returns early when self.workers is empty. If the first launch fails, the temp dir is leaked and PROMETHEUS_MULTIPROC_DIR stays stale in-process.

Suggested fix
     def _cleanup_workers(self) -> None:
         """SIGTERM all worker process groups, wait, then SIGKILL stragglers."""
-        if not self.workers:
-            return
-
-        # Send SIGTERM to each process group
-        for proc, port in self.workers:
-            try:
-                os.killpg(proc.pid, signal.SIGTERM)
-            except (ProcessLookupError, OSError):
-                pass
-
-        # Wait up to _WORKER_SHUTDOWN_TIMEOUT seconds for graceful exit
-        deadline = time.monotonic() + _WORKER_SHUTDOWN_TIMEOUT
-        for proc, port in self.workers:
-            remaining = max(0, deadline - time.monotonic())
-            try:
-                proc.wait(timeout=remaining)
-            except subprocess.TimeoutExpired:
-                try:
-                    os.killpg(proc.pid, signal.SIGKILL)
-                except (ProcessLookupError, OSError):
-                    pass
-
-        self._cleanup_prometheus_dir()
+        try:
+            if self.workers:
+                # Send SIGTERM to each process group
+                for proc, port in self.workers:
+                    try:
+                        os.killpg(proc.pid, signal.SIGTERM)
+                    except (ProcessLookupError, OSError):
+                        pass
+
+                # Wait up to _WORKER_SHUTDOWN_TIMEOUT seconds for graceful exit
+                deadline = time.monotonic() + _WORKER_SHUTDOWN_TIMEOUT
+                for proc, port in self.workers:
+                    remaining = max(0, deadline - time.monotonic())
+                    try:
+                        proc.wait(timeout=remaining)
+                    except subprocess.TimeoutExpired:
+                        try:
+                            os.killpg(proc.pid, signal.SIGKILL)
+                        except (ProcessLookupError, OSError):
+                            pass
+        finally:
+            self._cleanup_prometheus_dir()
 
     def _cleanup_prometheus_dir(self) -> None:
         """Remove the temporary prometheus multiprocess directory and its .db files."""
         if self._prometheus_dir is None:
             return
         try:
             shutil.rmtree(self._prometheus_dir)
             logger.info("Cleaned up PROMETHEUS_MULTIPROC_DIR=%s", self._prometheus_dir)
         except OSError as e:
             logger.warning(
                 "Failed to clean up PROMETHEUS_MULTIPROC_DIR=%s: %s",
                 self._prometheus_dir,
                 e,
             )
+        os.environ.pop("PROMETHEUS_MULTIPROC_DIR", None)
         self._prometheus_dir = None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bindings/python/src/smg/serve.py` around lines 632 - 671, The early return in
_cleanup_workers when self.workers is empty skips calling
_cleanup_prometheus_dir and can leak the prometheus temp dir; ensure prometheus
cleanup always runs by removing or moving the early return and invoking
self._cleanup_prometheus_dir() (e.g., call self._cleanup_prometheus_dir() before
returning or eliminate the return and only skip process-group shutdown when
self.workers is empty) so that _cleanup_prometheus_dir() runs even if no workers
were appended.
model_gateway/src/routers/grpc/harmony/builder.rs (1)

876-885: ⚠️ Potential issue | 🟠 Major

Prefer ChatMessage::Tool.name before falling back to tool_call_id.

This path still reconstructs the function name solely from the assistant tool-call map. If a request contains a tool message with name populated but no matching prior assistant tool call, Harmony will emit functions.{tool_call_id} instead of the actual tool name, which defeats the new field.

Suggested fix
-                ChatMessage::Tool {
-                    content,
-                    tool_call_id,
-                    ..
-                } => {
+                ChatMessage::Tool {
+                    content,
+                    tool_call_id,
+                    name,
+                } => {
                     // Look up the function name from the tool_call_id
-                    let function_name = tool_call_map
-                        .get(tool_call_id)
-                        .cloned()
-                        .unwrap_or_else(|| tool_call_id.clone());
+                    let function_name = name
+                        .clone()
+                        .or_else(|| tool_call_map.get(tool_call_id).cloned())
+                        .unwrap_or_else(|| tool_call_id.clone());
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/grpc/harmony/builder.rs` around lines 876 - 885,
Update the ChatMessage::Tool match arm to prefer the message's name field before
falling back to the assistant tool-call map or the tool_call_id: destructure the
tool message to include name (e.g. ChatMessage::Tool { content, tool_call_id,
name, .. }) and compute function_name by using name.clone() when Some, otherwise
lookup tool_call_map.get(tool_call_id).cloned() and finally fall back to
tool_call_id.clone().
model_gateway/src/routers/grpc/regular/responses/conversions.rs (1)

86-117: ⚠️ Potential issue | 🟠 Major

Don't drop tool names during Responses → Chat conversion.

Both branches hardcode name: None, so the new ChatMessage::Tool.name field never reaches chat templates on this path. FunctionToolCall already has the tool name, and FunctionCallOutput can recover it with a small call_id -> name map while iterating items. As written, converted histories still lose the tool identity this PR is trying to preserve.

Also applies to: 137-147

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/grpc/regular/responses/conversions.rs` around lines
86 - 117, The conversion currently drops tool names by setting
ChatMessage::Tool.name to None; for ResponseInputOutputItem::FunctionToolCall
use the existing local variable name to populate ChatMessage::Tool.name (e.g.,
name: Some(name.clone())) when pushing the Tool message, and for the
FunctionCallOutput branch build a temporary call_id -> name map while iterating
ResponseInputOutputItem items (recording id -> name from FunctionToolCall
entries) so that when you encounter FunctionCallOutput you can look up the tool
name and set ChatMessage::Tool.name to Some(found_name.clone()) (falling back to
None if not found); update both places where ChatMessage::Tool.name is currently
hardcoded to None.
model_gateway/src/routers/grpc/harmony/stages/request_building.rs (1)

177-207: ⚠️ Potential issue | 🟡 Minor

Forward model EOS token IDs to TRT-LLM in Harmony /responses path.

The chat branch passes &eos_ids to build_generate_request_from_chat(), where they populate stop_token_ids with merged EOS from model config. The responses branch calls build_generate_request_from_responses() without eos_ids, leaving stop_token_ids empty (line 451 in trtllm_service.rs).

The later Harmony injection at line 253 only adds assistant-action stop tokens (harmony_stops), not the model's EOS tokens. Since build_generate_request_from_responses() does not inject EOS internally, the /responses path will miss the merged EOS handling applied in chat.

Pass &eos_ids to build_generate_request_from_responses() or apply the same EOS-merge logic used in the chat builder (lines 291–294 in trtllm_service.rs) to the responses builder.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/grpc/harmony/stages/request_building.rs` around
lines 177 - 207, The /responses path is missing model EOS token IDs because the
code builds eos_ids earlier but only passes them into
trtllm_client.build_generate_request_from_chat; update the responses branch
inside HarmonyRequestBuildingStage::execute to pass the same &eos_ids into
trtllm_client.build_generate_request_from_responses (or replicate the EOS-merge
logic used by build_generate_request_from_chat) so stop_token_ids are populated
the same way for both RequestType::Chat and RequestType::Responses; locate the
eos_ids variable and the trtllm_client.build_generate_request_from_responses
call and add the eos_ids argument (or the equivalent merge) to match
build_generate_request_from_chat.
model_gateway/src/routers/grpc/client.rs (1)

271-280: ⚠️ Potential issue | 🟠 Major

EOS stop IDs are only wired through the chat builder.

This new parameter fixes TensorRT-LLM chat requests, but the sibling Messages path still has no way to forward merged EOS IDs and ultimately keeps stop_token_ids empty in TrtllmServiceClient::build_generate_request_from_messages. Anthropic Messages requests will therefore keep the old over-generation behavior. Please thread the same slice through build_messages_request as well.

Also applies to: 317-325

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/grpc/client.rs` around lines 271 - 280, The EOS
stop IDs are only passed into build_chat_request, so the messages path still
omits merged EOS IDs; update TrtllmServiceClient::build_messages_request (and
any callers like build_generate_request_from_messages) to accept the
eos_token_ids slice parameter (same type as in build_chat_request) and thread it
through to where stop_token_ids are set so stop_token_ids is populated for
Messages requests just like for Chat requests; adjust function signatures
(build_messages_request, build_generate_request_from_messages) and internal
calls to pass the eos_token_ids slice through unchanged.
crates/grpc_client/src/trtllm_service.rs (1)

291-308: ⚠️ Potential issue | 🟠 Major

Merge request stop-token IDs instead of replacing them.

stop_token_ids now comes only from tokenizer EOS IDs, so explicit ChatCompletionRequest.stop_token_ids are still ignored on the TRT-LLM chat path. That means backend-side stopping remains incomplete and extra tokens can be generated until router trimming kicks in.

Proposed fix
-        let stop_token_ids: Vec<u32> = eos_token_ids.to_vec();
+        let mut stop_token_ids = body.stop_token_ids.clone().unwrap_or_default();
+        stop_token_ids.extend_from_slice(eos_token_ids);
+        stop_token_ids.sort_unstable();
+        stop_token_ids.dedup();
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/grpc_client/src/trtllm_service.rs` around lines 291 - 308, The code
currently sets stop_token_ids = eos_token_ids.to_vec() which ignores any
stop_token_ids passed in the ChatCompletionRequest for the TRT-LLM path; change
the construction of stop_token_ids (used in proto::GenerateRequest) to merge
tokenizer EOS IDs with the request's explicit stop token IDs (e.g.,
body.stop_token_ids or ChatCompletionRequest.stop_token_ids), convert/normalize
them to u32, remove duplicates (preserve a sensible order such as
request-provided then EOS or vice-versa), and use that merged Vec<u32> in the
grpc_request so backend-side stopping honors both sources.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/tokenizer/src/tiktoken.rs`:
- Around line 45-97: load_eos_token_ids currently builds an empty set when
neither config.json nor generation_config.json declare numeric eos_token_id,
causing callers (e.g., assignments to special_tokens.eos_token_ids in load code
and in crates/tokenizer/src/huggingface.rs) to wipe out tokenizer-derived EOS
IDs resolved from tokenizer_config.json or vocab matching. Fix by seeding the
BTreeSet ids at the start of load_eos_token_ids with any already-resolved EOS
token ID(s) passed in (or by reading tokenizer_config-derived EOS prior to
merging) and then merge numeric IDs from config.json and generation_config.json
into that set so file-based IDs augment rather than replace existing
special_tokens.eos_token_ids; update callers to supply the tokenizer-derived EOS
if necessary (refer to load_eos_token_ids, special_tokens.eos_token_ids, and the
tokenizer_config.json resolution path in huggingface.rs).

In `@model_gateway/src/core/worker_manager.rs`:
- Around line 89-116: collect_prometheus_multiproc_metrics currently invokes a
Python subprocess that imports prometheus_client but that package isn't declared
in the smg/python bindings dependencies and the subprocess has no timeout; add
"prometheus_client" to the core Python deps in bindings/python/pyproject.toml
(so the ModuleNotFoundError is resolved) and wrap the
tokio::process::Command.output().await call in tokio::time::timeout with a
sensible Duration (e.g., a few seconds), mapping a timeout error into a clear
Err result from collect_prometheus_multiproc_metrics; update error messages to
reflect timeout vs subprocess failures and reference the
collect_prometheus_multiproc_metrics function when making the changes.

In `@model_gateway/src/routers/grpc/regular/processor.rs`:
- Around line 221-234: The non-streaming cleanup in processor.rs currently only
strips <|im_*|> tokens inside the block guarded by
self.configured_tool_parser/is_some but misses other leaked tokens (</think>,
[EOS], [BOS]) that the streaming path removes; create a single sanitizer
function (e.g., strip_leaked_tokens or sanitize_leaked_tokens) on the impl (or a
private helper) that accepts &str and returns a trimmed String, move the full
token list and replacement logic into it, and call this helper from the existing
processed_text cleanup block and from the Messages block referenced in the
review so both streaming and non-streaming paths reuse the same sanitizer (refer
to self.configured_tool_parser, self.configured_reasoning_parser,
processed_text, and the Messages handling code).
- Around line 710-720: The configured parser branch currently runs before the
constrained-output (used_json_schema) handling, causing configured parsers to
bypass the JSON-schema -> ToolUse fallback; update the logic in processor.rs
around configured_tool_parser and parse_tool_calls so it either (a) skip
invoking the configured parser when used_json_schema is true (matching
process_single_choice behavior), or (b) after parse_tool_calls returns no
tool_calls, check if used_json_schema is true and then run the JSON-schema
wrapper / create a ToolUse block from the arguments-only JSON; ensure you
reference and adjust the conditionals around self.configured_tool_parser,
tool_parser_available, parse_tool_calls, used_json_schema, and the code path
that constructs ToolUse so the constrained-output fallback always runs when
appropriate.

In `@model_gateway/src/routers/grpc/regular/stages/chat/preparation.rs`:
- Around line 130-134: The tokenizer call uses add_special_tokens=true on
processed_messages.text after apply_chat_template(), which allows user-supplied
sequences like "<|im_end|>" to be recognized as control tokens; to fix, ensure
user message content is escaped before template rendering (e.g., run an escape
function on strings fed into apply_chat_template()), or alternatively switch the
tokenizer.encode call to add_special_tokens=false for user content and only
apply special-token-aware encoding to the known template scaffolding; update
references around apply_chat_template(), processed_messages.text, and
tokenizer.encode so special markers from user input cannot be interpreted as
control tokens.

In `@model_gateway/src/routers/grpc/regular/streaming.rs`:
- Around line 438-450: The scrubber that removes ChatML/think/EOS tokens from a
chunked delta is only applied inside the per-chunk path; extract that logic into
a shared helper (e.g., fn scrub_delta(input: &str, configured_tool_parser:
Option<...>, configured_reasoning_parser: Option<...>) -> String) and call it
both where `delta` is currently sanitized and on the final
`stop_decoder.flush()` output before emitting the last SSE delta; reference the
existing symbols `configured_tool_parser`, `configured_reasoning_parser`,
`delta`, and `stop_decoder.flush()` so the new helper matches the current
conditional behavior and is reused in the flush path.
- Around line 1805-1807: The condition allows ToolChoice::Tool to fall into the
generic parser branch even when streaming_tool_parser (the JSON/constrained
parser) was selected, causing parameters with name: None to emit InputJsonDelta
before ContentBlockStart; change the conditional logic around ToolChoice::Tool
so it always follows the arguments-only branch (i.e., keep ToolChoice::Tool out
of the generic parser path) by ensuring the check that currently reads "if
is_specific_function && !(self.configured_tool_parser.is_some() &&
tool_parser_available)" instead preserves ToolChoice::Tool on the arguments-only
branch regardless of configured_tool_parser/tool_parser_available, mirroring the
chat streaming path and preventing use of streaming_tool_parser for
ToolChoice::Tool.

In `@model_gateway/src/routers/grpc/utils/chat_utils.rs`:
- Around line 547-562: The current loop in chat_utils.rs replaces control tokens
globally in processed_text (using clean.replace(...)), which can corrupt valid
JSON string fields; instead trim only leaked prefix/suffix markers from the
start and end of processed_text. Locate the block using the variables
processed_text and clean and replace the global replace() calls with logic that
repeatedly strip known tokens from the start (strip_prefix) and end
(strip_suffix) or use anchored regexes so only boundary occurrences like
prefix/suffix markers (e.g., "<|im_start|>", "<|im_end|>", "</think>", "[EOS]",
etc.) are removed; preserve any occurrences of those tokens inside the JSON
payload. Ensure the final clean is trimmed of whitespace as before.
- Around line 411-420: The generated TypeScript snippet inserted as
"tools_ts_str" via tools_to_typescript currently emits raw OpenAI tool names
(from request.tools) which may contain hyphens and produce invalid TypeScript
identifiers; update tools_to_typescript (and any generation logic that builds
tools_ts_str used when inserting into combined_template_kwargs) to either
generate quoted/string-literal type names (e.g., "'get-weather'" or
`"\"get-weather\""` in the TS output) or sanitize names into valid
BindingIdentifiers (e.g., replace/normalize hyphens to underscores) and
consistently use that sanitized form in all places it’s referenced; ensure the
change is applied across the related generator code paths (including the
tools_to_typescript function and where tools_ts_str is inserted) so templates
referencing tool names remain syntactically valid.

In `@model_gateway/tests/spec/chat_message.rs`:
- Around line 66-70: The test's pattern match on ChatMessage::Tool was loosened
but doesn't assert the new tool.name field; update the test in
model_gateway/tests/spec/chat_message.rs to deserialize a payload that includes
"name", assert that ChatMessage::Tool contains tool.name equal to the provided
value, and also serialize it back (round-trip) to ensure the wire contract;
locate the match arm for ChatMessage::Tool and add assertions against the
tool.name field and a serialization check to confirm round-trip fidelity.

---

Outside diff comments:
In `@bindings/python/src/smg/serve.py`:
- Around line 632-671: The early return in _cleanup_workers when self.workers is
empty skips calling _cleanup_prometheus_dir and can leak the prometheus temp
dir; ensure prometheus cleanup always runs by removing or moving the early
return and invoking self._cleanup_prometheus_dir() (e.g., call
self._cleanup_prometheus_dir() before returning or eliminate the return and only
skip process-group shutdown when self.workers is empty) so that
_cleanup_prometheus_dir() runs even if no workers were appended.

In `@crates/grpc_client/src/trtllm_service.rs`:
- Around line 291-308: The code currently sets stop_token_ids =
eos_token_ids.to_vec() which ignores any stop_token_ids passed in the
ChatCompletionRequest for the TRT-LLM path; change the construction of
stop_token_ids (used in proto::GenerateRequest) to merge tokenizer EOS IDs with
the request's explicit stop token IDs (e.g., body.stop_token_ids or
ChatCompletionRequest.stop_token_ids), convert/normalize them to u32, remove
duplicates (preserve a sensible order such as request-provided then EOS or
vice-versa), and use that merged Vec<u32> in the grpc_request so backend-side
stopping honors both sources.

In `@model_gateway/src/routers/grpc/client.rs`:
- Around line 271-280: The EOS stop IDs are only passed into build_chat_request,
so the messages path still omits merged EOS IDs; update
TrtllmServiceClient::build_messages_request (and any callers like
build_generate_request_from_messages) to accept the eos_token_ids slice
parameter (same type as in build_chat_request) and thread it through to where
stop_token_ids are set so stop_token_ids is populated for Messages requests just
like for Chat requests; adjust function signatures (build_messages_request,
build_generate_request_from_messages) and internal calls to pass the
eos_token_ids slice through unchanged.

In `@model_gateway/src/routers/grpc/harmony/builder.rs`:
- Around line 876-885: Update the ChatMessage::Tool match arm to prefer the
message's name field before falling back to the assistant tool-call map or the
tool_call_id: destructure the tool message to include name (e.g.
ChatMessage::Tool { content, tool_call_id, name, .. }) and compute function_name
by using name.clone() when Some, otherwise lookup
tool_call_map.get(tool_call_id).cloned() and finally fall back to
tool_call_id.clone().

In `@model_gateway/src/routers/grpc/harmony/stages/request_building.rs`:
- Around line 177-207: The /responses path is missing model EOS token IDs
because the code builds eos_ids earlier but only passes them into
trtllm_client.build_generate_request_from_chat; update the responses branch
inside HarmonyRequestBuildingStage::execute to pass the same &eos_ids into
trtllm_client.build_generate_request_from_responses (or replicate the EOS-merge
logic used by build_generate_request_from_chat) so stop_token_ids are populated
the same way for both RequestType::Chat and RequestType::Responses; locate the
eos_ids variable and the trtllm_client.build_generate_request_from_responses
call and add the eos_ids argument (or the equivalent merge) to match
build_generate_request_from_chat.

In `@model_gateway/src/routers/grpc/regular/responses/conversions.rs`:
- Around line 86-117: The conversion currently drops tool names by setting
ChatMessage::Tool.name to None; for ResponseInputOutputItem::FunctionToolCall
use the existing local variable name to populate ChatMessage::Tool.name (e.g.,
name: Some(name.clone())) when pushing the Tool message, and for the
FunctionCallOutput branch build a temporary call_id -> name map while iterating
ResponseInputOutputItem items (recording id -> name from FunctionToolCall
entries) so that when you encounter FunctionCallOutput you can look up the tool
name and set ChatMessage::Tool.name to Some(found_name.clone()) (falling back to
None if not found); update both places where ChatMessage::Tool.name is currently
hardcoded to None.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: d25c587a-cd42-47e9-8f11-7f2ef4b5d3d1

📥 Commits

Reviewing files that changed from the base of the PR and between 9516b44 and 46b284f.

📒 Files selected for processing (22)
  • bindings/python/src/smg/serve.py
  • crates/grpc_client/src/trtllm_service.rs
  • crates/protocols/src/chat.rs
  • crates/reasoning_parser/src/parsers/base.rs
  • crates/tokenizer/src/huggingface.rs
  • crates/tokenizer/src/mock.rs
  • crates/tokenizer/src/tiktoken.rs
  • crates/tokenizer/src/traits.rs
  • crates/tool_parser/src/parsers/kimik2.rs
  • model_gateway/src/core/worker_manager.rs
  • model_gateway/src/observability/metrics_server.rs
  • model_gateway/src/routers/grpc/client.rs
  • model_gateway/src/routers/grpc/harmony/builder.rs
  • model_gateway/src/routers/grpc/harmony/stages/request_building.rs
  • model_gateway/src/routers/grpc/regular/processor.rs
  • model_gateway/src/routers/grpc/regular/responses/conversions.rs
  • model_gateway/src/routers/grpc/regular/stages/chat/preparation.rs
  • model_gateway/src/routers/grpc/regular/stages/chat/request_building.rs
  • model_gateway/src/routers/grpc/regular/stages/messages/preparation.rs
  • model_gateway/src/routers/grpc/regular/streaming.rs
  • model_gateway/src/routers/grpc/utils/chat_utils.rs
  • model_gateway/tests/spec/chat_message.rs

Comment on lines +45 to +97
/// Load merged EOS token IDs from config.json and generation_config.json.
///
/// Models may define different EOS tokens in each file (e.g. Kimi-K2.5 has
/// `[EOS]` (163585) in config.json and `<|im_end|>` (163586) in
/// generation_config.json). We merge both into a deduplicated list so the
/// engine can stop at any of them — matching SGLang's behaviour.
pub fn load_eos_token_ids(dir: &Path) -> Vec<TokenIdType> {
let mut ids = std::collections::BTreeSet::new();

// config.json — eos_token_id can be int or list
if let Ok(content) = std::fs::read_to_string(dir.join("config.json")) {
if let Ok(cfg) = serde_json::from_str::<serde_json::Value>(&content) {
match cfg.get("eos_token_id") {
Some(serde_json::Value::Number(n)) => {
if let Some(id) = n.as_u64() {
ids.insert(id as TokenIdType);
}
}
Some(serde_json::Value::Array(arr)) => {
for v in arr {
if let Some(id) = v.as_u64() {
ids.insert(id as TokenIdType);
}
}
}
_ => {}
}
}
}

// generation_config.json — eos_token_id can be int or list
if let Ok(content) = std::fs::read_to_string(dir.join("generation_config.json")) {
if let Ok(cfg) = serde_json::from_str::<serde_json::Value>(&content) {
match cfg.get("eos_token_id") {
Some(serde_json::Value::Number(n)) => {
if let Some(id) = n.as_u64() {
ids.insert(id as TokenIdType);
}
}
Some(serde_json::Value::Array(arr)) => {
for v in arr {
if let Some(id) = v.as_u64() {
ids.insert(id as TokenIdType);
}
}
}
_ => {}
}
}
}

ids.into_iter().collect()
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Preserve tokenizer-derived EOS IDs when the JSON files don't declare them.

load_eos_token_ids() returns an empty vector if neither file has a numeric eos_token_id. Both this file and crates/tokenizer/src/huggingface.rs then assign that result straight into special_tokens.eos_token_ids, which wipes out EOS information even when special_tokens.eos_token was already resolved from tokenizer_config.json or vocab matching. For those models, TRT-LLM will receive no backend stop-token IDs at all. Seed the set from the resolved EOS token ID and then merge file-based IDs, rather than replacing everything with the file-only list.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/tokenizer/src/tiktoken.rs` around lines 45 - 97, load_eos_token_ids
currently builds an empty set when neither config.json nor
generation_config.json declare numeric eos_token_id, causing callers (e.g.,
assignments to special_tokens.eos_token_ids in load code and in
crates/tokenizer/src/huggingface.rs) to wipe out tokenizer-derived EOS IDs
resolved from tokenizer_config.json or vocab matching. Fix by seeding the
BTreeSet ids at the start of load_eos_token_ids with any already-resolved EOS
token ID(s) passed in (or by reading tokenizer_config-derived EOS prior to
merging) and then merge numeric IDs from config.json and generation_config.json
into that set so file-based IDs augment rather than replace existing
special_tokens.eos_token_ids; update callers to supply the tokenizer-derived EOS
if necessary (refer to load_eos_token_ids, special_tokens.eos_token_ids, and the
tokenizer_config.json resolution path in huggingface.rs).

Comment on lines +89 to +116
async fn collect_prometheus_multiproc_metrics() -> Result<String, String> {
let dir = std::env::var("PROMETHEUS_MULTIPROC_DIR").map_err(|_| {
"PROMETHEUS_MULTIPROC_DIR not set; cannot collect metrics from gRPC workers".to_string()
})?;

let output = tokio::process::Command::new("python3")
.args([
"-c",
"import sys\n\
from prometheus_client import CollectorRegistry, generate_latest\n\
from prometheus_client.multiprocess import MultiProcessCollector\n\
registry = CollectorRegistry()\n\
MultiProcessCollector(registry)\n\
sys.stdout.buffer.write(generate_latest(registry))\n",
])
.env("PROMETHEUS_MULTIPROC_DIR", &dir)
.output()
.await
.map_err(|e| format!("failed to run python3 prometheus collector: {e}"))?;

if !output.status.success() {
let stderr = String::from_utf8_lossy(&output.stderr);
return Err(format!("python3 prometheus collector failed: {stderr}"));
}

String::from_utf8(output.stdout)
.map_err(|e| format!("prometheus collector output is not valid UTF-8: {e}"))
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Call sites =="
rg -n -C2 'collect_prometheus_multiproc_metrics|PROMETHEUS_MULTIPROC_DIR|prometheus_client|python3' .

echo
echo "== Candidate runtime / packaging files =="
fd -HI 'Dockerfile|dockerfile|requirements.*|pyproject.toml|uv.lock|poetry.lock|Pipfile|setup.py|Cargo.toml|compose.*' . \
  | while read -r f; do
      echo
      echo "--- $f"
      rg -n -C2 'python3|prometheus_client|prometheus-client|PROMETHEUS_MULTIPROC_DIR' "$f" || true
    done

Repository: lightseekorg/smg

Length of output: 29259


🏁 Script executed:

# Check the full Dockerfile to see base image and any pip installs
cat -n ./docker/Dockerfile

# Also check if there are any hidden requirements files or setup files referenced
fd -HI 'requirements|setup.py|pyproject.toml' ./docker/

Repository: lightseekorg/smg

Length of output: 3219


🏁 Script executed:

# Check the Python bindings setup/dependency files
cat -n ./bindings/python/pyproject.toml
echo "---"
cat -n ./bindings/python/setup.py

Repository: lightseekorg/smg

Length of output: 2793


🏁 Script executed:

# Search for any other references to prometheus_client installation or imports
rg -i 'prometheus.client|prometheus-client' \
  ./bindings/python/ ./docker/ \
  --type py --type txt --type toml --type yaml

# Check if it's supposed to come from worker installations
rg -i 'prometheus.client' ./e2e_test/ ./scripts/ \
  --type py --type sh --type yaml

# Check model_gateway Cargo.toml for any Python-related deps
rg -A5 -B5 'prometheus|python' ./model_gateway/Cargo.toml

Repository: lightseekorg/smg

Length of output: 602


🏁 Script executed:

# Check if prometheus_client might come from worker dependencies or is conditionally installed
rg -i 'prometheus.client|prometheus-client' . --type py --type toml --type txt \
  | grep -v node_modules | grep -v '.git' | head -20

# Also check if there's any grpc worker setup that might provide it
rg -B10 -A10 'PROMETHEUS_MULTIPROC_DIR' ./bindings/python/src/smg/serve.py

Repository: lightseekorg/smg

Length of output: 1906


Add prometheus_client to gateway dependencies and wrap the gRPC metrics subprocess with a timeout.

The gateway attempts to call python3 -c "from prometheus_client import ..." but prometheus_client is not listed in the smg package dependencies (bindings/python/pyproject.toml:29-34). Every gRPC metrics scrape will fail with a ModuleNotFoundError. Additionally, the subprocess call has no timeout, creating a hang risk. Add prometheus_client as a core dependency and wrap the .output().await with tokio::time::timeout() before release.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/core/worker_manager.rs` around lines 89 - 116,
collect_prometheus_multiproc_metrics currently invokes a Python subprocess that
imports prometheus_client but that package isn't declared in the smg/python
bindings dependencies and the subprocess has no timeout; add "prometheus_client"
to the core Python deps in bindings/python/pyproject.toml (so the
ModuleNotFoundError is resolved) and wrap the
tokio::process::Command.output().await call in tokio::time::timeout with a
sensible Duration (e.g., a few seconds), mapping a timeout error into a clear
Err result from collect_prometheus_multiproc_metrics; update error messages to
reflect timeout vs subprocess failures and reference the
collect_prometheus_multiproc_metrics function when making the changes.

Comment on lines +221 to +234
// Strip leaked chatml tokens only when a model-specific parser is configured
if self.configured_tool_parser.is_some() || self.configured_reasoning_parser.is_some() {
for token in [
"<|im_end|>",
"<|im_start|>",
"<|im_user|>",
"<|im_assistant|>",
"<|im_system|>",
"<|im_middle|>",
] {
processed_text = processed_text.replace(token, "");
}
processed_text = processed_text.trim().to_string();
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Non-streaming cleanup still misses part of the leaked-token set.

The streaming path now strips </think>, [EOS], and [BOS], but this non-streaming sanitizer only removes <|im_*|>. That means the final chat response can still return the exact markers this PR is trying to scrub. Please centralize the sanitizer and reuse it here and in the Messages block below.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/grpc/regular/processor.rs` around lines 221 - 234,
The non-streaming cleanup in processor.rs currently only strips <|im_*|> tokens
inside the block guarded by self.configured_tool_parser/is_some but misses other
leaked tokens (</think>, [EOS], [BOS]) that the streaming path removes; create a
single sanitizer function (e.g., strip_leaked_tokens or sanitize_leaked_tokens)
on the impl (or a private helper) that accepts &str and returns a trimmed
String, move the full token list and replacement logic into it, and call this
helper from the existing processed_text cleanup block and from the Messages
block referenced in the review so both streaming and non-streaming paths reuse
the same sanitizer (refer to self.configured_tool_parser,
self.configured_reasoning_parser, processed_text, and the Messages handling
code).

Comment on lines +710 to +720
if self.configured_tool_parser.is_some() && tool_parser_available {
(tool_calls, processed_text) = self
.parse_tool_calls(
&processed_text,
&messages_request.model,
utils::message_utils::get_history_tool_calls_count_messages(
&messages_request,
),
)
.await;
} else if used_json_schema {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Configured parsers currently bypass the constrained-output fallback in Messages.

This if / else if ordering differs from process_single_choice. For messages::ToolChoice::Tool / Any, used_json_schema is already true, but a configured parser takes this branch first. If it returns no native tool calls, the JSON-schema wrapper path is skipped entirely and the arguments-only JSON is left as plain text instead of becoming a ToolUse block. Make this match the chat path by skipping the configured parser for constrained output, or by using an independent if tool_calls.is_none() && used_json_schema fallback.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/grpc/regular/processor.rs` around lines 710 - 720,
The configured parser branch currently runs before the constrained-output
(used_json_schema) handling, causing configured parsers to bypass the
JSON-schema -> ToolUse fallback; update the logic in processor.rs around
configured_tool_parser and parse_tool_calls so it either (a) skip invoking the
configured parser when used_json_schema is true (matching process_single_choice
behavior), or (b) after parse_tool_calls returns no tool_calls, check if
used_json_schema is true and then run the JSON-schema wrapper / create a ToolUse
block from the arguments-only JSON; ensure you reference and adjust the
conditionals around self.configured_tool_parser, tool_parser_available,
parse_tool_calls, used_json_schema, and the code path that constructs ToolUse so
the constrained-output fallback always runs when appropriate.

Comment on lines +130 to +134
// Step 3: Tokenize the processed text.
// Use add_special_tokens=true so the tokenizer recognizes special tokens
// (e.g. <|im_system|>, <|im_end|>, <think>) embedded in the chat template
// output as single token IDs rather than splitting them into characters.
let encoding = match tokenizer.encode(&processed_messages.text, true) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's explore the file structure and understand the context
find . -type f -name "preparation.rs" | head -20

Repository: lightseekorg/smg

Length of output: 544


🏁 Script executed:

# Check the full content of the preparation.rs file to understand the context
wc -l model_gateway/src/routers/grpc/regular/stages/chat/preparation.rs

Repository: lightseekorg/smg

Length of output: 129


🏁 Script executed:

# Read the relevant section and surrounding code for context
sed -n '115,150p' model_gateway/src/routers/grpc/regular/stages/chat/preparation.rs

Repository: lightseekorg/smg

Length of output: 1639


🏁 Script executed:

# Find the process_chat_messages function to understand what processed_messages.text contains
rg "fn process_chat_messages|process_chat_messages\(" -A 10 -B 2

Repository: lightseekorg/smg

Length of output: 6687


🏁 Script executed:

# Look for the Messages API path mentioned in the review to understand the parallel issue
rg "Messages API|messages.*special.*token" -i

Repository: lightseekorg/smg

Length of output: 8668


🏁 Script executed:

# Read the process_chat_messages function implementation
sed -n '1,100p' model_gateway/src/routers/grpc/utils/chat_utils.rs

Repository: lightseekorg/smg

Length of output: 3502


🏁 Script executed:

# Continue reading chat_utils.rs
wc -l model_gateway/src/routers/grpc/utils/chat_utils.rs
sed -n '1,200p' model_gateway/src/routers/grpc/utils/chat_utils.rs | tail -100

Repository: lightseekorg/smg

Length of output: 4102


🏁 Script executed:

# Check the Messages API preparation file to see how it handles tokenization
sed -n '1,200p' model_gateway/src/routers/grpc/regular/stages/messages/preparation.rs

Repository: lightseekorg/smg

Length of output: 7700


🏁 Script executed:

# Look for where add_special_tokens is used in the Messages API path
rg "add_special_tokens|encode.*true|encode.*false" -B 3 -A 3

Repository: lightseekorg/smg

Length of output: 50372


🏁 Script executed:

# Find and read the complete process_chat_messages function
sed -n '275,450p' model_gateway/src/routers/grpc/utils/chat_utils.rs

Repository: lightseekorg/smg

Length of output: 6702


🏁 Script executed:

# Look for template handling and escaping
rg "apply_chat_template|render|escape" -A 5 -B 2 | head -200

Repository: lightseekorg/smg

Length of output: 9963


🏁 Script executed:

# Check what the ProcessedMessages struct contains and how the text is built
rg "struct ProcessedMessages|ProcessedMessages {" -A 10

Repository: lightseekorg/smg

Length of output: 211


🏁 Script executed:

# Search for ProcessedMessages struct definition
rg "struct ProcessedMessages" -A 15

Repository: lightseekorg/smg

Length of output: 767


🏁 Script executed:

# Find where apply_chat_template is called - the key part
rg "apply_chat_template" -B 5 -A 10 | head -100

Repository: lightseekorg/smg

Length of output: 6959


🏁 Script executed:

# Check if there's any escaping of user content before the template
sed -n '400,550p' model_gateway/src/routers/grpc/utils/chat_utils.rs

Repository: lightseekorg/smg

Length of output: 5300


🏁 Script executed:

# Check what the comment in messages/preparation.rs says about skip_special_tokens
grep -B 5 -A 5 "skip_special_tokens" model_gateway/src/routers/grpc/regular/stages/messages/preparation.rs

Repository: lightseekorg/smg

Length of output: 42


🏁 Script executed:

# Check tokenizer.rs to understand apply_chat_template behavior
find . -name "*.rs" -path "*/crates/tokenizer/*" -type f | head -10

Repository: lightseekorg/smg

Length of output: 445


🏁 Script executed:

# Search for any documentation or comments about template escaping
rg "escape|sanitiz|safe" model_gateway/src/routers/grpc/utils/chat_utils.rs -i

Repository: lightseekorg/smg

Length of output: 99


🏁 Script executed:

# Look at what happens with add_special_tokens in the embedding stage for comparison
sed -n '1,150p' model_gateway/src/routers/grpc/regular/stages/embedding/preparation.rs

Repository: lightseekorg/smg

Length of output: 2908


🏁 Script executed:

# Look for how chat template handles special tokens - check the chat_template implementation
cat crates/tokenizer/src/chat_template.rs | head -200

Repository: lightseekorg/smg

Length of output: 7206


🏁 Script executed:

# Check for any escaping in the tokenizer's template application
rg "apply_chat_template" -B 5 -A 15 crates/tokenizer/src/ | head -150

Repository: lightseekorg/smg

Length of output: 8679


🏁 Script executed:

# Look for test cases showing template + user content interaction
find . -name "*test*.rs" -o -name "*tests.rs" | xargs rg "apply_chat_template|special.*token" | head -100

Repository: lightseekorg/smg

Length of output: 3706


🏁 Script executed:

# Check if minijinja auto-escapes content or if there's explicit escaping
rg "escape|autoescape" crates/tokenizer/src/chat_template.rs

Repository: lightseekorg/smg

Length of output: 159


🏁 Script executed:

# Verify the template rendering doesn't have HTML/special-char escaping
sed -n '600,900p' crates/tokenizer/src/chat_template.rs

Repository: lightseekorg/smg

Length of output: 11423


🏁 Script executed:

# Check if there are any tests showing user content with special tokens
rg "<\\|im_end\\|>" model_gateway/tests/ crates/tokenizer/tests/

Repository: lightseekorg/smg

Length of output: 13742


🏁 Script executed:

# Look at the complete implementation of render_chat_template
rg "fn render_chat_template" -A 50 crates/tokenizer/src/chat_template.rs | head -100

Repository: lightseekorg/smg

Length of output: 2350


User-supplied special-token markers in chat content can be reinterpreted as control tokens.

When the chat template is rendered via apply_chat_template(), user messages are embedded directly without escaping. Then tokenizing with add_special_tokens=true causes the tokenizer to recognize sequences like <|im_end|> in user content as atomic special tokens rather than plain text, potentially altering prompt structure. Escape these markers in user content before template rendering, or disable special-token recognition and only apply it to the template scaffolding itself.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/grpc/regular/stages/chat/preparation.rs` around
lines 130 - 134, The tokenizer call uses add_special_tokens=true on
processed_messages.text after apply_chat_template(), which allows user-supplied
sequences like "<|im_end|>" to be recognized as control tokens; to fix, ensure
user message content is escaped before template rendering (e.g., run an escape
function on strings fed into apply_chat_template()), or alternatively switch the
tokenizer.encode call to add_special_tokens=false for user content and only
apply special-token-aware encoding to the known template scaffolding; update
references around apply_chat_template(), processed_messages.text, and
tokenizer.encode so special markers from user input cannot be interpreted as
control tokens.

Comment on lines +438 to +450
// Strip leaked chatml/think tokens when a parser is configured
let mut delta = delta;
if self.configured_tool_parser.is_some()
|| self.configured_reasoning_parser.is_some()
{
for token in [
"<|im_end|>", "<|im_start|>", "<|im_user|>",
"<|im_assistant|>", "<|im_system|>", "<|im_middle|>",
"</think>", "[EOS]", "[BOS]",
] {
delta = delta.replace(token, "");
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Sanitize the flushed tail as well.

This scrubber only runs in the per-chunk path. The final stop_decoder.flush() content at Lines 471-490 bypasses it, so trailing ChatML / think / EOS text can still leak in the last SSE delta. Please extract this into a shared helper and reuse it in the flush path too.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/grpc/regular/streaming.rs` around lines 438 - 450,
The scrubber that removes ChatML/think/EOS tokens from a chunked delta is only
applied inside the per-chunk path; extract that logic into a shared helper
(e.g., fn scrub_delta(input: &str, configured_tool_parser: Option<...>,
configured_reasoning_parser: Option<...>) -> String) and call it both where
`delta` is currently sanitized and on the final `stop_decoder.flush()` output
before emitting the last SSE delta; reference the existing symbols
`configured_tool_parser`, `configured_reasoning_parser`, `delta`, and
`stop_decoder.flush()` so the new helper matches the current conditional
behavior and is reused in the flush path.

Comment on lines +1805 to +1807
if is_specific_function
&& !(self.configured_tool_parser.is_some() && tool_parser_available)
{
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

ToolChoice::Tool can emit invalid Messages events when a configured parser exists.

When this condition is false, control falls into the generic parser branch even though streaming_tool_parser was created as the JSON parser for constrained output. That parser can emit parameters with name: None, and the loop below will then send InputJsonDelta before any ContentBlockStart, which violates the Anthropic event sequence. Match the chat streaming path here and always keep ToolChoice::Tool on the arguments-only branch.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/grpc/regular/streaming.rs` around lines 1805 -
1807, The condition allows ToolChoice::Tool to fall into the generic parser
branch even when streaming_tool_parser (the JSON/constrained parser) was
selected, causing parameters with name: None to emit InputJsonDelta before
ContentBlockStart; change the conditional logic around ToolChoice::Tool so it
always follows the arguments-only branch (i.e., keep ToolChoice::Tool out of the
generic parser path) by ensuring the check that currently reads "if
is_specific_function && !(self.configured_tool_parser.is_some() &&
tool_parser_available)" instead preserves ToolChoice::Tool on the arguments-only
branch regardless of configured_tool_parser/tool_parser_available, mirroring the
chat streaming path and preventing use of streaming_tool_parser for
ToolChoice::Tool.

Comment on lines +411 to +420
// Generate TypeScript-style tool declarations for models that use it
// (e.g., Kimi K2.5 chat template checks for tools_ts_str)
if let Some(ref tools) = request.tools {
if !tools.is_empty() {
let ts_str = tools_to_typescript(tools);
if !ts_str.is_empty() {
combined_template_kwargs.insert(
"tools_ts_str".to_string(),
Value::String(ts_str),
);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Do TypeScript type aliases and object property declarations allow unquoted identifiers containing hyphens or spaces, and what characters are allowed in OpenAI function/tool names?

💡 Result:

TypeScript: type aliases + object/type property names

  • Type alias names must be a BindingIdentifier (type BindingIdentifier ... = ...), so they cannot contain spaces or hyphens unquoted. (sources.debian.org)

    • Example (invalid): type user-id = string / type "user id" = string
  • Object/type property names (in object literals, interfaces, and type literals) can be:

    • an IdentifierName (unquoted), or
    • a string literal (quoted), or
    • a numeric literal, or
    • a computed property name. (sources.debian.org)

    Because hyphens and spaces aren’t valid in ECMAScript IdentifierName, you must quote such property keys. (sources.debian.org)

    • Valid:
      • type T = { "user-id": string; "user id": string }
      • const o = { "user-id": 1, "user id": 2 }
    • Invalid (unquoted):
      • type T = { user-id: string } (parsed like user - id)
      • type T = { user id: string }

OpenAI function/tool names: allowed characters

OpenAI function tool name must be max 64 chars and contain only:

  • A–Z, a–z, 0–9, _ (underscore), - (dash)
    Spaces (and characters like .) are not allowed. (platform.openai.com)

Citations:


Quote or sanitize function names in tools_ts_str.

This generator emits raw OpenAI tool names directly into TypeScript type declarations (e.g., type {func.name} = ...). OpenAI tool names can include hyphens (e.g., get-weather), but TypeScript type aliases require valid BindingIdentifiers and reject unquoted hyphens as syntax errors. The generated tools_ts_str will be syntactically invalid for any tool with a hyphenated name, breaking templates that rely on it.

Also applies to: 943–1121

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/grpc/utils/chat_utils.rs` around lines 411 - 420,
The generated TypeScript snippet inserted as "tools_ts_str" via
tools_to_typescript currently emits raw OpenAI tool names (from request.tools)
which may contain hyphens and produce invalid TypeScript identifiers; update
tools_to_typescript (and any generation logic that builds tools_ts_str used when
inserting into combined_template_kwargs) to either generate
quoted/string-literal type names (e.g., "'get-weather'" or `"\"get-weather\""`
in the TS output) or sanitize names into valid BindingIdentifiers (e.g.,
replace/normalize hyphens to underscores) and consistently use that sanitized
form in all places it’s referenced; ensure the change is applied across the
related generator code paths (including the tools_to_typescript function and
where tools_ts_str is inserted) so templates referencing tool names remain
syntactically valid.

Comment on lines +547 to +562
// Strip chatml / special tokens that may trail the constrained JSON output
let mut clean = processed_text.to_string();
for token in [
"<|im_end|>",
"<|im_start|>",
"<|im_user|>",
"<|im_assistant|>",
"<|im_system|>",
"<|im_middle|>",
"</think>",
"[EOS]",
"[BOS]",
] {
clean = clean.replace(token, "");
}
let clean = clean.trim();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't replace() control tokens inside JSON payloads.

This removes every occurrence of markers like </think> and [EOS], not just leaked prefix/suffix tokens. If a valid JSON string field contains one of those literals, the payload is mutated before parsing and the tool arguments/content change silently. Please trim these markers only at the boundaries of processed_text.

Also applies to: 567-631

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/grpc/utils/chat_utils.rs` around lines 547 - 562,
The current loop in chat_utils.rs replaces control tokens globally in
processed_text (using clean.replace(...)), which can corrupt valid JSON string
fields; instead trim only leaked prefix/suffix markers from the start and end of
processed_text. Locate the block using the variables processed_text and clean
and replace the global replace() calls with logic that repeatedly strip known
tokens from the start (strip_prefix) and end (strip_suffix) or use anchored
regexes so only boundary occurrences like prefix/suffix markers (e.g.,
"<|im_start|>", "<|im_end|>", "</think>", "[EOS]", etc.) are removed; preserve
any occurrences of those tokens inside the JSON payload. Ensure the final clean
is trimmed of whitespace as before.

Comment on lines 66 to 70
ChatMessage::Tool {
content,
tool_call_id,
..
} => {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add an assertion for the new tool.name field.

This only loosens the pattern match. It would be better to also deserialize a payload with "name" and assert it round-trips, otherwise the new wire contract stays untested.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/tests/spec/chat_message.rs` around lines 66 - 70, The test's
pattern match on ChatMessage::Tool was loosened but doesn't assert the new
tool.name field; update the test in model_gateway/tests/spec/chat_message.rs to
deserialize a payload that includes "name", assert that ChatMessage::Tool
contains tool.name equal to the provided value, and also serialize it back
(round-trip) to ensure the wire contract; locate the match arm for
ChatMessage::Tool and add assertions against the tool.name field and a
serialization check to confirm round-trip fidelity.

When the gRPC worker hasn't written any .db files yet (startup phase or
no requests processed), generate_latest() returns empty bytes.  Passing
this empty string to parse_prometheus() triggers a parse error WARN log
on every scrape interval.

Add guards at both the collector function (return Err for empty output)
and the call site (skip empty text silently) to prevent log spam.

Signed-off-by: ConnorLi96 <ConnorLi96@users.noreply.github.com>
Made-with: Cursor
metrics_text: text,
});
}
Ok(_) => {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: This Ok(_) arm is unreachable. collect_prometheus_multiproc_metrics() (line 116-118) already returns Err when text.trim().is_empty(), so the Ok variant can never contain an empty string. The guard on line 339 is redundant too — it will always be true.

Consider simplifying to just Ok(text) => { ... } to avoid confusing future readers into thinking empty-Ok is a real case.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
model_gateway/src/core/worker_manager.rs (1)

94-107: ⚠️ Potential issue | 🟠 Major

Add a timeout around the Python metrics collector subprocess.

.output().await can block this handler indefinitely if python3 startup or the prometheus_client import stalls, so /metrics turns into a hung request instead of a degraded scrape. Please bound this with tokio::time::timeout(...) and verify that prometheus_client is actually packaged anywhere the gateway runs.

🛠️ Suggested fix
-    let output = tokio::process::Command::new("python3")
-        .args([
-            "-c",
-            "import sys\n\
-             from prometheus_client import CollectorRegistry, generate_latest\n\
-             from prometheus_client.multiprocess import MultiProcessCollector\n\
-             registry = CollectorRegistry()\n\
-             MultiProcessCollector(registry)\n\
-             sys.stdout.buffer.write(generate_latest(registry))\n",
-        ])
-        .env("PROMETHEUS_MULTIPROC_DIR", &dir)
-        .output()
-        .await
-        .map_err(|e| format!("failed to run python3 prometheus collector: {e}"))?;
+    let output = tokio::time::timeout(
+        Duration::from_secs(3),
+        tokio::process::Command::new("python3")
+            .args([
+                "-c",
+                "import sys\n\
+                 from prometheus_client import CollectorRegistry, generate_latest\n\
+                 from prometheus_client.multiprocess import MultiProcessCollector\n\
+                 registry = CollectorRegistry()\n\
+                 MultiProcessCollector(registry)\n\
+                 sys.stdout.buffer.write(generate_latest(registry))\n",
+            ])
+            .env("PROMETHEUS_MULTIPROC_DIR", &dir)
+            .output(),
+    )
+    .await
+    .map_err(|_| "collect_prometheus_multiproc_metrics timed out".to_string())?
+    .map_err(|e| format!("failed to run python3 prometheus collector: {e}"))?;
#!/bin/bash
set -euo pipefail

echo "== collector helper =="
sed -n '88,120p' model_gateway/src/core/worker_manager.rs

echo
echo "== packaging/runtime declarations for prometheus_client =="
fd -HI 'pyproject.toml|setup.py|requirements*.txt|Dockerfile|dockerfile|Pipfile|uv.lock|poetry.lock' . \
  | while read -r f; do
      echo "--- $f"
      rg -n -C2 'prometheus_client|python3' "$f" || true
    done
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/core/worker_manager.rs` around lines 94 - 107, Wrap the
tokio process call that creates the python3 prometheus collector (the Command
building code that currently ends with .output().await and assigns to output)
with tokio::time::timeout(...) to bound how long we wait for the subprocess to
start and return; on timeout return a mapped error like "prometheus collector
timed out" (same error type as the existing map_err) so the handler returns
promptly, and ensure you still propagate other I/O errors from .output(); keep
the existing env("PROMETHEUS_MULTIPROC_DIR", &dir) and variable name output
unchanged so the rest of the function can use it unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@model_gateway/src/core/worker_manager.rs`:
- Around line 116-118: The check that treats empty multiprocess output as an
error (the branch with text.trim().is_empty()) should instead return a soft-skip
sentinel so warmup/idle periods don't produce warnings; replace the Err(...)
with Ok(String::new()) (or otherwise return an empty String) in the function
containing the text.trim().is_empty() check, and update the later handling
around the warn! branch (the logic at the Ok(_) / warn! region) to special-case
this empty-string sentinel before logging a warning so the skip is reachable and
no spurious warnings occur.

---

Duplicate comments:
In `@model_gateway/src/core/worker_manager.rs`:
- Around line 94-107: Wrap the tokio process call that creates the python3
prometheus collector (the Command building code that currently ends with
.output().await and assigns to output) with tokio::time::timeout(...) to bound
how long we wait for the subprocess to start and return; on timeout return a
mapped error like "prometheus collector timed out" (same error type as the
existing map_err) so the handler returns promptly, and ensure you still
propagate other I/O errors from .output(); keep the existing
env("PROMETHEUS_MULTIPROC_DIR", &dir) and variable name output unchanged so the
rest of the function can use it unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 49179c74-ff91-475b-89f4-334604303a24

📥 Commits

Reviewing files that changed from the base of the PR and between 46b284f and 98bd903.

📒 Files selected for processing (1)
  • model_gateway/src/core/worker_manager.rs

Comment on lines +116 to +118
if text.trim().is_empty() {
return Err("no metrics available from gRPC workers yet".to_string());
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Treat empty multiprocess output as a soft skip, not an error.

Line 116 converts the “no .db files yet” case into Err(...), but Line 345 only skips silently on Ok(_). That makes the skip branch unreachable and will warn on every scrape during warmup/idle periods. Return Ok(String::new()) here, or special-case this sentinel before warn!.

🧹 Suggested fix
-    if text.trim().is_empty() {
-        return Err("no metrics available from gRPC workers yet".to_string());
-    }
+    if text.trim().is_empty() {
+        return Ok(String::new());
+    }

Also applies to: 345-350

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/core/worker_manager.rs` around lines 116 - 118, The check
that treats empty multiprocess output as an error (the branch with
text.trim().is_empty()) should instead return a soft-skip sentinel so
warmup/idle periods don't produce warnings; replace the Err(...) with
Ok(String::new()) (or otherwise return an empty String) in the function
containing the text.trim().is_empty() check, and update the later handling
around the warn! branch (the logic at the Ok(_) / warn! region) to special-case
this empty-string sentinel before logging a warning so the skip is reachable and
no spurious warnings occur.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 98bd903b4c

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

// Pass merged EOS token IDs from config.json + generation_config.json.
// TRT-LLM's gRPC path does not reliably merge these internally,
// so we provide them explicitly via the standard stop_token_ids field.
let stop_token_ids: Vec<u32> = eos_token_ids.to_vec();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Respect ignore_eos when injecting EOS stop token IDs

This unconditionally copies merged EOS IDs into stop_token_ids, but stop_token_ids are hard stop conditions while ignore_eos is meant to continue generation past EOS. As written, requests with ignore_eos=true will still terminate when an EOS token appears, which breaks the documented ignore_eos behavior for TRT-LLM chat/harmony requests.

Useful? React with 👍 / 👎.

"[EOS]",
"[BOS]",
] {
clean = clean.replace(token, "");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid mutating JSON payloads when stripping special tokens

This removes token substrings with global replace before parsing constrained JSON, so valid argument values containing literals like [EOS], </think>, or <|im_end|> get silently altered (for example a tool argument string can lose content). That corrupts structured outputs/tool inputs rather than just trimming leaked trailer tokens, so cleanup should be boundary-aware instead of replacing everywhere in the JSON text.

Useful? React with 👍 / 👎.

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Apr 9, 2026

Hi @ConnorLi96, this PR has merge conflicts that must be resolved before it can be merged. Please rebase your branch:

git fetch origin main
git rebase origin/main
# resolve any conflicts, then:
git push --force-with-lease

@mergify mergify bot added the needs-rebase PR has merge conflicts that need to be resolved label Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

grpc gRPC client and router changes model-gateway Model gateway crate changes needs-rebase PR has merge conflicts that need to be resolved protocols Protocols crate changes python-bindings Python bindings changes reasoning-parser Reasoning parser changes tests Test changes tokenizer Tokenizer related changes tool-parser Tool/function call parser changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants