[WIP] Connorli/fix func call for KimiK2.5 by ConnorLi96 · Pull Request #1070 · lightseekorg/smg

ConnorLi96 · 2026-04-08T23:22:38Z

Description

Note: This PR is for record-keeping purposes. The changes will be systematically split into separate PRs for review and merging.

Problem

Function calling with Kimi-K2.5 (NVFP4) is severely broken on SMG — fc-dash scores 44/60 (SMG+SGLang) and 18/60 (SMG+TRT-LLM) vs 58/60 on pure SGLang direct. Multiple stages in the SMG inference pipeline have issues that compound to produce wrong or missing tool calls, empty structured outputs, and native token leaks.

Solution

Fix function calling across the full SMG inference pipeline, organized by stage:

1. Input Tokenization — Fix Rust tiktoken encoder to use encode_with_special_tokens for chat template output, ensuring special tokens like <|im_system|>, <|im_end|> are encoded as single token IDs instead of being split into character-level BPE tokens (tiktoken.rs, preparation.rs)

2. Chat Template Rendering — Add name: Option<String> to ChatMessage::Tool so the Jinja2 template resolves message.get('name') correctly (e.g. tool role renders as function name instead of literal "tool"). Keep tool call arguments as strings instead of parsing to serde_json::Value to preserve original formatting like {"query": "weather"} vs {"query":"weather"} (chat.rs, chat_utils.rs)

3. Reasoning Parser — Skip reasoning parsing when output is constrained (specific tool_choice or response_format: json_object/json_schema), preventing the parser from swallowing constrained JSON output as reasoning content. Add <|tool_calls_section_begin|> detection as a split point when </think> is absent (processor.rs, streaming.rs, base.rs)

4. Tool Call Parser — Support hyphens in KimiK2 tool call IDs (e.g. functions.execute-tool:0) and alternative delimiters <|func_start|>/<|func_end|>. Add fallback from configured parser to JSON schema parser for tool_choice=specific_function when native tokens are absent (kimik2.rs, processor.rs)

5. Stop Token / EOS Handling — Load merged EOS token IDs from both config.json ([EOS]=163585) and generation_config.json (<|im_end|>=163586), matching SGLang's _get_hf_eos_token_id() merge logic. Expose eos_token_ids() on the Tokenizer trait and pass them as stop_token_ids in TRT-LLM gRPC requests. This fixes TRT-LLM generating past turn boundaries (18/60 → 55/60) (tiktoken.rs, traits.rs, trtllm_service.rs)

6. Output Cleanup — Strip leaked chatml tokens and [EOS]/[BOS] text from both streaming and non-streaming output paths, conditioned on parser configuration (processor.rs, streaming.rs, chat_utils.rs)

Other (separate concern): Graceful metrics server bind failure — log error instead of panicking when port is in use (metrics_server.rs). This should be a separate PR.

Summary by CodeRabbit

New Features
- Enhanced EOS token handling to improve generation stopping across backends
- Support for additional tool-call delimiters and richer tool-message metadata
Bug Fixes
- Metrics server no longer crashes on bind failures
- Reasoning parser no longer interferes with constrained JSON outputs
- Tokenization now recognizes special tokens correctly
Improvements
- Better gRPC metrics collection and safer metrics aggregation
- Cleaner JSON and token cleanup in streaming responses and parsing

Signed-off-by: Scott Lee <scott@together.ai>

…ol-call tokens - Prioritize explicitly configured tool parser over JSON schema parsing - Support alternative delimiters (<|func_start|>/<|func_end|>) in KimiK2 parser - Prevent reasoning parser from consuming tool call markers - Strip leaked chatml tokens only when a parser is explicitly configured - Truncate trailing content after first valid JSON object for JSON response formats Signed-off-by: ConnorLi96 <ConnorLi96@users.noreply.github.com> Made-with: Cursor

- Fix tiktoken encoder to use encode_with_special_tokens for chat template output - Prioritize configured tool parser over JSON schema with fallthrough - Skip reasoning parsing when output is constrained (specific tool_choice / JSON response_format) - Support alternative delimiters and hyphens in KimiK2 tool call parser - Prevent reasoning parser from consuming tool call markers - Keep tool call arguments as strings to preserve formatting - Generate TypeScript-style tool declarations for Kimi K2.5 - Strip leaked chatml tokens conditionally and clean up JSON response format - Add name field to Tool ChatMessage variant - Graceful metrics server bind failure Signed-off-by: ConnorLi96 <ConnorLi96@users.noreply.github.com> Made-with: Cursor

TRT-LLM's gRPC path does not reliably apply generation_config.eos_token_id via SamplingParams._setup(), causing chatml models (e.g. Kimi-K2.5) to generate past turn boundaries (18/60 -> 52/60 on fc-dash). This is a workaround: the proper fix is for TRT-LLM to honor generation_config.eos_token_id in its gRPC executor path. Signed-off-by: ConnorLi96 <ConnorLi96@users.noreply.github.com> Made-with: Cursor

…on_config.json Models like Kimi-K2.5 define different EOS tokens in config.json ([EOS]=163585) and generation_config.json (<|im_end|>=163586). The engine must stop at both. SGLang handles this internally by merging both sources, but TRT-LLM's gRPC path does not. This change: - Adds load_eos_token_ids() to read and merge EOS IDs from both config files - Exposes eos_token_ids() on the Tokenizer trait - Passes merged IDs as stop_token_ids in TRT-LLM gRPC requests - Removes the hardcoded <|im_end|> stop string workaround (bfab5ad) - Strips [EOS]/[BOS] text from constrained decoding output Result: SMG+TRT-LLM fc-dash 55/60 (was 52/60 with hack, 18/60 without) Signed-off-by: ConnorLi96 <ConnorLi96@users.noreply.github.com> Made-with: Cursor

coderabbitai · 2026-04-08T23:22:53Z

📝 Walkthrough

Walkthrough

Adds merged EOS token ID support across tokenizers and threads EOS IDs through gRPC request construction for TRT-LLM; enables special-token-aware tokenization; extends tool-message schema with an optional name; tightens reasoning/tool parsing around constrained JSON outputs; adds Prometheus multiprocess metrics collection and safer metrics server startup.

Changes

Cohort / File(s)	Summary
Tokenizer trait & implementations `crates/tokenizer/src/traits.rs`, `crates/tokenizer/src/tiktoken.rs`, `crates/tokenizer/src/huggingface.rs`, `crates/tokenizer/src/mock.rs`	Added `eos_token_ids` field/method to `SpecialTokens`/`Tokenizer` trait; implemented EOS loading/merge from `config.json` + `generation_config.json`; tokenizers expose merged EOS IDs.
gRPC request wiring (EOS) `model_gateway/src/routers/grpc/regular/stages/chat/request_building.rs`, `model_gateway/src/routers/grpc/harmony/stages/request_building.rs`, `model_gateway/src/routers/grpc/client.rs`, `crates/grpc_client/src/trtllm_service.rs`	Tokenizers' `eos_token_ids` passed into chat request builders; `TrtllmServiceClient::build_generate_request_from_chat` now accepts `eos_token_ids` and populates proto `stop_token_ids`.
Tokenization special-token handling `model_gateway/src/routers/grpc/regular/stages/chat/preparation.rs`, `model_gateway/src/routers/grpc/regular/stages/messages/preparation.rs`	Switched tokenizer calls to `encode(..., true)` to enable special-token recognition so chat-template markers and `<think>` are treated as single tokens.
Reasoning & tool parsing behavior `crates/reasoning_parser/src/parsers/base.rs`, `model_gateway/src/routers/grpc/regular/processor.rs`, `model_gateway/src/routers/grpc/regular/streaming.rs`	Detects output constraints (tool_choice + JSON formats) to skip reasoning/tool parsing when constrained; splits reasoning at tool-call section marker; adds streaming sanitization to avoid consuming constrained JSON.
Tool parser improvements `crates/tool_parser/src/parsers/kimik2.rs`	Added alternate delimiters `<
ChatML/token scrubbing & utils `model_gateway/src/routers/grpc/utils/chat_utils.rs`, `model_gateway/src/routers/grpc/regular/responses/conversions.rs`	Added ChatML/special-token stripping for parsing; added `tools_to_typescript` generator; ensure `ChatMessage::Tool` conversions include `name: None`.
ChatMessage schema & tests `crates/protocols/src/chat.rs`, `model_gateway/src/routers/grpc/harmony/builder.rs`, `model_gateway/tests/spec/chat_message.rs`	`ChatMessage::Tool` gains `name: Option<String>`; pattern matches updated to ignore new field; tests updated accordingly.
Prometheus multiprocess metrics & metrics server `bindings/python/src/smg/serve.py`, `model_gateway/src/core/worker_manager.rs`, `model_gateway/src/observability/metrics_server.rs`	For gRPC workers create/manage a temp `PROMETHEUS_MULTIPROC_DIR` and clean it on shutdown; added `collect_prometheus_multiproc_metrics()` calling a Python helper; metrics server startup no longer panics on bind failure.
Streaming/message preparation adjustments `model_gateway/src/routers/grpc/regular/stages/messages/preparation.rs`, `model_gateway/src/routers/grpc/regular/streaming.rs`	Tokenizer special-token handling enabled for Messages; streaming deltas sanitized of ChatML/think markers when configured tool/reasoning parsers are used.

Sequence Diagram(s)

sequenceDiagram
    actor Client
    participant Tokenizer
    participant RequestBuilder
    participant GrpcClient
    participant TRT_LLM

    Client->>Tokenizer: load tokenizer (config + generation_config)
    Tokenizer->>Tokenizer: merge eos_token_ids
    Tokenizer-->>RequestBuilder: eos_token_ids &[u32]

    Client->>RequestBuilder: prepare chat request
    RequestBuilder->>GrpcClient: build_chat_request(..., eos_token_ids)
    GrpcClient->>GrpcClient: build_generate_request_from_chat(eos_token_ids)
    GrpcClient->>TRT_LLM: GenerateRequest { stop_token_ids: eos_token_ids }
    TRT_LLM-->>GrpcClient: Response

sequenceDiagram
    actor Client
    participant Processor
    participant ConstraintDetector
    participant ReasoningParser
    participant ToolParser
    participant Formatter

    Client->>Processor: chat response (with response_format/tool_choice)
    Processor->>ConstraintDetector: detect output_is_constrained
    ConstraintDetector-->>Processor: true/false

    alt constrained
        Processor->>Formatter: skip ReasoningParser & ToolParser
        Processor->>Formatter: extract/truncate JSON from fenced blocks
    else unconstrained
        Processor->>ReasoningParser: parse reasoning blocks
        ReasoningParser-->>Processor: reasoning_text
        Processor->>ToolParser: parse tool calls
        ToolParser-->>Processor: tool_calls
    end

    Processor->>Formatter: final formatting
    Formatter-->>Client: response

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related issues

[Bug]: Mistral tool_choice=required fails when bos_token is correctly injected (guided decoding conflict) #838: BOS/EOS token injection / guided-decoding for Mistral — this PR's merged EOS handling and special-token-aware tokenization directly relate to that problem.

Possibly related PRs

refactor(tokenizer): inject special tokens inside tokenizer impls instead of callers #918: Related tokenizer special-token and defaulting changes — overlaps tokenizer implementation edits.
mcp: unify session-mapped tool execution and payload builders #370: Related router ↔ tool conversion refactors — touches ChatMessage::Tool and conversion paths.
feat(gateway): add message_utils and MessagePreparationStage for Messages API #741: Messages API preparation changes — overlaps MessagePreparationStage tokenization changes.

Suggested reviewers

CatherineSue
key4ng
slin1237

🐰 I hopped through token trees to find the ends,
Merged EOS seeds and threaded them to friends,
I scrubbed the chatter, kept constrained replies neat,
Prometheus hums while metrics tidy their seat,
A little rabbit clap — this patch is quite complete! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title is vague and does not clearly describe the main changes; '[WIP]' indicates work-in-progress status, and 'fix func call for KimiK2.5' is overly broad given the extensive multi-stage pipeline modifications.	Revise to a more descriptive title that captures the primary change(s), such as 'Fix function calling for KimiK2.5 via tokenizer and tool-call parser improvements' or split into separate, more specific PR titles as mentioned in the PR description.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch connorli/fix-func-call-all

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

claude · 2026-04-08T23:25:42Z

model_gateway/src/routers/grpc/regular/processor.rs

+        if self.configured_tool_parser.is_some() || self.configured_reasoning_parser.is_some() {
+            for token in [
+                "<|im_end|>",
+                "<|im_start|>",
+                "<|im_user|>",
+                "<|im_assistant|>",
+                "<|im_system|>",
+                "<|im_middle|>",
+            ] {
+                processed_text = processed_text.replace(token, "");
+            }
+            processed_text = processed_text.trim().to_string();
+        }


🟡 Nit: The chatml token stripping list here (non-streaming chat path) is missing </think>, [EOS], and [BOS] that the streaming path at streaming.rs:443-446 does strip. Same discrepancy exists in the Messages non-streaming path at line 747-758 and in process_tool_calls_stream at streaming.rs:1288-1291.

This means non-streaming responses could leak [EOS]/[BOS] tokens that streaming responses correctly filter out. Consider extracting a shared constant (e.g. const CHATML_LEAK_TOKENS: &[&str] = &[...]) to keep all 6 stripping sites consistent.

claude · 2026-04-08T23:25:46Z

model_gateway/src/routers/grpc/regular/processor.rs

+            if self.configured_tool_parser.is_some() && tool_parser_available {
+                (tool_calls, processed_text) = self
+                    .parse_tool_calls(
+                        &processed_text,
+                        &messages_request.model,
+                        utils::message_utils::get_history_tool_calls_count_messages(
+                            &messages_request,
+                        ),
+                    )
+                    .await;


🟡 Nit: The chat non-streaming path (line ~167) gates the configured tool parser with && !output_is_constrained, but this Messages non-streaming path does not. If a Messages API request uses constrained decoding (e.g. specific tool choice) with a configured tool parser, the parser will run on raw JSON and may misparse it. Consider adding the same output_is_constrained guard here for consistency, or document why the Messages path intentionally differs.

claude · 2026-04-08T23:25:48Z

model_gateway/src/routers/grpc/utils/chat_utils.rs

+    out
+}
+
+fn schema_to_typescript(schema: &Value, indent: &str, required: &[&str]) -> String {


🟡 Nit: The required parameter is unused — object_to_typescript reads the required list from the schema value directly. This is dead code; consider removing the parameter.

Suggested change

fn schema_to_typescript(schema: &Value, indent: &str, required: &[&str]) -> String {

fn schema_to_typescript(schema: &Value, indent: &str) -> String {

claude · 2026-04-08T23:25:53Z

model_gateway/src/routers/grpc/regular/processor.rs

+            if processed_text.starts_with('{') {
+                let mut depth = 0i32;
+                let mut in_string = false;
+                let mut escape = false;
+                let mut json_end = None;
+                for (i, ch) in processed_text.char_indices() {
+                    if escape { escape = false; continue; }
+                    if ch == '\\' && in_string { escape = true; continue; }
+                    if ch == '"' { in_string = !in_string; continue; }
+                    if in_string { continue; }
+                    if ch == '{' { depth += 1; } else if ch == '}' {
+                        depth -= 1;
+                        if depth == 0 { json_end = Some(i + 1); break; }
+                    }
+                }
+                if let Some(end) = json_end {
+                    processed_text = processed_text[..end].to_string();
+                }
+            }


🟡 Nit: This hand-rolled JSON brace-matching truncator only handles {…} top-level objects. If response_format is json_schema and the schema's root type is an array ([…]), the JSON response will start with [ and this code won't trim trailing tokens from it. Consider also handling the array case, or using serde_json::from_str to find the valid JSON prefix (attempt parse, on failure progressively trim trailing characters).

claude · 2026-04-08T23:25:56Z

model_gateway/src/core/worker_manager.rs


+/// Collect gRPC worker metrics by aggregating `PROMETHEUS_MULTIPROC_DIR` via a python3 subprocess.
+async fn collect_prometheus_multiproc_metrics() -> Result<String, String> {
+    let dir = std::env::var("PROMETHEUS_MULTIPROC_DIR").map_err(|_| {
+        "PROMETHEUS_MULTIPROC_DIR not set; cannot collect metrics from gRPC workers".to_string()
+    })?;
+
+    let output = tokio::process::Command::new("python3")
+        .args([
+            "-c",
+            "import sys\n\
+             from prometheus_client import CollectorRegistry, generate_latest\n\
+             from prometheus_client.multiprocess import MultiProcessCollector\n\
+             registry = CollectorRegistry()\n\
+             MultiProcessCollector(registry)\n\
+             sys.stdout.buffer.write(generate_latest(registry))\n",
+        ])
+        .env("PROMETHEUS_MULTIPROC_DIR", &dir)
+        .output()
+        .await
+        .map_err(|e| format!("failed to run python3 prometheus collector: {e}"))?;
+
+    if !output.status.success() {
+        let stderr = String::from_utf8_lossy(&output.stderr);
+        return Err(format!("python3 prometheus collector failed: {stderr}"));
+    }
+
+    String::from_utf8(output.stdout)
+        .map_err(|e| format!("prometheus collector output is not valid UTF-8: {e}"))


🟡 Nit: collect_prometheus_multiproc_metrics spawns a new python3 process on every /metrics scrape. If Prometheus scrapes at 15s intervals, that's ~5760 Python startups/day. This works but could add latency to the metrics endpoint under load. Consider documenting this trade-off or, as a follow-up, caching the Python process / using a longer-lived collector.

claude

Comprehensive set of fixes for func call handling, EOS token merging, chatml token stripping, and gRPC metrics. The changes are well-structured and the logic is sound.

Summary: 0 🔴 Important · 5 🟡 Nit · 0 🟣 Pre-existing

Nits flagged:

Inconsistent chatml token strip lists — 6 stripping sites use different token sets; non-streaming paths miss </think>, [EOS], [BOS]. Extract a shared constant.
Messages non-streaming path missing output_is_constrained guard — Chat path gates configured tool parser on constrained output, Messages path doesn't.
Unused required parameter in schema_to_typescript — dead code.
JSON truncator only handles {…} — Array-rooted JSON schemas ([…]) won't be trimmed.
Python subprocess per metrics scrape — works but spawns a process on every /metrics call; consider documenting the trade-off.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 46b284fe0e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-08T23:26:38Z

model_gateway/src/routers/grpc/regular/processor.rs

+            if self.configured_tool_parser.is_some() && tool_parser_available {
+                (tool_calls, processed_text) = self
+                    .parse_tool_calls(
+                        &processed_text,
+                        &messages_request.model,
+                        utils::message_utils::get_history_tool_calls_count_messages(
+                            &messages_request,
+                        ),
+                    )
+                    .await;
+            } else if used_json_schema {


Restore JSON-schema fallback after configured tool parsing

When messages_request.tool_choice is Tool/Any (JSON-schema constrained output) and a configured tool parser is present, this branch always runs parse_tool_calls first and skips the JSON-schema parser due to the else if chain. In constrained mode the model often returns plain JSON without native tool-call markers, so parse_tool_calls returns no calls and the response loses tool invocations instead of building ToolUse blocks. This is a regression from the previous logic that prioritized parse_json_schema_response for constrained tool choices.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-08T23:26:38Z

model_gateway/src/routers/grpc/regular/stages/chat/preparation.rs

+        // Use add_special_tokens=true so the tokenizer recognizes special tokens
+        // (e.g. <|im_system|>, <|im_end|>, <think>) embedded in the chat template
+        // output as single token IDs rather than splitting them into characters.
+        let encoding = match tokenizer.encode(&processed_messages.text, true) {


Avoid enabling generic special-token injection for chat prompts

This now tokenizes already-rendered chat-template text with add_special_tokens=true. For HuggingFaceTokenizer, encode(..., true) is forwarded directly to the HF tokenizer post-processor, which can prepend/append BOS/EOS on top of template-provided markers. That changes prompt token IDs and can break stop boundaries or generation behavior for non-tiktoken models. The special-token recognition fix is tiktoken-specific, but this call affects all tokenizer backends (and is duplicated in the Messages preparation path).

Useful? React with 👍 / 👎.

gemini-code-assist

Code Review

This pull request improves support for Kimi K2.5 models by merging EOS token IDs from configuration files, implementing TypeScript-style tool declarations, and refining reasoning and tool call parsers. It also adds gRPC worker metrics collection and logic to strip leaked ChatML tokens. Review feedback identifies opportunities to reduce code duplication in token parsing and stripping, and suggests using more robust JSON deserialization to improve performance and reliability.

gemini-code-assist · 2026-04-08T23:27:51Z

crates/tokenizer/src/tiktoken.rs

+pub fn load_eos_token_ids(dir: &Path) -> Vec<TokenIdType> {
+    let mut ids = std::collections::BTreeSet::new();
+
+    // config.json — eos_token_id can be int or list
+    if let Ok(content) = std::fs::read_to_string(dir.join("config.json")) {
+        if let Ok(cfg) = serde_json::from_str::<serde_json::Value>(&content) {
+            match cfg.get("eos_token_id") {
+                Some(serde_json::Value::Number(n)) => {
+                    if let Some(id) = n.as_u64() {
+                        ids.insert(id as TokenIdType);
+                    }
+                }
+                Some(serde_json::Value::Array(arr)) => {
+                    for v in arr {
+                        if let Some(id) = v.as_u64() {
+                            ids.insert(id as TokenIdType);
+                        }
+                    }
+                }
+                _ => {}
+            }
+        }
+    }
+
+    // generation_config.json — eos_token_id can be int or list
+    if let Ok(content) = std::fs::read_to_string(dir.join("generation_config.json")) {
+        if let Ok(cfg) = serde_json::from_str::<serde_json::Value>(&content) {
+            match cfg.get("eos_token_id") {
+                Some(serde_json::Value::Number(n)) => {
+                    if let Some(id) = n.as_u64() {
+                        ids.insert(id as TokenIdType);
+                    }
+                }
+                Some(serde_json::Value::Array(arr)) => {
+                    for v in arr {
+                        if let Some(id) = v.as_u64() {
+                            ids.insert(id as TokenIdType);
+                        }
+                    }
+                }
+                _ => {}
+            }
+        }
+    }
+
+    ids.into_iter().collect()
+}


The logic for parsing eos_token_id from config.json and generation_config.json is duplicated. To improve maintainability and reduce code duplication, extract this logic into a helper closure. Additionally, refactor the match statement to return the differing values and perform the insertion logic once to avoid redundancy.

pub fn load_eos_token_ids(dir: &Path) -> Vec<TokenIdType> { let mut ids = std::collections::BTreeSet::new(); let mut extract_eos_ids = |content: &str| { if let Ok(cfg) = serde_json::from_str::<serde_json::Value>(content) { let extracted: Vec<u64> = match cfg.get("eos_token_id") { Some(serde_json::Value::Number(n)) => n.as_u64().into_iter().collect(), Some(serde_json::Value::Array(arr)) => arr.iter().filter_map(|v| v.as_u64()).collect(), _ => vec![], }; for id in extracted { ids.insert(id as TokenIdType); } } }; if let Ok(content) = std::fs::read_to_string(dir.join("config.json")) { extract_eos_ids(&content); } if let Ok(content) = std::fs::read_to_string(dir.join("generation_config.json")) { extract_eos_ids(&content); } ids.into_iter().collect() }

References

Extract duplicated logic into a shared helper function to improve maintainability and reduce redundancy.

Refactor match statements to avoid duplication. When arms have common logic, use the match to return the differing value and perform the common logic once.

gemini-code-assist · 2026-04-08T23:27:51Z

model_gateway/src/routers/grpc/regular/processor.rs

+            for token in [
+                "<|im_end|>",
+                "<|im_start|>",
+                "<|im_user|>",
+                "<|im_assistant|>",
+                "<|im_system|>",
+                "<|im_middle|>",
+            ] {


This list of tokens to strip is duplicated in other parts of the codebase. To avoid duplication and improve maintainability, consider defining this array as a constant at the module level. Furthermore, when stripping these markers from data that might contain structured formats like JSON, ensure you use whole-string matching instead of character-set matching to prevent data corruption.

References

Extract duplicated logic into a shared helper function or constant to improve maintainability.

When stripping specific marker strings from data that might contain structured formats like JSON, use whole-string matching instead of character-set matching to avoid data corruption.

gemini-code-assist · 2026-04-08T23:27:51Z

model_gateway/src/routers/grpc/regular/processor.rs

+            if processed_text.starts_with('{') {
+                let mut depth = 0i32;
+                let mut in_string = false;
+                let mut escape = false;
+                let mut json_end = None;
+                for (i, ch) in processed_text.char_indices() {
+                    if escape { escape = false; continue; }
+                    if ch == '\\' && in_string { escape = true; continue; }
+                    if ch == '"' { in_string = !in_string; continue; }
+                    if in_string { continue; }
+                    if ch == '{' { depth += 1; } else if ch == '}' {
+                        depth -= 1;
+                        if depth == 0 { json_end = Some(i + 1); break; }
+                    }
+                }
+                if let Some(end) = json_end {
+                    processed_text = processed_text[..end].to_string();
+                }
+            }


This manual JSON parsing logic to find the end of the first object can be brittle. For a more robust solution, consider using serde_json::Deserializer to stream-parse the first valid JSON object. Additionally, avoid reallocating memory with .to_string() within processing loops; use .truncate() to modify the buffer in-place.

if processed_text.starts_with('{') { let mut stream = serde_json::Deserializer::from_str(&processed_text).into_iter::<serde_json::Value>(); if stream.next().is_some() { let end = stream.byte_offset(); processed_text.truncate(end); } }

References

When processing streaming data, avoid inefficient string manipulations like buffer = buffer[..].to_string() as it reallocates memory; use performant methods like .drain() or .truncate() instead.

gemini-code-assist · 2026-04-08T23:27:51Z

model_gateway/src/routers/grpc/regular/streaming.rs

+                        for token in [
+                            "<|im_end|>", "<|im_start|>", "<|im_user|>",
+                            "<|im_assistant|>", "<|im_system|>", "<|im_middle|>",
+                            "</think>", "[EOS]", "[BOS]",
+                        ] {
+                            delta = delta.replace(token, "");
+                        }


This list of tokens to strip is duplicated across multiple stream processing functions. To improve maintainability and consistency, define this array as a constant at the module level. Ensure that any stripping logic uses whole-string matching rather than character-set matching to avoid corrupting JSON data.

References

Extract duplicated logic into a shared helper function or constant to improve maintainability.

When stripping specific marker strings from data that might contain structured formats like JSON, use whole-string matching instead of character-set matching to avoid data corruption.

coderabbitai

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (6)

bindings/python/src/smg/serve.py (1)

632-671: ⚠️ Potential issue | 🟡 Minor

Always run Prometheus cleanup, even when no worker was appended.

_launch_workers() creates self._prometheus_dir before the first self.workers.append(...), but Line 634 returns early when self.workers is empty. If the first launch fails, the temp dir is leaked and PROMETHEUS_MULTIPROC_DIR stays stale in-process.

Suggested fix

     def _cleanup_workers(self) -> None:
         """SIGTERM all worker process groups, wait, then SIGKILL stragglers."""
-        if not self.workers:
-            return
-
-        # Send SIGTERM to each process group
-        for proc, port in self.workers:
-            try:
-                os.killpg(proc.pid, signal.SIGTERM)
-            except (ProcessLookupError, OSError):
-                pass
-
-        # Wait up to _WORKER_SHUTDOWN_TIMEOUT seconds for graceful exit
-        deadline = time.monotonic() + _WORKER_SHUTDOWN_TIMEOUT
-        for proc, port in self.workers:
-            remaining = max(0, deadline - time.monotonic())
-            try:
-                proc.wait(timeout=remaining)
-            except subprocess.TimeoutExpired:
-                try:
-                    os.killpg(proc.pid, signal.SIGKILL)
-                except (ProcessLookupError, OSError):
-                    pass
-
-        self._cleanup_prometheus_dir()
+        try:
+            if self.workers:
+                # Send SIGTERM to each process group
+                for proc, port in self.workers:
+                    try:
+                        os.killpg(proc.pid, signal.SIGTERM)
+                    except (ProcessLookupError, OSError):
+                        pass
+
+                # Wait up to _WORKER_SHUTDOWN_TIMEOUT seconds for graceful exit
+                deadline = time.monotonic() + _WORKER_SHUTDOWN_TIMEOUT
+                for proc, port in self.workers:
+                    remaining = max(0, deadline - time.monotonic())
+                    try:
+                        proc.wait(timeout=remaining)
+                    except subprocess.TimeoutExpired:
+                        try:
+                            os.killpg(proc.pid, signal.SIGKILL)
+                        except (ProcessLookupError, OSError):
+                            pass
+        finally:
+            self._cleanup_prometheus_dir()
 
     def _cleanup_prometheus_dir(self) -> None:
         """Remove the temporary prometheus multiprocess directory and its .db files."""
         if self._prometheus_dir is None:
             return
         try:
             shutil.rmtree(self._prometheus_dir)
             logger.info("Cleaned up PROMETHEUS_MULTIPROC_DIR=%s", self._prometheus_dir)
         except OSError as e:
             logger.warning(
                 "Failed to clean up PROMETHEUS_MULTIPROC_DIR=%s: %s",
                 self._prometheus_dir,
                 e,
             )
+        os.environ.pop("PROMETHEUS_MULTIPROC_DIR", None)
         self._prometheus_dir = None

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@bindings/python/src/smg/serve.py` around lines 632 - 671, The early return in
_cleanup_workers when self.workers is empty skips calling
_cleanup_prometheus_dir and can leak the prometheus temp dir; ensure prometheus
cleanup always runs by removing or moving the early return and invoking
self._cleanup_prometheus_dir() (e.g., call self._cleanup_prometheus_dir() before
returning or eliminate the return and only skip process-group shutdown when
self.workers is empty) so that _cleanup_prometheus_dir() runs even if no workers
were appended.

model_gateway/src/routers/grpc/harmony/builder.rs (1)

876-885: ⚠️ Potential issue | 🟠 Major

Prefer ChatMessage::Tool.name before falling back to tool_call_id.

This path still reconstructs the function name solely from the assistant tool-call map. If a request contains a tool message with name populated but no matching prior assistant tool call, Harmony will emit functions.{tool_call_id} instead of the actual tool name, which defeats the new field.

Suggested fix

-                ChatMessage::Tool {
-                    content,
-                    tool_call_id,
-                    ..
-                } => {
+                ChatMessage::Tool {
+                    content,
+                    tool_call_id,
+                    name,
+                } => {
                     // Look up the function name from the tool_call_id
-                    let function_name = tool_call_map
-                        .get(tool_call_id)
-                        .cloned()
-                        .unwrap_or_else(|| tool_call_id.clone());
+                    let function_name = name
+                        .clone()
+                        .or_else(|| tool_call_map.get(tool_call_id).cloned())
+                        .unwrap_or_else(|| tool_call_id.clone());

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/grpc/harmony/builder.rs` around lines 876 - 885,
Update the ChatMessage::Tool match arm to prefer the message's name field before
falling back to the assistant tool-call map or the tool_call_id: destructure the
tool message to include name (e.g. ChatMessage::Tool { content, tool_call_id,
name, .. }) and compute function_name by using name.clone() when Some, otherwise
lookup tool_call_map.get(tool_call_id).cloned() and finally fall back to
tool_call_id.clone().

model_gateway/src/routers/grpc/regular/responses/conversions.rs (1)

86-117: ⚠️ Potential issue | 🟠 Major

Don't drop tool names during Responses → Chat conversion.

Both branches hardcode name: None, so the new ChatMessage::Tool.name field never reaches chat templates on this path. FunctionToolCall already has the tool name, and FunctionCallOutput can recover it with a small call_id -> name map while iterating items. As written, converted histories still lose the tool identity this PR is trying to preserve.

Also applies to: 137-147
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/grpc/regular/responses/conversions.rs` around lines
86 - 117, The conversion currently drops tool names by setting
ChatMessage::Tool.name to None; for ResponseInputOutputItem::FunctionToolCall
use the existing local variable name to populate ChatMessage::Tool.name (e.g.,
name: Some(name.clone())) when pushing the Tool message, and for the
FunctionCallOutput branch build a temporary call_id -> name map while iterating
ResponseInputOutputItem items (recording id -> name from FunctionToolCall
entries) so that when you encounter FunctionCallOutput you can look up the tool
name and set ChatMessage::Tool.name to Some(found_name.clone()) (falling back to
None if not found); update both places where ChatMessage::Tool.name is currently
hardcoded to None.

model_gateway/src/routers/grpc/harmony/stages/request_building.rs (1)

177-207: ⚠️ Potential issue | 🟡 Minor

Forward model EOS token IDs to TRT-LLM in Harmony /responses path.

The chat branch passes &eos_ids to build_generate_request_from_chat(), where they populate stop_token_ids with merged EOS from model config. The responses branch calls build_generate_request_from_responses() without eos_ids, leaving stop_token_ids empty (line 451 in trtllm_service.rs).

The later Harmony injection at line 253 only adds assistant-action stop tokens (harmony_stops), not the model's EOS tokens. Since build_generate_request_from_responses() does not inject EOS internally, the /responses path will miss the merged EOS handling applied in chat.

Pass &eos_ids to build_generate_request_from_responses() or apply the same EOS-merge logic used in the chat builder (lines 291–294 in trtllm_service.rs) to the responses builder.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/grpc/harmony/stages/request_building.rs` around
lines 177 - 207, The /responses path is missing model EOS token IDs because the
code builds eos_ids earlier but only passes them into
trtllm_client.build_generate_request_from_chat; update the responses branch
inside HarmonyRequestBuildingStage::execute to pass the same &eos_ids into
trtllm_client.build_generate_request_from_responses (or replicate the EOS-merge
logic used by build_generate_request_from_chat) so stop_token_ids are populated
the same way for both RequestType::Chat and RequestType::Responses; locate the
eos_ids variable and the trtllm_client.build_generate_request_from_responses
call and add the eos_ids argument (or the equivalent merge) to match
build_generate_request_from_chat.

model_gateway/src/routers/grpc/client.rs (1)

271-280: ⚠️ Potential issue | 🟠 Major

EOS stop IDs are only wired through the chat builder.

This new parameter fixes TensorRT-LLM chat requests, but the sibling Messages path still has no way to forward merged EOS IDs and ultimately keeps stop_token_ids empty in TrtllmServiceClient::build_generate_request_from_messages. Anthropic Messages requests will therefore keep the old over-generation behavior. Please thread the same slice through build_messages_request as well.

Also applies to: 317-325
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/grpc/client.rs` around lines 271 - 280, The EOS
stop IDs are only passed into build_chat_request, so the messages path still
omits merged EOS IDs; update TrtllmServiceClient::build_messages_request (and
any callers like build_generate_request_from_messages) to accept the
eos_token_ids slice parameter (same type as in build_chat_request) and thread it
through to where stop_token_ids are set so stop_token_ids is populated for
Messages requests just like for Chat requests; adjust function signatures
(build_messages_request, build_generate_request_from_messages) and internal
calls to pass the eos_token_ids slice through unchanged.

crates/grpc_client/src/trtllm_service.rs (1)

291-308: ⚠️ Potential issue | 🟠 Major

Merge request stop-token IDs instead of replacing them.

stop_token_ids now comes only from tokenizer EOS IDs, so explicit ChatCompletionRequest.stop_token_ids are still ignored on the TRT-LLM chat path. That means backend-side stopping remains incomplete and extra tokens can be generated until router trimming kicks in.

Proposed fix

-        let stop_token_ids: Vec<u32> = eos_token_ids.to_vec();
+        let mut stop_token_ids = body.stop_token_ids.clone().unwrap_or_default();
+        stop_token_ids.extend_from_slice(eos_token_ids);
+        stop_token_ids.sort_unstable();
+        stop_token_ids.dedup();

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/grpc_client/src/trtllm_service.rs` around lines 291 - 308, The code
currently sets stop_token_ids = eos_token_ids.to_vec() which ignores any
stop_token_ids passed in the ChatCompletionRequest for the TRT-LLM path; change
the construction of stop_token_ids (used in proto::GenerateRequest) to merge
tokenizer EOS IDs with the request's explicit stop token IDs (e.g.,
body.stop_token_ids or ChatCompletionRequest.stop_token_ids), convert/normalize
them to u32, remove duplicates (preserve a sensible order such as
request-provided then EOS or vice-versa), and use that merged Vec<u32> in the
grpc_request so backend-side stopping honors both sources.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/tokenizer/src/tiktoken.rs`:
- Around line 45-97: load_eos_token_ids currently builds an empty set when
neither config.json nor generation_config.json declare numeric eos_token_id,
causing callers (e.g., assignments to special_tokens.eos_token_ids in load code
and in crates/tokenizer/src/huggingface.rs) to wipe out tokenizer-derived EOS
IDs resolved from tokenizer_config.json or vocab matching. Fix by seeding the
BTreeSet ids at the start of load_eos_token_ids with any already-resolved EOS
token ID(s) passed in (or by reading tokenizer_config-derived EOS prior to
merging) and then merge numeric IDs from config.json and generation_config.json
into that set so file-based IDs augment rather than replace existing
special_tokens.eos_token_ids; update callers to supply the tokenizer-derived EOS
if necessary (refer to load_eos_token_ids, special_tokens.eos_token_ids, and the
tokenizer_config.json resolution path in huggingface.rs).

In `@model_gateway/src/core/worker_manager.rs`:
- Around line 89-116: collect_prometheus_multiproc_metrics currently invokes a
Python subprocess that imports prometheus_client but that package isn't declared
in the smg/python bindings dependencies and the subprocess has no timeout; add
"prometheus_client" to the core Python deps in bindings/python/pyproject.toml
(so the ModuleNotFoundError is resolved) and wrap the
tokio::process::Command.output().await call in tokio::time::timeout with a
sensible Duration (e.g., a few seconds), mapping a timeout error into a clear
Err result from collect_prometheus_multiproc_metrics; update error messages to
reflect timeout vs subprocess failures and reference the
collect_prometheus_multiproc_metrics function when making the changes.

In `@model_gateway/src/routers/grpc/regular/processor.rs`:
- Around line 221-234: The non-streaming cleanup in processor.rs currently only
strips <|im_*|> tokens inside the block guarded by
self.configured_tool_parser/is_some but misses other leaked tokens (</think>,
[EOS], [BOS]) that the streaming path removes; create a single sanitizer
function (e.g., strip_leaked_tokens or sanitize_leaked_tokens) on the impl (or a
private helper) that accepts &str and returns a trimmed String, move the full
token list and replacement logic into it, and call this helper from the existing
processed_text cleanup block and from the Messages block referenced in the
review so both streaming and non-streaming paths reuse the same sanitizer (refer
to self.configured_tool_parser, self.configured_reasoning_parser,
processed_text, and the Messages handling code).
- Around line 710-720: The configured parser branch currently runs before the
constrained-output (used_json_schema) handling, causing configured parsers to
bypass the JSON-schema -> ToolUse fallback; update the logic in processor.rs
around configured_tool_parser and parse_tool_calls so it either (a) skip
invoking the configured parser when used_json_schema is true (matching
process_single_choice behavior), or (b) after parse_tool_calls returns no
tool_calls, check if used_json_schema is true and then run the JSON-schema
wrapper / create a ToolUse block from the arguments-only JSON; ensure you
reference and adjust the conditionals around self.configured_tool_parser,
tool_parser_available, parse_tool_calls, used_json_schema, and the code path
that constructs ToolUse so the constrained-output fallback always runs when
appropriate.

In `@model_gateway/src/routers/grpc/regular/stages/chat/preparation.rs`:
- Around line 130-134: The tokenizer call uses add_special_tokens=true on
processed_messages.text after apply_chat_template(), which allows user-supplied
sequences like "<|im_end|>" to be recognized as control tokens; to fix, ensure
user message content is escaped before template rendering (e.g., run an escape
function on strings fed into apply_chat_template()), or alternatively switch the
tokenizer.encode call to add_special_tokens=false for user content and only
apply special-token-aware encoding to the known template scaffolding; update
references around apply_chat_template(), processed_messages.text, and
tokenizer.encode so special markers from user input cannot be interpreted as
control tokens.

In `@model_gateway/src/routers/grpc/regular/streaming.rs`:
- Around line 438-450: The scrubber that removes ChatML/think/EOS tokens from a
chunked delta is only applied inside the per-chunk path; extract that logic into
a shared helper (e.g., fn scrub_delta(input: &str, configured_tool_parser:
Option<...>, configured_reasoning_parser: Option<...>) -> String) and call it
both where `delta` is currently sanitized and on the final
`stop_decoder.flush()` output before emitting the last SSE delta; reference the
existing symbols `configured_tool_parser`, `configured_reasoning_parser`,
`delta`, and `stop_decoder.flush()` so the new helper matches the current
conditional behavior and is reused in the flush path.
- Around line 1805-1807: The condition allows ToolChoice::Tool to fall into the
generic parser branch even when streaming_tool_parser (the JSON/constrained
parser) was selected, causing parameters with name: None to emit InputJsonDelta
before ContentBlockStart; change the conditional logic around ToolChoice::Tool
so it always follows the arguments-only branch (i.e., keep ToolChoice::Tool out
of the generic parser path) by ensuring the check that currently reads "if
is_specific_function && !(self.configured_tool_parser.is_some() &&
tool_parser_available)" instead preserves ToolChoice::Tool on the arguments-only
branch regardless of configured_tool_parser/tool_parser_available, mirroring the
chat streaming path and preventing use of streaming_tool_parser for
ToolChoice::Tool.

In `@model_gateway/src/routers/grpc/utils/chat_utils.rs`:
- Around line 547-562: The current loop in chat_utils.rs replaces control tokens
globally in processed_text (using clean.replace(...)), which can corrupt valid
JSON string fields; instead trim only leaked prefix/suffix markers from the
start and end of processed_text. Locate the block using the variables
processed_text and clean and replace the global replace() calls with logic that
repeatedly strip known tokens from the start (strip_prefix) and end
(strip_suffix) or use anchored regexes so only boundary occurrences like
prefix/suffix markers (e.g., "<|im_start|>", "<|im_end|>", "</think>", "[EOS]",
etc.) are removed; preserve any occurrences of those tokens inside the JSON
payload. Ensure the final clean is trimmed of whitespace as before.
- Around line 411-420: The generated TypeScript snippet inserted as
"tools_ts_str" via tools_to_typescript currently emits raw OpenAI tool names
(from request.tools) which may contain hyphens and produce invalid TypeScript
identifiers; update tools_to_typescript (and any generation logic that builds
tools_ts_str used when inserting into combined_template_kwargs) to either
generate quoted/string-literal type names (e.g., "'get-weather'" or
`"\"get-weather\""` in the TS output) or sanitize names into valid
BindingIdentifiers (e.g., replace/normalize hyphens to underscores) and
consistently use that sanitized form in all places it’s referenced; ensure the
change is applied across the related generator code paths (including the
tools_to_typescript function and where tools_ts_str is inserted) so templates
referencing tool names remain syntactically valid.

In `@model_gateway/tests/spec/chat_message.rs`:
- Around line 66-70: The test's pattern match on ChatMessage::Tool was loosened
but doesn't assert the new tool.name field; update the test in
model_gateway/tests/spec/chat_message.rs to deserialize a payload that includes
"name", assert that ChatMessage::Tool contains tool.name equal to the provided
value, and also serialize it back (round-trip) to ensure the wire contract;
locate the match arm for ChatMessage::Tool and add assertions against the
tool.name field and a serialization check to confirm round-trip fidelity.

---

Outside diff comments:
In `@bindings/python/src/smg/serve.py`:
- Around line 632-671: The early return in _cleanup_workers when self.workers is
empty skips calling _cleanup_prometheus_dir and can leak the prometheus temp
dir; ensure prometheus cleanup always runs by removing or moving the early
return and invoking self._cleanup_prometheus_dir() (e.g., call
self._cleanup_prometheus_dir() before returning or eliminate the return and only
skip process-group shutdown when self.workers is empty) so that
_cleanup_prometheus_dir() runs even if no workers were appended.

In `@crates/grpc_client/src/trtllm_service.rs`:
- Around line 291-308: The code currently sets stop_token_ids =
eos_token_ids.to_vec() which ignores any stop_token_ids passed in the
ChatCompletionRequest for the TRT-LLM path; change the construction of
stop_token_ids (used in proto::GenerateRequest) to merge tokenizer EOS IDs with
the request's explicit stop token IDs (e.g., body.stop_token_ids or
ChatCompletionRequest.stop_token_ids), convert/normalize them to u32, remove
duplicates (preserve a sensible order such as request-provided then EOS or
vice-versa), and use that merged Vec<u32> in the grpc_request so backend-side
stopping honors both sources.

In `@model_gateway/src/routers/grpc/client.rs`:
- Around line 271-280: The EOS stop IDs are only passed into build_chat_request,
so the messages path still omits merged EOS IDs; update
TrtllmServiceClient::build_messages_request (and any callers like
build_generate_request_from_messages) to accept the eos_token_ids slice
parameter (same type as in build_chat_request) and thread it through to where
stop_token_ids are set so stop_token_ids is populated for Messages requests just
like for Chat requests; adjust function signatures (build_messages_request,
build_generate_request_from_messages) and internal calls to pass the
eos_token_ids slice through unchanged.

In `@model_gateway/src/routers/grpc/harmony/builder.rs`:
- Around line 876-885: Update the ChatMessage::Tool match arm to prefer the
message's name field before falling back to the assistant tool-call map or the
tool_call_id: destructure the tool message to include name (e.g.
ChatMessage::Tool { content, tool_call_id, name, .. }) and compute function_name
by using name.clone() when Some, otherwise lookup
tool_call_map.get(tool_call_id).cloned() and finally fall back to
tool_call_id.clone().

In `@model_gateway/src/routers/grpc/harmony/stages/request_building.rs`:
- Around line 177-207: The /responses path is missing model EOS token IDs
because the code builds eos_ids earlier but only passes them into
trtllm_client.build_generate_request_from_chat; update the responses branch
inside HarmonyRequestBuildingStage::execute to pass the same &eos_ids into
trtllm_client.build_generate_request_from_responses (or replicate the EOS-merge
logic used by build_generate_request_from_chat) so stop_token_ids are populated
the same way for both RequestType::Chat and RequestType::Responses; locate the
eos_ids variable and the trtllm_client.build_generate_request_from_responses
call and add the eos_ids argument (or the equivalent merge) to match
build_generate_request_from_chat.

In `@model_gateway/src/routers/grpc/regular/responses/conversions.rs`:
- Around line 86-117: The conversion currently drops tool names by setting
ChatMessage::Tool.name to None; for ResponseInputOutputItem::FunctionToolCall
use the existing local variable name to populate ChatMessage::Tool.name (e.g.,
name: Some(name.clone())) when pushing the Tool message, and for the
FunctionCallOutput branch build a temporary call_id -> name map while iterating
ResponseInputOutputItem items (recording id -> name from FunctionToolCall
entries) so that when you encounter FunctionCallOutput you can look up the tool
name and set ChatMessage::Tool.name to Some(found_name.clone()) (falling back to
None if not found); update both places where ChatMessage::Tool.name is currently
hardcoded to None.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: d25c587a-cd42-47e9-8f11-7f2ef4b5d3d1

📥 Commits

Reviewing files that changed from the base of the PR and between 9516b44 and 46b284f.

📒 Files selected for processing (22)

bindings/python/src/smg/serve.py
crates/grpc_client/src/trtllm_service.rs
crates/protocols/src/chat.rs
crates/reasoning_parser/src/parsers/base.rs
crates/tokenizer/src/huggingface.rs
crates/tokenizer/src/mock.rs
crates/tokenizer/src/tiktoken.rs
crates/tokenizer/src/traits.rs
crates/tool_parser/src/parsers/kimik2.rs
model_gateway/src/core/worker_manager.rs
model_gateway/src/observability/metrics_server.rs
model_gateway/src/routers/grpc/client.rs
model_gateway/src/routers/grpc/harmony/builder.rs
model_gateway/src/routers/grpc/harmony/stages/request_building.rs
model_gateway/src/routers/grpc/regular/processor.rs
model_gateway/src/routers/grpc/regular/responses/conversions.rs
model_gateway/src/routers/grpc/regular/stages/chat/preparation.rs
model_gateway/src/routers/grpc/regular/stages/chat/request_building.rs
model_gateway/src/routers/grpc/regular/stages/messages/preparation.rs
model_gateway/src/routers/grpc/regular/streaming.rs
model_gateway/src/routers/grpc/utils/chat_utils.rs
model_gateway/tests/spec/chat_message.rs

coderabbitai · 2026-04-08T23:33:24Z

crates/tokenizer/src/tiktoken.rs

+/// Load merged EOS token IDs from config.json and generation_config.json.
+///
+/// Models may define different EOS tokens in each file (e.g. Kimi-K2.5 has
+/// `[EOS]` (163585) in config.json and `<|im_end|>` (163586) in
+/// generation_config.json).  We merge both into a deduplicated list so the
+/// engine can stop at any of them — matching SGLang's behaviour.
+pub fn load_eos_token_ids(dir: &Path) -> Vec<TokenIdType> {
+    let mut ids = std::collections::BTreeSet::new();
+
+    // config.json — eos_token_id can be int or list
+    if let Ok(content) = std::fs::read_to_string(dir.join("config.json")) {
+        if let Ok(cfg) = serde_json::from_str::<serde_json::Value>(&content) {
+            match cfg.get("eos_token_id") {
+                Some(serde_json::Value::Number(n)) => {
+                    if let Some(id) = n.as_u64() {
+                        ids.insert(id as TokenIdType);
+                    }
+                }
+                Some(serde_json::Value::Array(arr)) => {
+                    for v in arr {
+                        if let Some(id) = v.as_u64() {
+                            ids.insert(id as TokenIdType);
+                        }
+                    }
+                }
+                _ => {}
+            }
+        }
+    }
+
+    // generation_config.json — eos_token_id can be int or list
+    if let Ok(content) = std::fs::read_to_string(dir.join("generation_config.json")) {
+        if let Ok(cfg) = serde_json::from_str::<serde_json::Value>(&content) {
+            match cfg.get("eos_token_id") {
+                Some(serde_json::Value::Number(n)) => {
+                    if let Some(id) = n.as_u64() {
+                        ids.insert(id as TokenIdType);
+                    }
+                }
+                Some(serde_json::Value::Array(arr)) => {
+                    for v in arr {
+                        if let Some(id) = v.as_u64() {
+                            ids.insert(id as TokenIdType);
+                        }
+                    }
+                }
+                _ => {}
+            }
+        }
+    }
+
+    ids.into_iter().collect()
+}


⚠️ Potential issue | 🟠 Major

Preserve tokenizer-derived EOS IDs when the JSON files don't declare them.

load_eos_token_ids() returns an empty vector if neither file has a numeric eos_token_id. Both this file and crates/tokenizer/src/huggingface.rs then assign that result straight into special_tokens.eos_token_ids, which wipes out EOS information even when special_tokens.eos_token was already resolved from tokenizer_config.json or vocab matching. For those models, TRT-LLM will receive no backend stop-token IDs at all. Seed the set from the resolved EOS token ID and then merge file-based IDs, rather than replacing everything with the file-only list.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@crates/tokenizer/src/tiktoken.rs` around lines 45 - 97, load_eos_token_ids currently builds an empty set when neither config.json nor generation_config.json declare numeric eos_token_id, causing callers (e.g., assignments to special_tokens.eos_token_ids in load code and in crates/tokenizer/src/huggingface.rs) to wipe out tokenizer-derived EOS IDs resolved from tokenizer_config.json or vocab matching. Fix by seeding the BTreeSet ids at the start of load_eos_token_ids with any already-resolved EOS token ID(s) passed in (or by reading tokenizer_config-derived EOS prior to merging) and then merge numeric IDs from config.json and generation_config.json into that set so file-based IDs augment rather than replace existing special_tokens.eos_token_ids; update callers to supply the tokenizer-derived EOS if necessary (refer to load_eos_token_ids, special_tokens.eos_token_ids, and the tokenizer_config.json resolution path in huggingface.rs).

coderabbitai · 2026-04-08T23:33:24Z

model_gateway/src/core/worker_manager.rs

+async fn collect_prometheus_multiproc_metrics() -> Result<String, String> {
+    let dir = std::env::var("PROMETHEUS_MULTIPROC_DIR").map_err(|_| {
+        "PROMETHEUS_MULTIPROC_DIR not set; cannot collect metrics from gRPC workers".to_string()
+    })?;
+
+    let output = tokio::process::Command::new("python3")
+        .args([
+            "-c",
+            "import sys\n\
+             from prometheus_client import CollectorRegistry, generate_latest\n\
+             from prometheus_client.multiprocess import MultiProcessCollector\n\
+             registry = CollectorRegistry()\n\
+             MultiProcessCollector(registry)\n\
+             sys.stdout.buffer.write(generate_latest(registry))\n",
+        ])
+        .env("PROMETHEUS_MULTIPROC_DIR", &dir)
+        .output()
+        .await
+        .map_err(|e| format!("failed to run python3 prometheus collector: {e}"))?;
+
+    if !output.status.success() {
+        let stderr = String::from_utf8_lossy(&output.stderr);
+        return Err(format!("python3 prometheus collector failed: {stderr}"));
+    }
+
+    String::from_utf8(output.stdout)
+        .map_err(|e| format!("prometheus collector output is not valid UTF-8: {e}"))
+}


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== Call sites ==" rg -n -C2 'collect_prometheus_multiproc_metrics|PROMETHEUS_MULTIPROC_DIR|prometheus_client|python3' . echo echo "== Candidate runtime / packaging files ==" fd -HI 'Dockerfile|dockerfile|requirements.*|pyproject.toml|uv.lock|poetry.lock|Pipfile|setup.py|Cargo.toml|compose.*' . \ | while read -r f; do echo echo "--- $f" rg -n -C2 'python3|prometheus_client|prometheus-client|PROMETHEUS_MULTIPROC_DIR' "$f" || true done

Repository: lightseekorg/smg

Length of output: 29259

🏁 Script executed:

# Check the full Dockerfile to see base image and any pip installs cat -n ./docker/Dockerfile # Also check if there are any hidden requirements files or setup files referenced fd -HI 'requirements|setup.py|pyproject.toml' ./docker/

Repository: lightseekorg/smg

Length of output: 3219

🏁 Script executed:

# Check the Python bindings setup/dependency files cat -n ./bindings/python/pyproject.toml echo "---" cat -n ./bindings/python/setup.py

Repository: lightseekorg/smg

Length of output: 2793

🏁 Script executed:

# Search for any other references to prometheus_client installation or imports rg -i 'prometheus.client|prometheus-client' \ ./bindings/python/ ./docker/ \ --type py --type txt --type toml --type yaml # Check if it's supposed to come from worker installations rg -i 'prometheus.client' ./e2e_test/ ./scripts/ \ --type py --type sh --type yaml # Check model_gateway Cargo.toml for any Python-related deps rg -A5 -B5 'prometheus|python' ./model_gateway/Cargo.toml

Repository: lightseekorg/smg

Length of output: 602

🏁 Script executed:

# Check if prometheus_client might come from worker dependencies or is conditionally installed rg -i 'prometheus.client|prometheus-client' . --type py --type toml --type txt \ | grep -v node_modules | grep -v '.git' | head -20 # Also check if there's any grpc worker setup that might provide it rg -B10 -A10 'PROMETHEUS_MULTIPROC_DIR' ./bindings/python/src/smg/serve.py

Repository: lightseekorg/smg

Length of output: 1906

Add prometheus_client to gateway dependencies and wrap the gRPC metrics subprocess with a timeout.

The gateway attempts to call python3 -c "from prometheus_client import ..." but prometheus_client is not listed in the smg package dependencies (bindings/python/pyproject.toml:29-34). Every gRPC metrics scrape will fail with a ModuleNotFoundError. Additionally, the subprocess call has no timeout, creating a hang risk. Add prometheus_client as a core dependency and wrap the .output().await with tokio::time::timeout() before release.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/core/worker_manager.rs` around lines 89 - 116, collect_prometheus_multiproc_metrics currently invokes a Python subprocess that imports prometheus_client but that package isn't declared in the smg/python bindings dependencies and the subprocess has no timeout; add "prometheus_client" to the core Python deps in bindings/python/pyproject.toml (so the ModuleNotFoundError is resolved) and wrap the tokio::process::Command.output().await call in tokio::time::timeout with a sensible Duration (e.g., a few seconds), mapping a timeout error into a clear Err result from collect_prometheus_multiproc_metrics; update error messages to reflect timeout vs subprocess failures and reference the collect_prometheus_multiproc_metrics function when making the changes.

coderabbitai · 2026-04-08T23:33:24Z

model_gateway/src/routers/grpc/regular/processor.rs

+        // Strip leaked chatml tokens only when a model-specific parser is configured
+        if self.configured_tool_parser.is_some() || self.configured_reasoning_parser.is_some() {
+            for token in [
+                "<|im_end|>",
+                "<|im_start|>",
+                "<|im_user|>",
+                "<|im_assistant|>",
+                "<|im_system|>",
+                "<|im_middle|>",
+            ] {
+                processed_text = processed_text.replace(token, "");
+            }
+            processed_text = processed_text.trim().to_string();
+        }


⚠️ Potential issue | 🟠 Major

Non-streaming cleanup still misses part of the leaked-token set.

The streaming path now strips </think>, [EOS], and [BOS], but this non-streaming sanitizer only removes <|im_*|>. That means the final chat response can still return the exact markers this PR is trying to scrub. Please centralize the sanitizer and reuse it here and in the Messages block below.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/routers/grpc/regular/processor.rs` around lines 221 - 234, The non-streaming cleanup in processor.rs currently only strips <|im_*|> tokens inside the block guarded by self.configured_tool_parser/is_some but misses other leaked tokens (</think>, [EOS], [BOS]) that the streaming path removes; create a single sanitizer function (e.g., strip_leaked_tokens or sanitize_leaked_tokens) on the impl (or a private helper) that accepts &str and returns a trimmed String, move the full token list and replacement logic into it, and call this helper from the existing processed_text cleanup block and from the Messages block referenced in the review so both streaming and non-streaming paths reuse the same sanitizer (refer to self.configured_tool_parser, self.configured_reasoning_parser, processed_text, and the Messages handling code).

coderabbitai · 2026-04-08T23:33:24Z

model_gateway/src/routers/grpc/regular/processor.rs

+            if self.configured_tool_parser.is_some() && tool_parser_available {
+                (tool_calls, processed_text) = self
+                    .parse_tool_calls(
+                        &processed_text,
+                        &messages_request.model,
+                        utils::message_utils::get_history_tool_calls_count_messages(
+                            &messages_request,
+                        ),
+                    )
+                    .await;
+            } else if used_json_schema {


⚠️ Potential issue | 🟠 Major

Configured parsers currently bypass the constrained-output fallback in Messages.

This if / else if ordering differs from process_single_choice. For messages::ToolChoice::Tool / Any, used_json_schema is already true, but a configured parser takes this branch first. If it returns no native tool calls, the JSON-schema wrapper path is skipped entirely and the arguments-only JSON is left as plain text instead of becoming a ToolUse block. Make this match the chat path by skipping the configured parser for constrained output, or by using an independent if tool_calls.is_none() && used_json_schema fallback.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/routers/grpc/regular/processor.rs` around lines 710 - 720, The configured parser branch currently runs before the constrained-output (used_json_schema) handling, causing configured parsers to bypass the JSON-schema -> ToolUse fallback; update the logic in processor.rs around configured_tool_parser and parse_tool_calls so it either (a) skip invoking the configured parser when used_json_schema is true (matching process_single_choice behavior), or (b) after parse_tool_calls returns no tool_calls, check if used_json_schema is true and then run the JSON-schema wrapper / create a ToolUse block from the arguments-only JSON; ensure you reference and adjust the conditionals around self.configured_tool_parser, tool_parser_available, parse_tool_calls, used_json_schema, and the code path that constructs ToolUse so the constrained-output fallback always runs when appropriate.

coderabbitai · 2026-04-08T23:33:25Z

model_gateway/src/routers/grpc/regular/stages/chat/preparation.rs

+        // Step 3: Tokenize the processed text.
+        // Use add_special_tokens=true so the tokenizer recognizes special tokens
+        // (e.g. <|im_system|>, <|im_end|>, <think>) embedded in the chat template
+        // output as single token IDs rather than splitting them into characters.
+        let encoding = match tokenizer.encode(&processed_messages.text, true) {


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's explore the file structure and understand the context find . -type f -name "preparation.rs" | head -20

Repository: lightseekorg/smg

Length of output: 544

🏁 Script executed:

# Check the full content of the preparation.rs file to understand the context wc -l model_gateway/src/routers/grpc/regular/stages/chat/preparation.rs

Repository: lightseekorg/smg

Length of output: 129

🏁 Script executed:

# Read the relevant section and surrounding code for context sed -n '115,150p' model_gateway/src/routers/grpc/regular/stages/chat/preparation.rs

Repository: lightseekorg/smg

Length of output: 1639

🏁 Script executed:

# Find the process_chat_messages function to understand what processed_messages.text contains rg "fn process_chat_messages|process_chat_messages\(" -A 10 -B 2

Repository: lightseekorg/smg

Length of output: 6687

🏁 Script executed:

# Look for the Messages API path mentioned in the review to understand the parallel issue rg "Messages API|messages.*special.*token" -i

Repository: lightseekorg/smg

Length of output: 8668

🏁 Script executed:

# Read the process_chat_messages function implementation sed -n '1,100p' model_gateway/src/routers/grpc/utils/chat_utils.rs

Repository: lightseekorg/smg

Length of output: 3502

🏁 Script executed:

# Continue reading chat_utils.rs wc -l model_gateway/src/routers/grpc/utils/chat_utils.rs sed -n '1,200p' model_gateway/src/routers/grpc/utils/chat_utils.rs | tail -100

Repository: lightseekorg/smg

Length of output: 4102

🏁 Script executed:

# Check the Messages API preparation file to see how it handles tokenization sed -n '1,200p' model_gateway/src/routers/grpc/regular/stages/messages/preparation.rs

Repository: lightseekorg/smg

Length of output: 7700

🏁 Script executed:

# Look for where add_special_tokens is used in the Messages API path rg "add_special_tokens|encode.*true|encode.*false" -B 3 -A 3

Repository: lightseekorg/smg

Length of output: 50372

🏁 Script executed:

# Find and read the complete process_chat_messages function sed -n '275,450p' model_gateway/src/routers/grpc/utils/chat_utils.rs

Repository: lightseekorg/smg

Length of output: 6702

🏁 Script executed:

# Look for template handling and escaping rg "apply_chat_template|render|escape" -A 5 -B 2 | head -200

Repository: lightseekorg/smg

Length of output: 9963

🏁 Script executed:

# Check what the ProcessedMessages struct contains and how the text is built rg "struct ProcessedMessages|ProcessedMessages {" -A 10

Repository: lightseekorg/smg

Length of output: 211

🏁 Script executed:

# Search for ProcessedMessages struct definition rg "struct ProcessedMessages" -A 15

Repository: lightseekorg/smg

Length of output: 767

🏁 Script executed:

# Find where apply_chat_template is called - the key part rg "apply_chat_template" -B 5 -A 10 | head -100

Repository: lightseekorg/smg

Length of output: 6959

🏁 Script executed:

# Check if there's any escaping of user content before the template sed -n '400,550p' model_gateway/src/routers/grpc/utils/chat_utils.rs

Repository: lightseekorg/smg

Length of output: 5300

🏁 Script executed:

# Check what the comment in messages/preparation.rs says about skip_special_tokens grep -B 5 -A 5 "skip_special_tokens" model_gateway/src/routers/grpc/regular/stages/messages/preparation.rs

Repository: lightseekorg/smg

Length of output: 42

🏁 Script executed:

# Check tokenizer.rs to understand apply_chat_template behavior find . -name "*.rs" -path "*/crates/tokenizer/*" -type f | head -10

Repository: lightseekorg/smg

Length of output: 445

🏁 Script executed:

# Search for any documentation or comments about template escaping rg "escape|sanitiz|safe" model_gateway/src/routers/grpc/utils/chat_utils.rs -i

Repository: lightseekorg/smg

Length of output: 99

🏁 Script executed:

# Look at what happens with add_special_tokens in the embedding stage for comparison sed -n '1,150p' model_gateway/src/routers/grpc/regular/stages/embedding/preparation.rs

Repository: lightseekorg/smg

Length of output: 2908

🏁 Script executed:

# Look for how chat template handles special tokens - check the chat_template implementation cat crates/tokenizer/src/chat_template.rs | head -200

Repository: lightseekorg/smg

Length of output: 7206

🏁 Script executed:

# Check for any escaping in the tokenizer's template application rg "apply_chat_template" -B 5 -A 15 crates/tokenizer/src/ | head -150

Repository: lightseekorg/smg

Length of output: 8679

🏁 Script executed:

# Look for test cases showing template + user content interaction find . -name "*test*.rs" -o -name "*tests.rs" | xargs rg "apply_chat_template|special.*token" | head -100

Repository: lightseekorg/smg

Length of output: 3706

🏁 Script executed:

# Check if minijinja auto-escapes content or if there's explicit escaping rg "escape|autoescape" crates/tokenizer/src/chat_template.rs

Repository: lightseekorg/smg

Length of output: 159

🏁 Script executed:

# Verify the template rendering doesn't have HTML/special-char escaping sed -n '600,900p' crates/tokenizer/src/chat_template.rs

Repository: lightseekorg/smg

Length of output: 11423

🏁 Script executed:

# Check if there are any tests showing user content with special tokens rg "<\\|im_end\\|>" model_gateway/tests/ crates/tokenizer/tests/

Repository: lightseekorg/smg

Length of output: 13742

🏁 Script executed:

# Look at the complete implementation of render_chat_template rg "fn render_chat_template" -A 50 crates/tokenizer/src/chat_template.rs | head -100

Repository: lightseekorg/smg

Length of output: 2350

User-supplied special-token markers in chat content can be reinterpreted as control tokens.

When the chat template is rendered via apply_chat_template(), user messages are embedded directly without escaping. Then tokenizing with add_special_tokens=true causes the tokenizer to recognize sequences like <|im_end|> in user content as atomic special tokens rather than plain text, potentially altering prompt structure. Escape these markers in user content before template rendering, or disable special-token recognition and only apply it to the template scaffolding itself.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/routers/grpc/regular/stages/chat/preparation.rs` around lines 130 - 134, The tokenizer call uses add_special_tokens=true on processed_messages.text after apply_chat_template(), which allows user-supplied sequences like "<|im_end|>" to be recognized as control tokens; to fix, ensure user message content is escaped before template rendering (e.g., run an escape function on strings fed into apply_chat_template()), or alternatively switch the tokenizer.encode call to add_special_tokens=false for user content and only apply special-token-aware encoding to the known template scaffolding; update references around apply_chat_template(), processed_messages.text, and tokenizer.encode so special markers from user input cannot be interpreted as control tokens.

coderabbitai · 2026-04-08T23:33:25Z

model_gateway/src/routers/grpc/regular/streaming.rs

+                    // Strip leaked chatml/think tokens when a parser is configured
+                    let mut delta = delta;
+                    if self.configured_tool_parser.is_some()
+                        || self.configured_reasoning_parser.is_some()
+                    {
+                        for token in [
+                            "<|im_end|>", "<|im_start|>", "<|im_user|>",
+                            "<|im_assistant|>", "<|im_system|>", "<|im_middle|>",
+                            "</think>", "[EOS]", "[BOS]",
+                        ] {
+                            delta = delta.replace(token, "");
+                        }
+                    }


⚠️ Potential issue | 🟠 Major

Sanitize the flushed tail as well.

This scrubber only runs in the per-chunk path. The final stop_decoder.flush() content at Lines 471-490 bypasses it, so trailing ChatML / think / EOS text can still leak in the last SSE delta. Please extract this into a shared helper and reuse it in the flush path too.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/routers/grpc/regular/streaming.rs` around lines 438 - 450, The scrubber that removes ChatML/think/EOS tokens from a chunked delta is only applied inside the per-chunk path; extract that logic into a shared helper (e.g., fn scrub_delta(input: &str, configured_tool_parser: Option<...>, configured_reasoning_parser: Option<...>) -> String) and call it both where `delta` is currently sanitized and on the final `stop_decoder.flush()` output before emitting the last SSE delta; reference the existing symbols `configured_tool_parser`, `configured_reasoning_parser`, `delta`, and `stop_decoder.flush()` so the new helper matches the current conditional behavior and is reused in the flush path.

coderabbitai · 2026-04-08T23:33:25Z

model_gateway/src/routers/grpc/regular/streaming.rs

+                        if is_specific_function
+                            && !(self.configured_tool_parser.is_some() && tool_parser_available)
+                        {


⚠️ Potential issue | 🟠 Major

ToolChoice::Tool can emit invalid Messages events when a configured parser exists.

When this condition is false, control falls into the generic parser branch even though streaming_tool_parser was created as the JSON parser for constrained output. That parser can emit parameters with name: None, and the loop below will then send InputJsonDelta before any ContentBlockStart, which violates the Anthropic event sequence. Match the chat streaming path here and always keep ToolChoice::Tool on the arguments-only branch.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/routers/grpc/regular/streaming.rs` around lines 1805 - 1807, The condition allows ToolChoice::Tool to fall into the generic parser branch even when streaming_tool_parser (the JSON/constrained parser) was selected, causing parameters with name: None to emit InputJsonDelta before ContentBlockStart; change the conditional logic around ToolChoice::Tool so it always follows the arguments-only branch (i.e., keep ToolChoice::Tool out of the generic parser path) by ensuring the check that currently reads "if is_specific_function && !(self.configured_tool_parser.is_some() && tool_parser_available)" instead preserves ToolChoice::Tool on the arguments-only branch regardless of configured_tool_parser/tool_parser_available, mirroring the chat streaming path and preventing use of streaming_tool_parser for ToolChoice::Tool.

coderabbitai · 2026-04-08T23:33:25Z

model_gateway/src/routers/grpc/utils/chat_utils.rs

+        // Generate TypeScript-style tool declarations for models that use it
+        // (e.g., Kimi K2.5 chat template checks for tools_ts_str)
+        if let Some(ref tools) = request.tools {
+            if !tools.is_empty() {
+                let ts_str = tools_to_typescript(tools);
+                if !ts_str.is_empty() {
+                    combined_template_kwargs.insert(
+                        "tools_ts_str".to_string(),
+                        Value::String(ts_str),
+                    );


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Do TypeScript type aliases and object property declarations allow unquoted identifiers containing hyphens or spaces, and what characters are allowed in OpenAI function/tool names?

💡 Result:

TypeScript: type aliases + object/type property names

Type alias names must be a BindingIdentifier (type BindingIdentifier ... = ...), so they cannot contain spaces or hyphens unquoted. (sources.debian.org)

Example (invalid): type user-id = string / type "user id" = string

Object/type property names (in object literals, interfaces, and type literals) can be:

an IdentifierName (unquoted), or

a string literal (quoted), or

a numeric literal, or

a computed property name. (sources.debian.org)

Because hyphens and spaces aren’t valid in ECMAScript IdentifierName, you must quote such property keys. (sources.debian.org)

Valid:

type T = { "user-id": string; "user id": string }

const o = { "user-id": 1, "user id": 2 }

Invalid (unquoted):

type T = { user-id: string } (parsed like user - id)

type T = { user id: string }

OpenAI function/tool names: allowed characters

OpenAI function tool name must be max 64 chars and contain only:

A–Z, a–z, 0–9, _ (underscore), - (dash)
Spaces (and characters like .) are not allowed. (platform.openai.com)

Citations:

1: https://sources.debian.org/src/node-typescript/4.1.3-1/doc/TypeScript%20Language%20Specification%20%28Change%20Markup%29%20-%20ARCHIVED.pdf

2: https://sources.debian.org/src/node-typescript/4.1.3-1/doc/TypeScript%20Language%20Specification%20%28Change%20Markup%29%20-%20ARCHIVED.pdf

3: https://sources.debian.org/src/node-typescript/4.1.3-1/doc/TypeScript%20Language%20Specification%20%28Change%20Markup%29%20-%20ARCHIVED.pdf

4: https://platform.openai.com/docs/api-reference/chat/create-chat-completion?utm_source=openai

Quote or sanitize function names in tools_ts_str.

This generator emits raw OpenAI tool names directly into TypeScript type declarations (e.g., type {func.name} = ...). OpenAI tool names can include hyphens (e.g., get-weather), but TypeScript type aliases require valid BindingIdentifiers and reject unquoted hyphens as syntax errors. The generated tools_ts_str will be syntactically invalid for any tool with a hyphenated name, breaking templates that rely on it.

Also applies to: 943–1121

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/routers/grpc/utils/chat_utils.rs` around lines 411 - 420, The generated TypeScript snippet inserted as "tools_ts_str" via tools_to_typescript currently emits raw OpenAI tool names (from request.tools) which may contain hyphens and produce invalid TypeScript identifiers; update tools_to_typescript (and any generation logic that builds tools_ts_str used when inserting into combined_template_kwargs) to either generate quoted/string-literal type names (e.g., "'get-weather'" or `"\"get-weather\""` in the TS output) or sanitize names into valid BindingIdentifiers (e.g., replace/normalize hyphens to underscores) and consistently use that sanitized form in all places it’s referenced; ensure the change is applied across the related generator code paths (including the tools_to_typescript function and where tools_ts_str is inserted) so templates referencing tool names remain syntactically valid.

coderabbitai · 2026-04-08T23:33:25Z

model_gateway/src/routers/grpc/utils/chat_utils.rs

+    // Strip chatml / special tokens that may trail the constrained JSON output
+    let mut clean = processed_text.to_string();
+    for token in [
+        "<|im_end|>",
+        "<|im_start|>",
+        "<|im_user|>",
+        "<|im_assistant|>",
+        "<|im_system|>",
+        "<|im_middle|>",
+        "</think>",
+        "[EOS]",
+        "[BOS]",
+    ] {
+        clean = clean.replace(token, "");
+    }
+    let clean = clean.trim();


⚠️ Potential issue | 🟠 Major

Don't replace() control tokens inside JSON payloads.

This removes every occurrence of markers like </think> and [EOS], not just leaked prefix/suffix tokens. If a valid JSON string field contains one of those literals, the payload is mutated before parsing and the tool arguments/content change silently. Please trim these markers only at the boundaries of processed_text.

Also applies to: 567-631

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/routers/grpc/utils/chat_utils.rs` around lines 547 - 562, The current loop in chat_utils.rs replaces control tokens globally in processed_text (using clean.replace(...)), which can corrupt valid JSON string fields; instead trim only leaked prefix/suffix markers from the start and end of processed_text. Locate the block using the variables processed_text and clean and replace the global replace() calls with logic that repeatedly strip known tokens from the start (strip_prefix) and end (strip_suffix) or use anchored regexes so only boundary occurrences like prefix/suffix markers (e.g., "<|im_start|>", "<|im_end|>", "</think>", "[EOS]", etc.) are removed; preserve any occurrences of those tokens inside the JSON payload. Ensure the final clean is trimmed of whitespace as before.

coderabbitai · 2026-04-08T23:33:25Z

model_gateway/tests/spec/chat_message.rs

        ChatMessage::Tool {
            content,
            tool_call_id,
+            ..
        } => {


⚠️ Potential issue | 🟡 Minor

Add an assertion for the new tool.name field.

This only loosens the pattern match. It would be better to also deserialize a payload with "name" and assert it round-trips, otherwise the new wire contract stays untested.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@model_gateway/tests/spec/chat_message.rs` around lines 66 - 70, The test's pattern match on ChatMessage::Tool was loosened but doesn't assert the new tool.name field; update the test in model_gateway/tests/spec/chat_message.rs to deserialize a payload that includes "name", assert that ChatMessage::Tool contains tool.name equal to the provided value, and also serialize it back (round-trip) to ensure the wire contract; locate the match arm for ChatMessage::Tool and add assertions against the tool.name field and a serialization check to confirm round-trip fidelity.

When the gRPC worker hasn't written any .db files yet (startup phase or no requests processed), generate_latest() returns empty bytes. Passing this empty string to parse_prometheus() triggers a parse error WARN log on every scrape interval. Add guards at both the collector function (return Err for empty output) and the call site (skip empty text silently) to prevent log spam. Signed-off-by: ConnorLi96 <ConnorLi96@users.noreply.github.com> Made-with: Cursor

claude · 2026-04-09T00:42:33Z

model_gateway/src/core/worker_manager.rs

+                        metrics_text: text,
+                    });
+                }
+                Ok(_) => {


🟡 Nit: This Ok(_) arm is unreachable. collect_prometheus_multiproc_metrics() (line 116-118) already returns Err when text.trim().is_empty(), so the Ok variant can never contain an empty string. The guard on line 339 is redundant too — it will always be true.

Consider simplifying to just Ok(text) => { ... } to avoid confusing future readers into thinking empty-Ok is a real case.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

model_gateway/src/core/worker_manager.rs (1)

94-107: ⚠️ Potential issue | 🟠 Major

Add a timeout around the Python metrics collector subprocess.

.output().await can block this handler indefinitely if python3 startup or the prometheus_client import stalls, so /metrics turns into a hung request instead of a degraded scrape. Please bound this with tokio::time::timeout(...) and verify that prometheus_client is actually packaged anywhere the gateway runs.

🛠️ Suggested fix

-    let output = tokio::process::Command::new("python3")
-        .args([
-            "-c",
-            "import sys\n\
-             from prometheus_client import CollectorRegistry, generate_latest\n\
-             from prometheus_client.multiprocess import MultiProcessCollector\n\
-             registry = CollectorRegistry()\n\
-             MultiProcessCollector(registry)\n\
-             sys.stdout.buffer.write(generate_latest(registry))\n",
-        ])
-        .env("PROMETHEUS_MULTIPROC_DIR", &dir)
-        .output()
-        .await
-        .map_err(|e| format!("failed to run python3 prometheus collector: {e}"))?;
+    let output = tokio::time::timeout(
+        Duration::from_secs(3),
+        tokio::process::Command::new("python3")
+            .args([
+                "-c",
+                "import sys\n\
+                 from prometheus_client import CollectorRegistry, generate_latest\n\
+                 from prometheus_client.multiprocess import MultiProcessCollector\n\
+                 registry = CollectorRegistry()\n\
+                 MultiProcessCollector(registry)\n\
+                 sys.stdout.buffer.write(generate_latest(registry))\n",
+            ])
+            .env("PROMETHEUS_MULTIPROC_DIR", &dir)
+            .output(),
+    )
+    .await
+    .map_err(|_| "collect_prometheus_multiproc_metrics timed out".to_string())?
+    .map_err(|e| format!("failed to run python3 prometheus collector: {e}"))?;

#!/bin/bash
set -euo pipefail

echo "== collector helper =="
sed -n '88,120p' model_gateway/src/core/worker_manager.rs

echo
echo "== packaging/runtime declarations for prometheus_client =="
fd -HI 'pyproject.toml|setup.py|requirements*.txt|Dockerfile|dockerfile|Pipfile|uv.lock|poetry.lock' . \
  | while read -r f; do
      echo "--- $f"
      rg -n -C2 'prometheus_client|python3' "$f" || true
    done

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/core/worker_manager.rs` around lines 94 - 107, Wrap the
tokio process call that creates the python3 prometheus collector (the Command
building code that currently ends with .output().await and assigns to output)
with tokio::time::timeout(...) to bound how long we wait for the subprocess to
start and return; on timeout return a mapped error like "prometheus collector
timed out" (same error type as the existing map_err) so the handler returns
promptly, and ensure you still propagate other I/O errors from .output(); keep
the existing env("PROMETHEUS_MULTIPROC_DIR", &dir) and variable name output
unchanged so the rest of the function can use it unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@model_gateway/src/core/worker_manager.rs`:
- Around line 116-118: The check that treats empty multiprocess output as an
error (the branch with text.trim().is_empty()) should instead return a soft-skip
sentinel so warmup/idle periods don't produce warnings; replace the Err(...)
with Ok(String::new()) (or otherwise return an empty String) in the function
containing the text.trim().is_empty() check, and update the later handling
around the warn! branch (the logic at the Ok(_) / warn! region) to special-case
this empty-string sentinel before logging a warning so the skip is reachable and
no spurious warnings occur.

---

Duplicate comments:
In `@model_gateway/src/core/worker_manager.rs`:
- Around line 94-107: Wrap the tokio process call that creates the python3
prometheus collector (the Command building code that currently ends with
.output().await and assigns to output) with tokio::time::timeout(...) to bound
how long we wait for the subprocess to start and return; on timeout return a
mapped error like "prometheus collector timed out" (same error type as the
existing map_err) so the handler returns promptly, and ensure you still
propagate other I/O errors from .output(); keep the existing
env("PROMETHEUS_MULTIPROC_DIR", &dir) and variable name output unchanged so the
rest of the function can use it unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 49179c74-ff91-475b-89f4-334604303a24

📥 Commits

Reviewing files that changed from the base of the PR and between 46b284f and 98bd903.

📒 Files selected for processing (1)

model_gateway/src/core/worker_manager.rs

coderabbitai · 2026-04-09T00:43:06Z

model_gateway/src/core/worker_manager.rs

+    if text.trim().is_empty() {
+        return Err("no metrics available from gRPC workers yet".to_string());
+    }


⚠️ Potential issue | 🟡 Minor

Treat empty multiprocess output as a soft skip, not an error.

Line 116 converts the “no .db files yet” case into Err(...), but Line 345 only skips silently on Ok(_). That makes the skip branch unreachable and will warn on every scrape during warmup/idle periods. Return Ok(String::new()) here, or special-case this sentinel before warn!.

🧹 Suggested fix

- if text.trim().is_empty() { - return Err("no metrics available from gRPC workers yet".to_string()); - } + if text.trim().is_empty() { + return Ok(String::new()); + }

Also applies to: 345-350

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/core/worker_manager.rs` around lines 116 - 118, The check that treats empty multiprocess output as an error (the branch with text.trim().is_empty()) should instead return a soft-skip sentinel so warmup/idle periods don't produce warnings; replace the Err(...) with Ok(String::new()) (or otherwise return an empty String) in the function containing the text.trim().is_empty() check, and update the later handling around the warn! branch (the logic at the Ok(_) / warn! region) to special-case this empty-string sentinel before logging a warning so the skip is reachable and no spurious warnings occur.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 98bd903b4c

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-09T00:45:46Z

crates/grpc_client/src/trtllm_service.rs

+        // Pass merged EOS token IDs from config.json + generation_config.json.
+        // TRT-LLM's gRPC path does not reliably merge these internally,
+        // so we provide them explicitly via the standard stop_token_ids field.
+        let stop_token_ids: Vec<u32> = eos_token_ids.to_vec();


Respect ignore_eos when injecting EOS stop token IDs

This unconditionally copies merged EOS IDs into stop_token_ids, but stop_token_ids are hard stop conditions while ignore_eos is meant to continue generation past EOS. As written, requests with ignore_eos=true will still terminate when an EOS token appears, which breaks the documented ignore_eos behavior for TRT-LLM chat/harmony requests.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-09T00:45:46Z

model_gateway/src/routers/grpc/utils/chat_utils.rs

+        "[EOS]",
+        "[BOS]",
+    ] {
+        clean = clean.replace(token, "");


Avoid mutating JSON payloads when stripping special tokens

This removes token substrings with global replace before parsing constrained JSON, so valid argument values containing literals like [EOS], </think>, or <|im_end|> get silently altered (for example a tool argument string can lose content). That corrupts structured outputs/tool inputs rather than just trimming leaked trailer tokens, so cleanup should be boundary-aware instead of replacing everywhere in the JSON text.

Useful? React with 👍 / 👎.

mergify · 2026-04-09T05:38:47Z

Hi @ConnorLi96, this PR has merge conflicts that must be resolved before it can be merged. Please rebase your branch:

git fetch origin main
git rebase origin/main
# resolve any conflicts, then:
git push --force-with-lease

scottjlee and others added 6 commits April 3, 2026 22:17

read engine metrics from PROMETHEUS_MULTIPROC_DIR in grpc mode

ea47938

Signed-off-by: Scott Lee <scott@together.ai>

lint and format

65b14bb

Signed-off-by: Scott Lee <scott@together.ai>

ConnorLi96 requested review from CatherineSue, gongwei-130, key4ng and slin1237 as code owners April 8, 2026 23:22

ConnorLi96 changed the title ~~[WIP] Connorli/fix func call all~~ [WIP] Connorli/fix func call for KimiK2.5 Apr 8, 2026

claude bot reviewed Apr 8, 2026

View reviewed changes

claude bot approved these changes Apr 8, 2026

View reviewed changes

chatgpt-codex-connector bot reviewed Apr 8, 2026

View reviewed changes

gemini-code-assist bot reviewed Apr 8, 2026

View reviewed changes

coderabbitai bot requested changes Apr 8, 2026

View reviewed changes

claude bot reviewed Apr 9, 2026

View reviewed changes

coderabbitai bot requested changes Apr 9, 2026

View reviewed changes

chatgpt-codex-connector bot reviewed Apr 9, 2026

View reviewed changes

mergify bot added the needs-rebase PR has merge conflicts that need to be resolved label Apr 9, 2026

	fn schema_to_typescript(schema: &Value, indent: &str, required: &[&str]) -> String {
	fn schema_to_typescript(schema: &Value, indent: &str) -> String {

Conversation

ConnorLi96 commented Apr 8, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Problem

Solution

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 inconclusive)

Uh oh!

claude bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 8, 2026

Choose a reason for hiding this comment

ConnorLi96 commented Apr 8, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 8, 2026 •

edited

Loading