Conversation
Signed-off-by: Scott Lee <scott@together.ai>
Signed-off-by: Scott Lee <scott@together.ai>
…ol-call tokens - Prioritize explicitly configured tool parser over JSON schema parsing - Support alternative delimiters (<|func_start|>/<|func_end|>) in KimiK2 parser - Prevent reasoning parser from consuming tool call markers - Strip leaked chatml tokens only when a parser is explicitly configured - Truncate trailing content after first valid JSON object for JSON response formats Signed-off-by: ConnorLi96 <ConnorLi96@users.noreply.github.com> Made-with: Cursor
- Fix tiktoken encoder to use encode_with_special_tokens for chat template output - Prioritize configured tool parser over JSON schema with fallthrough - Skip reasoning parsing when output is constrained (specific tool_choice / JSON response_format) - Support alternative delimiters and hyphens in KimiK2 tool call parser - Prevent reasoning parser from consuming tool call markers - Keep tool call arguments as strings to preserve formatting - Generate TypeScript-style tool declarations for Kimi K2.5 - Strip leaked chatml tokens conditionally and clean up JSON response format - Add name field to Tool ChatMessage variant - Graceful metrics server bind failure Signed-off-by: ConnorLi96 <ConnorLi96@users.noreply.github.com> Made-with: Cursor
TRT-LLM's gRPC path does not reliably apply generation_config.eos_token_id via SamplingParams._setup(), causing chatml models (e.g. Kimi-K2.5) to generate past turn boundaries (18/60 -> 52/60 on fc-dash). This is a workaround: the proper fix is for TRT-LLM to honor generation_config.eos_token_id in its gRPC executor path. Signed-off-by: ConnorLi96 <ConnorLi96@users.noreply.github.com> Made-with: Cursor
…on_config.json Models like Kimi-K2.5 define different EOS tokens in config.json ([EOS]=163585) and generation_config.json (<|im_end|>=163586). The engine must stop at both. SGLang handles this internally by merging both sources, but TRT-LLM's gRPC path does not. This change: - Adds load_eos_token_ids() to read and merge EOS IDs from both config files - Exposes eos_token_ids() on the Tokenizer trait - Passes merged IDs as stop_token_ids in TRT-LLM gRPC requests - Removes the hardcoded <|im_end|> stop string workaround (bfab5ad) - Strips [EOS]/[BOS] text from constrained decoding output Result: SMG+TRT-LLM fc-dash 55/60 (was 52/60 with hack, 18/60 without) Signed-off-by: ConnorLi96 <ConnorLi96@users.noreply.github.com> Made-with: Cursor
📝 WalkthroughWalkthroughAdds merged EOS token ID support across tokenizers and threads EOS IDs through gRPC request construction for TRT-LLM; enables special-token-aware tokenization; extends tool-message schema with an optional Changes
Sequence Diagram(s)sequenceDiagram
actor Client
participant Tokenizer
participant RequestBuilder
participant GrpcClient
participant TRT_LLM
Client->>Tokenizer: load tokenizer (config + generation_config)
Tokenizer->>Tokenizer: merge eos_token_ids
Tokenizer-->>RequestBuilder: eos_token_ids &[u32]
Client->>RequestBuilder: prepare chat request
RequestBuilder->>GrpcClient: build_chat_request(..., eos_token_ids)
GrpcClient->>GrpcClient: build_generate_request_from_chat(eos_token_ids)
GrpcClient->>TRT_LLM: GenerateRequest { stop_token_ids: eos_token_ids }
TRT_LLM-->>GrpcClient: Response
sequenceDiagram
actor Client
participant Processor
participant ConstraintDetector
participant ReasoningParser
participant ToolParser
participant Formatter
Client->>Processor: chat response (with response_format/tool_choice)
Processor->>ConstraintDetector: detect output_is_constrained
ConstraintDetector-->>Processor: true/false
alt constrained
Processor->>Formatter: skip ReasoningParser & ToolParser
Processor->>Formatter: extract/truncate JSON from fenced blocks
else unconstrained
Processor->>ReasoningParser: parse reasoning blocks
ReasoningParser-->>Processor: reasoning_text
Processor->>ToolParser: parse tool calls
ToolParser-->>Processor: tool_calls
end
Processor->>Formatter: final formatting
Formatter-->>Client: response
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related issues
Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| if self.configured_tool_parser.is_some() || self.configured_reasoning_parser.is_some() { | ||
| for token in [ | ||
| "<|im_end|>", | ||
| "<|im_start|>", | ||
| "<|im_user|>", | ||
| "<|im_assistant|>", | ||
| "<|im_system|>", | ||
| "<|im_middle|>", | ||
| ] { | ||
| processed_text = processed_text.replace(token, ""); | ||
| } | ||
| processed_text = processed_text.trim().to_string(); | ||
| } |
There was a problem hiding this comment.
🟡 Nit: The chatml token stripping list here (non-streaming chat path) is missing </think>, [EOS], and [BOS] that the streaming path at streaming.rs:443-446 does strip. Same discrepancy exists in the Messages non-streaming path at line 747-758 and in process_tool_calls_stream at streaming.rs:1288-1291.
This means non-streaming responses could leak [EOS]/[BOS] tokens that streaming responses correctly filter out. Consider extracting a shared constant (e.g. const CHATML_LEAK_TOKENS: &[&str] = &[...]) to keep all 6 stripping sites consistent.
| if self.configured_tool_parser.is_some() && tool_parser_available { | ||
| (tool_calls, processed_text) = self | ||
| .parse_tool_calls( | ||
| &processed_text, | ||
| &messages_request.model, | ||
| utils::message_utils::get_history_tool_calls_count_messages( | ||
| &messages_request, | ||
| ), | ||
| ) | ||
| .await; |
There was a problem hiding this comment.
🟡 Nit: The chat non-streaming path (line ~167) gates the configured tool parser with && !output_is_constrained, but this Messages non-streaming path does not. If a Messages API request uses constrained decoding (e.g. specific tool choice) with a configured tool parser, the parser will run on raw JSON and may misparse it. Consider adding the same output_is_constrained guard here for consistency, or document why the Messages path intentionally differs.
| out | ||
| } | ||
|
|
||
| fn schema_to_typescript(schema: &Value, indent: &str, required: &[&str]) -> String { |
There was a problem hiding this comment.
🟡 Nit: The required parameter is unused — object_to_typescript reads the required list from the schema value directly. This is dead code; consider removing the parameter.
| fn schema_to_typescript(schema: &Value, indent: &str, required: &[&str]) -> String { | |
| fn schema_to_typescript(schema: &Value, indent: &str) -> String { |
| if processed_text.starts_with('{') { | ||
| let mut depth = 0i32; | ||
| let mut in_string = false; | ||
| let mut escape = false; | ||
| let mut json_end = None; | ||
| for (i, ch) in processed_text.char_indices() { | ||
| if escape { escape = false; continue; } | ||
| if ch == '\\' && in_string { escape = true; continue; } | ||
| if ch == '"' { in_string = !in_string; continue; } | ||
| if in_string { continue; } | ||
| if ch == '{' { depth += 1; } else if ch == '}' { | ||
| depth -= 1; | ||
| if depth == 0 { json_end = Some(i + 1); break; } | ||
| } | ||
| } | ||
| if let Some(end) = json_end { | ||
| processed_text = processed_text[..end].to_string(); | ||
| } | ||
| } |
There was a problem hiding this comment.
🟡 Nit: This hand-rolled JSON brace-matching truncator only handles {…} top-level objects. If response_format is json_schema and the schema's root type is an array ([…]), the JSON response will start with [ and this code won't trim trailing tokens from it. Consider also handling the array case, or using serde_json::from_str to find the valid JSON prefix (attempt parse, on failure progressively trim trailing characters).
|
|
||
| /// Collect gRPC worker metrics by aggregating `PROMETHEUS_MULTIPROC_DIR` via a python3 subprocess. | ||
| async fn collect_prometheus_multiproc_metrics() -> Result<String, String> { | ||
| let dir = std::env::var("PROMETHEUS_MULTIPROC_DIR").map_err(|_| { | ||
| "PROMETHEUS_MULTIPROC_DIR not set; cannot collect metrics from gRPC workers".to_string() | ||
| })?; | ||
|
|
||
| let output = tokio::process::Command::new("python3") | ||
| .args([ | ||
| "-c", | ||
| "import sys\n\ | ||
| from prometheus_client import CollectorRegistry, generate_latest\n\ | ||
| from prometheus_client.multiprocess import MultiProcessCollector\n\ | ||
| registry = CollectorRegistry()\n\ | ||
| MultiProcessCollector(registry)\n\ | ||
| sys.stdout.buffer.write(generate_latest(registry))\n", | ||
| ]) | ||
| .env("PROMETHEUS_MULTIPROC_DIR", &dir) | ||
| .output() | ||
| .await | ||
| .map_err(|e| format!("failed to run python3 prometheus collector: {e}"))?; | ||
|
|
||
| if !output.status.success() { | ||
| let stderr = String::from_utf8_lossy(&output.stderr); | ||
| return Err(format!("python3 prometheus collector failed: {stderr}")); | ||
| } | ||
|
|
||
| String::from_utf8(output.stdout) | ||
| .map_err(|e| format!("prometheus collector output is not valid UTF-8: {e}")) |
There was a problem hiding this comment.
🟡 Nit: collect_prometheus_multiproc_metrics spawns a new python3 process on every /metrics scrape. If Prometheus scrapes at 15s intervals, that's ~5760 Python startups/day. This works but could add latency to the metrics endpoint under load. Consider documenting this trade-off or, as a follow-up, caching the Python process / using a longer-lived collector.
There was a problem hiding this comment.
Comprehensive set of fixes for func call handling, EOS token merging, chatml token stripping, and gRPC metrics. The changes are well-structured and the logic is sound.
Summary: 0 🔴 Important · 5 🟡 Nit · 0 🟣 Pre-existing
Nits flagged:
- Inconsistent chatml token strip lists — 6 stripping sites use different token sets; non-streaming paths miss
</think>,[EOS],[BOS]. Extract a shared constant. - Messages non-streaming path missing
output_is_constrainedguard — Chat path gates configured tool parser on constrained output, Messages path doesn't. - Unused
requiredparameter inschema_to_typescript— dead code. - JSON truncator only handles
{…}— Array-rooted JSON schemas ([…]) won't be trimmed. - Python subprocess per metrics scrape — works but spawns a process on every
/metricscall; consider documenting the trade-off.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 46b284fe0e
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if self.configured_tool_parser.is_some() && tool_parser_available { | ||
| (tool_calls, processed_text) = self | ||
| .parse_tool_calls( | ||
| &processed_text, | ||
| &messages_request.model, | ||
| utils::message_utils::get_history_tool_calls_count_messages( | ||
| &messages_request, | ||
| ), | ||
| ) | ||
| .await; | ||
| } else if used_json_schema { |
There was a problem hiding this comment.
Restore JSON-schema fallback after configured tool parsing
When messages_request.tool_choice is Tool/Any (JSON-schema constrained output) and a configured tool parser is present, this branch always runs parse_tool_calls first and skips the JSON-schema parser due to the else if chain. In constrained mode the model often returns plain JSON without native tool-call markers, so parse_tool_calls returns no calls and the response loses tool invocations instead of building ToolUse blocks. This is a regression from the previous logic that prioritized parse_json_schema_response for constrained tool choices.
Useful? React with 👍 / 👎.
| // Use add_special_tokens=true so the tokenizer recognizes special tokens | ||
| // (e.g. <|im_system|>, <|im_end|>, <think>) embedded in the chat template | ||
| // output as single token IDs rather than splitting them into characters. | ||
| let encoding = match tokenizer.encode(&processed_messages.text, true) { |
There was a problem hiding this comment.
Avoid enabling generic special-token injection for chat prompts
This now tokenizes already-rendered chat-template text with add_special_tokens=true. For HuggingFaceTokenizer, encode(..., true) is forwarded directly to the HF tokenizer post-processor, which can prepend/append BOS/EOS on top of template-provided markers. That changes prompt token IDs and can break stop boundaries or generation behavior for non-tiktoken models. The special-token recognition fix is tiktoken-specific, but this call affects all tokenizer backends (and is duplicated in the Messages preparation path).
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request improves support for Kimi K2.5 models by merging EOS token IDs from configuration files, implementing TypeScript-style tool declarations, and refining reasoning and tool call parsers. It also adds gRPC worker metrics collection and logic to strip leaked ChatML tokens. Review feedback identifies opportunities to reduce code duplication in token parsing and stripping, and suggests using more robust JSON deserialization to improve performance and reliability.
| pub fn load_eos_token_ids(dir: &Path) -> Vec<TokenIdType> { | ||
| let mut ids = std::collections::BTreeSet::new(); | ||
|
|
||
| // config.json — eos_token_id can be int or list | ||
| if let Ok(content) = std::fs::read_to_string(dir.join("config.json")) { | ||
| if let Ok(cfg) = serde_json::from_str::<serde_json::Value>(&content) { | ||
| match cfg.get("eos_token_id") { | ||
| Some(serde_json::Value::Number(n)) => { | ||
| if let Some(id) = n.as_u64() { | ||
| ids.insert(id as TokenIdType); | ||
| } | ||
| } | ||
| Some(serde_json::Value::Array(arr)) => { | ||
| for v in arr { | ||
| if let Some(id) = v.as_u64() { | ||
| ids.insert(id as TokenIdType); | ||
| } | ||
| } | ||
| } | ||
| _ => {} | ||
| } | ||
| } | ||
| } | ||
|
|
||
| // generation_config.json — eos_token_id can be int or list | ||
| if let Ok(content) = std::fs::read_to_string(dir.join("generation_config.json")) { | ||
| if let Ok(cfg) = serde_json::from_str::<serde_json::Value>(&content) { | ||
| match cfg.get("eos_token_id") { | ||
| Some(serde_json::Value::Number(n)) => { | ||
| if let Some(id) = n.as_u64() { | ||
| ids.insert(id as TokenIdType); | ||
| } | ||
| } | ||
| Some(serde_json::Value::Array(arr)) => { | ||
| for v in arr { | ||
| if let Some(id) = v.as_u64() { | ||
| ids.insert(id as TokenIdType); | ||
| } | ||
| } | ||
| } | ||
| _ => {} | ||
| } | ||
| } | ||
| } | ||
|
|
||
| ids.into_iter().collect() | ||
| } |
There was a problem hiding this comment.
The logic for parsing eos_token_id from config.json and generation_config.json is duplicated. To improve maintainability and reduce code duplication, extract this logic into a helper closure. Additionally, refactor the match statement to return the differing values and perform the insertion logic once to avoid redundancy.
pub fn load_eos_token_ids(dir: &Path) -> Vec<TokenIdType> {
let mut ids = std::collections::BTreeSet::new();
let mut extract_eos_ids = |content: &str| {
if let Ok(cfg) = serde_json::from_str::<serde_json::Value>(content) {
let extracted: Vec<u64> = match cfg.get("eos_token_id") {
Some(serde_json::Value::Number(n)) => n.as_u64().into_iter().collect(),
Some(serde_json::Value::Array(arr)) => arr.iter().filter_map(|v| v.as_u64()).collect(),
_ => vec![],
};
for id in extracted {
ids.insert(id as TokenIdType);
}
}
};
if let Ok(content) = std::fs::read_to_string(dir.join("config.json")) {
extract_eos_ids(&content);
}
if let Ok(content) = std::fs::read_to_string(dir.join("generation_config.json")) {
extract_eos_ids(&content);
}
ids.into_iter().collect()
}References
- Extract duplicated logic into a shared helper function to improve maintainability and reduce redundancy.
- Refactor match statements to avoid duplication. When arms have common logic, use the match to return the differing value and perform the common logic once.
| for token in [ | ||
| "<|im_end|>", | ||
| "<|im_start|>", | ||
| "<|im_user|>", | ||
| "<|im_assistant|>", | ||
| "<|im_system|>", | ||
| "<|im_middle|>", | ||
| ] { |
There was a problem hiding this comment.
This list of tokens to strip is duplicated in other parts of the codebase. To avoid duplication and improve maintainability, consider defining this array as a constant at the module level. Furthermore, when stripping these markers from data that might contain structured formats like JSON, ensure you use whole-string matching instead of character-set matching to prevent data corruption.
References
- Extract duplicated logic into a shared helper function or constant to improve maintainability.
- When stripping specific marker strings from data that might contain structured formats like JSON, use whole-string matching instead of character-set matching to avoid data corruption.
| if processed_text.starts_with('{') { | ||
| let mut depth = 0i32; | ||
| let mut in_string = false; | ||
| let mut escape = false; | ||
| let mut json_end = None; | ||
| for (i, ch) in processed_text.char_indices() { | ||
| if escape { escape = false; continue; } | ||
| if ch == '\\' && in_string { escape = true; continue; } | ||
| if ch == '"' { in_string = !in_string; continue; } | ||
| if in_string { continue; } | ||
| if ch == '{' { depth += 1; } else if ch == '}' { | ||
| depth -= 1; | ||
| if depth == 0 { json_end = Some(i + 1); break; } | ||
| } | ||
| } | ||
| if let Some(end) = json_end { | ||
| processed_text = processed_text[..end].to_string(); | ||
| } | ||
| } |
There was a problem hiding this comment.
This manual JSON parsing logic to find the end of the first object can be brittle. For a more robust solution, consider using serde_json::Deserializer to stream-parse the first valid JSON object. Additionally, avoid reallocating memory with .to_string() within processing loops; use .truncate() to modify the buffer in-place.
if processed_text.starts_with('{') {
let mut stream = serde_json::Deserializer::from_str(&processed_text).into_iter::<serde_json::Value>();
if stream.next().is_some() {
let end = stream.byte_offset();
processed_text.truncate(end);
}
}References
- When processing streaming data, avoid inefficient string manipulations like buffer = buffer[..].to_string() as it reallocates memory; use performant methods like .drain() or .truncate() instead.
| for token in [ | ||
| "<|im_end|>", "<|im_start|>", "<|im_user|>", | ||
| "<|im_assistant|>", "<|im_system|>", "<|im_middle|>", | ||
| "</think>", "[EOS]", "[BOS]", | ||
| ] { | ||
| delta = delta.replace(token, ""); | ||
| } |
There was a problem hiding this comment.
This list of tokens to strip is duplicated across multiple stream processing functions. To improve maintainability and consistency, define this array as a constant at the module level. Ensure that any stripping logic uses whole-string matching rather than character-set matching to avoid corrupting JSON data.
References
- Extract duplicated logic into a shared helper function or constant to improve maintainability.
- When stripping specific marker strings from data that might contain structured formats like JSON, use whole-string matching instead of character-set matching to avoid data corruption.
There was a problem hiding this comment.
Actionable comments posted: 10
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (6)
bindings/python/src/smg/serve.py (1)
632-671:⚠️ Potential issue | 🟡 MinorAlways run Prometheus cleanup, even when no worker was appended.
_launch_workers()createsself._prometheus_dirbefore the firstself.workers.append(...), butLine 634returns early whenself.workersis empty. If the first launch fails, the temp dir is leaked andPROMETHEUS_MULTIPROC_DIRstays stale in-process.Suggested fix
def _cleanup_workers(self) -> None: """SIGTERM all worker process groups, wait, then SIGKILL stragglers.""" - if not self.workers: - return - - # Send SIGTERM to each process group - for proc, port in self.workers: - try: - os.killpg(proc.pid, signal.SIGTERM) - except (ProcessLookupError, OSError): - pass - - # Wait up to _WORKER_SHUTDOWN_TIMEOUT seconds for graceful exit - deadline = time.monotonic() + _WORKER_SHUTDOWN_TIMEOUT - for proc, port in self.workers: - remaining = max(0, deadline - time.monotonic()) - try: - proc.wait(timeout=remaining) - except subprocess.TimeoutExpired: - try: - os.killpg(proc.pid, signal.SIGKILL) - except (ProcessLookupError, OSError): - pass - - self._cleanup_prometheus_dir() + try: + if self.workers: + # Send SIGTERM to each process group + for proc, port in self.workers: + try: + os.killpg(proc.pid, signal.SIGTERM) + except (ProcessLookupError, OSError): + pass + + # Wait up to _WORKER_SHUTDOWN_TIMEOUT seconds for graceful exit + deadline = time.monotonic() + _WORKER_SHUTDOWN_TIMEOUT + for proc, port in self.workers: + remaining = max(0, deadline - time.monotonic()) + try: + proc.wait(timeout=remaining) + except subprocess.TimeoutExpired: + try: + os.killpg(proc.pid, signal.SIGKILL) + except (ProcessLookupError, OSError): + pass + finally: + self._cleanup_prometheus_dir() def _cleanup_prometheus_dir(self) -> None: """Remove the temporary prometheus multiprocess directory and its .db files.""" if self._prometheus_dir is None: return try: shutil.rmtree(self._prometheus_dir) logger.info("Cleaned up PROMETHEUS_MULTIPROC_DIR=%s", self._prometheus_dir) except OSError as e: logger.warning( "Failed to clean up PROMETHEUS_MULTIPROC_DIR=%s: %s", self._prometheus_dir, e, ) + os.environ.pop("PROMETHEUS_MULTIPROC_DIR", None) self._prometheus_dir = None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bindings/python/src/smg/serve.py` around lines 632 - 671, The early return in _cleanup_workers when self.workers is empty skips calling _cleanup_prometheus_dir and can leak the prometheus temp dir; ensure prometheus cleanup always runs by removing or moving the early return and invoking self._cleanup_prometheus_dir() (e.g., call self._cleanup_prometheus_dir() before returning or eliminate the return and only skip process-group shutdown when self.workers is empty) so that _cleanup_prometheus_dir() runs even if no workers were appended.model_gateway/src/routers/grpc/harmony/builder.rs (1)
876-885:⚠️ Potential issue | 🟠 MajorPrefer
ChatMessage::Tool.namebefore falling back totool_call_id.This path still reconstructs the function name solely from the assistant tool-call map. If a request contains a tool message with
namepopulated but no matching prior assistant tool call, Harmony will emitfunctions.{tool_call_id}instead of the actual tool name, which defeats the new field.Suggested fix
- ChatMessage::Tool { - content, - tool_call_id, - .. - } => { + ChatMessage::Tool { + content, + tool_call_id, + name, + } => { // Look up the function name from the tool_call_id - let function_name = tool_call_map - .get(tool_call_id) - .cloned() - .unwrap_or_else(|| tool_call_id.clone()); + let function_name = name + .clone() + .or_else(|| tool_call_map.get(tool_call_id).cloned()) + .unwrap_or_else(|| tool_call_id.clone());🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/routers/grpc/harmony/builder.rs` around lines 876 - 885, Update the ChatMessage::Tool match arm to prefer the message's name field before falling back to the assistant tool-call map or the tool_call_id: destructure the tool message to include name (e.g. ChatMessage::Tool { content, tool_call_id, name, .. }) and compute function_name by using name.clone() when Some, otherwise lookup tool_call_map.get(tool_call_id).cloned() and finally fall back to tool_call_id.clone().model_gateway/src/routers/grpc/regular/responses/conversions.rs (1)
86-117:⚠️ Potential issue | 🟠 MajorDon't drop tool names during Responses → Chat conversion.
Both branches hardcode
name: None, so the newChatMessage::Tool.namefield never reaches chat templates on this path.FunctionToolCallalready has the tool name, andFunctionCallOutputcan recover it with a smallcall_id -> namemap while iterating items. As written, converted histories still lose the tool identity this PR is trying to preserve.Also applies to: 137-147
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/routers/grpc/regular/responses/conversions.rs` around lines 86 - 117, The conversion currently drops tool names by setting ChatMessage::Tool.name to None; for ResponseInputOutputItem::FunctionToolCall use the existing local variable name to populate ChatMessage::Tool.name (e.g., name: Some(name.clone())) when pushing the Tool message, and for the FunctionCallOutput branch build a temporary call_id -> name map while iterating ResponseInputOutputItem items (recording id -> name from FunctionToolCall entries) so that when you encounter FunctionCallOutput you can look up the tool name and set ChatMessage::Tool.name to Some(found_name.clone()) (falling back to None if not found); update both places where ChatMessage::Tool.name is currently hardcoded to None.model_gateway/src/routers/grpc/harmony/stages/request_building.rs (1)
177-207:⚠️ Potential issue | 🟡 MinorForward model EOS token IDs to TRT-LLM in Harmony
/responsespath.The chat branch passes
&eos_idstobuild_generate_request_from_chat(), where they populatestop_token_idswith merged EOS from model config. The responses branch callsbuild_generate_request_from_responses()withouteos_ids, leavingstop_token_idsempty (line 451 intrtllm_service.rs).The later Harmony injection at line 253 only adds assistant-action stop tokens (
harmony_stops), not the model's EOS tokens. Sincebuild_generate_request_from_responses()does not inject EOS internally, the/responsespath will miss the merged EOS handling applied in chat.Pass
&eos_idstobuild_generate_request_from_responses()or apply the same EOS-merge logic used in the chat builder (lines 291–294 intrtllm_service.rs) to the responses builder.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/routers/grpc/harmony/stages/request_building.rs` around lines 177 - 207, The /responses path is missing model EOS token IDs because the code builds eos_ids earlier but only passes them into trtllm_client.build_generate_request_from_chat; update the responses branch inside HarmonyRequestBuildingStage::execute to pass the same &eos_ids into trtllm_client.build_generate_request_from_responses (or replicate the EOS-merge logic used by build_generate_request_from_chat) so stop_token_ids are populated the same way for both RequestType::Chat and RequestType::Responses; locate the eos_ids variable and the trtllm_client.build_generate_request_from_responses call and add the eos_ids argument (or the equivalent merge) to match build_generate_request_from_chat.model_gateway/src/routers/grpc/client.rs (1)
271-280:⚠️ Potential issue | 🟠 MajorEOS stop IDs are only wired through the chat builder.
This new parameter fixes TensorRT-LLM chat requests, but the sibling Messages path still has no way to forward merged EOS IDs and ultimately keeps
stop_token_idsempty inTrtllmServiceClient::build_generate_request_from_messages. Anthropic Messages requests will therefore keep the old over-generation behavior. Please thread the same slice throughbuild_messages_requestas well.Also applies to: 317-325
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/routers/grpc/client.rs` around lines 271 - 280, The EOS stop IDs are only passed into build_chat_request, so the messages path still omits merged EOS IDs; update TrtllmServiceClient::build_messages_request (and any callers like build_generate_request_from_messages) to accept the eos_token_ids slice parameter (same type as in build_chat_request) and thread it through to where stop_token_ids are set so stop_token_ids is populated for Messages requests just like for Chat requests; adjust function signatures (build_messages_request, build_generate_request_from_messages) and internal calls to pass the eos_token_ids slice through unchanged.crates/grpc_client/src/trtllm_service.rs (1)
291-308:⚠️ Potential issue | 🟠 MajorMerge request stop-token IDs instead of replacing them.
stop_token_idsnow comes only from tokenizer EOS IDs, so explicitChatCompletionRequest.stop_token_idsare still ignored on the TRT-LLM chat path. That means backend-side stopping remains incomplete and extra tokens can be generated until router trimming kicks in.Proposed fix
- let stop_token_ids: Vec<u32> = eos_token_ids.to_vec(); + let mut stop_token_ids = body.stop_token_ids.clone().unwrap_or_default(); + stop_token_ids.extend_from_slice(eos_token_ids); + stop_token_ids.sort_unstable(); + stop_token_ids.dedup();🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/grpc_client/src/trtllm_service.rs` around lines 291 - 308, The code currently sets stop_token_ids = eos_token_ids.to_vec() which ignores any stop_token_ids passed in the ChatCompletionRequest for the TRT-LLM path; change the construction of stop_token_ids (used in proto::GenerateRequest) to merge tokenizer EOS IDs with the request's explicit stop token IDs (e.g., body.stop_token_ids or ChatCompletionRequest.stop_token_ids), convert/normalize them to u32, remove duplicates (preserve a sensible order such as request-provided then EOS or vice-versa), and use that merged Vec<u32> in the grpc_request so backend-side stopping honors both sources.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/tokenizer/src/tiktoken.rs`:
- Around line 45-97: load_eos_token_ids currently builds an empty set when
neither config.json nor generation_config.json declare numeric eos_token_id,
causing callers (e.g., assignments to special_tokens.eos_token_ids in load code
and in crates/tokenizer/src/huggingface.rs) to wipe out tokenizer-derived EOS
IDs resolved from tokenizer_config.json or vocab matching. Fix by seeding the
BTreeSet ids at the start of load_eos_token_ids with any already-resolved EOS
token ID(s) passed in (or by reading tokenizer_config-derived EOS prior to
merging) and then merge numeric IDs from config.json and generation_config.json
into that set so file-based IDs augment rather than replace existing
special_tokens.eos_token_ids; update callers to supply the tokenizer-derived EOS
if necessary (refer to load_eos_token_ids, special_tokens.eos_token_ids, and the
tokenizer_config.json resolution path in huggingface.rs).
In `@model_gateway/src/core/worker_manager.rs`:
- Around line 89-116: collect_prometheus_multiproc_metrics currently invokes a
Python subprocess that imports prometheus_client but that package isn't declared
in the smg/python bindings dependencies and the subprocess has no timeout; add
"prometheus_client" to the core Python deps in bindings/python/pyproject.toml
(so the ModuleNotFoundError is resolved) and wrap the
tokio::process::Command.output().await call in tokio::time::timeout with a
sensible Duration (e.g., a few seconds), mapping a timeout error into a clear
Err result from collect_prometheus_multiproc_metrics; update error messages to
reflect timeout vs subprocess failures and reference the
collect_prometheus_multiproc_metrics function when making the changes.
In `@model_gateway/src/routers/grpc/regular/processor.rs`:
- Around line 221-234: The non-streaming cleanup in processor.rs currently only
strips <|im_*|> tokens inside the block guarded by
self.configured_tool_parser/is_some but misses other leaked tokens (</think>,
[EOS], [BOS]) that the streaming path removes; create a single sanitizer
function (e.g., strip_leaked_tokens or sanitize_leaked_tokens) on the impl (or a
private helper) that accepts &str and returns a trimmed String, move the full
token list and replacement logic into it, and call this helper from the existing
processed_text cleanup block and from the Messages block referenced in the
review so both streaming and non-streaming paths reuse the same sanitizer (refer
to self.configured_tool_parser, self.configured_reasoning_parser,
processed_text, and the Messages handling code).
- Around line 710-720: The configured parser branch currently runs before the
constrained-output (used_json_schema) handling, causing configured parsers to
bypass the JSON-schema -> ToolUse fallback; update the logic in processor.rs
around configured_tool_parser and parse_tool_calls so it either (a) skip
invoking the configured parser when used_json_schema is true (matching
process_single_choice behavior), or (b) after parse_tool_calls returns no
tool_calls, check if used_json_schema is true and then run the JSON-schema
wrapper / create a ToolUse block from the arguments-only JSON; ensure you
reference and adjust the conditionals around self.configured_tool_parser,
tool_parser_available, parse_tool_calls, used_json_schema, and the code path
that constructs ToolUse so the constrained-output fallback always runs when
appropriate.
In `@model_gateway/src/routers/grpc/regular/stages/chat/preparation.rs`:
- Around line 130-134: The tokenizer call uses add_special_tokens=true on
processed_messages.text after apply_chat_template(), which allows user-supplied
sequences like "<|im_end|>" to be recognized as control tokens; to fix, ensure
user message content is escaped before template rendering (e.g., run an escape
function on strings fed into apply_chat_template()), or alternatively switch the
tokenizer.encode call to add_special_tokens=false for user content and only
apply special-token-aware encoding to the known template scaffolding; update
references around apply_chat_template(), processed_messages.text, and
tokenizer.encode so special markers from user input cannot be interpreted as
control tokens.
In `@model_gateway/src/routers/grpc/regular/streaming.rs`:
- Around line 438-450: The scrubber that removes ChatML/think/EOS tokens from a
chunked delta is only applied inside the per-chunk path; extract that logic into
a shared helper (e.g., fn scrub_delta(input: &str, configured_tool_parser:
Option<...>, configured_reasoning_parser: Option<...>) -> String) and call it
both where `delta` is currently sanitized and on the final
`stop_decoder.flush()` output before emitting the last SSE delta; reference the
existing symbols `configured_tool_parser`, `configured_reasoning_parser`,
`delta`, and `stop_decoder.flush()` so the new helper matches the current
conditional behavior and is reused in the flush path.
- Around line 1805-1807: The condition allows ToolChoice::Tool to fall into the
generic parser branch even when streaming_tool_parser (the JSON/constrained
parser) was selected, causing parameters with name: None to emit InputJsonDelta
before ContentBlockStart; change the conditional logic around ToolChoice::Tool
so it always follows the arguments-only branch (i.e., keep ToolChoice::Tool out
of the generic parser path) by ensuring the check that currently reads "if
is_specific_function && !(self.configured_tool_parser.is_some() &&
tool_parser_available)" instead preserves ToolChoice::Tool on the arguments-only
branch regardless of configured_tool_parser/tool_parser_available, mirroring the
chat streaming path and preventing use of streaming_tool_parser for
ToolChoice::Tool.
In `@model_gateway/src/routers/grpc/utils/chat_utils.rs`:
- Around line 547-562: The current loop in chat_utils.rs replaces control tokens
globally in processed_text (using clean.replace(...)), which can corrupt valid
JSON string fields; instead trim only leaked prefix/suffix markers from the
start and end of processed_text. Locate the block using the variables
processed_text and clean and replace the global replace() calls with logic that
repeatedly strip known tokens from the start (strip_prefix) and end
(strip_suffix) or use anchored regexes so only boundary occurrences like
prefix/suffix markers (e.g., "<|im_start|>", "<|im_end|>", "</think>", "[EOS]",
etc.) are removed; preserve any occurrences of those tokens inside the JSON
payload. Ensure the final clean is trimmed of whitespace as before.
- Around line 411-420: The generated TypeScript snippet inserted as
"tools_ts_str" via tools_to_typescript currently emits raw OpenAI tool names
(from request.tools) which may contain hyphens and produce invalid TypeScript
identifiers; update tools_to_typescript (and any generation logic that builds
tools_ts_str used when inserting into combined_template_kwargs) to either
generate quoted/string-literal type names (e.g., "'get-weather'" or
`"\"get-weather\""` in the TS output) or sanitize names into valid
BindingIdentifiers (e.g., replace/normalize hyphens to underscores) and
consistently use that sanitized form in all places it’s referenced; ensure the
change is applied across the related generator code paths (including the
tools_to_typescript function and where tools_ts_str is inserted) so templates
referencing tool names remain syntactically valid.
In `@model_gateway/tests/spec/chat_message.rs`:
- Around line 66-70: The test's pattern match on ChatMessage::Tool was loosened
but doesn't assert the new tool.name field; update the test in
model_gateway/tests/spec/chat_message.rs to deserialize a payload that includes
"name", assert that ChatMessage::Tool contains tool.name equal to the provided
value, and also serialize it back (round-trip) to ensure the wire contract;
locate the match arm for ChatMessage::Tool and add assertions against the
tool.name field and a serialization check to confirm round-trip fidelity.
---
Outside diff comments:
In `@bindings/python/src/smg/serve.py`:
- Around line 632-671: The early return in _cleanup_workers when self.workers is
empty skips calling _cleanup_prometheus_dir and can leak the prometheus temp
dir; ensure prometheus cleanup always runs by removing or moving the early
return and invoking self._cleanup_prometheus_dir() (e.g., call
self._cleanup_prometheus_dir() before returning or eliminate the return and only
skip process-group shutdown when self.workers is empty) so that
_cleanup_prometheus_dir() runs even if no workers were appended.
In `@crates/grpc_client/src/trtllm_service.rs`:
- Around line 291-308: The code currently sets stop_token_ids =
eos_token_ids.to_vec() which ignores any stop_token_ids passed in the
ChatCompletionRequest for the TRT-LLM path; change the construction of
stop_token_ids (used in proto::GenerateRequest) to merge tokenizer EOS IDs with
the request's explicit stop token IDs (e.g., body.stop_token_ids or
ChatCompletionRequest.stop_token_ids), convert/normalize them to u32, remove
duplicates (preserve a sensible order such as request-provided then EOS or
vice-versa), and use that merged Vec<u32> in the grpc_request so backend-side
stopping honors both sources.
In `@model_gateway/src/routers/grpc/client.rs`:
- Around line 271-280: The EOS stop IDs are only passed into build_chat_request,
so the messages path still omits merged EOS IDs; update
TrtllmServiceClient::build_messages_request (and any callers like
build_generate_request_from_messages) to accept the eos_token_ids slice
parameter (same type as in build_chat_request) and thread it through to where
stop_token_ids are set so stop_token_ids is populated for Messages requests just
like for Chat requests; adjust function signatures (build_messages_request,
build_generate_request_from_messages) and internal calls to pass the
eos_token_ids slice through unchanged.
In `@model_gateway/src/routers/grpc/harmony/builder.rs`:
- Around line 876-885: Update the ChatMessage::Tool match arm to prefer the
message's name field before falling back to the assistant tool-call map or the
tool_call_id: destructure the tool message to include name (e.g.
ChatMessage::Tool { content, tool_call_id, name, .. }) and compute function_name
by using name.clone() when Some, otherwise lookup
tool_call_map.get(tool_call_id).cloned() and finally fall back to
tool_call_id.clone().
In `@model_gateway/src/routers/grpc/harmony/stages/request_building.rs`:
- Around line 177-207: The /responses path is missing model EOS token IDs
because the code builds eos_ids earlier but only passes them into
trtllm_client.build_generate_request_from_chat; update the responses branch
inside HarmonyRequestBuildingStage::execute to pass the same &eos_ids into
trtllm_client.build_generate_request_from_responses (or replicate the EOS-merge
logic used by build_generate_request_from_chat) so stop_token_ids are populated
the same way for both RequestType::Chat and RequestType::Responses; locate the
eos_ids variable and the trtllm_client.build_generate_request_from_responses
call and add the eos_ids argument (or the equivalent merge) to match
build_generate_request_from_chat.
In `@model_gateway/src/routers/grpc/regular/responses/conversions.rs`:
- Around line 86-117: The conversion currently drops tool names by setting
ChatMessage::Tool.name to None; for ResponseInputOutputItem::FunctionToolCall
use the existing local variable name to populate ChatMessage::Tool.name (e.g.,
name: Some(name.clone())) when pushing the Tool message, and for the
FunctionCallOutput branch build a temporary call_id -> name map while iterating
ResponseInputOutputItem items (recording id -> name from FunctionToolCall
entries) so that when you encounter FunctionCallOutput you can look up the tool
name and set ChatMessage::Tool.name to Some(found_name.clone()) (falling back to
None if not found); update both places where ChatMessage::Tool.name is currently
hardcoded to None.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: d25c587a-cd42-47e9-8f11-7f2ef4b5d3d1
📒 Files selected for processing (22)
bindings/python/src/smg/serve.pycrates/grpc_client/src/trtllm_service.rscrates/protocols/src/chat.rscrates/reasoning_parser/src/parsers/base.rscrates/tokenizer/src/huggingface.rscrates/tokenizer/src/mock.rscrates/tokenizer/src/tiktoken.rscrates/tokenizer/src/traits.rscrates/tool_parser/src/parsers/kimik2.rsmodel_gateway/src/core/worker_manager.rsmodel_gateway/src/observability/metrics_server.rsmodel_gateway/src/routers/grpc/client.rsmodel_gateway/src/routers/grpc/harmony/builder.rsmodel_gateway/src/routers/grpc/harmony/stages/request_building.rsmodel_gateway/src/routers/grpc/regular/processor.rsmodel_gateway/src/routers/grpc/regular/responses/conversions.rsmodel_gateway/src/routers/grpc/regular/stages/chat/preparation.rsmodel_gateway/src/routers/grpc/regular/stages/chat/request_building.rsmodel_gateway/src/routers/grpc/regular/stages/messages/preparation.rsmodel_gateway/src/routers/grpc/regular/streaming.rsmodel_gateway/src/routers/grpc/utils/chat_utils.rsmodel_gateway/tests/spec/chat_message.rs
| /// Load merged EOS token IDs from config.json and generation_config.json. | ||
| /// | ||
| /// Models may define different EOS tokens in each file (e.g. Kimi-K2.5 has | ||
| /// `[EOS]` (163585) in config.json and `<|im_end|>` (163586) in | ||
| /// generation_config.json). We merge both into a deduplicated list so the | ||
| /// engine can stop at any of them — matching SGLang's behaviour. | ||
| pub fn load_eos_token_ids(dir: &Path) -> Vec<TokenIdType> { | ||
| let mut ids = std::collections::BTreeSet::new(); | ||
|
|
||
| // config.json — eos_token_id can be int or list | ||
| if let Ok(content) = std::fs::read_to_string(dir.join("config.json")) { | ||
| if let Ok(cfg) = serde_json::from_str::<serde_json::Value>(&content) { | ||
| match cfg.get("eos_token_id") { | ||
| Some(serde_json::Value::Number(n)) => { | ||
| if let Some(id) = n.as_u64() { | ||
| ids.insert(id as TokenIdType); | ||
| } | ||
| } | ||
| Some(serde_json::Value::Array(arr)) => { | ||
| for v in arr { | ||
| if let Some(id) = v.as_u64() { | ||
| ids.insert(id as TokenIdType); | ||
| } | ||
| } | ||
| } | ||
| _ => {} | ||
| } | ||
| } | ||
| } | ||
|
|
||
| // generation_config.json — eos_token_id can be int or list | ||
| if let Ok(content) = std::fs::read_to_string(dir.join("generation_config.json")) { | ||
| if let Ok(cfg) = serde_json::from_str::<serde_json::Value>(&content) { | ||
| match cfg.get("eos_token_id") { | ||
| Some(serde_json::Value::Number(n)) => { | ||
| if let Some(id) = n.as_u64() { | ||
| ids.insert(id as TokenIdType); | ||
| } | ||
| } | ||
| Some(serde_json::Value::Array(arr)) => { | ||
| for v in arr { | ||
| if let Some(id) = v.as_u64() { | ||
| ids.insert(id as TokenIdType); | ||
| } | ||
| } | ||
| } | ||
| _ => {} | ||
| } | ||
| } | ||
| } | ||
|
|
||
| ids.into_iter().collect() | ||
| } |
There was a problem hiding this comment.
Preserve tokenizer-derived EOS IDs when the JSON files don't declare them.
load_eos_token_ids() returns an empty vector if neither file has a numeric eos_token_id. Both this file and crates/tokenizer/src/huggingface.rs then assign that result straight into special_tokens.eos_token_ids, which wipes out EOS information even when special_tokens.eos_token was already resolved from tokenizer_config.json or vocab matching. For those models, TRT-LLM will receive no backend stop-token IDs at all. Seed the set from the resolved EOS token ID and then merge file-based IDs, rather than replacing everything with the file-only list.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/tokenizer/src/tiktoken.rs` around lines 45 - 97, load_eos_token_ids
currently builds an empty set when neither config.json nor
generation_config.json declare numeric eos_token_id, causing callers (e.g.,
assignments to special_tokens.eos_token_ids in load code and in
crates/tokenizer/src/huggingface.rs) to wipe out tokenizer-derived EOS IDs
resolved from tokenizer_config.json or vocab matching. Fix by seeding the
BTreeSet ids at the start of load_eos_token_ids with any already-resolved EOS
token ID(s) passed in (or by reading tokenizer_config-derived EOS prior to
merging) and then merge numeric IDs from config.json and generation_config.json
into that set so file-based IDs augment rather than replace existing
special_tokens.eos_token_ids; update callers to supply the tokenizer-derived EOS
if necessary (refer to load_eos_token_ids, special_tokens.eos_token_ids, and the
tokenizer_config.json resolution path in huggingface.rs).
| async fn collect_prometheus_multiproc_metrics() -> Result<String, String> { | ||
| let dir = std::env::var("PROMETHEUS_MULTIPROC_DIR").map_err(|_| { | ||
| "PROMETHEUS_MULTIPROC_DIR not set; cannot collect metrics from gRPC workers".to_string() | ||
| })?; | ||
|
|
||
| let output = tokio::process::Command::new("python3") | ||
| .args([ | ||
| "-c", | ||
| "import sys\n\ | ||
| from prometheus_client import CollectorRegistry, generate_latest\n\ | ||
| from prometheus_client.multiprocess import MultiProcessCollector\n\ | ||
| registry = CollectorRegistry()\n\ | ||
| MultiProcessCollector(registry)\n\ | ||
| sys.stdout.buffer.write(generate_latest(registry))\n", | ||
| ]) | ||
| .env("PROMETHEUS_MULTIPROC_DIR", &dir) | ||
| .output() | ||
| .await | ||
| .map_err(|e| format!("failed to run python3 prometheus collector: {e}"))?; | ||
|
|
||
| if !output.status.success() { | ||
| let stderr = String::from_utf8_lossy(&output.stderr); | ||
| return Err(format!("python3 prometheus collector failed: {stderr}")); | ||
| } | ||
|
|
||
| String::from_utf8(output.stdout) | ||
| .map_err(|e| format!("prometheus collector output is not valid UTF-8: {e}")) | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Call sites =="
rg -n -C2 'collect_prometheus_multiproc_metrics|PROMETHEUS_MULTIPROC_DIR|prometheus_client|python3' .
echo
echo "== Candidate runtime / packaging files =="
fd -HI 'Dockerfile|dockerfile|requirements.*|pyproject.toml|uv.lock|poetry.lock|Pipfile|setup.py|Cargo.toml|compose.*' . \
| while read -r f; do
echo
echo "--- $f"
rg -n -C2 'python3|prometheus_client|prometheus-client|PROMETHEUS_MULTIPROC_DIR' "$f" || true
doneRepository: lightseekorg/smg
Length of output: 29259
🏁 Script executed:
# Check the full Dockerfile to see base image and any pip installs
cat -n ./docker/Dockerfile
# Also check if there are any hidden requirements files or setup files referenced
fd -HI 'requirements|setup.py|pyproject.toml' ./docker/Repository: lightseekorg/smg
Length of output: 3219
🏁 Script executed:
# Check the Python bindings setup/dependency files
cat -n ./bindings/python/pyproject.toml
echo "---"
cat -n ./bindings/python/setup.pyRepository: lightseekorg/smg
Length of output: 2793
🏁 Script executed:
# Search for any other references to prometheus_client installation or imports
rg -i 'prometheus.client|prometheus-client' \
./bindings/python/ ./docker/ \
--type py --type txt --type toml --type yaml
# Check if it's supposed to come from worker installations
rg -i 'prometheus.client' ./e2e_test/ ./scripts/ \
--type py --type sh --type yaml
# Check model_gateway Cargo.toml for any Python-related deps
rg -A5 -B5 'prometheus|python' ./model_gateway/Cargo.tomlRepository: lightseekorg/smg
Length of output: 602
🏁 Script executed:
# Check if prometheus_client might come from worker dependencies or is conditionally installed
rg -i 'prometheus.client|prometheus-client' . --type py --type toml --type txt \
| grep -v node_modules | grep -v '.git' | head -20
# Also check if there's any grpc worker setup that might provide it
rg -B10 -A10 'PROMETHEUS_MULTIPROC_DIR' ./bindings/python/src/smg/serve.pyRepository: lightseekorg/smg
Length of output: 1906
Add prometheus_client to gateway dependencies and wrap the gRPC metrics subprocess with a timeout.
The gateway attempts to call python3 -c "from prometheus_client import ..." but prometheus_client is not listed in the smg package dependencies (bindings/python/pyproject.toml:29-34). Every gRPC metrics scrape will fail with a ModuleNotFoundError. Additionally, the subprocess call has no timeout, creating a hang risk. Add prometheus_client as a core dependency and wrap the .output().await with tokio::time::timeout() before release.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@model_gateway/src/core/worker_manager.rs` around lines 89 - 116,
collect_prometheus_multiproc_metrics currently invokes a Python subprocess that
imports prometheus_client but that package isn't declared in the smg/python
bindings dependencies and the subprocess has no timeout; add "prometheus_client"
to the core Python deps in bindings/python/pyproject.toml (so the
ModuleNotFoundError is resolved) and wrap the
tokio::process::Command.output().await call in tokio::time::timeout with a
sensible Duration (e.g., a few seconds), mapping a timeout error into a clear
Err result from collect_prometheus_multiproc_metrics; update error messages to
reflect timeout vs subprocess failures and reference the
collect_prometheus_multiproc_metrics function when making the changes.
| // Strip leaked chatml tokens only when a model-specific parser is configured | ||
| if self.configured_tool_parser.is_some() || self.configured_reasoning_parser.is_some() { | ||
| for token in [ | ||
| "<|im_end|>", | ||
| "<|im_start|>", | ||
| "<|im_user|>", | ||
| "<|im_assistant|>", | ||
| "<|im_system|>", | ||
| "<|im_middle|>", | ||
| ] { | ||
| processed_text = processed_text.replace(token, ""); | ||
| } | ||
| processed_text = processed_text.trim().to_string(); | ||
| } |
There was a problem hiding this comment.
Non-streaming cleanup still misses part of the leaked-token set.
The streaming path now strips </think>, [EOS], and [BOS], but this non-streaming sanitizer only removes <|im_*|>. That means the final chat response can still return the exact markers this PR is trying to scrub. Please centralize the sanitizer and reuse it here and in the Messages block below.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@model_gateway/src/routers/grpc/regular/processor.rs` around lines 221 - 234,
The non-streaming cleanup in processor.rs currently only strips <|im_*|> tokens
inside the block guarded by self.configured_tool_parser/is_some but misses other
leaked tokens (</think>, [EOS], [BOS]) that the streaming path removes; create a
single sanitizer function (e.g., strip_leaked_tokens or sanitize_leaked_tokens)
on the impl (or a private helper) that accepts &str and returns a trimmed
String, move the full token list and replacement logic into it, and call this
helper from the existing processed_text cleanup block and from the Messages
block referenced in the review so both streaming and non-streaming paths reuse
the same sanitizer (refer to self.configured_tool_parser,
self.configured_reasoning_parser, processed_text, and the Messages handling
code).
| if self.configured_tool_parser.is_some() && tool_parser_available { | ||
| (tool_calls, processed_text) = self | ||
| .parse_tool_calls( | ||
| &processed_text, | ||
| &messages_request.model, | ||
| utils::message_utils::get_history_tool_calls_count_messages( | ||
| &messages_request, | ||
| ), | ||
| ) | ||
| .await; | ||
| } else if used_json_schema { |
There was a problem hiding this comment.
Configured parsers currently bypass the constrained-output fallback in Messages.
This if / else if ordering differs from process_single_choice. For messages::ToolChoice::Tool / Any, used_json_schema is already true, but a configured parser takes this branch first. If it returns no native tool calls, the JSON-schema wrapper path is skipped entirely and the arguments-only JSON is left as plain text instead of becoming a ToolUse block. Make this match the chat path by skipping the configured parser for constrained output, or by using an independent if tool_calls.is_none() && used_json_schema fallback.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@model_gateway/src/routers/grpc/regular/processor.rs` around lines 710 - 720,
The configured parser branch currently runs before the constrained-output
(used_json_schema) handling, causing configured parsers to bypass the
JSON-schema -> ToolUse fallback; update the logic in processor.rs around
configured_tool_parser and parse_tool_calls so it either (a) skip invoking the
configured parser when used_json_schema is true (matching process_single_choice
behavior), or (b) after parse_tool_calls returns no tool_calls, check if
used_json_schema is true and then run the JSON-schema wrapper / create a ToolUse
block from the arguments-only JSON; ensure you reference and adjust the
conditionals around self.configured_tool_parser, tool_parser_available,
parse_tool_calls, used_json_schema, and the code path that constructs ToolUse so
the constrained-output fallback always runs when appropriate.
| // Step 3: Tokenize the processed text. | ||
| // Use add_special_tokens=true so the tokenizer recognizes special tokens | ||
| // (e.g. <|im_system|>, <|im_end|>, <think>) embedded in the chat template | ||
| // output as single token IDs rather than splitting them into characters. | ||
| let encoding = match tokenizer.encode(&processed_messages.text, true) { |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's explore the file structure and understand the context
find . -type f -name "preparation.rs" | head -20Repository: lightseekorg/smg
Length of output: 544
🏁 Script executed:
# Check the full content of the preparation.rs file to understand the context
wc -l model_gateway/src/routers/grpc/regular/stages/chat/preparation.rsRepository: lightseekorg/smg
Length of output: 129
🏁 Script executed:
# Read the relevant section and surrounding code for context
sed -n '115,150p' model_gateway/src/routers/grpc/regular/stages/chat/preparation.rsRepository: lightseekorg/smg
Length of output: 1639
🏁 Script executed:
# Find the process_chat_messages function to understand what processed_messages.text contains
rg "fn process_chat_messages|process_chat_messages\(" -A 10 -B 2Repository: lightseekorg/smg
Length of output: 6687
🏁 Script executed:
# Look for the Messages API path mentioned in the review to understand the parallel issue
rg "Messages API|messages.*special.*token" -iRepository: lightseekorg/smg
Length of output: 8668
🏁 Script executed:
# Read the process_chat_messages function implementation
sed -n '1,100p' model_gateway/src/routers/grpc/utils/chat_utils.rsRepository: lightseekorg/smg
Length of output: 3502
🏁 Script executed:
# Continue reading chat_utils.rs
wc -l model_gateway/src/routers/grpc/utils/chat_utils.rs
sed -n '1,200p' model_gateway/src/routers/grpc/utils/chat_utils.rs | tail -100Repository: lightseekorg/smg
Length of output: 4102
🏁 Script executed:
# Check the Messages API preparation file to see how it handles tokenization
sed -n '1,200p' model_gateway/src/routers/grpc/regular/stages/messages/preparation.rsRepository: lightseekorg/smg
Length of output: 7700
🏁 Script executed:
# Look for where add_special_tokens is used in the Messages API path
rg "add_special_tokens|encode.*true|encode.*false" -B 3 -A 3Repository: lightseekorg/smg
Length of output: 50372
🏁 Script executed:
# Find and read the complete process_chat_messages function
sed -n '275,450p' model_gateway/src/routers/grpc/utils/chat_utils.rsRepository: lightseekorg/smg
Length of output: 6702
🏁 Script executed:
# Look for template handling and escaping
rg "apply_chat_template|render|escape" -A 5 -B 2 | head -200Repository: lightseekorg/smg
Length of output: 9963
🏁 Script executed:
# Check what the ProcessedMessages struct contains and how the text is built
rg "struct ProcessedMessages|ProcessedMessages {" -A 10Repository: lightseekorg/smg
Length of output: 211
🏁 Script executed:
# Search for ProcessedMessages struct definition
rg "struct ProcessedMessages" -A 15Repository: lightseekorg/smg
Length of output: 767
🏁 Script executed:
# Find where apply_chat_template is called - the key part
rg "apply_chat_template" -B 5 -A 10 | head -100Repository: lightseekorg/smg
Length of output: 6959
🏁 Script executed:
# Check if there's any escaping of user content before the template
sed -n '400,550p' model_gateway/src/routers/grpc/utils/chat_utils.rsRepository: lightseekorg/smg
Length of output: 5300
🏁 Script executed:
# Check what the comment in messages/preparation.rs says about skip_special_tokens
grep -B 5 -A 5 "skip_special_tokens" model_gateway/src/routers/grpc/regular/stages/messages/preparation.rsRepository: lightseekorg/smg
Length of output: 42
🏁 Script executed:
# Check tokenizer.rs to understand apply_chat_template behavior
find . -name "*.rs" -path "*/crates/tokenizer/*" -type f | head -10Repository: lightseekorg/smg
Length of output: 445
🏁 Script executed:
# Search for any documentation or comments about template escaping
rg "escape|sanitiz|safe" model_gateway/src/routers/grpc/utils/chat_utils.rs -iRepository: lightseekorg/smg
Length of output: 99
🏁 Script executed:
# Look at what happens with add_special_tokens in the embedding stage for comparison
sed -n '1,150p' model_gateway/src/routers/grpc/regular/stages/embedding/preparation.rsRepository: lightseekorg/smg
Length of output: 2908
🏁 Script executed:
# Look for how chat template handles special tokens - check the chat_template implementation
cat crates/tokenizer/src/chat_template.rs | head -200Repository: lightseekorg/smg
Length of output: 7206
🏁 Script executed:
# Check for any escaping in the tokenizer's template application
rg "apply_chat_template" -B 5 -A 15 crates/tokenizer/src/ | head -150Repository: lightseekorg/smg
Length of output: 8679
🏁 Script executed:
# Look for test cases showing template + user content interaction
find . -name "*test*.rs" -o -name "*tests.rs" | xargs rg "apply_chat_template|special.*token" | head -100Repository: lightseekorg/smg
Length of output: 3706
🏁 Script executed:
# Check if minijinja auto-escapes content or if there's explicit escaping
rg "escape|autoescape" crates/tokenizer/src/chat_template.rsRepository: lightseekorg/smg
Length of output: 159
🏁 Script executed:
# Verify the template rendering doesn't have HTML/special-char escaping
sed -n '600,900p' crates/tokenizer/src/chat_template.rsRepository: lightseekorg/smg
Length of output: 11423
🏁 Script executed:
# Check if there are any tests showing user content with special tokens
rg "<\\|im_end\\|>" model_gateway/tests/ crates/tokenizer/tests/Repository: lightseekorg/smg
Length of output: 13742
🏁 Script executed:
# Look at the complete implementation of render_chat_template
rg "fn render_chat_template" -A 50 crates/tokenizer/src/chat_template.rs | head -100Repository: lightseekorg/smg
Length of output: 2350
User-supplied special-token markers in chat content can be reinterpreted as control tokens.
When the chat template is rendered via apply_chat_template(), user messages are embedded directly without escaping. Then tokenizing with add_special_tokens=true causes the tokenizer to recognize sequences like <|im_end|> in user content as atomic special tokens rather than plain text, potentially altering prompt structure. Escape these markers in user content before template rendering, or disable special-token recognition and only apply it to the template scaffolding itself.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@model_gateway/src/routers/grpc/regular/stages/chat/preparation.rs` around
lines 130 - 134, The tokenizer call uses add_special_tokens=true on
processed_messages.text after apply_chat_template(), which allows user-supplied
sequences like "<|im_end|>" to be recognized as control tokens; to fix, ensure
user message content is escaped before template rendering (e.g., run an escape
function on strings fed into apply_chat_template()), or alternatively switch the
tokenizer.encode call to add_special_tokens=false for user content and only
apply special-token-aware encoding to the known template scaffolding; update
references around apply_chat_template(), processed_messages.text, and
tokenizer.encode so special markers from user input cannot be interpreted as
control tokens.
| // Strip leaked chatml/think tokens when a parser is configured | ||
| let mut delta = delta; | ||
| if self.configured_tool_parser.is_some() | ||
| || self.configured_reasoning_parser.is_some() | ||
| { | ||
| for token in [ | ||
| "<|im_end|>", "<|im_start|>", "<|im_user|>", | ||
| "<|im_assistant|>", "<|im_system|>", "<|im_middle|>", | ||
| "</think>", "[EOS]", "[BOS]", | ||
| ] { | ||
| delta = delta.replace(token, ""); | ||
| } | ||
| } |
There was a problem hiding this comment.
Sanitize the flushed tail as well.
This scrubber only runs in the per-chunk path. The final stop_decoder.flush() content at Lines 471-490 bypasses it, so trailing ChatML / think / EOS text can still leak in the last SSE delta. Please extract this into a shared helper and reuse it in the flush path too.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@model_gateway/src/routers/grpc/regular/streaming.rs` around lines 438 - 450,
The scrubber that removes ChatML/think/EOS tokens from a chunked delta is only
applied inside the per-chunk path; extract that logic into a shared helper
(e.g., fn scrub_delta(input: &str, configured_tool_parser: Option<...>,
configured_reasoning_parser: Option<...>) -> String) and call it both where
`delta` is currently sanitized and on the final `stop_decoder.flush()` output
before emitting the last SSE delta; reference the existing symbols
`configured_tool_parser`, `configured_reasoning_parser`, `delta`, and
`stop_decoder.flush()` so the new helper matches the current conditional
behavior and is reused in the flush path.
| if is_specific_function | ||
| && !(self.configured_tool_parser.is_some() && tool_parser_available) | ||
| { |
There was a problem hiding this comment.
ToolChoice::Tool can emit invalid Messages events when a configured parser exists.
When this condition is false, control falls into the generic parser branch even though streaming_tool_parser was created as the JSON parser for constrained output. That parser can emit parameters with name: None, and the loop below will then send InputJsonDelta before any ContentBlockStart, which violates the Anthropic event sequence. Match the chat streaming path here and always keep ToolChoice::Tool on the arguments-only branch.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@model_gateway/src/routers/grpc/regular/streaming.rs` around lines 1805 -
1807, The condition allows ToolChoice::Tool to fall into the generic parser
branch even when streaming_tool_parser (the JSON/constrained parser) was
selected, causing parameters with name: None to emit InputJsonDelta before
ContentBlockStart; change the conditional logic around ToolChoice::Tool so it
always follows the arguments-only branch (i.e., keep ToolChoice::Tool out of the
generic parser path) by ensuring the check that currently reads "if
is_specific_function && !(self.configured_tool_parser.is_some() &&
tool_parser_available)" instead preserves ToolChoice::Tool on the arguments-only
branch regardless of configured_tool_parser/tool_parser_available, mirroring the
chat streaming path and preventing use of streaming_tool_parser for
ToolChoice::Tool.
| // Generate TypeScript-style tool declarations for models that use it | ||
| // (e.g., Kimi K2.5 chat template checks for tools_ts_str) | ||
| if let Some(ref tools) = request.tools { | ||
| if !tools.is_empty() { | ||
| let ts_str = tools_to_typescript(tools); | ||
| if !ts_str.is_empty() { | ||
| combined_template_kwargs.insert( | ||
| "tools_ts_str".to_string(), | ||
| Value::String(ts_str), | ||
| ); |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Do TypeScript type aliases and object property declarations allow unquoted identifiers containing hyphens or spaces, and what characters are allowed in OpenAI function/tool names?
💡 Result:
TypeScript: type aliases + object/type property names
-
Type alias names must be a
BindingIdentifier(type BindingIdentifier ... = ...), so they cannot contain spaces or hyphens unquoted. (sources.debian.org)- Example (invalid):
type user-id = string/type "user id" = string
- Example (invalid):
-
Object/type property names (in object literals, interfaces, and type literals) can be:
- an
IdentifierName(unquoted), or - a string literal (quoted), or
- a numeric literal, or
- a computed property name. (sources.debian.org)
Because hyphens and spaces aren’t valid in ECMAScript
IdentifierName, you must quote such property keys. (sources.debian.org)- Valid:
type T = { "user-id": string; "user id": string }const o = { "user-id": 1, "user id": 2 }
- Invalid (unquoted):
type T = { user-id: string }(parsed likeuser - id)type T = { user id: string }
- an
OpenAI function/tool names: allowed characters
OpenAI function tool name must be max 64 chars and contain only:
A–Z,a–z,0–9,_(underscore),-(dash)
Spaces (and characters like.) are not allowed. (platform.openai.com)
Citations:
- 1: https://sources.debian.org/src/node-typescript/4.1.3-1/doc/TypeScript%20Language%20Specification%20%28Change%20Markup%29%20-%20ARCHIVED.pdf
- 2: https://sources.debian.org/src/node-typescript/4.1.3-1/doc/TypeScript%20Language%20Specification%20%28Change%20Markup%29%20-%20ARCHIVED.pdf
- 3: https://sources.debian.org/src/node-typescript/4.1.3-1/doc/TypeScript%20Language%20Specification%20%28Change%20Markup%29%20-%20ARCHIVED.pdf
- 4: https://platform.openai.com/docs/api-reference/chat/create-chat-completion?utm_source=openai
Quote or sanitize function names in tools_ts_str.
This generator emits raw OpenAI tool names directly into TypeScript type declarations (e.g., type {func.name} = ...). OpenAI tool names can include hyphens (e.g., get-weather), but TypeScript type aliases require valid BindingIdentifiers and reject unquoted hyphens as syntax errors. The generated tools_ts_str will be syntactically invalid for any tool with a hyphenated name, breaking templates that rely on it.
Also applies to: 943–1121
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@model_gateway/src/routers/grpc/utils/chat_utils.rs` around lines 411 - 420,
The generated TypeScript snippet inserted as "tools_ts_str" via
tools_to_typescript currently emits raw OpenAI tool names (from request.tools)
which may contain hyphens and produce invalid TypeScript identifiers; update
tools_to_typescript (and any generation logic that builds tools_ts_str used when
inserting into combined_template_kwargs) to either generate
quoted/string-literal type names (e.g., "'get-weather'" or `"\"get-weather\""`
in the TS output) or sanitize names into valid BindingIdentifiers (e.g.,
replace/normalize hyphens to underscores) and consistently use that sanitized
form in all places it’s referenced; ensure the change is applied across the
related generator code paths (including the tools_to_typescript function and
where tools_ts_str is inserted) so templates referencing tool names remain
syntactically valid.
| // Strip chatml / special tokens that may trail the constrained JSON output | ||
| let mut clean = processed_text.to_string(); | ||
| for token in [ | ||
| "<|im_end|>", | ||
| "<|im_start|>", | ||
| "<|im_user|>", | ||
| "<|im_assistant|>", | ||
| "<|im_system|>", | ||
| "<|im_middle|>", | ||
| "</think>", | ||
| "[EOS]", | ||
| "[BOS]", | ||
| ] { | ||
| clean = clean.replace(token, ""); | ||
| } | ||
| let clean = clean.trim(); |
There was a problem hiding this comment.
Don't replace() control tokens inside JSON payloads.
This removes every occurrence of markers like </think> and [EOS], not just leaked prefix/suffix tokens. If a valid JSON string field contains one of those literals, the payload is mutated before parsing and the tool arguments/content change silently. Please trim these markers only at the boundaries of processed_text.
Also applies to: 567-631
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@model_gateway/src/routers/grpc/utils/chat_utils.rs` around lines 547 - 562,
The current loop in chat_utils.rs replaces control tokens globally in
processed_text (using clean.replace(...)), which can corrupt valid JSON string
fields; instead trim only leaked prefix/suffix markers from the start and end of
processed_text. Locate the block using the variables processed_text and clean
and replace the global replace() calls with logic that repeatedly strip known
tokens from the start (strip_prefix) and end (strip_suffix) or use anchored
regexes so only boundary occurrences like prefix/suffix markers (e.g.,
"<|im_start|>", "<|im_end|>", "</think>", "[EOS]", etc.) are removed; preserve
any occurrences of those tokens inside the JSON payload. Ensure the final clean
is trimmed of whitespace as before.
| ChatMessage::Tool { | ||
| content, | ||
| tool_call_id, | ||
| .. | ||
| } => { |
There was a problem hiding this comment.
Add an assertion for the new tool.name field.
This only loosens the pattern match. It would be better to also deserialize a payload with "name" and assert it round-trips, otherwise the new wire contract stays untested.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@model_gateway/tests/spec/chat_message.rs` around lines 66 - 70, The test's
pattern match on ChatMessage::Tool was loosened but doesn't assert the new
tool.name field; update the test in model_gateway/tests/spec/chat_message.rs to
deserialize a payload that includes "name", assert that ChatMessage::Tool
contains tool.name equal to the provided value, and also serialize it back
(round-trip) to ensure the wire contract; locate the match arm for
ChatMessage::Tool and add assertions against the tool.name field and a
serialization check to confirm round-trip fidelity.
When the gRPC worker hasn't written any .db files yet (startup phase or no requests processed), generate_latest() returns empty bytes. Passing this empty string to parse_prometheus() triggers a parse error WARN log on every scrape interval. Add guards at both the collector function (return Err for empty output) and the call site (skip empty text silently) to prevent log spam. Signed-off-by: ConnorLi96 <ConnorLi96@users.noreply.github.com> Made-with: Cursor
| metrics_text: text, | ||
| }); | ||
| } | ||
| Ok(_) => { |
There was a problem hiding this comment.
🟡 Nit: This Ok(_) arm is unreachable. collect_prometheus_multiproc_metrics() (line 116-118) already returns Err when text.trim().is_empty(), so the Ok variant can never contain an empty string. The guard on line 339 is redundant too — it will always be true.
Consider simplifying to just Ok(text) => { ... } to avoid confusing future readers into thinking empty-Ok is a real case.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
model_gateway/src/core/worker_manager.rs (1)
94-107:⚠️ Potential issue | 🟠 MajorAdd a timeout around the Python metrics collector subprocess.
.output().awaitcan block this handler indefinitely ifpython3startup or theprometheus_clientimport stalls, so/metricsturns into a hung request instead of a degraded scrape. Please bound this withtokio::time::timeout(...)and verify thatprometheus_clientis actually packaged anywhere the gateway runs.🛠️ Suggested fix
- let output = tokio::process::Command::new("python3") - .args([ - "-c", - "import sys\n\ - from prometheus_client import CollectorRegistry, generate_latest\n\ - from prometheus_client.multiprocess import MultiProcessCollector\n\ - registry = CollectorRegistry()\n\ - MultiProcessCollector(registry)\n\ - sys.stdout.buffer.write(generate_latest(registry))\n", - ]) - .env("PROMETHEUS_MULTIPROC_DIR", &dir) - .output() - .await - .map_err(|e| format!("failed to run python3 prometheus collector: {e}"))?; + let output = tokio::time::timeout( + Duration::from_secs(3), + tokio::process::Command::new("python3") + .args([ + "-c", + "import sys\n\ + from prometheus_client import CollectorRegistry, generate_latest\n\ + from prometheus_client.multiprocess import MultiProcessCollector\n\ + registry = CollectorRegistry()\n\ + MultiProcessCollector(registry)\n\ + sys.stdout.buffer.write(generate_latest(registry))\n", + ]) + .env("PROMETHEUS_MULTIPROC_DIR", &dir) + .output(), + ) + .await + .map_err(|_| "collect_prometheus_multiproc_metrics timed out".to_string())? + .map_err(|e| format!("failed to run python3 prometheus collector: {e}"))?;#!/bin/bash set -euo pipefail echo "== collector helper ==" sed -n '88,120p' model_gateway/src/core/worker_manager.rs echo echo "== packaging/runtime declarations for prometheus_client ==" fd -HI 'pyproject.toml|setup.py|requirements*.txt|Dockerfile|dockerfile|Pipfile|uv.lock|poetry.lock' . \ | while read -r f; do echo "--- $f" rg -n -C2 'prometheus_client|python3' "$f" || true done🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/core/worker_manager.rs` around lines 94 - 107, Wrap the tokio process call that creates the python3 prometheus collector (the Command building code that currently ends with .output().await and assigns to output) with tokio::time::timeout(...) to bound how long we wait for the subprocess to start and return; on timeout return a mapped error like "prometheus collector timed out" (same error type as the existing map_err) so the handler returns promptly, and ensure you still propagate other I/O errors from .output(); keep the existing env("PROMETHEUS_MULTIPROC_DIR", &dir) and variable name output unchanged so the rest of the function can use it unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@model_gateway/src/core/worker_manager.rs`:
- Around line 116-118: The check that treats empty multiprocess output as an
error (the branch with text.trim().is_empty()) should instead return a soft-skip
sentinel so warmup/idle periods don't produce warnings; replace the Err(...)
with Ok(String::new()) (or otherwise return an empty String) in the function
containing the text.trim().is_empty() check, and update the later handling
around the warn! branch (the logic at the Ok(_) / warn! region) to special-case
this empty-string sentinel before logging a warning so the skip is reachable and
no spurious warnings occur.
---
Duplicate comments:
In `@model_gateway/src/core/worker_manager.rs`:
- Around line 94-107: Wrap the tokio process call that creates the python3
prometheus collector (the Command building code that currently ends with
.output().await and assigns to output) with tokio::time::timeout(...) to bound
how long we wait for the subprocess to start and return; on timeout return a
mapped error like "prometheus collector timed out" (same error type as the
existing map_err) so the handler returns promptly, and ensure you still
propagate other I/O errors from .output(); keep the existing
env("PROMETHEUS_MULTIPROC_DIR", &dir) and variable name output unchanged so the
rest of the function can use it unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 49179c74-ff91-475b-89f4-334604303a24
📒 Files selected for processing (1)
model_gateway/src/core/worker_manager.rs
| if text.trim().is_empty() { | ||
| return Err("no metrics available from gRPC workers yet".to_string()); | ||
| } |
There was a problem hiding this comment.
Treat empty multiprocess output as a soft skip, not an error.
Line 116 converts the “no .db files yet” case into Err(...), but Line 345 only skips silently on Ok(_). That makes the skip branch unreachable and will warn on every scrape during warmup/idle periods. Return Ok(String::new()) here, or special-case this sentinel before warn!.
🧹 Suggested fix
- if text.trim().is_empty() {
- return Err("no metrics available from gRPC workers yet".to_string());
- }
+ if text.trim().is_empty() {
+ return Ok(String::new());
+ }Also applies to: 345-350
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@model_gateway/src/core/worker_manager.rs` around lines 116 - 118, The check
that treats empty multiprocess output as an error (the branch with
text.trim().is_empty()) should instead return a soft-skip sentinel so
warmup/idle periods don't produce warnings; replace the Err(...) with
Ok(String::new()) (or otherwise return an empty String) in the function
containing the text.trim().is_empty() check, and update the later handling
around the warn! branch (the logic at the Ok(_) / warn! region) to special-case
this empty-string sentinel before logging a warning so the skip is reachable and
no spurious warnings occur.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 98bd903b4c
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| // Pass merged EOS token IDs from config.json + generation_config.json. | ||
| // TRT-LLM's gRPC path does not reliably merge these internally, | ||
| // so we provide them explicitly via the standard stop_token_ids field. | ||
| let stop_token_ids: Vec<u32> = eos_token_ids.to_vec(); |
There was a problem hiding this comment.
Respect ignore_eos when injecting EOS stop token IDs
This unconditionally copies merged EOS IDs into stop_token_ids, but stop_token_ids are hard stop conditions while ignore_eos is meant to continue generation past EOS. As written, requests with ignore_eos=true will still terminate when an EOS token appears, which breaks the documented ignore_eos behavior for TRT-LLM chat/harmony requests.
Useful? React with 👍 / 👎.
| "[EOS]", | ||
| "[BOS]", | ||
| ] { | ||
| clean = clean.replace(token, ""); |
There was a problem hiding this comment.
Avoid mutating JSON payloads when stripping special tokens
This removes token substrings with global replace before parsing constrained JSON, so valid argument values containing literals like [EOS], </think>, or <|im_end|> get silently altered (for example a tool argument string can lose content). That corrupts structured outputs/tool inputs rather than just trimming leaked trailer tokens, so cleanup should be boundary-aware instead of replacing everywhere in the JSON text.
Useful? React with 👍 / 👎.
|
Hi @ConnorLi96, this PR has merge conflicts that must be resolved before it can be merged. Please rebase your branch: git fetch origin main
git rebase origin/main
# resolve any conflicts, then:
git push --force-with-lease |
Description
Problem
Function calling with Kimi-K2.5 (NVFP4) is severely broken on SMG — fc-dash scores 44/60 (SMG+SGLang) and 18/60 (SMG+TRT-LLM) vs 58/60 on pure SGLang direct. Multiple stages in the SMG inference pipeline have issues that compound to produce wrong or missing tool calls, empty structured outputs, and native token leaks.
Solution
Fix function calling across the full SMG inference pipeline, organized by stage:
1. Input Tokenization — Fix Rust tiktoken encoder to use
encode_with_special_tokensfor chat template output, ensuring special tokens like<|im_system|>,<|im_end|>are encoded as single token IDs instead of being split into character-level BPE tokens (tiktoken.rs,preparation.rs)2. Chat Template Rendering — Add
name: Option<String>toChatMessage::Toolso the Jinja2 template resolvesmessage.get('name')correctly (e.g. tool role renders as function name instead of literal"tool"). Keep tool call arguments as strings instead of parsing toserde_json::Valueto preserve original formatting like{"query": "weather"}vs{"query":"weather"}(chat.rs,chat_utils.rs)3. Reasoning Parser — Skip reasoning parsing when output is constrained (specific
tool_choiceorresponse_format: json_object/json_schema), preventing the parser from swallowing constrained JSON output as reasoning content. Add<|tool_calls_section_begin|>detection as a split point when</think>is absent (processor.rs,streaming.rs,base.rs)4. Tool Call Parser — Support hyphens in KimiK2 tool call IDs (e.g.
functions.execute-tool:0) and alternative delimiters<|func_start|>/<|func_end|>. Add fallback from configured parser to JSON schema parser fortool_choice=specific_functionwhen native tokens are absent (kimik2.rs,processor.rs)5. Stop Token / EOS Handling — Load merged EOS token IDs from both
config.json([EOS]=163585) andgeneration_config.json(<|im_end|>=163586), matching SGLang's_get_hf_eos_token_id()merge logic. Exposeeos_token_ids()on theTokenizertrait and pass them asstop_token_idsin TRT-LLM gRPC requests. This fixes TRT-LLM generating past turn boundaries (18/60 → 55/60) (tiktoken.rs,traits.rs,trtllm_service.rs)6. Output Cleanup — Strip leaked chatml tokens and
[EOS]/[BOS]text from both streaming and non-streaming output paths, conditioned on parser configuration (processor.rs,streaming.rs,chat_utils.rs)Other (separate concern): Graceful metrics server bind failure — log error instead of panicking when port is in use (
metrics_server.rs). This should be a separate PR.Summary by CodeRabbit
New Features
Bug Fixes
Improvements