Skip to content

feat(multimodal): add Kimi-K2.5 vision support for gRPC router#1044

Merged
CatherineSue merged 26 commits intolightseekorg:mainfrom
Kangyan-Zhou:feat/kimi-k25-vision-grpc
Apr 15, 2026
Merged

feat(multimodal): add Kimi-K2.5 vision support for gRPC router#1044
CatherineSue merged 26 commits intolightseekorg:mainfrom
Kangyan-Zhou:feat/kimi-k25-vision-grpc

Conversation

@Kangyan-Zhou
Copy link
Copy Markdown
Contributor

@Kangyan-Zhou Kangyan-Zhou commented Apr 7, 2026

Summary

Add multimodal (image) support for moonshotai/Kimi-K2.5 in the gRPC PD router, matching the HTTP path's accuracy and improving TTFT at high concurrency.

  • KimiK25VisionSpec: model registry spec with <|media_pad|> placeholder, grid_thws field layout, media_placeholder_token_id from config
  • KimiK25Processor: standalone image preprocessor matching HF's navit_resize + zero-pad pipeline, producing 4D [N, 3, 14, 14] patches for MoonViT's Conv2d
  • PreProcessorConfig: parse nested media_proc_cfg from Kimi's non-standard preprocessor_config.json
  • Tiktoken encoding: use encode_with_special_tokens so chat template special tokens (e.g., <|media_pad|>) are recognized as single token IDs
  • HF Hub config download: download_model_configs_from_hf fetches config.json + preprocessor_config.json when not available locally
  • Performance: fused resize+pad+normalize, SIMD resize (fast_image_resize), spawn_blocking for preprocessing, strip mm_inputs from decode worker

Validation

  • Accuracy: MMMU-Pro 1730 samples — gRPC 78.3% (KVV thinking mode)
  • TTFT (MMMU images, bench_serving):
Concurrency gRPC Median TTFT HTTP Median TTFT gRPC P99 TTFT HTTP P99 TTFT
1 185ms 212ms 617ms 566ms
10 488ms 524ms 795ms 797ms
50 1,759ms 2,135ms 2,360ms 2,825ms
100 4,262ms 5,105ms 4,855ms 6,333ms
  • Throughput (MMMU, concurrency 100): gRPC 3,741 tok/s vs HTTP 3,257 tok/s (+15%)

Test plan

  • cargo test -p llm-multimodal -- kimi (17 tests)
  • cargo test -p llm-tokenizer (103 tests including special token encoding)
  • pre-commit run --all-files clean
  • Smoke test: image correctly identified via gRPC
  • 50-sample MMMU-Pro accuracy check (80%+)
  • Full 1730-sample MMMU-Pro eval via KVV (78.3%)
  • TTFT benchmark at concurrency 1/10/30/50/100

Replaces #1026 (branch renamed for naming convention compliance, DCO sign-offs added)

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added support for Kimi K2.5 vision models with full image preprocessing and multi-image handling (up to 10 images).
    • Added ability to strip multimodal pixel payloads from outgoing generate requests.
  • Improvements

    • Tokenizer now recognizes special-token strings in input as single tokens.
    • Image preprocessing runs off the main thread to reduce blocking and better error reporting.
    • Multimodal config parsing accepts nested Kimi-style fields and respects a media placeholder token ID.
  • Tests

    • Added unit tests covering Kimi preprocessing, prompt/token behavior, and config parsing.

@github-actions github-actions bot added tokenizer Tokenizer related changes grpc gRPC client and router changes multimodal Multimodal crate changes model-gateway Model gateway crate changes labels Apr 7, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: be79785a5b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread crates/tokenizer/src/tiktoken.rs

for image in images {
let (w, h) = image.dimensions();
let cfg = self.compute_resize_config(w as usize, h as usize);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use preprocessor config for Kimi resize/token limits

preprocess() computes resize, patch counts, and num_img_tokens from self.compute_resize_config(...), which only reads struct fields initialized by KimiK25Processor::new(). Because this path never applies PreProcessorConfig limits (media_proc_cfg values like in_patch_limit / patch_limit_on_one_side), runtime behavior is locked to hardcoded defaults even though those fields are parsed. If a model config differs from defaults, placeholder expansion and patch layout will diverge from model expectations.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the Kimi-K2.5 (MoonViT) model, including its specific vision specification and image processor. It updates the preprocessor configuration to handle nested JSON structures and optimizes the gateway by moving image preprocessing to a blocking thread pool and stripping multimodal data from decode requests. Review feedback suggests improving the efficiency of JSON parsing by utilizing existing deserialized data and moving image cloning operations into the blocking task to prevent blocking the async runtime.

Comment on lines 236 to 288
pub fn from_json(json: &str) -> Result<Self, serde_json::Error> {
serde_json::from_str(json)
let mut config: Self = serde_json::from_str(json)?;
// Extract values from nested media_proc_cfg (used by Kimi-K2.5 and similar models)
// when top-level fields are missing.
{
if let Ok(raw) = serde_json::from_str::<serde_json::Value>(json) {
if let Some(media_cfg) = raw.get("media_proc_cfg") {
if config.image_mean.is_none() {
config.image_mean = media_cfg
.get("image_mean")
.and_then(|v| serde_json::from_value(v.clone()).ok());
}
if config.image_std.is_none() {
config.image_std = media_cfg
.get("image_std")
.and_then(|v| serde_json::from_value(v.clone()).ok());
}
if config.patch_size.is_none() {
config.patch_size = media_cfg.get("patch_size").and_then(|v| {
v.as_u64().map(|ps| PatchSize {
height: Some(ps as u32),
width: Some(ps as u32),
})
});
}
if config.merge_size.is_none() {
config.merge_size = media_cfg
.get("merge_kernel_size")
.and_then(|v| v.as_u64())
.map(|v| v as usize);
}
// Also extract Kimi-specific limits into the extra map
// so processors can read them via get_extra()
for key in ["in_patch_limit", "patch_limit_on_one_side"] {
if !config.extra.contains_key(key) {
if let Some(v) = media_cfg.get(key) {
config.extra.insert(key.to_string(), v.clone());
}
}
}
}
}
}
Ok(config)
}

/// Parse from JSON value.
///
/// Handles nested `media_proc_cfg` the same way as `from_json`.
pub fn from_value(value: serde_json::Value) -> Result<Self, serde_json::Error> {
serde_json::from_value(value)
let json_str = serde_json::to_string(&value)?;
Self::from_json(&json_str)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation of from_json and from_value is inefficient as it parses the JSON string multiple times and performs unnecessary stringification. Since PreProcessorConfig uses #[serde(flatten)] for the extra field, any non-standard fields like media_proc_cfg are already available in the extra map after the first deserialization. Additionally, avoid silently ignoring potential failures during field extraction; instead, log them as warnings to aid in debugging, as per repository guidelines.

    pub fn from_json(json: &str) -> Result<Self, serde_json::Error> {
        let mut config: Self = serde_json::from_str(json)?;
        if let Some(media_cfg) = config.extra.get("media_proc_cfg").cloned() {
            Self::apply_kimi_patches(&mut config, &media_cfg);
        }
        Ok(config)
    }

    pub fn from_value(value: serde_json::Value) -> Result<Self, serde_json::Error> {
        let mut config: Self = serde_json::from_value(value)?;
        if let Some(media_cfg) = config.extra.get("media_proc_cfg").cloned() {
            Self::apply_kimi_patches(&mut config, &media_cfg);
        }
        Ok(config)
    }

    fn apply_kimi_patches(config: &mut Self, media_cfg: &serde_json::Value) {
        if config.image_mean.is_none() {
            config.image_mean = media_cfg.get("image_mean").and_then(|v| {
                serde_json::from_value(v.clone())
                    .map_err(|e| log::warn!("Failed to parse image_mean from media_proc_cfg: {}", e))
                    .ok()
            });
        }
        if config.image_std.is_none() {
            config.image_std = media_cfg.get("image_std").and_then(|v| {
                serde_json::from_value(v.clone())
                    .map_err(|e| log::warn!("Failed to parse image_std from media_proc_cfg: {}", e))
                    .ok()
            });
        }
        if config.patch_size.is_none() {
            config.patch_size = media_cfg.get("patch_size").and_then(|v| {
                serde_json::from_value(v.clone())
                    .map_err(|e| log::warn!("Failed to parse patch_size from media_proc_cfg: {}", e))
                    .ok()
                    .flatten()
            });
        }
        if config.merge_size.is_none() {
            config.merge_size = media_cfg.get("merge_kernel_size").and_then(|v| v.as_u64()).map(|v| v as usize);
        }
        for key in ["in_patch_limit", "patch_limit_on_one_side"] {
            if !config.extra.contains_key(key) {
                if let Some(v) = media_cfg.get(key) {
                    config.extra.insert(key.to_string(), v.clone());
                }
            }
        }
    }
References
  1. Instead of silently ignoring potential failures (e.g., from serialization), log them as warnings to aid in debugging. In Rust, prefer using unwrap_or_else to log an error over unwrap_or_default which would fail silently.

Comment thread model_gateway/src/routers/grpc/multimodal.rs
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 7, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds Kimi K2.5 vision support: new KimiK25VisionSpec and KimiK25Processor, registry and image-processor registrations, nested media-config extraction, tokenizer special-token encoding, and gateway/proto changes to run preprocessing on a blocking thread and optionally clear multimodal inputs. (50 words)

Changes

Cohort / File(s) Summary
Vision Spec & Registry
crates/multimodal/src/registry/kimi_k25.rs, crates/multimodal/src/registry/mod.rs
Adds KimiK25VisionSpec registered as "kimi_k25". Matches by model_id/model_type, requires media_placeholder_token_id, limits images, declares field layouts (pixel_values flat on patches_per_image, grid_thws & patches_per_image batched, grid_thws CPU), and produces per-image PromptReplacements using preprocessed.num_img_tokens.
Image Processor Implementation & Registration
crates/multimodal/src/vision/processors/kimi_k25.rs, crates/multimodal/src/vision/processors/mod.rs, crates/multimodal/src/vision/image_processor.rs
Adds KimiK25Processor with fused resize→pad→normalize, patch extraction, token counting, and many unit tests. Exposes processor in processors mod and registers default name patterns ("kimi-k2", "kimi_k2").
Preprocessor Config Parsing
crates/multimodal/src/vision/preprocessor_config.rs
from_json/from_value now support nested media_proc_cfg extraction to populate missing top-level fields (image_mean, image_std, patch_size, merge_kernel_size) and copy Kimi-specific extras; unit test added.
Tokenizer
crates/tokenizer/src/tiktoken.rs
Encoder::encode now calls encode_with_special_tokens, so special-token substrings in raw input are recognized as single tokens; unit test added.
gRPC / Gateway Multimodal Flow
grpc_servicer/.../servicer.py, model_gateway/src/routers/grpc/multimodal.rs, model_gateway/src/routers/grpc/common/stages/request_execution.rs, model_gateway/src/routers/grpc/proto_wrapper.rs
Image preprocessing moved into tokio::task::spawn_blocking; decode path uses a cleared-mm-input clone via new ProtoGenerateRequest::clear_mm_inputs(); placeholder-token resolution logs failures and falls back to tokenizer; pixel serialization prefers as_slice() with as_slice_memory_order() fallback; small import/docstring/construct changes in servicer.
Tests / Small edits
various new unit tests across multimodal and tokenizer crates
Adds tests for spec matching, prompt replacement counts and token IDs, Kimi preprocessing behaviors (resize/pad/patches/tokens), tokenizer special-token handling, and small import/control-flow cleanups.

Sequence Diagram(s)

sequenceDiagram
    participant GW as gRPC Gateway
    participant Reg as Vision Registry
    participant Proc as KimiK25Processor
    participant Spec as KimiK25VisionSpec
    participant TK as Tokenizer

    GW->>Reg: lookup processor by model_id / model_type
    Reg-->>GW: KimiK25Processor

    GW->>+Proc: spawn_blocking preprocess(images, config)
    Proc->>Proc: read config (patch_size, merge_size, limits)
    Proc->>Proc: resize → pad → normalize (fused)
    Proc->>Proc: extract patches, compute grid_thws & num_img_tokens
    Proc-->>-GW: PreprocessedImages {pixel_values, num_img_tokens, metadata}

    GW->>Spec: match model_id / config.model_type
    Spec-->>GW: require media_placeholder_token_id / field layouts

    GW->>Spec: resolve media_placeholder_token_id(config)
    Spec-->>GW: token_id or MissingConfigField

    GW->>TK: encode placeholder token (fallback)
    TK-->>GW: token_id

    GW->>Spec: generate_prompt_replacements(preprocessed)
    Spec-->>GW: PromptReplacements per image
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related issues

Possibly related PRs

Suggested labels

tests

Suggested reviewers

  • slin1237
  • key4ng

Poem

🐰 A rabbit hops across the patch,
Resizes, pads, then makes a match.
Tokens snug in placeholder's nest,
Preprocessing off the async crest.
Pixels prance — the model's guests! 🥕✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main objective of the PR: adding Kimi-K2.5 vision support to the gRPC router, which is directly reflected in the file changes.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/multimodal/src/vision/preprocessor_config.rs`:
- Around line 253-259: The current logic that sets config.patch_size from
media_cfg only handles the integer form (v.as_u64) and ignores the object form
like {"height":14,"width":14}; update the extraction for config.patch_size (the
block that reads media_cfg.get("patch_size")) to also check for
v.as_object()/v.as_object().and_then(...) and map optional "height" and "width"
fields (as_u64 -> u32) into a PatchSize { height: Some(...), width: Some(...) }
fallbacking to the scalar behavior if the value is a single integer; preserve
existing None handling so defaults aren't silently used.

In `@crates/multimodal/src/vision/processors/kimi_k25.rs`:
- Around line 96-125: compute_resize_config currently computes scale using
unaligned patch counts (s1) which can be exceeded after aligning to factor and
applying pad_width/pad_height; update compute_resize_config (and ResizeConfig
output) to ensure the final aligned token count <= self.in_patch_limit: after
computing new_w/new_h, factor, pad_width, pad_height and deriving
token_width/token_height/num_tokens, if num_tokens > self.in_patch_limit then
reduce new_w and/or new_h (for example decrement by factor steps or compute a
tightened scale from in_patch_limit divided by aligned patch grid) and recompute
pad_width/pad_height/token dimensions until num_tokens <= self.in_patch_limit;
ensure you reference and update symbols new_w, new_h, pad_width, pad_height,
token_width, token_height, num_tokens and return the adjusted values in
ResizeConfig.
- Around line 67-78: from_preprocessor_config() correctly parses patch_size,
merge_size, in_patch_limit and patch_limit_on_one_side from PreProcessorConfig
but those per-request overrides are never applied in the live path; update
preprocess() and calculate_num_tokens() to compute and use an effective set of
processor settings by merging the call's media_proc_cfg/PreProcessorConfig (or
the media_proc_cfg passed through ImageProcessorRegistry) with the instance
defaults (self.patch_size, self.merge_size, self.in_patch_limit,
self.patch_limit_on_one_side) and then use those local effective values for
layout and placeholder calculations instead of always reading self.* so
per-request non-default Kimi config is respected.
- Around line 184-190: Replace the panicking
Array3::from_shape_vec(...).expect(...) calls with error-aware code that maps
the shape/vec construction failure into a TransformError return; e.g., call
Array3::from_shape_vec((3, canvas_h, canvas_w), data) and propagate its Err via
Result::map_err into an appropriate TransformError value (with an "INVARIANT:
data has exactly 3*canvas_h*canvas_w elements by construction" comment next to
the mapping), updating the enclosing function signature to return Result<_,
TransformError> if needed; apply the same pattern to the other expect() usages
around lines 210-216 so preprocessing shape/layout failures produce
Err(TransformError) rather than panicking.

In `@crates/tokenizer/src/tiktoken.rs`:
- Around line 439-450: The encode() implementation currently ignores the
add_special_tokens flag and always calls
self.tokenizer.encode_with_special_tokens(input); change it to respect the
caller intent by calling encode_with_special_tokens only when add_special_tokens
is true, and call the tokenizer method that preserves literal special-token
strings (e.g., encode or encode_without_special_tokens) when add_special_tokens
is false, returning Encoding::Tiktoken(...) as before; if Kimi needs placeholder
recognition independent of BOS/EOS, add an explicit flag/parameter instead of
repurposing add_special_tokens.

In `@model_gateway/src/routers/grpc/multimodal.rs`:
- Around line 434-453: The current code clones every image on the async runtime
before calling tokio::task::spawn_blocking (the images.iter().map(|f|
f.image.clone()).collect() line), which performs the heavy pixel copy off the
blocking pool too late; move the deep image cloning into the blocking closure so
the async task only moves lightweight handles into spawn_blocking. Specifically,
change the closure passed to tokio::task::spawn_blocking to capture/move the
original images (and pp_config, registry, model_id/model_type) and perform image
cloning and processor.preprocess(...) inside that closure (keeping the existing
error mapping), so the async runtime doesn't perform the O(total_pixels) copy
before spawn_blocking.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: f673a289-cf17-45dd-b8b3-7102970082ff

📥 Commits

Reviewing files that changed from the base of the PR and between c89dc31 and be79785.

📒 Files selected for processing (11)
  • crates/multimodal/src/registry/kimi_k25.rs
  • crates/multimodal/src/registry/mod.rs
  • crates/multimodal/src/vision/image_processor.rs
  • crates/multimodal/src/vision/preprocessor_config.rs
  • crates/multimodal/src/vision/processors/kimi_k25.rs
  • crates/multimodal/src/vision/processors/mod.rs
  • crates/tokenizer/src/tiktoken.rs
  • grpc_servicer/smg_grpc_servicer/sglang/servicer.py
  • model_gateway/src/routers/grpc/common/stages/request_execution.rs
  • model_gateway/src/routers/grpc/multimodal.rs
  • model_gateway/src/routers/grpc/proto_wrapper.rs

Comment on lines +253 to +259
if config.patch_size.is_none() {
config.patch_size = media_cfg.get("patch_size").and_then(|v| {
v.as_u64().map(|ps| PatchSize {
height: Some(ps as u32),
width: Some(ps as u32),
})
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Nested media_proc_cfg.patch_size only supports integer form.

The new nested extraction ignores object-form patch size (e.g. {"height":14,"width":14}), even though this config type is supported elsewhere. That can silently fall back to defaults and break preprocessing dimensions.

🔧 Proposed fix
                     if config.patch_size.is_none() {
                         config.patch_size = media_cfg.get("patch_size").and_then(|v| {
-                            v.as_u64().map(|ps| PatchSize {
-                                height: Some(ps as u32),
-                                width: Some(ps as u32),
-                            })
+                            v.as_u64()
+                                .and_then(|ps| u32::try_from(ps).ok())
+                                .map(|ps| PatchSize {
+                                    height: Some(ps),
+                                    width: Some(ps),
+                                })
+                                .or_else(|| {
+                                    let h = v.get("height")?.as_u64().and_then(|x| u32::try_from(x).ok())?;
+                                    let w = v
+                                        .get("width")
+                                        .and_then(|x| x.as_u64())
+                                        .and_then(|x| u32::try_from(x).ok())
+                                        .unwrap_or(h);
+                                    Some(PatchSize {
+                                        height: Some(h),
+                                        width: Some(w),
+                                    })
+                                })
                         });
                     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if config.patch_size.is_none() {
config.patch_size = media_cfg.get("patch_size").and_then(|v| {
v.as_u64().map(|ps| PatchSize {
height: Some(ps as u32),
width: Some(ps as u32),
})
});
if config.patch_size.is_none() {
config.patch_size = media_cfg.get("patch_size").and_then(|v| {
v.as_u64()
.and_then(|ps| u32::try_from(ps).ok())
.map(|ps| PatchSize {
height: Some(ps),
width: Some(ps),
})
.or_else(|| {
let h = v.get("height")?.as_u64().and_then(|x| u32::try_from(x).ok())?;
let w = v
.get("width")
.and_then(|x| x.as_u64())
.and_then(|x| u32::try_from(x).ok())
.unwrap_or(h);
Some(PatchSize {
height: Some(h),
width: Some(w),
})
})
});
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/multimodal/src/vision/preprocessor_config.rs` around lines 253 - 259,
The current logic that sets config.patch_size from media_cfg only handles the
integer form (v.as_u64) and ignores the object form like
{"height":14,"width":14}; update the extraction for config.patch_size (the block
that reads media_cfg.get("patch_size")) to also check for
v.as_object()/v.as_object().and_then(...) and map optional "height" and "width"
fields (as_u64 -> u32) into a PatchSize { height: Some(...), width: Some(...) }
fallbacking to the scalar behavior if the value is a single integer; preserve
existing None handling so defaults aren't silently used.

Comment on lines +67 to +78
pub fn from_preprocessor_config(config: &PreProcessorConfig) -> Self {
Self {
patch_size: config.get_patch_size(DEFAULT_PATCH_SIZE),
merge_size: config.merge_size.unwrap_or(DEFAULT_MERGE_SIZE),
in_patch_limit: config
.get_extra::<usize>("in_patch_limit")
.unwrap_or(DEFAULT_IN_PATCH_LIMIT),
patch_limit_on_one_side: config
.get_extra::<usize>("patch_limit_on_one_side")
.unwrap_or(DEFAULT_PATCH_LIMIT_ON_ONE_SIDE),
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

media_proc_cfg overrides are parsed but never applied in the live path.

from_preprocessor_config() reads patch_size, merge_size, and the Kimi patch limits, but preprocess() / calculate_num_tokens() still drive all geometry off the registry instance’s fixed self.* fields. Since the router reuses a shared processor from ImageProcessorRegistry, any non-default Kimi config silently produces the wrong patch layout and placeholder counts.

💡 Apply the effective processor settings per call
 fn preprocess(
     &self,
     images: &[DynamicImage],
     config: &PreProcessorConfig,
 ) -> Result<PreprocessedImages, TransformError> {
+    let runtime = Self::from_preprocessor_config(config);
+
     if images.is_empty() {
         return Err(TransformError::EmptyBatch);
     }

@@
         for image in images {
             let (w, h) = image.dimensions();
-            let cfg = self.compute_resize_config(w as usize, h as usize);
+            let cfg = runtime.compute_resize_config(w as usize, h as usize);

@@
-            let grid_h = padded_h / self.patch_size;
-            let grid_w = padded_w / self.patch_size;
+            let grid_h = padded_h / runtime.patch_size;
+            let grid_w = padded_w / runtime.patch_size;

@@
-            let patches = Self::extract_patches(&tensor, self.patch_size);
+            let patches = Self::extract_patches(&tensor, runtime.patch_size);

@@
         let pixel_values = ndarray::Array4::from_shape_vec(
-            (total_patches, 3, self.patch_size, self.patch_size),
+            (total_patches, 3, runtime.patch_size, runtime.patch_size),
             all_patches,
         )
 fn calculate_num_tokens(&self, width: u32, height: u32, config: &PreProcessorConfig) -> usize {
-    self.compute_resize_config(width as usize, height as usize)
-        .num_tokens
+    Self::from_preprocessor_config(config)
+        .compute_resize_config(width as usize, height as usize)
+        .num_tokens
 }

Also applies to: 245-325

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/multimodal/src/vision/processors/kimi_k25.rs` around lines 67 - 78,
from_preprocessor_config() correctly parses patch_size, merge_size,
in_patch_limit and patch_limit_on_one_side from PreProcessorConfig but those
per-request overrides are never applied in the live path; update preprocess()
and calculate_num_tokens() to compute and use an effective set of processor
settings by merging the call's media_proc_cfg/PreProcessorConfig (or the
media_proc_cfg passed through ImageProcessorRegistry) with the instance defaults
(self.patch_size, self.merge_size, self.in_patch_limit,
self.patch_limit_on_one_side) and then use those local effective values for
layout and placeholder calculations instead of always reading self.* so
per-request non-default Kimi config is respected.

Comment on lines +96 to +125
fn compute_resize_config(&self, width: usize, height: usize) -> ResizeConfig {
let ps = self.patch_size;
let patches_w = (width / ps).max(1) as f64;
let patches_h = (height / ps).max(1) as f64;

let s1 = (self.in_patch_limit as f64 / (patches_w * patches_h)).sqrt();
let s2 = (self.patch_limit_on_one_side * ps) as f64 / width as f64;
let s3 = (self.patch_limit_on_one_side * ps) as f64 / height as f64;
let scale = f64::min(1.0, f64::min(s1, f64::min(s2, s3)));

let new_w = ((width as f64 * scale) as usize).max(1);
let new_h = ((height as f64 * scale) as usize).max(1);
let new_w = new_w.min(self.patch_limit_on_one_side * ps);
let new_h = new_h.min(self.patch_limit_on_one_side * ps);

let factor = self.factor();
let pad_width = (factor - new_w % factor) % factor;
let pad_height = (factor - new_h % factor) % factor;

let token_height = (new_h + pad_height) / factor;
let token_width = (new_w + pad_width) / factor;
let num_tokens = token_height * token_width;

ResizeConfig {
new_width: new_w,
new_height: new_h,
pad_width,
pad_height,
num_tokens,
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Alignment padding can exceed in_patch_limit.

s1 is computed from floor(width / patch_size) * floor(height / patch_size), but the real patch count is derived after both axes are rounded up to the 28-pixel alignment. Near the cap this overshoots: with the defaults, a 1792x1806 image scales to 1785x1798 and then pads to 1792x1820, i.e. 128x130 = 16,640 patches, which is above the configured 16,384 limit.

Please base the limit check on the aligned dimensions, or add a post-adjust step before returning ResizeConfig.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/multimodal/src/vision/processors/kimi_k25.rs` around lines 96 - 125,
compute_resize_config currently computes scale using unaligned patch counts (s1)
which can be exceeded after aligning to factor and applying
pad_width/pad_height; update compute_resize_config (and ResizeConfig output) to
ensure the final aligned token count <= self.in_patch_limit: after computing
new_w/new_h, factor, pad_width, pad_height and deriving
token_width/token_height/num_tokens, if num_tokens > self.in_patch_limit then
reduce new_w and/or new_h (for example decrement by factor steps or compute a
tightened scale from in_patch_limit divided by aligned patch grid) and recompute
pad_width/pad_height/token dimensions until num_tokens <= self.in_patch_limit;
ensure you reference and update symbols new_w, new_h, pad_width, pad_height,
token_width, token_height, num_tokens and return the adjusted values in
ResizeConfig.

Comment on lines +184 to +190
#[expect(
clippy::expect_used,
reason = "data has exactly 3*canvas_h*canvas_w elements by construction"
)]
Array3::from_shape_vec((3, canvas_h, canvas_w), data)
.expect("shape matches pre-allocated buffer")
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Return TransformError here instead of panicking.

Both expect() calls turn preprocessing invariants into panics in production request handling. Threading Result through these helpers would keep unexpected shape/layout regressions recoverable, and you can still document the assumption with an INVARIANT: comment next to the error mapping. Based on learnings: multimodal vision processors in this repo should avoid expect/unreachable in production code and use INVARIANT: to document safe-code assumptions.

Also applies to: 210-216

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/multimodal/src/vision/processors/kimi_k25.rs` around lines 184 - 190,
Replace the panicking Array3::from_shape_vec(...).expect(...) calls with
error-aware code that maps the shape/vec construction failure into a
TransformError return; e.g., call Array3::from_shape_vec((3, canvas_h,
canvas_w), data) and propagate its Err via Result::map_err into an appropriate
TransformError value (with an "INVARIANT: data has exactly 3*canvas_h*canvas_w
elements by construction" comment next to the mapping), updating the enclosing
function signature to return Result<_, TransformError> if needed; apply the same
pattern to the other expect() usages around lines 210-216 so preprocessing
shape/layout failures produce Err(TransformError) rather than panicking.

Comment thread crates/tokenizer/src/tiktoken.rs
Comment thread model_gateway/src/routers/grpc/multimodal.rs
Copy link
Copy Markdown
Member

@CatherineSue CatherineSue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Left some comments for future improvements.

Comment thread model_gateway/src/routers/grpc/multimodal.rs
Comment thread model_gateway/src/routers/grpc/multimodal.rs
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Apr 9, 2026

Hi @Kangyan-Zhou, this PR has merge conflicts that must be resolved before it can be merged. Please rebase your branch:

git fetch origin main
git rebase origin/main
# resolve any conflicts, then:
git push --force-with-lease

@mergify mergify bot added the needs-rebase PR has merge conflicts that need to be resolved label Apr 9, 2026
@Kangyan-Zhou Kangyan-Zhou force-pushed the feat/kimi-k25-vision-grpc branch from be79785 to 518fc42 Compare April 10, 2026 18:47
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Apr 10, 2026

Hi @Kangyan-Zhou, the DCO sign-off check has failed. All commits must include a Signed-off-by line.

To fix existing commits:

# Sign off the last N commits (replace N with the number of unsigned commits)
git rebase HEAD~N --signoff
git push --force-with-lease

To sign off future commits automatically:

  • Use git commit -s every time, or
  • VSCode: enable Git: Always Sign Off in Settings
  • PyCharm: enable Sign-off commit in the Commit tool window

@mergify mergify bot removed the needs-rebase PR has merge conflicts that need to be resolved label Apr 10, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (7)
crates/tokenizer/src/tiktoken.rs (1)

439-455: ⚠️ Potential issue | 🟠 Major

Respect add_special_tokens in encode().

Always calling encode_with_special_tokens() changes raw-text tokenization for callers that explicitly pass false, so this silently rewrites token counts and literal special-token handling. If the multimodal path needs placeholder recognition independently of BOS/EOS behavior, that needs a separate knob; the new test on Line 837 should also flip to true once this is fixed.

🔧 Proposed fix
-    fn encode(&self, input: &str, _add_special_tokens: bool) -> Result<Encoding> {
+    fn encode(&self, input: &str, add_special_tokens: bool) -> Result<Encoding> {
@@
-        let tokens = self.tokenizer.encode_with_special_tokens(input);
+        let tokens = if add_special_tokens {
+            self.tokenizer.encode_with_special_tokens(input)
+        } else {
+            self.tokenizer.encode_ordinary(input)
+        };
         Ok(Encoding::Tiktoken(tokens))
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/tokenizer/src/tiktoken.rs` around lines 439 - 455, Summary: encode()
currently always calls encode_with_special_tokens and ignores the
add_special_tokens flag, altering tokenization for callers that pass false. Fix:
in fn encode(&self, input: &str, _add_special_tokens: bool) -> Result<Encoding>,
honor the add_special_tokens parameter by calling
self.tokenizer.encode_with_special_tokens(input) when add_special_tokens is
true, and call self.tokenizer.encode_ordinary(input) when add_special_tokens is
false; return Encoding::Tiktoken(tokens) as before. Update any tests that
assumed the current behavior (the new test that expects encode to pass
add_special_tokens=false should be flipped to true once this is implemented).
Ensure function and tokenizer method names referenced are encode,
encode_with_special_tokens, encode_ordinary, and Encoding::Tiktoken.
crates/multimodal/src/vision/preprocessor_config.rs (1)

253-259: ⚠️ Potential issue | 🟡 Minor

Nested media_proc_cfg.patch_size only supports integer form.

The extraction logic handles only the integer form (v.as_u64()), but ignores object-form patch size like {"height":14,"width":14} which is supported elsewhere in this file via deserialize_patch_size. This could silently fall back to defaults.

🔧 Proposed fix to handle both forms
                     if config.patch_size.is_none() {
                         config.patch_size = media_cfg.get("patch_size").and_then(|v| {
-                            v.as_u64().map(|ps| PatchSize {
-                                height: Some(ps as u32),
-                                width: Some(ps as u32),
-                            })
+                            v.as_u64()
+                                .and_then(|ps| u32::try_from(ps).ok())
+                                .map(|ps| PatchSize {
+                                    height: Some(ps),
+                                    width: Some(ps),
+                                })
+                                .or_else(|| {
+                                    let h = v.get("height")?.as_u64().and_then(|x| u32::try_from(x).ok())?;
+                                    let w = v.get("width")?.as_u64().and_then(|x| u32::try_from(x).ok())?;
+                                    Some(PatchSize {
+                                        height: Some(h),
+                                        width: Some(w),
+                                    })
+                                })
                         });
                     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/multimodal/src/vision/preprocessor_config.rs` around lines 253 - 259,
The current logic that sets config.patch_size only handles integer form via
media_cfg.get("patch_size").and_then(|v| v.as_u64()...), so object-form values
like {"height":14,"width":14} are ignored; update the branch that fills
config.patch_size to first try the integer extraction (as_u64) and if that fails
attempt to deserialize the value into PatchSize using the existing
deserialize_patch_size helper (or serde_json::from_value) so both numeric and
object representations are supported for config.patch_size when reading
media_cfg.get("patch_size").
crates/multimodal/src/vision/processors/kimi_k25.rs (5)

96-126: ⚠️ Potential issue | 🟠 Major

Alignment padding can push token count beyond limits.

The scale s1 is computed using unaligned patch counts (based on patch_size=14), but the final num_tokens uses aligned dimensions (based on factor=28). The zero-padding rounds up each dimension by up to factor-1 pixels, potentially exceeding in_patch_limit near the boundary.

Consider adding a post-alignment check or iteratively reducing dimensions if the final token count exceeds the limit.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/multimodal/src/vision/processors/kimi_k25.rs` around lines 96 - 126,
compute_resize_config currently computes scale from unaligned patch counts
(using patch_size and in_patch_limit) but then pads to factor alignment
(factor()) which can increase num_tokens beyond in_patch_limit; update
compute_resize_config to perform a post-alignment check after computing
pad_width/pad_height/token_width/token_height/num_tokens and, if num_tokens >
self.in_patch_limit, iteratively reduce new_w and/or new_h (e.g., decrement by
factor or reduce scale) and recompute padding and token counts until num_tokens
<= self.in_patch_limit (ensure new_w/new_h remain >=1), then return
ResizeConfig; keep references to patch_size, in_patch_limit, factor(),
new_w/new_h, pad_width/pad_height, token_width/token_height, and num_tokens so
reviewers can find the change.

184-190: 🧹 Nitpick | 🔵 Trivial

Consider returning TransformError instead of panicking on shape mismatch.

While the #[expect(clippy::expect_used)] annotation documents the invariant, the repo guideline prefers returning recoverable errors over panics in production code. The shape is guaranteed by construction, so failure here indicates a bug, but a TransformError would allow cleaner error propagation.

💡 Alternative using map_err
-        #[expect(
-            clippy::expect_used,
-            reason = "data has exactly 3*canvas_h*canvas_w elements by construction"
-        )]
-        Array3::from_shape_vec((3, canvas_h, canvas_w), data)
-            .expect("shape matches pre-allocated buffer")
+        // INVARIANT: data has exactly 3*canvas_h*canvas_w elements by construction
+        Array3::from_shape_vec((3, canvas_h, canvas_w), data).map_err(|e| {
+            TransformError::ShapeError(format!(
+                "Failed to create canvas tensor [3, {canvas_h}, {canvas_w}]: {e}"
+            ))
+        })

Note: This would require changing the return type to Result<Array3<f32>, TransformError>.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/multimodal/src/vision/processors/kimi_k25.rs` around lines 184 - 190,
Change the panic on shape mismatch to return a TransformError: update the
function that currently returns Array3::from_shape_vec((3, canvas_h, canvas_w),
data).expect(...) to return Result<Array3<f32>, TransformError>, remove the
#[expect(...)] annotation, and replace .expect("shape matches pre-allocated
buffer") with a map_err that converts the shape/Vec error into a TransformError
(include context like "invalid shape when constructing Array3 in kimi_k25").
Keep the original dimensions (3, canvas_h, canvas_w) and ensure call sites are
updated to handle the Result.

263-286: ⚠️ Potential issue | 🟠 Major

Inconsistent config usage: resize uses self.*, normalization uses passed config.

preprocess() reads image_mean/image_std from the passed config (lines 255-256), but compute_resize_config at line 265 uses self.* fields. If the processor instance is reused with varying configs, the resize dimensions will be stale while normalization will be fresh.

For consistency, either:

  1. Always use self.* fields and ignore the passed config entirely, or
  2. Create a runtime config from the passed PreProcessorConfig via from_preprocessor_config(config) and use its values throughout.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/multimodal/src/vision/processors/kimi_k25.rs` around lines 263 - 286,
preprocess() is mixing config sources: normalization uses the passed
PreProcessorConfig (image_mean/image_std) while compute_resize_config() reads
self.* fields, causing stale resize behavior when the processor is reused; to
fix, create a runtime config from the passed config (e.g. via
from_preprocessor_config(config)) at the start of preprocess() and use that
runtime config for compute_resize_config(), resize_pad_and_normalize(),
num_tokens, and any other parameters (instead of self.*), ensuring
compute_resize_config, resize_pad_and_normalize, extract_patches, and
grid/num_tokens calculations all use the same config values.

210-216: 🧹 Nitpick | 🔵 Trivial

Same expect() concern applies here.

The as_standard_layout() does guarantee contiguous memory, but per the same recommendation above, converting to a Result return would improve error handling consistency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/multimodal/src/vision/processors/kimi_k25.rs` around lines 210 - 216,
Replace the panic-producing expect on flat.as_slice() with proper error
propagation: change the surrounding function to return Result, replace the
`.as_slice().expect("as_standard_layout ...")` with `.as_slice().ok_or_else(||
/* appropriate error e.g. anyhow::anyhow!("non-contiguous memory despite
as_standard_layout") or a crate-specific error variant */)?`, and propagate that
Result up (using `?`) so `data` is a `&[T]` without panicking; reference the
calls `as_standard_layout()`, `flat.as_slice()`, and the `data` binding when
making the change.

314-317: ⚠️ Potential issue | 🟡 Minor

calculate_num_tokens ignores the config parameter.

The method ignores the passed PreProcessorConfig and uses self.* fields instead. This could cause token count mismatches if the caller passes a config different from what was used to construct the processor instance.

For consistency with preprocess(), either:

  1. Document that config is intentionally unused (and consider removing the parameter if the trait allows), or
  2. Use Self::from_preprocessor_config(config).compute_resize_config(...) to ensure consistency.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/multimodal/src/vision/processors/kimi_k25.rs` around lines 314 - 317,
The calculate_num_tokens method currently ignores the passed PreProcessorConfig;
update it to construct a processor from the provided config and then call
compute_resize_config so token counts match preprocess(): replace the direct use
of self with a processor built via Self::from_preprocessor_config(config) and
call .compute_resize_config(width as usize, height as usize).num_tokens (or
alternatively document/remove the unused config if the trait permits), ensuring
you reference calculate_num_tokens, PreProcessorConfig, compute_resize_config,
Self::from_preprocessor_config and preprocess() when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/multimodal/src/vision/preprocessor_config.rs`:
- Around line 285-288: The from_value method currently serializes the
serde_json::Value to a string and delegates to from_json, which is inefficient;
extract the parsing logic that handles the nested media_proc_cfg into a shared
helper that accepts &serde_json::Value (e.g., parse_from_value or
from_value_inner) and performs deserialization directly from the Value for
fields used in PreprocessorConfig, then change from_value to call that helper
and have from_json reuse the same helper by parsing the string into a Value and
delegating to it; update or keep the same Result<Self, serde_json::Error> return
type and reference the existing function names from_value and from_json and the
media_proc_cfg handling to locate where to move the logic.
- Around line 519-548: In test_parse_kimi_nested_media_proc_cfg add assertions
that the two extra fields were captured in PreProcessorConfig::extra: check that
config.extra contains "in_patch_limit" with value 16384 and
"patch_limit_on_one_side" with value 512 (e.g., via asserts that lookup those
keys in config.extra and compare to the expected integers), so the test verifies
extraction of those fields alongside get_image_mean/get_image_std/get_patch_size
and merge_size.

---

Duplicate comments:
In `@crates/multimodal/src/vision/preprocessor_config.rs`:
- Around line 253-259: The current logic that sets config.patch_size only
handles integer form via media_cfg.get("patch_size").and_then(|v|
v.as_u64()...), so object-form values like {"height":14,"width":14} are ignored;
update the branch that fills config.patch_size to first try the integer
extraction (as_u64) and if that fails attempt to deserialize the value into
PatchSize using the existing deserialize_patch_size helper (or
serde_json::from_value) so both numeric and object representations are supported
for config.patch_size when reading media_cfg.get("patch_size").

In `@crates/multimodal/src/vision/processors/kimi_k25.rs`:
- Around line 96-126: compute_resize_config currently computes scale from
unaligned patch counts (using patch_size and in_patch_limit) but then pads to
factor alignment (factor()) which can increase num_tokens beyond in_patch_limit;
update compute_resize_config to perform a post-alignment check after computing
pad_width/pad_height/token_width/token_height/num_tokens and, if num_tokens >
self.in_patch_limit, iteratively reduce new_w and/or new_h (e.g., decrement by
factor or reduce scale) and recompute padding and token counts until num_tokens
<= self.in_patch_limit (ensure new_w/new_h remain >=1), then return
ResizeConfig; keep references to patch_size, in_patch_limit, factor(),
new_w/new_h, pad_width/pad_height, token_width/token_height, and num_tokens so
reviewers can find the change.
- Around line 184-190: Change the panic on shape mismatch to return a
TransformError: update the function that currently returns
Array3::from_shape_vec((3, canvas_h, canvas_w), data).expect(...) to return
Result<Array3<f32>, TransformError>, remove the #[expect(...)] annotation, and
replace .expect("shape matches pre-allocated buffer") with a map_err that
converts the shape/Vec error into a TransformError (include context like
"invalid shape when constructing Array3 in kimi_k25"). Keep the original
dimensions (3, canvas_h, canvas_w) and ensure call sites are updated to handle
the Result.
- Around line 263-286: preprocess() is mixing config sources: normalization uses
the passed PreProcessorConfig (image_mean/image_std) while
compute_resize_config() reads self.* fields, causing stale resize behavior when
the processor is reused; to fix, create a runtime config from the passed config
(e.g. via from_preprocessor_config(config)) at the start of preprocess() and use
that runtime config for compute_resize_config(), resize_pad_and_normalize(),
num_tokens, and any other parameters (instead of self.*), ensuring
compute_resize_config, resize_pad_and_normalize, extract_patches, and
grid/num_tokens calculations all use the same config values.
- Around line 210-216: Replace the panic-producing expect on flat.as_slice()
with proper error propagation: change the surrounding function to return Result,
replace the `.as_slice().expect("as_standard_layout ...")` with
`.as_slice().ok_or_else(|| /* appropriate error e.g.
anyhow::anyhow!("non-contiguous memory despite as_standard_layout") or a
crate-specific error variant */)?`, and propagate that Result up (using `?`) so
`data` is a `&[T]` without panicking; reference the calls
`as_standard_layout()`, `flat.as_slice()`, and the `data` binding when making
the change.
- Around line 314-317: The calculate_num_tokens method currently ignores the
passed PreProcessorConfig; update it to construct a processor from the provided
config and then call compute_resize_config so token counts match preprocess():
replace the direct use of self with a processor built via
Self::from_preprocessor_config(config) and call .compute_resize_config(width as
usize, height as usize).num_tokens (or alternatively document/remove the unused
config if the trait permits), ensuring you reference calculate_num_tokens,
PreProcessorConfig, compute_resize_config, Self::from_preprocessor_config and
preprocess() when making the change.

In `@crates/tokenizer/src/tiktoken.rs`:
- Around line 439-455: Summary: encode() currently always calls
encode_with_special_tokens and ignores the add_special_tokens flag, altering
tokenization for callers that pass false. Fix: in fn encode(&self, input: &str,
_add_special_tokens: bool) -> Result<Encoding>, honor the add_special_tokens
parameter by calling self.tokenizer.encode_with_special_tokens(input) when
add_special_tokens is true, and call self.tokenizer.encode_ordinary(input) when
add_special_tokens is false; return Encoding::Tiktoken(tokens) as before. Update
any tests that assumed the current behavior (the new test that expects encode to
pass add_special_tokens=false should be flipped to true once this is
implemented). Ensure function and tokenizer method names referenced are encode,
encode_with_special_tokens, encode_ordinary, and Encoding::Tiktoken.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c3e68e81-085a-4618-bd14-548408c01844

📥 Commits

Reviewing files that changed from the base of the PR and between be79785 and 518fc42.

📒 Files selected for processing (11)
  • crates/multimodal/src/registry/kimi_k25.rs
  • crates/multimodal/src/registry/mod.rs
  • crates/multimodal/src/vision/image_processor.rs
  • crates/multimodal/src/vision/preprocessor_config.rs
  • crates/multimodal/src/vision/processors/kimi_k25.rs
  • crates/multimodal/src/vision/processors/mod.rs
  • crates/tokenizer/src/tiktoken.rs
  • grpc_servicer/smg_grpc_servicer/sglang/servicer.py
  • model_gateway/src/routers/grpc/common/stages/request_execution.rs
  • model_gateway/src/routers/grpc/multimodal.rs
  • model_gateway/src/routers/grpc/proto_wrapper.rs

Comment thread crates/multimodal/src/vision/preprocessor_config.rs
Comment thread crates/multimodal/src/vision/preprocessor_config.rs
@Kangyan-Zhou Kangyan-Zhou force-pushed the feat/kimi-k25-vision-grpc branch from 518fc42 to c42b225 Compare April 10, 2026 19:07
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
crates/multimodal/src/vision/preprocessor_config.rs (2)

267-280: ⚠️ Potential issue | 🟠 Major

Handle nested patch_size object form and avoid unchecked narrowing casts.

media_proc_cfg.patch_size currently only accepts scalar form, and as u32 / as usize can silently truncate oversized values. That can produce invalid preprocessing dimensions.

🔧 Proposed fix
         if config.patch_size.is_none() {
             config.patch_size = media_cfg.get("patch_size").and_then(|v| {
-                v.as_u64().map(|ps| PatchSize {
-                    height: Some(ps as u32),
-                    width: Some(ps as u32),
-                })
+                v.as_u64()
+                    .and_then(|ps| u32::try_from(ps).ok())
+                    .map(|ps| PatchSize {
+                        height: Some(ps),
+                        width: Some(ps),
+                    })
+                    .or_else(|| {
+                        let h = v
+                            .get("height")?
+                            .as_u64()
+                            .and_then(|x| u32::try_from(x).ok())?;
+                        let w = v
+                            .get("width")
+                            .and_then(|x| x.as_u64())
+                            .and_then(|x| u32::try_from(x).ok())
+                            .unwrap_or(h);
+                        Some(PatchSize {
+                            height: Some(h),
+                            width: Some(w),
+                        })
+                    })
             });
         }
         if config.merge_size.is_none() {
             config.merge_size = media_cfg
                 .get("merge_kernel_size")
                 .and_then(|v| v.as_u64())
-                .map(|v| v as usize);
+                .and_then(|v| usize::try_from(v).ok());
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/multimodal/src/vision/preprocessor_config.rs` around lines 267 - 280,
The patch_size and merge_size population logic must accept either a scalar or an
object form and must avoid unchecked narrowing casts; update the block that sets
config.patch_size and config.merge_size (referencing config.patch_size,
PatchSize, media_cfg, and merge_kernel_size) to: 1) if
media_cfg.get("patch_size") is an object, read "height" and "width" fields
individually (falling back to a scalar if present), 2) when converting numeric
values use checked conversion (e.g., u64.try_into().ok()) or explicit bounds
checks and only assign PatchSize when conversions succeed, and 3) for merge_size
convert the u64 to usize using a safe try_into().ok() and skip/return an error
if conversion would overflow rather than using as u32/as usize.

521-550: ⚠️ Potential issue | 🟡 Minor

Expand the Kimi nested-config test to assert extracted limits.

The test includes in_patch_limit and patch_limit_on_one_side in JSON but does not verify they were captured into extra.

🧪 Proposed test assertions
         let std = config.get_image_std();
         assert!((std[0] - 0.5).abs() < 1e-6);
+        assert!((std[1] - 0.5).abs() < 1e-6);
+        assert!((std[2] - 0.5).abs() < 1e-6);

         assert_eq!(config.get_patch_size(0), 14);
         assert_eq!(config.merge_size, Some(2));
+        let in_patch_limit: Option<usize> = config.get_extra("in_patch_limit");
+        assert_eq!(in_patch_limit, Some(16384));
+        let patch_limit_on_one_side: Option<usize> = config.get_extra("patch_limit_on_one_side");
+        assert_eq!(patch_limit_on_one_side, Some(512));
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/multimodal/src/vision/preprocessor_config.rs` around lines 521 - 550,
Update the test_parse_kimi_nested_media_proc_cfg test to also assert that the
nested media_proc_cfg limits are preserved: after creating config via
PreProcessorConfig::from_json, check that config.extra (or the field that stores
parsed extra settings) contains "in_patch_limit" == 16384 and
"patch_limit_on_one_side" == 512; add assertions that extract these keys from
config.extra and compare them to the expected integer values so the test
verifies the limits were captured.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@model_gateway/src/routers/grpc/multimodal.rs`:
- Around line 704-708: Change the fast-path pixel serialization to only call
preprocessed.pixel_values.as_slice() and remove the fallback to
as_slice_memory_order(); specifically, update the logic that builds pixel_bytes
(the let pixel_bytes: Vec<u8> block) so it uses as_slice() alone and falls back
to the existing .iter() path for non-C-contiguous arrays, ensuring
preprocessed.pixel_values.as_slice_memory_order() is no longer invoked and the
.iter()-based branch handles all non-contiguous cases.

---

Duplicate comments:
In `@crates/multimodal/src/vision/preprocessor_config.rs`:
- Around line 267-280: The patch_size and merge_size population logic must
accept either a scalar or an object form and must avoid unchecked narrowing
casts; update the block that sets config.patch_size and config.merge_size
(referencing config.patch_size, PatchSize, media_cfg, and merge_kernel_size) to:
1) if media_cfg.get("patch_size") is an object, read "height" and "width" fields
individually (falling back to a scalar if present), 2) when converting numeric
values use checked conversion (e.g., u64.try_into().ok()) or explicit bounds
checks and only assign PatchSize when conversions succeed, and 3) for merge_size
convert the u64 to usize using a safe try_into().ok() and skip/return an error
if conversion would overflow rather than using as u32/as usize.
- Around line 521-550: Update the test_parse_kimi_nested_media_proc_cfg test to
also assert that the nested media_proc_cfg limits are preserved: after creating
config via PreProcessorConfig::from_json, check that config.extra (or the field
that stores parsed extra settings) contains "in_patch_limit" == 16384 and
"patch_limit_on_one_side" == 512; add assertions that extract these keys from
config.extra and compare them to the expected integer values so the test
verifies the limits were captured.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: dc4b2606-6e4c-4593-9486-b7b1d7af43a9

📥 Commits

Reviewing files that changed from the base of the PR and between 518fc42 and c42b225.

📒 Files selected for processing (3)
  • crates/multimodal/src/vision/preprocessor_config.rs
  • crates/tokenizer/src/tiktoken.rs
  • model_gateway/src/routers/grpc/multimodal.rs

Comment on lines +704 to +708
let pixel_bytes: Vec<u8> = if let Some(pixel_slice) = preprocessed
.pixel_values
.as_slice()
.or_else(|| preprocessed.pixel_values.as_slice_memory_order())
{
Copy link
Copy Markdown

@coderabbitai coderabbitai bot Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Verify whether producers of pixel_values can generate non-standard layouts
# that would make as_slice_memory_order() ordering-sensitive.
rg -n --type=rust -C2 'pixel_values|permuted_axes|reversed_axes|as_standard_layout|into_shape|from_shape_vec|to_shape' crates/multimodal/src

# Re-check serializer logic in this file.
rg -n --type=rust -C4 'fn serialize_pixel_values|as_slice_memory_order|as_slice\(' model_gateway/src/routers/grpc/multimodal.rs

Repository: lightseekorg/smg

Length of output: 46674


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check the full serialize_pixel_values function to see current fallback logic
head -n 730 model_gateway/src/routers/grpc/multimodal.rs | tail -n +700

# Verify if any processor explicitly creates non-C-contiguous or permuted arrays
rg -A5 'from_shape_vec|into_shape|into_dyn|Array.*::zeros' crates/multimodal/src/vision/processors/ | head -50

Repository: lightseekorg/smg

Length of output: 5523


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Search for any code that might produce non-C-contiguous pixel_values
# (e.g., transpose, permute, slice operations without calling as_standard_layout)
rg -n 'permute|transpose|reversed_axes|to_owned|into_dyn' crates/multimodal/src/vision/processors/ | grep -v test | head -20

Repository: lightseekorg/smg

Length of output: 2203


Remove as_slice_memory_order() fallback from pixel serialization.

The fast path should only use as_slice() (which guarantees C-contiguous/row-major layout). The existing comment at lines 722–724 explicitly warns that as_slice_memory_order() can serialize Fortran-contiguous arrays in wrong dimension order, corrupting pixel tensor layout. This fallback is unnecessary since the .iter() path already handles non-C-contiguous cases correctly.

Proposed fix
     let pixel_bytes: Vec<u8> = if let Some(pixel_slice) = preprocessed
         .pixel_values
-        .as_slice()
-        .or_else(|| preprocessed.pixel_values.as_slice_memory_order())
+        .as_slice()
     {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/grpc/multimodal.rs` around lines 704 - 708, Change
the fast-path pixel serialization to only call
preprocessed.pixel_values.as_slice() and remove the fallback to
as_slice_memory_order(); specifically, update the logic that builds pixel_bytes
(the let pixel_bytes: Vec<u8> block) so it uses as_slice() alone and falls back
to the existing .iter() path for non-C-contiguous arrays, ensuring
preprocessed.pixel_values.as_slice_memory_order() is no longer invoked and the
.iter()-based branch handles all non-contiguous cases.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is right but really nit right now. We don't have that many non-C-contiguous right now. We can address this in a follow up PR.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CatherineSue, understood — the concern is valid but non-blocking for this PR since C-contiguous arrays are the common case. Would you like me to open a follow-up GitHub issue to track the removal of the as_slice_memory_order() fallback from serialize_pixel_values?


🧠 Learnings used
Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 588
File: model_gateway/src/routers/grpc/multimodal.rs:453-514
Timestamp: 2026-03-03T18:03:45.713Z
Learning: In repo lightseekorg/smg, backend assembly functions in model_gateway/src/routers/grpc/multimodal.rs (e.g., assemble_sglang, assemble_vllm, assemble_trtllm) are tested via E2E tests rather than unit tests, as unit tests for these functions are not considered worthwhile.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 570
File: multimodal/src/registry.rs:133-143
Timestamp: 2026-03-01T06:00:45.296Z
Learning: In repo lightseekorg/smg, the image_sizes field in PreprocessedImages will be standardized to use `(height, width)` tuple ordering across all processors in multimodal/src/vision/processors/. Currently llama4, phi4, and pixtral use `(height, width)` while llava and phi3 use `(width, height)`. The image_sizes_hw helper in multimodal/src/registry.rs correctly interprets tuples as (h, w) for the majority convention.

Learnt from: slin1237
Repo: lightseekorg/smg PR: 489
File: model_gateway/src/routers/grpc/client.rs:206-211
Timestamp: 2026-02-21T02:31:17.841Z
Learning: Repo: lightseekorg/smg
File: model_gateway/src/routers/grpc/client.rs
Context: GrpcClient::embed match arm
Learning: The catch-all panic in GrpcClient::embed for mismatched client/request types or unsupported embedding backends is intentional to catch invariant violations and unsupported configurations at development time. Converting this path to a returned error is out of scope for PR `#489`.

Learnt from: XinyueZhang369
Repo: lightseekorg/smg PR: 987
File: model_gateway/src/routers/common/sse.rs:414-427
Timestamp: 2026-04-08T17:21:24.867Z
Learning: In repo lightseekorg/smg, `model_gateway/src/routers/common/sse.rs`: The `parse_frame` function does NOT need to strip a leading UTF-8 BOM (U+FEFF). All upstream SSE sources in this gateway are HTTP JSON API streams (OpenAI, Anthropic, internal gRPC/HTTP workers); RFC 8259 forbids BOMs in JSON, and HTTP API response bodies from these providers do not emit BOMs. Do not flag the absence of BOM normalization in `parse_frame` or `parse_block` as a defect.

Learnt from: slin1237
Repo: lightseekorg/smg PR: 489
File: multimodal/src/vision/processors/phi4_vision.rs:486-490
Timestamp: 2026-02-21T02:39:23.635Z
Learning: Repo: lightseekorg/smg — For multimodal/src/vision/processors/phi4_vision.rs (Phi4VisionProcessor::preprocess), prefer returning a recoverable error via ok_or(TransformError::EmptyBatch)? over using expect/unreachable, even when a prior non-empty check exists. This follows the repo’s “avoid panics in production code” guideline for PR `#489` and similar lint-only efforts.

Learnt from: vschandramourya
Repo: lightseekorg/smg PR: 915
File: model_gateway/src/routers/grpc/client.rs:387-423
Timestamp: 2026-03-26T17:06:14.307Z
Learning: In repo lightseekorg/smg, in `model_gateway/src/routers/grpc/client.rs` and the corresponding backend builders (`crates/grpc_client/src/sglang_scheduler.rs`, `vllm_engine.rs`, `trtllm_service.rs`): The per-backend divergence in handling `CompletionRequest.max_tokens == None` is intentional. SGLang and vLLM pass `None` through to their proto builders, while TRT-LLM falls back to `16`. This matches the pre-existing per-backend pattern used in the chat/messages request builders. Do not flag this divergence as a bug or request normalization at the `build_completion_request` dispatcher layer in `client.rs`.

Learnt from: pallasathena92
Repo: lightseekorg/smg PR: 687
File: model_gateway/src/routers/openai/realtime/webrtc_bridge.rs:571-590
Timestamp: 2026-03-10T23:05:51.755Z
Learning: In repo lightseekorg/smg, file model_gateway/src/routers/openai/realtime/webrtc_bridge.rs: The `pkt.payload.clone()` and `pkt.header.ext_vals.clone()` calls in `forward_rtp` are forced by str0m's `write_rtp` API, which takes owned values. Zero-copy RTP forwarding is not achievable without upstream API changes to str0m itself; do not flag these clones as an avoidable allocation in PR `#687` or similar PRs using str0m 0.16.

Learnt from: key4ng
Repo: lightseekorg/smg PR: 1006
File: crates/tool_parser/src/parsers/deepseek31.rs:162-181
Timestamp: 2026-04-01T04:14:46.469Z
Learning: In repo lightseekorg/smg, `crates/tool_parser/src/parsers/deepseek31.rs` (and the analogous V3 parser `crates/tool_parser/src/parsers/deepseek.rs` lines 186-197): The `parse_incremental` method does NOT split off a plain-text prefix before the first tool marker within the same chunk. This is intentional because the gRPC streaming path delivers tokens individually, so normal text content and tool-call markers always arrive in separate chunks — a prefix and a tool marker will never coexist in the same chunk. Do not flag the absence of a within-chunk prefix-split as a bug; the `test_deepseek31_streaming_text_before_tools` test covers the realistic multi-chunk case.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 791
File: model_gateway/src/routers/grpc/harmony/stages/preparation.rs:118-124
Timestamp: 2026-03-17T20:14:15.295Z
Learning: In repo lightseekorg/smg, file model_gateway/src/routers/grpc/harmony/stages/preparation.rs (prepare_chat): The condition `if constraint.is_some() && body_ref.response_format.is_some()` that clears `response_format` only fires when `response_format` was the source of the structural-tag constraint. Protocol-level validation and the previous pipeline stage ensure that a request carrying both a tool constraint and a `response_format` is rejected before reaching this stage. Therefore the comment "If response_format was converted to a structural tag, clear it..." accurately describes the runtime behavior.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 662
File: grpc_servicer/smg_grpc_servicer/vllm/servicer.py:42-47
Timestamp: 2026-03-06T21:56:31.502Z
Learning: In repo lightseekorg/smg, `_tensor_from_proto` in `grpc_servicer/smg_grpc_servicer/vllm/servicer.py` only deserializes tensors produced by the internal Rust router (never raw client input). A `RuntimeError` from `torch.frombuffer()` or `Tensor.reshape()` indicates a server-side serialization bug; it should surface as gRPC `INTERNAL`, not `INVALID_ARGUMENT`. Do not suggest wrapping these calls in try/except to convert to `ValueError`.

Learnt from: slin1237
Repo: lightseekorg/smg PR: 489
File: model_gateway/benches/wasm_middleware_latency.rs:88-91
Timestamp: 2026-02-21T02:37:02.009Z
Learning: Repo: lightseekorg/smg — For clippy-only/enforcement PRs (e.g., PR `#489`), even micro-optimizations (like replacing an async closure with std::future::ready in benches such as model_gateway/benches/wasm_middleware_latency.rs) should be deferred to a follow-up PR rather than included inline.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 1023
File: crates/protocols/src/chat.rs:41-50
Timestamp: 2026-04-02T02:01:23.983Z
Learning: In repo lightseekorg/smg, `crates/protocols/src/chat.rs`: The `Assistant` variant of `ChatMessage` intentionally always serializes `content` (even as `null` when `None`) to satisfy Jinja chat templates that check `message['content'] is none`. This applies to ALL assistant messages regardless of whether `tool_calls` or `reasoning_content` is present — `content: null` is considered accurate and harmless for reasoning-only messages (no tool_calls, no text) as well. Do not flag the absence of a conditional `skip_serializing_if` guard that restricts null-content emission to only tool-call-bearing messages.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 958
File: crates/multimodal/src/registry/llava.rs:61-70
Timestamp: 2026-03-27T23:55:38.295Z
Learning: In repo lightseekorg/smg, LLaVA 1.5 (LlavaSpec in crates/multimodal/src/registry/llava.rs) always produces the same number of image tokens for every image in a batch, because LlavaProcessor always resizes inputs to 336×336 before ViT processing (yielding (336/14)^2 = 576 tokens per image). The first-count fan-out pattern in LlavaSpec::prompt_replacements — reading preprocessed.num_img_tokens.first() and applying that count to all images via vec![replacement; len] — is intentional and correct. Do not flag this as a bug or request per-image iteration (as LlavaNextSpec does); the debug_assert! is a sufficient guard for the uniformity assumption.

Learnt from: TingtingZhou7
Repo: lightseekorg/smg PR: 1057
File: crates/mcp/src/transform/transformer.rs:854-879
Timestamp: 2026-04-08T04:33:48.435Z
Learning: In repo lightseekorg/smg, `to_image_generation_call` in `crates/mcp/src/transform/transformer.rs` intentionally always sets `ImageGenerationCallStatus::Completed` regardless of whether the payload contains an error, aligning with the uniform behavior of the web-search (`WebSearchCallStatus::Completed`), code interpreter (`CodeInterpreterCallStatus::Completed`), and file search (`FileSearchCallStatus::Completed`) transformers. The `test_image_generation_transform_error_payload` test therefore asserts `Completed` even for error payloads. Per-status differentiation for image-generation errors is deferred to a follow-up PR. Do not flag this as a bug or request `Failed` status in this or similar incremental PRs.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 570
File: model_gateway/src/routers/grpc/regular/stages/chat/preparation.rs:87-97
Timestamp: 2026-03-01T05:57:56.940Z
Learning: In model_gateway/src/routers/grpc/regular/stages/chat/preparation.rs, when multimodal content is detected, an empty tokenizer_source must be rejected with a bad_request (multimodal_config_missing) before calling process_multimodal(). Without a valid tokenizer source, get_or_load_config() fails with a confusing file-not-found error downstream. The early guard provides a clear error message. This differs from request_building.rs, where empty tokenizer_source is a valid fallback for non-multimodal config loading.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 497
File: model_gateway/src/routers/grpc/regular/stages/chat/request_building.rs:108-113
Timestamp: 2026-02-21T23:56:04.191Z
Learning: In repo lightseekorg/smg, file model_gateway/src/routers/grpc/regular/stages/chat/request_building.rs: When fetching tokenizer_source via ctx.components.tokenizer_registry.get_by_name(model_id).map(|e| e.source).unwrap_or_default(), an empty string is an intentional valid fallback when the model isn't in the registry. This allows config loading to proceed with the default path. Comments documenting this fallback are not necessary to keep noise down.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 990
File: crates/tokenizer/benches/stop_decoder.rs:23-24
Timestamp: 2026-04-01T08:02:32.590Z
Learning: In `lightseekorg/smg` (`crates/tokenizer/benches/stop_decoder.rs`), benchmarks intentionally use `MockTokenizer` (not a real HuggingFace tokenizer) to keep them hermetic and avoid network/filesystem dependencies for model downloads. The HuggingFace native `step_decode_stream` fast-path is covered by the integration test `test_sequence_operations` in `crates/tokenizer/tests/tokenizer_integration.rs`, which uses a real HF tokenizer. A fixture-backed HF benchmark is a known follow-up item but is not required in this PR. Do not flag the absence of HF tokenizer usage in the bench file as a gap.

Learnt from: TingtingZhou7
Repo: lightseekorg/smg PR: 1057
File: model_gateway/src/routers/openai/mcp/tool_loop.rs:856-885
Timestamp: 2026-04-08T00:08:05.944Z
Learning: In repo lightseekorg/smg, `sanitize_builtin_tool_arguments` in `model_gateway/src/routers/openai/mcp/tool_loop.rs` intentionally drops image-generation options (size, quality, background, output_format, compression) when handling `ResponseFormat::ImageGenerationCall`, keeping only `model` (hardcoded `IMAGE_MODEL`) and `revised_prompt`. This is a deliberate scoped decision for the initial image-generation tool integration; per-option overrides/defaults are planned for a follow-up PR. Do not flag the truncation as a bug or request preservation of extra fields.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 495
File: grpc_client/proto/sglang_scheduler.proto:140-172
Timestamp: 2026-02-21T11:57:36.560Z
Learning: In repo lightseekorg/smg, the multimodal proto fields in `grpc_client/proto/sglang_scheduler.proto` (including `MultimodalInputs.pixel_values`, `model_specific_tensors`, `im_token_id`, and `mm_placeholders`) were introduced in PR `#495` as the first real usage; prior internal-only development fields like `processed_features` were never shipped to production, so wire-compatibility concerns do not apply to these field number assignments.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 495
File: model_gateway/src/routers/grpc/multimodal.rs:153-166
Timestamp: 2026-02-21T18:39:37.571Z
Learning: In repo lightseekorg/smg, the gRPC multimodal pipeline in model_gateway/src/routers/grpc/multimodal.rs intentionally skips ContentPart::VideoUrl silently (without warning) because video support is not yet implemented; adding warnings would be noisy for legitimate requests containing video parts that aren't processed yet. Proper video handling will be added when the feature is implemented.

Learnt from: key4ng
Repo: lightseekorg/smg PR: 1034
File: model_gateway/src/routers/grpc/multimodal.rs:602-610
Timestamp: 2026-04-04T00:29:52.690Z
Learning: In repo lightseekorg/smg, the gRPC regular pipeline runs stages in order: Preparation (tokenization + multimodal media fetch + preprocessing) → Worker Selection → Request Building. Because backend identity (e.g., GrpcClient::Mlx) is only resolved after worker selection, it is architecturally impossible to short-circuit MLX multimodal rejection before process_multimodal() / process_multimodal_messages() runs in the Preparation stage. The earliest feasible rejection point for backend-specific multimodal unsupported errors is at assemble_multimodal_data in request_building (after preprocessing but before generation). Moving rejection earlier would require a larger architectural change (e.g., moving worker selection before preparation). Do not flag assemble_multimodal_data's GrpcClient::Mlx error arm as "too late" in the pipeline.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 638
File: grpc_servicer/smg_grpc_servicer/vllm/servicer.py:371-406
Timestamp: 2026-03-05T04:48:49.033Z
Learning: In repo lightseekorg/smg, `grpc_servicer/smg_grpc_servicer/vllm/servicer.py` (`_build_preprocessed_mm_inputs`): multimodal metadata fields (`flat_keys`, `batched_keys`, `mm_placeholders`, `model_specific_tensors`) are produced exclusively by the internal Rust router and are treated as trusted data. Defensive input-validation guards (e.g., checking flat_keys targets exist in hf_dict, validating PlaceholderRange bounds) are intentionally omitted to avoid hot-path overhead. Do not flag missing validation on these fields in future reviews.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 495
File: model_gateway/src/routers/grpc/regular/stages/chat/preparation.rs:121-131
Timestamp: 2026-02-21T18:45:58.696Z
Learning: In repo lightseekorg/smg, all multimodal processing failures in model_gateway/src/routers/grpc/regular/stages/chat/preparation.rs currently return 400 Bad Request for simplicity, because the underlying `anyhow::Error` from the multimodal crate doesn't distinguish error types (client vs upstream failures). Error categorization (e.g., mapping upstream fetch failures to 502 Bad Gateway) is deferred to a follow-up when more failure modes need differentiation.

Learnt from: slin1237
Repo: lightseekorg/smg PR: 489
File: multimodal/tests/vision_golden_tests.rs:576-578
Timestamp: 2026-02-21T02:39:51.670Z
Learning: Repo lightseekorg/smg — For PR `#489` (clippy/lint-only), do not replace unwrap() with expect(...) in test files when a file-/crate-level #![expect(clippy::unwrap_used)] is present (e.g., multimodal/tests/vision_golden_tests.rs). Such stylistic swaps are out of scope.

Learnt from: slin1237
Repo: lightseekorg/smg PR: 447
File: model_gateway/src/routers/grpc/client.rs:312-328
Timestamp: 2026-02-17T20:30:27.647Z
Learning: Actionable guideline: In model_gateway gRPC metadata discovery (specifically in model_gateway/src/routers/grpc/...), verify how keys are handled for different proto sources. SGLang uses short-form keys (tp_size, dp_size, pp_size) via pick_prost_fields() without normalization, while vLLM/TRT-LLM use long-form keys (tensor_parallel_size, pipeline_parallel_size) that pass through flat_labels() and are normalized by normalize_grpc_keys() in discover_metadata.rs after model_info.to_labels() and device/server_info.to_labels(). Ensure reviewers check that the code paths correctly reflect these normalization rules and that tests cover both code paths.

Learnt from: XinyueZhang369
Repo: lightseekorg/smg PR: 399
File: protocols/src/interactions.rs:505-509
Timestamp: 2026-02-19T03:08:50.192Z
Learning: In code reviews for Rust projects using the validator crate (v0.20.0), ensure that custom validation functions for numeric primitive types (e.g., f32, i32, u32, i16, etc.) accept the value by value, not by reference. Example: fn validate(value: f32) { ... }. The validator derive macro has a hardcoded list of numeric types that are passed by value, while all other types are passed by reference. Apply this guideline whenever validating numeric fields to align with the derive macro behavior.

Learnt from: slin1237
Repo: lightseekorg/smg PR: 489
File: model_gateway/src/core/token_bucket.rs:58-63
Timestamp: 2026-02-21T02:30:51.443Z
Learning: For lint-only/Clippy enforcement PRs in this repository, avoid introducing behavioral changes (e.g., new input validation or logic changes). Treat such PRs as non-functional changes and plan a separate follow-up issue/PR for hardening or behavior changes. This applies broadly to Rust files across the repo; during review, focus on lint/style corrections and clearly note any intentional exceptions. 

Learnt from: slin1237
Repo: lightseekorg/smg PR: 489
File: protocols/src/responses.rs:928-931
Timestamp: 2026-02-21T02:36:00.882Z
Learning: In Rust code across the repository, use the marker INVARIANT: to document assumptions in safe code. Reserve SAFETY: for explaining why unsafe blocks are sound. This improves clarity of invariants and safety reasoning. Example reference: protocols/src/responses.rs near validate_tool_choice_with_tools().

Learnt from: slin1237
Repo: lightseekorg/smg PR: 489
File: mesh/src/sync.rs:83-83
Timestamp: 2026-02-21T02:37:01.416Z
Learning: General Rust formatting rule: format! with implicit captures only supports simple identifiers, not full expressions like {state.model_id}. For cases where you want to interpolate a field or expression, bind the value first and interpolate the binding, e.g., let model_id = &state.model_id; and then use format!("policy:{}", model_id). In the specific file mesh/src/sync.rs, prefer format!("policy:{}", state.model_id) or bind to a local variable if you need named interpolation, to keep clarity and avoid unintended captures.

Learnt from: zhaowenzi
Repo: lightseekorg/smg PR: 807
File: model_gateway/src/middleware.rs:61-81
Timestamp: 2026-03-18T21:32:00.041Z
Learning: In Rust code using the http crate, HeaderMap::get() is effectively case-insensitive because HeaderName normalizes keys to lowercase on insertion and lookup. Do not require or perform explicit .to_lowercase() before HeaderMap::get() calls. Mark as not a concern for case-sensitivity in lookups; only consider normalization when inserting or comparing via HeaderName, not in lookups.

Learnt from: key4ng
Repo: lightseekorg/smg PR: 867
File: tui/src/app.rs:798-813
Timestamp: 2026-03-22T20:13:55.778Z
Learning: In this repo (lightseekorg/smg), treat the workspace `Cargo.toml`’s `package.rust-version` (MSRV) as the source of truth (e.g., `rust-version = "1.85"`). When reviewing Rust changes, do not flag usage of Rust language/library features that were stabilized on or before the MSRV (e.g., `Option::is_none_or`, stabilized in 1.82, is compatible with an MSRV of 1.85). Always verify the MSRV from the workspace `Cargo.toml` rather than relying on issue templates.

Learnt from: CatherineSue
Repo: lightseekorg/smg PR: 937
File: model_gateway/src/core/worker.rs:0-0
Timestamp: 2026-03-27T03:20:19.917Z
Learning: When calling `worker.record_outcome(status_code: u16)` (the unified Circuit Breaker outcome recording API), it’s valid to pass *synthetic* HTTP status codes for transport/connection errors where no real HTTP response was received. For example, callers may pass `502` (send error), `504` (timeout), or other appropriate `502/503/504`-style synthetic codes to preserve CB feedback. Do not flag these calls as incorrect usage of `record_outcome`. Health checks should still handle reachability separately.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c42b2255ec

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +704 to +707
let pixel_bytes: Vec<u8> = if let Some(pixel_slice) = preprocessed
.pixel_values
.as_slice()
.or_else(|| preprocessed.pixel_values.as_slice_memory_order())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep logical tensor order for non-standard pixel buffers

The new as_slice_memory_order() fallback can serialize contiguous-but-nonstandard ndarray layouts (e.g., Fortran-order / transposed views) in memory order instead of logical row-major order, while still sending the original shape. In that case the receiver reconstructs the tensor with permuted values, producing incorrect pixel data for vision models. The previous fallback (iter()) preserved logical order; this optimization should only apply to standard-layout slices.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member

@CatherineSue CatherineSue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kangyan-Zhou the PR LGTM. But there seems to be a merge conflict with the main branch. Can you resolve the conflicts? After a fresh CI pass, I will merge this one.

@CatherineSue
Copy link
Copy Markdown
Member

Nvm, realize that I can merge

@CatherineSue
Copy link
Copy Markdown
Member

FYI you can try install the pre-commit hook under smg. It will run the cargo fmt, clippy, including commit message validation on changed files. And CC will remove the co-authored-by automatically.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
model_gateway/src/routers/grpc/multimodal.rs (1)

711-715: ⚠️ Potential issue | 🟠 Major

Remove as_slice_memory_order() from this fast path.

This still contradicts the warning on Lines 729-731: as_slice_memory_order() can emit storage-order bytes for non-C-contiguous tensors while you still send the logical shape. The .iter() branch already handles non-C layouts correctly, and Kimi K2.5 already materializes standard-layout tensors, so this fallback only reopens the corruption path.

🛠️ Proposed fix
-    let pixel_bytes: Vec<u8> = if let Some(pixel_slice) = preprocessed
-        .pixel_values
-        .as_slice()
-        .or_else(|| preprocessed.pixel_values.as_slice_memory_order())
-    {
+    let pixel_bytes: Vec<u8> = if let Some(pixel_slice) = preprocessed.pixel_values.as_slice() {
#!/bin/bash
set -euo pipefail
sed -n '709,737p' model_gateway/src/routers/grpc/multimodal.rs
sed -n '208,216p' crates/multimodal/src/vision/processors/kimi_k25.rs
sed -n '288,301p' crates/multimodal/src/vision/processors/kimi_k25.rs

Expected: the serializer still uses as_slice_memory_order(), while the Kimi K2.5 processor already guarantees standard-layout output, so .iter() should remain the only non-contiguous fallback.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/grpc/multimodal.rs` around lines 711 - 715, The
fast-path for building pixel_bytes incorrectly falls back to
preprocessed.pixel_values.as_slice_memory_order(), which can yield storage-order
bytes for non-C-contiguous tensors; remove the as_slice_memory_order() call so
the fast path only uses preprocessed.pixel_values.as_slice(), and rely on the
existing .iter() branch to handle non-contiguous layouts (given Kimi K2.5
already outputs standard-layout tensors). Update the code around the pixel_bytes
construction (the let pixel_bytes: Vec<u8> = ... block referencing
preprocessed.pixel_values) to drop .as_slice_memory_order() and keep the .iter()
fallback unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@model_gateway/src/routers/grpc/common/stages/request_execution.rs`:
- Around line 239-242: In execute_sequential_pd(), mirror the same multimodal
stripping done in execute_dual_dispatch(): create a mutable copy of the incoming
proto_request (e.g., let mut decode_request = proto_request;) and call
decode_request.clear_mm_inputs() before forwarding to the decode worker so the
vLLM PD path does not send full multimodal tensors; ensure the code uses this
stripped decode_request where execute_dual_dispatch uses it.

In `@model_gateway/src/routers/grpc/proto_wrapper.rs`:
- Around line 414-419: The match in clear_mm_inputs() is missing the
ProtoGenerateRequest::Mlx variant; add a Self::Mlx arm to make the match
exhaustive — if the Mlx inner request has an mm_inputs field set it to None
(e.g. Self::Mlx(req) => req.mm_inputs = None), otherwise add a no-op arm
(Self::Mlx(_) => {}) so the function compiles and covers the Mlx variant.

---

Duplicate comments:
In `@model_gateway/src/routers/grpc/multimodal.rs`:
- Around line 711-715: The fast-path for building pixel_bytes incorrectly falls
back to preprocessed.pixel_values.as_slice_memory_order(), which can yield
storage-order bytes for non-C-contiguous tensors; remove the
as_slice_memory_order() call so the fast path only uses
preprocessed.pixel_values.as_slice(), and rely on the existing .iter() branch to
handle non-contiguous layouts (given Kimi K2.5 already outputs standard-layout
tensors). Update the code around the pixel_bytes construction (the let
pixel_bytes: Vec<u8> = ... block referencing preprocessed.pixel_values) to drop
.as_slice_memory_order() and keep the .iter() fallback unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 5d8a888c-5dd1-46c4-a999-a5bb5b81b196

📥 Commits

Reviewing files that changed from the base of the PR and between c42b225 and 5c6a8b7.

📒 Files selected for processing (3)
  • model_gateway/src/routers/grpc/common/stages/request_execution.rs
  • model_gateway/src/routers/grpc/multimodal.rs
  • model_gateway/src/routers/grpc/proto_wrapper.rs

Comment thread model_gateway/src/routers/grpc/common/stages/request_execution.rs
Comment thread model_gateway/src/routers/grpc/proto_wrapper.rs
Add ModelProcessorSpec and ImagePreProcessor for moonshotai/Kimi-K2.5
so the gRPC PD router can handle multimodal (image) requests.

- KimiK25VisionSpec: matches "kimi" + "k2" model IDs, uses
  <|media_pad|> placeholder (media_placeholder_token_id from config),
  NaViT-style field layouts identical to Qwen-VL family
- KimiK25Processor: wraps QwenVLProcessorBase with Kimi-specific
  defaults (patch_size=14, merge_size=2, normalization=[0.5,0.5,0.5],
  max_pixels=3,211,264 from in_patch_limit=16384)
- Fix get_zmq_socket import path for sglang main compat

Signed-off-by: Kangyan Zhou <zky314343421@gmail.com>
…locally

When the tokenizer source is a HuggingFace model ID (e.g.,
"moonshotai/Kimi-K2.5") rather than a local directory, the gRPC router
cannot read config.json and preprocessor_config.json from disk. This
causes multimodal requests to fail with "Failed to read config.json".

Make get_or_load_config async and fall back to downloading the two
config files from HF Hub via the new download_files_from_hf helper
when the local path doesn't exist.

Signed-off-by: Kangyan Zhou <zky314343421@gmail.com>
Address review findings:
- Log errors from HF Hub downloads instead of silently swallowing them
- Add explicit error when local model directory exists but config.json
  is missing (prevents misleading fallback to HF Hub)
- Upgrade fallback log from debug to warn for production visibility

Signed-off-by: Kangyan Zhou <zky314343421@gmail.com>
TiktokenTokenizer::encode() was using encode_ordinary() which ignores
special tokens in the input text. This caused chat-template-rendered
special tokens like <|media_pad|> to be split into BPE sub-tokens
instead of being recognized as single token IDs.

Switch to encode_with_special_tokens() unconditionally, matching
HuggingFace tokenizer behavior where added special tokens are always
recognized in input text. This fixes Kimi-K2.5 multimodal where the
chat template inserts <|media_pad|> (ID 163605) but the tokenizer
was producing sub-tokens that expand_tokens couldn't find.

Signed-off-by: Kangyan Zhou <zky314343421@gmail.com>
Kimi-K2.5 engine accesses `item.grid_thws` (plural) on
MultimodalDataItem, but the gateway was sending `image_grid_thw`
(Qwen-VL convention). Rename the key in the processor output and
update field_layouts/keep_on_cpu_keys to match.

Signed-off-by: Kangyan Zhou <zky314343421@gmail.com>
Remove QwenVLProcessorBase dependency. Kimi-K2.5's MoonViT expects
pixel_values as [N, C, patch_size, patch_size] (4D), not flattened
[N, C*T*patch_size*patch_size] (2D) like Qwen-VL. The model's
PatchEmbed3d applies Conv2d on each patch directly.

Implement smart_resize and extract_patches independently, producing
[total_patches, 3*14*14] = [N, 588] patches that the engine
reconstructs as [N, 3, 14, 14] for Conv2d input.

Signed-off-by: Kangyan Zhou <zky314343421@gmail.com>
The engine's PatchEmbed3d Conv2d expects 4D input [N, C, H, W] but the
gateway was serializing pixel_values as 2D [N, C*patch_size*patch_size].
Store as ndarray::Array4 so the proto shape field is [N, 3, 14, 14],
which the engine reconstructs correctly for Conv2d.

Signed-off-by: Kangyan Zhou <zky314343421@gmail.com>
The previous smart_resize (from Qwen-VL) resized images directly to
factor-aligned dimensions, stretching the content. The HF Kimi
preprocessor instead:
1. Computes scale capped at 1.0 (never upscales)
2. Resizes with BICUBIC interpolation
3. Zero-pads to factor-aligned dimensions

This mismatch caused degraded image quality — the model was trained
with zero-padded images, not stretched ones. Rewrite to match the
HF navit_resize_image + resize_image pipeline exactly.

Signed-off-by: Kangyan Zhou <zky314343421@gmail.com>
download_files_from_hf was silently failing in production (likely
hf-hub crate issue). Switch to download_tokenizer_from_hf which
already works for tokenizer loading and returns the HF cache
directory containing config.json and preprocessor_config.json.

Signed-off-by: Kangyan Zhou <zky314343421@gmail.com>
download_tokenizer_from_hf only downloads tokenizer files (filtered by
is_tokenizer_file), not config.json or preprocessor_config.json. Add a
dedicated download_model_configs_from_hf that fetches these two files
on the first multimodal request.

Signed-off-by: Kangyan Zhou <zky314343421@gmail.com>
Add detailed logging at key points:
- Image dimensions, color type, raw bytes size after fetch
- Pixel values shape, token counts, first/last pixels, min/max
- Serialized pixel_values bytes and shape
- Token expansion details (search_token_id, placeholders, offsets)

Also use download_model_configs_from_hf and remove dead code.

Signed-off-by: Kangyan Zhou <zky314343421@gmail.com>
KimiK25Processor::preprocess() was reading mean/std from
PreProcessorConfig which falls back to CLIP values when the config
can't be parsed (Kimi's preprocessor_config.json nests values under
media_proc_cfg). This caused images to be normalized with CLIP
mean=[0.48,0.46,0.41] std=[0.27,0.26,0.28] instead of Kimi's
mean=[0.5,0.5,0.5] std=[0.5,0.5,0.5], producing wrong pixel values
that made the model misinterpret images entirely.

Use self.default_mean()/default_std() which are hardcoded to the
correct Kimi values.

Signed-off-by: Kangyan Zhou <zky314343421@gmail.com>
Instead of hardcoding normalization values, parse the nested
media_proc_cfg structure in Kimi's preprocessor_config.json to extract
image_mean, image_std, patch_size, and merge_kernel_size. This ensures
the correct values are used regardless of how the config is structured.

The previous fix hardcoded [0.5,0.5,0.5] in the processor, which
worked but would break if values changed. Now from_json() checks for
media_proc_cfg when top-level fields are missing.

Signed-off-by: Kangyan Zhou <zky314343421@gmail.com>
Revert info-level diagnostic logging back to debug level now that the
normalization root cause has been identified and fixed.

Signed-off-by: Kangyan Zhou <zky314343421@gmail.com>
- Make from_value consistent with from_json by delegating to it,
  ensuring nested media_proc_cfg extraction applies to both paths
- Add test for encode_with_special_tokens verifying special token
  strings in input produce single token IDs

Signed-off-by: Kangyan Zhou <zky314343421@gmail.com>
…tion

Two optimizations for the Kimi-K2.5 image preprocessing pipeline:

1. Fuse resize + pad + normalize into a single pass using
   deinterleave_rgb_to_planes with precomputed scale/bias. Eliminates
   2 intermediate Array3 allocations and 2 extra passes over pixel data.

2. Replace per-element scalar indexing in extract_patches with
   row-based extend_from_slice (14-element memcpy per row), enabling
   compiler auto-vectorization.

Also take upstream multimodal.rs which has resolve_model_config_dir
and updated image_processor_registry.find() API.

Signed-off-by: Kangyan Zhou <zky314343421@gmail.com>
Log timing breakdown: image fetch, config load, preprocessing,
token expansion, and assembly/serialization. This helps identify
which step dominates TTFT for multimodal gRPC requests.

Signed-off-by: Kangyan Zhou <zky314343421@gmail.com>
…inputs

Two optimizations to reduce gRPC multimodal TTFT:

1. Move image preprocessing (resize + pad + normalize + patchify) to
   tokio::task::spawn_blocking so CPU-intensive work doesn't block
   the async runtime. Under 200 concurrent requests, this prevents
   serialized preprocessing from inflating tail latencies.

2. Strip mm_inputs from decode worker requests in PD dual dispatch.
   The decode worker only needs the KV cache from prefill — sending
   ~40MB of pixel tensors to it was pure waste.

Signed-off-by: Kangyan Zhou <zky314343421@gmail.com>
Replace image::resize_exact(CatmullRom) with transforms::resize()
which uses fast_image_resize (AVX2/SSE4 SIMD). This is a drop-in
replacement that gives 3-5x faster BICUBIC resize — the dominant
CPU cost in the preprocessing pipeline.

Signed-off-by: Kangyan Zhou <zky314343421@gmail.com>
Remove info-level timing logs (fetch_ms, config_ms, preprocess_ms,
expand_ms, serialize_ms) now that performance analysis is complete.

Signed-off-by: Kangyan Zhou <zky314343421@gmail.com>
- Remove dead download_model_configs_from_hf (replaced by upstream
  resolve_model_config_dir)
- Extract in_patch_limit/patch_limit_on_one_side from media_proc_cfg
  into config.extra, read in from_preprocessor_config
- Always check media_proc_cfg for all fields, not just when
  image_mean/std are missing (fixes partial config overlap)
- Log warning when placeholder_token_id fails instead of silent .ok()
- Add config_model_type fallback to KimiK25VisionSpec::matches
- Add tests: 1x1 image, empty batch, from_preprocessor_config limits
- Improve tiktoken encode comment explaining why
  encode_with_special_tokens is used

Signed-off-by: Kangyan Zhou <zky314343421@gmail.com>
Make Kimi-K2.5 code comments self-contained instead of comparing
against Qwen-VL internals.

Signed-off-by: Kangyan Zhou <zky314343421@gmail.com>
…arse_mm_inputs

sglang v0.5.10 mm_utils.has_shm_features() accesses req.mm_inputs.mm_items
via attribute, which fails when mm_inputs is a plain dict. Return a proper
MultimodalInputs dataclass to fix the AttributeError crash on VLM requests
in gRPC mode.

Signed-off-by: Kangyan Zhou <zky314343421@gmail.com>
- Move DynamicImage cloning inside spawn_blocking so the expensive
  deep-copy runs on the blocking thread pool instead of the tokio
  async runtime.
- Document why tiktoken's encode() intentionally ignores the
  add_special_tokens parameter (semantic mismatch with HF backend).

Signed-off-by: Kangyan Zhou <zky314343421@gmail.com>
- Refactor preprocessor_config: extract nested media_proc_cfg logic
  into shared apply_nested_media_cfg() to avoid round-trip
  serialization in from_value().
- Add TODO for configurable blocking thread pool size.

Signed-off-by: Kangyan Zhou <zky314343421@gmail.com>
Add Mlx arm to the ProtoGenerateRequest::clear_mm_inputs match statement.
The Mlx proto (like TRT-LLM) has no mm_inputs field, so the arm is a no-op.
This fixes a non-exhaustive pattern error introduced when the Mlx variant
was added to ProtoGenerateRequest in lightseekorg#1034.

Signed-off-by: Kangyan Zhou <zky314343421@gmail.com>
@Kangyan-Zhou Kangyan-Zhou force-pushed the feat/kimi-k25-vision-grpc branch from 5c6a8b7 to 722264b Compare April 14, 2026 23:55
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 722264beca

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +32 to +33
id.contains("kimi") && id.contains("k2")
|| metadata
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Narrow Kimi spec match to vision model identifiers

The match condition accepts any model ID containing both "kimi" and "k2", which also captures non-vision Kimi-K2 variants (for example instruct/text checkpoints). In multimodal requests this routes those models through the Kimi-K2.5 vision path and then fails later (e.g., missing media_placeholder_token_id) instead of cleanly reporting unsupported multimodal capability. Restricting the predicate to explicit vision identifiers (or model_type == "kimi_k25") avoids these false positives and incorrect routing.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (7)
model_gateway/src/routers/grpc/common/stages/request_execution.rs (1)

239-242: ⚠️ Potential issue | 🟠 Major

Apply the same mm_inputs stripping in sequential PD decode path.

This only strips decode payloads in execute_dual_dispatch. The execute_sequential_pd() decode request path (Line 376 onward) still forwards full multimodal tensors to decode worker.

🛠️ Proposed fix
-        let mut decode_request = proto_request;
+        let mut decode_request = proto_request;
+        decode_request.clear_mm_inputs();
         if let Some((remote_host, remote_port)) = kv_transfer_params {
             debug!(
                 remote_host = %remote_host,
                 remote_port = remote_port,
                 "vLLM PD: injecting kv_transfer_params into decode request"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/grpc/common/stages/request_execution.rs` around
lines 239 - 242, The sequential PD decode path in execute_sequential_pd is still
forwarding full multimodal tensors; mirror the fix from execute_dual_dispatch by
creating a mutable copy of the decode proto (e.g., let mut decode_request =
proto_request.clone() or similar) and call clear_mm_inputs() on it before
sending to the decode worker so only the KV cache is forwarded; update the
execute_sequential_pd flow to use this stripped decode_request when invoking the
decode worker, referencing execute_sequential_pd, execute_dual_dispatch,
decode_request, and clear_mm_inputs to locate where to apply the change.
crates/multimodal/src/vision/preprocessor_config.rs (2)

267-273: ⚠️ Potential issue | 🟠 Major

Support object-form media_proc_cfg.patch_size during nested extraction.

The nested path currently handles only numeric patch_size. If media_proc_cfg.patch_size is object-shaped ({"height":...,"width":...}), it is ignored and defaults are used.

🛠️ Proposed fix
         if config.patch_size.is_none() {
             config.patch_size = media_cfg.get("patch_size").and_then(|v| {
-                v.as_u64().map(|ps| PatchSize {
-                    height: Some(ps as u32),
-                    width: Some(ps as u32),
-                })
+                v.as_u64()
+                    .and_then(|ps| u32::try_from(ps).ok())
+                    .map(|ps| PatchSize {
+                        height: Some(ps),
+                        width: Some(ps),
+                    })
+                    .or_else(|| {
+                        let obj = v.as_object()?;
+                        let h = obj
+                            .get("height")?
+                            .as_u64()
+                            .and_then(|x| u32::try_from(x).ok())?;
+                        let w = obj
+                            .get("width")
+                            .and_then(|x| x.as_u64())
+                            .and_then(|x| u32::try_from(x).ok())
+                            .unwrap_or(h);
+                        Some(PatchSize {
+                            height: Some(h),
+                            width: Some(w),
+                        })
+                    })
             });
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/multimodal/src/vision/preprocessor_config.rs` around lines 267 - 273,
The current extraction only handles numeric `media_cfg.get("patch_size")` and
ignores object-form values; update the conditional that sets `config.patch_size`
(the block using `media_cfg.get("patch_size").and_then(...)`) to also accept
object-shaped inputs by checking for an object/value map and reading `"height"`
and `"width"` (e.g., via `v.get("height").and_then(|h| h.as_u64())` and
similarly for width) and then constructing `PatchSize { height: Some(h as u32),
width: Some(w as u32) }` when present; keep the existing numeric branch that
fills both dimensions from a single u64 so both forms are supported.

521-550: ⚠️ Potential issue | 🟡 Minor

Extend the nested-config test to assert extracted limits and object-form patch size.

This test currently validates scalar nested values but misses in_patch_limit / patch_limit_on_one_side assertions and the object-form nested patch_size branch.

🧪 Suggested test additions
         assert_eq!(config.get_patch_size(0), 14);
         assert_eq!(config.merge_size, Some(2));
+        let in_patch_limit: Option<usize> = config.get_extra("in_patch_limit");
+        let patch_limit_on_one_side: Option<usize> = config.get_extra("patch_limit_on_one_side");
+        assert_eq!(in_patch_limit, Some(16384));
+        assert_eq!(patch_limit_on_one_side, Some(512));
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/multimodal/src/vision/preprocessor_config.rs` around lines 521 - 550,
The test test_parse_kimi_nested_media_proc_cfg should be expanded to assert that
numeric limits from the nested media_proc_cfg are parsed: check
PreProcessorConfig::from_json result for the in_patch_limit (use the getter or
field that exposes it, e.g., config.get_in_patch_limit() or
config.in_patch_limit) and patch_limit_on_one_side (e.g.,
config.get_patch_limit_on_one_side() or config.patch_limit_on_one_side) equal
16384 and 512 respectively, and add an assertion for the object-form patch_size
branch by supplying or asserting the behavior of config.get_patch_size for an
index that triggers the object-form handling (verify it returns 14 as expected);
keep using existing helpers get_image_mean, get_image_std, get_patch_size and
the merge_size field to locate where to add these assertions.
model_gateway/src/routers/grpc/multimodal.rs (1)

711-715: ⚠️ Potential issue | 🟠 Major

Remove as_slice_memory_order() from the pixel fast path.

Line 714 still reintroduces the layout-corruption case called out in the comment below: a Fortran-contiguous tensor can be serialized in memory order instead of logical row-major order. The .iter() branch already handles every non-C-contiguous layout safely.

💡 Suggested fix
-    let pixel_bytes: Vec<u8> = if let Some(pixel_slice) = preprocessed
-        .pixel_values
-        .as_slice()
-        .or_else(|| preprocessed.pixel_values.as_slice_memory_order())
-    {
+    let pixel_bytes: Vec<u8> = if let Some(pixel_slice) = preprocessed.pixel_values.as_slice() {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/routers/grpc/multimodal.rs` around lines 711 - 715, The
fast-path for building pixel_bytes is currently using
preprocessed.pixel_values.as_slice().or_else(||
preprocessed.pixel_values.as_slice_memory_order()), which can reintroduce
Fortran-contiguous layout corruption; remove the as_slice_memory_order() call so
the fast path only uses as_slice() and let the existing .iter() branch handle
non-C-contiguous tensors safely. Update the condition that constructs
pixel_bytes to only attempt preprocessed.pixel_values.as_slice() and fall back
to the iterator branch when that returns None, ensuring pixel_bytes is produced
from the iterator for any non-C-contiguous layout.
crates/multimodal/src/vision/processors/kimi_k25.rs (3)

184-190: ⚠️ Potential issue | 🟠 Major

Return TransformError here instead of panicking.

Lines 188-189 and 214-216 still turn request-time shape/layout regressions into panics. Thread Result through resize_pad_and_normalize() and extract_patches() so the caller can surface a normal preprocessing error instead.

Based on learnings, multimodal vision processors in this repo should avoid expect/unreachable in production code and document safe-code assumptions with INVARIANT: comments instead.

Also applies to: 210-216

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/multimodal/src/vision/processors/kimi_k25.rs` around lines 184 - 190,
The code currently panics via Array3::from_shape_vec(...).expect(...) and other
expect/unreachable sites in resize_pad_and_normalize() and extract_patches();
change those functions to return Result<..., TransformError> (thread Result up
to the caller), replace .expect(...)/.unwrap()/unreachable!() with map_err(|e|
TransformError::Preprocessing { source: e.into(), context: "array shape/layout
mismatch" }) or an appropriate TransformError variant, and propagate errors
using ?; also add an INVARIANT: comment near the array construction explaining
the assumption about data length so callers know why the check exists. Ensure
both resize_pad_and_normalize() and extract_patches() signatures and all call
sites are updated to handle the Result return.

245-317: ⚠️ Potential issue | 🟠 Major

Use the effective Kimi config inside the live preprocessing path.

from_preprocessor_config() parses patch_size, merge_size, in_patch_limit, and patch_limit_on_one_side, but preprocess() and calculate_num_tokens() still drive geometry from the shared registry instance. Any non-default media_proc_cfg will silently emit the wrong patch grid, placeholder counts, and pixel_values shape.

💡 Apply the merged config per call
     fn preprocess(
         &self,
         images: &[DynamicImage],
         config: &PreProcessorConfig,
     ) -> Result<PreprocessedImages, TransformError> {
+        let runtime = Self::from_preprocessor_config(config);
+
         if images.is_empty() {
             return Err(TransformError::EmptyBatch);
         }
@@
-            let cfg = self.compute_resize_config(w as usize, h as usize);
+            let cfg = runtime.compute_resize_config(w as usize, h as usize);
@@
-            let grid_h = padded_h / self.patch_size;
-            let grid_w = padded_w / self.patch_size;
+            let grid_h = padded_h / runtime.patch_size;
+            let grid_w = padded_w / runtime.patch_size;
@@
-            let patches = Self::extract_patches(&tensor, self.patch_size);
+            let patches = Self::extract_patches(&tensor, runtime.patch_size);
@@
         let pixel_values = ndarray::Array4::from_shape_vec(
-            (total_patches, 3, self.patch_size, self.patch_size),
+            (total_patches, 3, runtime.patch_size, runtime.patch_size),
             all_patches,
         )
@@
-    fn calculate_num_tokens(&self, width: u32, height: u32, _config: &PreProcessorConfig) -> usize {
-        self.compute_resize_config(width as usize, height as usize)
+    fn calculate_num_tokens(&self, width: u32, height: u32, config: &PreProcessorConfig) -> usize {
+        Self::from_preprocessor_config(config)
+            .compute_resize_config(width as usize, height as usize)
             .num_tokens
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/multimodal/src/vision/processors/kimi_k25.rs` around lines 245 - 317,
preprocess() and calculate_num_tokens() are using the shared registry geometry
via compute_resize_config instead of the per-call Kimi settings parsed by
from_preprocessor_config(), which causes wrong grids and shapes for non-default
media_proc_cfg; fix by calling the per-call parser (e.g. let kimi_cfg =
Self::from_preprocessor_config(config) or self.from_preprocessor_config(config))
at the start of preprocess() and calculate_num_tokens() and then use the parsed
kimi_cfg (either pass it into compute_resize_config or use
kimi_cfg.patch_size/num_tokens) when computing resize config, extracting
patches, and returning num_tokens so geometry honors the effective Kimi config
for each call.

96-125: ⚠️ Potential issue | 🟠 Major

Cap against the aligned patch grid, not the pre-pad estimate.

in_patch_limit is checked from floor(width / patch_size) * floor(height / patch_size), but the real patch count is determined after both axes are rounded up to the 28-pixel factor. Near the cap that still overshoots; with the defaults, 1792x1806 ends up padded to 1792x1820, i.e. 16_640 pre-merge patches, above the configured 16_384.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/multimodal/src/vision/processors/kimi_k25.rs` around lines 96 - 125,
The current compute_resize_config computes a scale based on pre-pad patch
counts, but the real token count is determined after aligning
new_width/new_height to factor and padding; this can still exceed in_patch_limit
(e.g., 1792x1806 -> padded tokens > limit). Fix: after computing new_w/new_h,
pad_width, pad_height and token_width/token_height, check if num_tokens >
self.in_patch_limit and if so reduce the target size (e.g., shrink new_w/new_h
proportionally or decrement token_width/token_height) and recompute pads/tokens
in a small loop until token_width * token_height <= self.in_patch_limit; ensure
you still respect the per-side cap self.patch_limit_on_one_side * ps and use
self.factor() for alignment; make the change inside compute_resize_config so the
returned ResizeConfig guarantees num_tokens <= in_patch_limit.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@crates/multimodal/src/vision/preprocessor_config.rs`:
- Around line 267-273: The current extraction only handles numeric
`media_cfg.get("patch_size")` and ignores object-form values; update the
conditional that sets `config.patch_size` (the block using
`media_cfg.get("patch_size").and_then(...)`) to also accept object-shaped inputs
by checking for an object/value map and reading `"height"` and `"width"` (e.g.,
via `v.get("height").and_then(|h| h.as_u64())` and similarly for width) and then
constructing `PatchSize { height: Some(h as u32), width: Some(w as u32) }` when
present; keep the existing numeric branch that fills both dimensions from a
single u64 so both forms are supported.
- Around line 521-550: The test test_parse_kimi_nested_media_proc_cfg should be
expanded to assert that numeric limits from the nested media_proc_cfg are
parsed: check PreProcessorConfig::from_json result for the in_patch_limit (use
the getter or field that exposes it, e.g., config.get_in_patch_limit() or
config.in_patch_limit) and patch_limit_on_one_side (e.g.,
config.get_patch_limit_on_one_side() or config.patch_limit_on_one_side) equal
16384 and 512 respectively, and add an assertion for the object-form patch_size
branch by supplying or asserting the behavior of config.get_patch_size for an
index that triggers the object-form handling (verify it returns 14 as expected);
keep using existing helpers get_image_mean, get_image_std, get_patch_size and
the merge_size field to locate where to add these assertions.

In `@crates/multimodal/src/vision/processors/kimi_k25.rs`:
- Around line 184-190: The code currently panics via
Array3::from_shape_vec(...).expect(...) and other expect/unreachable sites in
resize_pad_and_normalize() and extract_patches(); change those functions to
return Result<..., TransformError> (thread Result up to the caller), replace
.expect(...)/.unwrap()/unreachable!() with map_err(|e|
TransformError::Preprocessing { source: e.into(), context: "array shape/layout
mismatch" }) or an appropriate TransformError variant, and propagate errors
using ?; also add an INVARIANT: comment near the array construction explaining
the assumption about data length so callers know why the check exists. Ensure
both resize_pad_and_normalize() and extract_patches() signatures and all call
sites are updated to handle the Result return.
- Around line 245-317: preprocess() and calculate_num_tokens() are using the
shared registry geometry via compute_resize_config instead of the per-call Kimi
settings parsed by from_preprocessor_config(), which causes wrong grids and
shapes for non-default media_proc_cfg; fix by calling the per-call parser (e.g.
let kimi_cfg = Self::from_preprocessor_config(config) or
self.from_preprocessor_config(config)) at the start of preprocess() and
calculate_num_tokens() and then use the parsed kimi_cfg (either pass it into
compute_resize_config or use kimi_cfg.patch_size/num_tokens) when computing
resize config, extracting patches, and returning num_tokens so geometry honors
the effective Kimi config for each call.
- Around line 96-125: The current compute_resize_config computes a scale based
on pre-pad patch counts, but the real token count is determined after aligning
new_width/new_height to factor and padding; this can still exceed in_patch_limit
(e.g., 1792x1806 -> padded tokens > limit). Fix: after computing new_w/new_h,
pad_width, pad_height and token_width/token_height, check if num_tokens >
self.in_patch_limit and if so reduce the target size (e.g., shrink new_w/new_h
proportionally or decrement token_width/token_height) and recompute pads/tokens
in a small loop until token_width * token_height <= self.in_patch_limit; ensure
you still respect the per-side cap self.patch_limit_on_one_side * ps and use
self.factor() for alignment; make the change inside compute_resize_config so the
returned ResizeConfig guarantees num_tokens <= in_patch_limit.

In `@model_gateway/src/routers/grpc/common/stages/request_execution.rs`:
- Around line 239-242: The sequential PD decode path in execute_sequential_pd is
still forwarding full multimodal tensors; mirror the fix from
execute_dual_dispatch by creating a mutable copy of the decode proto (e.g., let
mut decode_request = proto_request.clone() or similar) and call
clear_mm_inputs() on it before sending to the decode worker so only the KV cache
is forwarded; update the execute_sequential_pd flow to use this stripped
decode_request when invoking the decode worker, referencing
execute_sequential_pd, execute_dual_dispatch, decode_request, and
clear_mm_inputs to locate where to apply the change.

In `@model_gateway/src/routers/grpc/multimodal.rs`:
- Around line 711-715: The fast-path for building pixel_bytes is currently using
preprocessed.pixel_values.as_slice().or_else(||
preprocessed.pixel_values.as_slice_memory_order()), which can reintroduce
Fortran-contiguous layout corruption; remove the as_slice_memory_order() call so
the fast path only uses as_slice() and let the existing .iter() branch handle
non-C-contiguous tensors safely. Update the condition that constructs
pixel_bytes to only attempt preprocessed.pixel_values.as_slice() and fall back
to the iterator branch when that returns None, ensuring pixel_bytes is produced
from the iterator for any non-C-contiguous layout.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 42d1dc81-bca7-4538-b7b0-053cc513a398

📥 Commits

Reviewing files that changed from the base of the PR and between 5c6a8b7 and 722264b.

📒 Files selected for processing (11)
  • crates/multimodal/src/registry/kimi_k25.rs
  • crates/multimodal/src/registry/mod.rs
  • crates/multimodal/src/vision/image_processor.rs
  • crates/multimodal/src/vision/preprocessor_config.rs
  • crates/multimodal/src/vision/processors/kimi_k25.rs
  • crates/multimodal/src/vision/processors/mod.rs
  • crates/tokenizer/src/tiktoken.rs
  • grpc_servicer/smg_grpc_servicer/sglang/servicer.py
  • model_gateway/src/routers/grpc/common/stages/request_execution.rs
  • model_gateway/src/routers/grpc/multimodal.rs
  • model_gateway/src/routers/grpc/proto_wrapper.rs

Copy link
Copy Markdown
Member

@CatherineSue CatherineSue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. I just need to comment a few nitpicks to remind me / others to do it.

Comment on lines +704 to +708
let pixel_bytes: Vec<u8> = if let Some(pixel_slice) = preprocessed
.pixel_values
.as_slice()
.or_else(|| preprocessed.pixel_values.as_slice_memory_order())
{
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is right but really nit right now. We don't have that many non-C-contiguous right now. We can address this in a follow up PR.

{
// Zero-copy reinterpret: &[f32] → &[u8] on little-endian (x86).
// This replaces the per-element flat_map(to_le_bytes) which was the
// #1 CPU hotspot (13% of SMG CPU in profiling).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to remove this kind of comments.

@CatherineSue CatherineSue merged commit 15e2017 into lightseekorg:main Apr 15, 2026
74 of 75 checks passed
TingtingZhou7 pushed a commit to TingtingZhou7/smg that referenced this pull request Apr 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

grpc gRPC client and router changes model-gateway Model gateway crate changes multimodal Multimodal crate changes tokenizer Tokenizer related changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants