feat(ci): add Devstral 2 nightly benchmark by nishanthp · Pull Request #1022 · lightseekorg/smg

nishanthp · 2026-04-02T00:13:40Z

Description

Problem

The nightly benchmark workflow does not currently include coverage for mistralai/Devstral-2-123B-Instruct-2512, so regressions for that model are not captured in nightly benchmarking.

Solution

Add mistralai/Devstral-2-123B-Instruct-2512 to the H200 single-worker nightly benchmark matrix and register its benchmark/test configuration in the e2e model specs.

Changes

Added mistralai/Devstral-2-123B-Instruct-2512 to the nightly benchmark workflow.
Added a nightly benchmark test entry for Devstral.
Added a model spec for Devstral with tp: 8, tool-calling runtime args, and an extended startup timeout.

Test Plan

Run python3 -m py_compile e2e_test/benchmarks/test_nightly_perf.py e2e_test/infra/model_specs.py
Parse the workflow locally to confirm valid YAML:
ruby -e 'require "yaml"; YAML.load_file(".github/workflows/nightly-benchmark.yml"); puts "yaml ok"'
Trigger the nightly benchmark workflow with the Devstral model selected.
Confirm the workflow schedules TestNightlyDevstral123bSingle.
Verify the model resolves correctly from MODEL_SPECS with tp=8.

Summary by CodeRabbit

Chores
- Extended nightly benchmark and test infrastructure to include the mistralai/Devstral-2-123B-Instruct-2512 model with single-worker testing across HTTP and gRPC backends.
New Features
- Declared support for chat, streaming, and function-calling capabilities for the added model.

Signed-off-by: Nishanth Prakash <nishanth.prakash@gmail.com>

coderabbitai · 2026-04-02T00:13:53Z

📝 Walkthrough

Walkthrough

Adds the mistralai/Devstral-2-123B-Instruct-2512 model to nightly benchmarks by registering it in the CI workflow matrix, the nightly test model list, and the MODEL_SPECS with single-worker H200 configuration and tool-calling/runtime args.

Changes

Cohort / File(s)	Summary
CI workflow `.github/workflows/nightly-benchmark.yml`	Added `mistralai/Devstral-2-123B-Instruct-2512` entry to the `single-worker-h200` job matrix with `slug` and `test_class` (`TestNightlyDevstral123bSingle`).
Nightly test list `e2e_test/benchmarks/test_nightly_perf.py`	Appended `mistralai/Devstral-2-123B-Instruct-2512` to `_NIGHTLY_MODELS` with class fragment `Devstral123b` for `http` and `grpc`, `worker_count=1` (single-worker only).
Model specifications `e2e_test/infra/model_specs.py`	Added `MODEL_SPECS["mistralai/Devstral-2-123B-Instruct-2512"]` with resolved model path, `tp: 8`, `features: ["chat","streaming","function_calling"]`, `worker_args: ["--tool-call-parser","mistral"]`, `vllm_args: ["--tool-call-parser","mistral","--enable-auto-tool-choice"]`, and `startup_timeout: 1200`.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

feat(ci): add minimaxai/minimax-m2 to nightly benchmark #795: Adds a new model across MODEL_SPECS, nightly test list, and CI workflow using the same change pattern.
test(e2e): add Llama-4 and Llama-3.3 70b to nightly benchmarks #900: Similar triage—updates .github/workflows/nightly-benchmark.yml, test_nightly_perf.py, and model_specs.py for a different model.
chore(e2e): overhaul nightly benchmark summary and trim model list #392: Modifies nightly benchmark model matrix and test model list in the CI/test surfaces.

Suggested reviewers

CatherineSue
key4ng
slin1237
XinyueZhang369

Poem

🐰 I stitched a model into the night,
A Devstral ready, bold and bright.
Single-worker hums on H200 steam,
Tool-calls ready, chasing dreams.
Benchmarks hop — a coder's delight! 🥕✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat(ci): add Devstral 2 nightly benchmark' clearly and concisely summarizes the main change—adding Devstral 2 to the nightly benchmark suite—which aligns with the primary objective and all code modifications across the three files.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request adds the 'mistralai/Devstral-2-123B-Instruct-2512' model to the nightly performance benchmarks and updates the infrastructure model specifications. The reviewer noted an inconsistency between the SGLang and vLLM configurations regarding tool choice flags, suggesting that these should be aligned to ensure a fair benchmark comparison.

gemini-code-assist · 2026-04-02T00:18:04Z

e2e_test/infra/model_specs.py

+        "worker_args": ["--tool-call-parser", "mistral"],
+        "vllm_args": ["--tool-call-parser", "mistral", "--enable-auto-tool-choice"],


There's an inconsistency in the configured arguments for SGLang (worker_args) and vLLM (vllm_args). The vllm_args enable auto tool choice, while worker_args do not. For a fair benchmark comparison, both runtimes should be configured with the same features. Please either add the corresponding flag to worker_args if SGLang supports it, or remove --enable-auto-tool-choice from vllm_args if it does not.

lets remove enable-auto-tool-choice

@slin1237 @CatherineSue Should I just remove --enable-auto-tool-choice from vllm_args?

smfirmin · 2026-04-02T00:45:51Z

.github/workflows/nightly-benchmark.yml

          - { id: meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8, slug: meta-llama-Llama-4-Maverick-17B-128E-Instruct-FP8, test_class: TestNightlyLlama4MaverickSingle }
+          - id: mistralai/Devstral-2-123B-Instruct-2512
+            slug: mistralai-Devstral-2-123B-Instruct-2512
+            test_class: TestNightlyDevstral123bSingle


nit: lets collapse to one line for parity

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8fccc6dfe1

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-02T17:17:35Z

e2e_test/infra/model_specs.py

+        "model": _resolve_model_path("mistralai/Devstral-2-123B-Instruct-2512"),
+        "tp": 8,
+        "features": ["chat", "streaming", "function_calling"],
+        "worker_args": ["--tool-call-parser", "mistral"],


Use recognized SGLang spec key for Devstral args

worker_args is not consumed by the worker launcher, so this new Devstral setting is currently a no-op. In e2e_test/infra/worker.py, SGLang startup only reads spec.get("sglang_args", []) (and vLLM reads vllm_args), which means the --tool-call-parser mistral flag is never applied for the sglang nightly job and those benchmark runs execute with unintended runtime settings.

Useful? React with 👍 / 👎.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@e2e_test/infra/model_specs.py`:
- Around line 83-91: The model spec for
"mistralai/Devstral-2-123B-Instruct-2512" mistakenly uses the unused key
worker_args instead of sglang_args, so the tool-call-parser config never reaches
the SGLang engine; change the key name from "worker_args" to "sglang_args" in
that dict so the value ["--tool-call-parser", "mistral"] is consumed (the worker
reads tp, features, sglang_args, vllm_args, trtllm_extra_config, trtllm_args,
startup_timeout in e2e_test/infra/worker.py), ensuring function_calling
tool-parser is applied.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7f45b55c-ebe7-42ca-a245-490ee3262de5

📥 Commits

Reviewing files that changed from the base of the PR and between 3732dc7 and 8fccc6d.

📒 Files selected for processing (1)

e2e_test/infra/model_specs.py

coderabbitai · 2026-04-02T17:18:37Z

e2e_test/infra/model_specs.py

+    # Devstral 2 - nightly benchmarks
+    "mistralai/Devstral-2-123B-Instruct-2512": {
+        "model": _resolve_model_path("mistralai/Devstral-2-123B-Instruct-2512"),
+        "tp": 8,
+        "features": ["chat", "streaming", "function_calling"],
+        "worker_args": ["--tool-call-parser", "mistral"],
+        "vllm_args": ["--tool-call-parser", "mistral", "--enable-auto-tool-choice"],
+        "startup_timeout": 1200,
+    },


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Description: Verify which spec keys are accessed in worker.py echo "=== Keys accessed from spec dict in worker.py ===" rg -n 'spec\.get\(' e2e_test/infra/worker.py echo "" echo "=== Any reference to worker_args? ===" rg -n 'worker_args' e2e_test/

Repository: lightseekorg/smg

Length of output: 624

worker_args is never read; must be sglang_args for tool-call-parser to be configured.

The spec keys accessed in e2e_test/infra/worker.py are: tp, features, sglang_args, vllm_args, trtllm_extra_config, trtllm_args, and startup_timeout. The worker_args key on line 88 is never consumed and has no effect. Since this model spec declares "function_calling" in features and includes tool-call-parser configuration, rename worker_args to sglang_args so the setting is actually applied to the SGLang engine.

🔧 Proposed fix

# Devstral 2 - nightly benchmarks "mistralai/Devstral-2-123B-Instruct-2512": { "model": _resolve_model_path("mistralai/Devstral-2-123B-Instruct-2512"), "tp": 8, "features": ["chat", "streaming", "function_calling"], - "worker_args": ["--tool-call-parser", "mistral"], + "sglang_args": ["--tool-call-parser", "mistral"], "vllm_args": ["--tool-call-parser", "mistral", "--enable-auto-tool-choice"], "startup_timeout": 1200, },

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Devstral 2 - nightly benchmarks

"mistralai/Devstral-2-123B-Instruct-2512": {

"model": _resolve_model_path("mistralai/Devstral-2-123B-Instruct-2512"),

"tp": 8,

"features": ["chat", "streaming", "function_calling"],

"worker_args": ["--tool-call-parser", "mistral"],

"vllm_args": ["--tool-call-parser", "mistral", "--enable-auto-tool-choice"],

"startup_timeout": 1200,

},

# Devstral 2 - nightly benchmarks

"mistralai/Devstral-2-123B-Instruct-2512": {

"model": _resolve_model_path("mistralai/Devstral-2-123B-Instruct-2512"),

"tp": 8,

"features": ["chat", "streaming", "function_calling"],

"sglang_args": ["--tool-call-parser", "mistral"],

"vllm_args": ["--tool-call-parser", "mistral", "--enable-auto-tool-choice"],

"startup_timeout": 1200,

},

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@e2e_test/infra/model_specs.py` around lines 83 - 91, The model spec for "mistralai/Devstral-2-123B-Instruct-2512" mistakenly uses the unused key worker_args instead of sglang_args, so the tool-call-parser config never reaches the SGLang engine; change the key name from "worker_args" to "sglang_args" in that dict so the value ["--tool-call-parser", "mistral"] is consumed (the worker reads tp, features, sglang_args, vllm_args, trtllm_extra_config, trtllm_args, startup_timeout in e2e_test/infra/worker.py), ensuring function_calling tool-parser is applied.

feat(ci): add Devstral 2 nightly benchmark

3732dc7

Signed-off-by: Nishanth Prakash <nishanth.prakash@gmail.com>

nishanthp requested review from CatherineSue, XinyueZhang369, key4ng and slin1237 as code owners April 2, 2026 00:13

github-actions bot added ci CI/CD configuration changes tests Test changes labels Apr 2, 2026

coderabbitai bot approved these changes Apr 2, 2026

View reviewed changes

gemini-code-assist bot reviewed Apr 2, 2026

View reviewed changes

smfirmin reviewed Apr 2, 2026

View reviewed changes

nishanthp added 2 commits April 1, 2026 23:49

Merge branch 'main' into add-devstral2-nightly-benchmark

81176e9

Merge branch 'main' into add-devstral2-nightly-benchmark

8fccc6d

chatgpt-codex-connector bot reviewed Apr 2, 2026

View reviewed changes

coderabbitai bot requested changes Apr 2, 2026

View reviewed changes

		"worker_args": ["--tool-call-parser", "mistral"],
		"vllm_args": ["--tool-call-parser", "mistral", "--enable-auto-tool-choice"],

Conversation

nishanthp commented Apr 2, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Problem

Solution

Changes

Test Plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

smfirmin Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

nishanthp Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

smfirmin Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nishanthp commented Apr 2, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 2, 2026 •

edited

Loading