Skip to content

feat(ci): add Devstral 2 nightly benchmark#1022

Open
nishanthp wants to merge 3 commits intolightseekorg:mainfrom
nishanthp:add-devstral2-nightly-benchmark
Open

feat(ci): add Devstral 2 nightly benchmark#1022
nishanthp wants to merge 3 commits intolightseekorg:mainfrom
nishanthp:add-devstral2-nightly-benchmark

Conversation

@nishanthp
Copy link
Copy Markdown
Contributor

@nishanthp nishanthp commented Apr 2, 2026

Description

Problem

The nightly benchmark workflow does not currently include coverage for mistralai/Devstral-2-123B-Instruct-2512, so regressions for that model are not captured in nightly benchmarking.

Solution

Add mistralai/Devstral-2-123B-Instruct-2512 to the H200 single-worker nightly benchmark matrix and register its benchmark/test configuration in the e2e model specs.

Changes

  • Added mistralai/Devstral-2-123B-Instruct-2512 to the nightly benchmark workflow.
  • Added a nightly benchmark test entry for Devstral.
  • Added a model spec for Devstral with tp: 8, tool-calling runtime args, and an extended startup timeout.

Test Plan

  • Run python3 -m py_compile e2e_test/benchmarks/test_nightly_perf.py e2e_test/infra/model_specs.py
  • Parse the workflow locally to confirm valid YAML:
    ruby -e 'require "yaml"; YAML.load_file(".github/workflows/nightly-benchmark.yml"); puts "yaml ok"'
  • Trigger the nightly benchmark workflow with the Devstral model selected.
  • Confirm the workflow schedules TestNightlyDevstral123bSingle.
  • Verify the model resolves correctly from MODEL_SPECS with tp=8.

Summary by CodeRabbit

  • Chores
    • Extended nightly benchmark and test infrastructure to include the mistralai/Devstral-2-123B-Instruct-2512 model with single-worker testing across HTTP and gRPC backends.
  • New Features
    • Declared support for chat, streaming, and function-calling capabilities for the added model.

Signed-off-by: Nishanth Prakash <nishanth.prakash@gmail.com>
@github-actions github-actions bot added ci CI/CD configuration changes tests Test changes labels Apr 2, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 2, 2026

📝 Walkthrough

Walkthrough

Adds the mistralai/Devstral-2-123B-Instruct-2512 model to nightly benchmarks by registering it in the CI workflow matrix, the nightly test model list, and the MODEL_SPECS with single-worker H200 configuration and tool-calling/runtime args.

Changes

Cohort / File(s) Summary
CI workflow
.github/workflows/nightly-benchmark.yml
Added mistralai/Devstral-2-123B-Instruct-2512 entry to the single-worker-h200 job matrix with slug and test_class (TestNightlyDevstral123bSingle).
Nightly test list
e2e_test/benchmarks/test_nightly_perf.py
Appended mistralai/Devstral-2-123B-Instruct-2512 to _NIGHTLY_MODELS with class fragment Devstral123b for http and grpc, worker_count=1 (single-worker only).
Model specifications
e2e_test/infra/model_specs.py
Added MODEL_SPECS["mistralai/Devstral-2-123B-Instruct-2512"] with resolved model path, tp: 8, features: ["chat","streaming","function_calling"], worker_args: ["--tool-call-parser","mistral"], vllm_args: ["--tool-call-parser","mistral","--enable-auto-tool-choice"], and startup_timeout: 1200.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Suggested reviewers

  • CatherineSue
  • key4ng
  • slin1237
  • XinyueZhang369

Poem

🐰 I stitched a model into the night,
A Devstral ready, bold and bright.
Single-worker hums on H200 steam,
Tool-calls ready, chasing dreams.
Benchmarks hop — a coder's delight! 🥕✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(ci): add Devstral 2 nightly benchmark' clearly and concisely summarizes the main change—adding Devstral 2 to the nightly benchmark suite—which aligns with the primary objective and all code modifications across the three files.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds the 'mistralai/Devstral-2-123B-Instruct-2512' model to the nightly performance benchmarks and updates the infrastructure model specifications. The reviewer noted an inconsistency between the SGLang and vLLM configurations regarding tool choice flags, suggesting that these should be aligned to ensure a fair benchmark comparison.

Comment on lines +85 to +86
"worker_args": ["--tool-call-parser", "mistral"],
"vllm_args": ["--tool-call-parser", "mistral", "--enable-auto-tool-choice"],
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's an inconsistency in the configured arguments for SGLang (worker_args) and vLLM (vllm_args). The vllm_args enable auto tool choice, while worker_args do not. For a fair benchmark comparison, both runtimes should be configured with the same features. Please either add the corresponding flag to worker_args if SGLang supports it, or remove --enable-auto-tool-choice from vllm_args if it does not.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets remove enable-auto-tool-choice

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@slin1237 @CatherineSue Should I just remove --enable-auto-tool-choice from vllm_args?

- { id: meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8, slug: meta-llama-Llama-4-Maverick-17B-128E-Instruct-FP8, test_class: TestNightlyLlama4MaverickSingle }
- id: mistralai/Devstral-2-123B-Instruct-2512
slug: mistralai-Devstral-2-123B-Instruct-2512
test_class: TestNightlyDevstral123bSingle
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: lets collapse to one line for parity

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8fccc6dfe1

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

"model": _resolve_model_path("mistralai/Devstral-2-123B-Instruct-2512"),
"tp": 8,
"features": ["chat", "streaming", "function_calling"],
"worker_args": ["--tool-call-parser", "mistral"],
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use recognized SGLang spec key for Devstral args

worker_args is not consumed by the worker launcher, so this new Devstral setting is currently a no-op. In e2e_test/infra/worker.py, SGLang startup only reads spec.get("sglang_args", []) (and vLLM reads vllm_args), which means the --tool-call-parser mistral flag is never applied for the sglang nightly job and those benchmark runs execute with unintended runtime settings.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@e2e_test/infra/model_specs.py`:
- Around line 83-91: The model spec for
"mistralai/Devstral-2-123B-Instruct-2512" mistakenly uses the unused key
worker_args instead of sglang_args, so the tool-call-parser config never reaches
the SGLang engine; change the key name from "worker_args" to "sglang_args" in
that dict so the value ["--tool-call-parser", "mistral"] is consumed (the worker
reads tp, features, sglang_args, vllm_args, trtllm_extra_config, trtllm_args,
startup_timeout in e2e_test/infra/worker.py), ensuring function_calling
tool-parser is applied.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7f45b55c-ebe7-42ca-a245-490ee3262de5

📥 Commits

Reviewing files that changed from the base of the PR and between 3732dc7 and 8fccc6d.

📒 Files selected for processing (1)
  • e2e_test/infra/model_specs.py

Comment on lines +83 to +91
# Devstral 2 - nightly benchmarks
"mistralai/Devstral-2-123B-Instruct-2512": {
"model": _resolve_model_path("mistralai/Devstral-2-123B-Instruct-2512"),
"tp": 8,
"features": ["chat", "streaming", "function_calling"],
"worker_args": ["--tool-call-parser", "mistral"],
"vllm_args": ["--tool-call-parser", "mistral", "--enable-auto-tool-choice"],
"startup_timeout": 1200,
},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Verify which spec keys are accessed in worker.py

echo "=== Keys accessed from spec dict in worker.py ==="
rg -n 'spec\.get\(' e2e_test/infra/worker.py

echo ""
echo "=== Any reference to worker_args? ==="
rg -n 'worker_args' e2e_test/

Repository: lightseekorg/smg

Length of output: 624


worker_args is never read; must be sglang_args for tool-call-parser to be configured.

The spec keys accessed in e2e_test/infra/worker.py are: tp, features, sglang_args, vllm_args, trtllm_extra_config, trtllm_args, and startup_timeout. The worker_args key on line 88 is never consumed and has no effect. Since this model spec declares "function_calling" in features and includes tool-call-parser configuration, rename worker_args to sglang_args so the setting is actually applied to the SGLang engine.

🔧 Proposed fix
     # Devstral 2 - nightly benchmarks
     "mistralai/Devstral-2-123B-Instruct-2512": {
         "model": _resolve_model_path("mistralai/Devstral-2-123B-Instruct-2512"),
         "tp": 8,
         "features": ["chat", "streaming", "function_calling"],
-        "worker_args": ["--tool-call-parser", "mistral"],
+        "sglang_args": ["--tool-call-parser", "mistral"],
         "vllm_args": ["--tool-call-parser", "mistral", "--enable-auto-tool-choice"],
         "startup_timeout": 1200,
     },
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Devstral 2 - nightly benchmarks
"mistralai/Devstral-2-123B-Instruct-2512": {
"model": _resolve_model_path("mistralai/Devstral-2-123B-Instruct-2512"),
"tp": 8,
"features": ["chat", "streaming", "function_calling"],
"worker_args": ["--tool-call-parser", "mistral"],
"vllm_args": ["--tool-call-parser", "mistral", "--enable-auto-tool-choice"],
"startup_timeout": 1200,
},
# Devstral 2 - nightly benchmarks
"mistralai/Devstral-2-123B-Instruct-2512": {
"model": _resolve_model_path("mistralai/Devstral-2-123B-Instruct-2512"),
"tp": 8,
"features": ["chat", "streaming", "function_calling"],
"sglang_args": ["--tool-call-parser", "mistral"],
"vllm_args": ["--tool-call-parser", "mistral", "--enable-auto-tool-choice"],
"startup_timeout": 1200,
},
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@e2e_test/infra/model_specs.py` around lines 83 - 91, The model spec for
"mistralai/Devstral-2-123B-Instruct-2512" mistakenly uses the unused key
worker_args instead of sglang_args, so the tool-call-parser config never reaches
the SGLang engine; change the key name from "worker_args" to "sglang_args" in
that dict so the value ["--tool-call-parser", "mistral"] is consumed (the worker
reads tp, features, sglang_args, vllm_args, trtllm_extra_config, trtllm_args,
startup_timeout in e2e_test/infra/worker.py), ensuring function_calling
tool-parser is applied.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci CI/CD configuration changes tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants