feat(ci): add Devstral 2 nightly benchmark#1022
feat(ci): add Devstral 2 nightly benchmark#1022nishanthp wants to merge 3 commits intolightseekorg:mainfrom
Conversation
Signed-off-by: Nishanth Prakash <nishanth.prakash@gmail.com>
📝 WalkthroughWalkthroughAdds the Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request adds the 'mistralai/Devstral-2-123B-Instruct-2512' model to the nightly performance benchmarks and updates the infrastructure model specifications. The reviewer noted an inconsistency between the SGLang and vLLM configurations regarding tool choice flags, suggesting that these should be aligned to ensure a fair benchmark comparison.
| "worker_args": ["--tool-call-parser", "mistral"], | ||
| "vllm_args": ["--tool-call-parser", "mistral", "--enable-auto-tool-choice"], |
There was a problem hiding this comment.
There's an inconsistency in the configured arguments for SGLang (worker_args) and vLLM (vllm_args). The vllm_args enable auto tool choice, while worker_args do not. For a fair benchmark comparison, both runtimes should be configured with the same features. Please either add the corresponding flag to worker_args if SGLang supports it, or remove --enable-auto-tool-choice from vllm_args if it does not.
There was a problem hiding this comment.
lets remove enable-auto-tool-choice
There was a problem hiding this comment.
@slin1237 @CatherineSue Should I just remove --enable-auto-tool-choice from vllm_args?
| - { id: meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8, slug: meta-llama-Llama-4-Maverick-17B-128E-Instruct-FP8, test_class: TestNightlyLlama4MaverickSingle } | ||
| - id: mistralai/Devstral-2-123B-Instruct-2512 | ||
| slug: mistralai-Devstral-2-123B-Instruct-2512 | ||
| test_class: TestNightlyDevstral123bSingle |
There was a problem hiding this comment.
nit: lets collapse to one line for parity
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8fccc6dfe1
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| "model": _resolve_model_path("mistralai/Devstral-2-123B-Instruct-2512"), | ||
| "tp": 8, | ||
| "features": ["chat", "streaming", "function_calling"], | ||
| "worker_args": ["--tool-call-parser", "mistral"], |
There was a problem hiding this comment.
Use recognized SGLang spec key for Devstral args
worker_args is not consumed by the worker launcher, so this new Devstral setting is currently a no-op. In e2e_test/infra/worker.py, SGLang startup only reads spec.get("sglang_args", []) (and vLLM reads vllm_args), which means the --tool-call-parser mistral flag is never applied for the sglang nightly job and those benchmark runs execute with unintended runtime settings.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@e2e_test/infra/model_specs.py`:
- Around line 83-91: The model spec for
"mistralai/Devstral-2-123B-Instruct-2512" mistakenly uses the unused key
worker_args instead of sglang_args, so the tool-call-parser config never reaches
the SGLang engine; change the key name from "worker_args" to "sglang_args" in
that dict so the value ["--tool-call-parser", "mistral"] is consumed (the worker
reads tp, features, sglang_args, vllm_args, trtllm_extra_config, trtllm_args,
startup_timeout in e2e_test/infra/worker.py), ensuring function_calling
tool-parser is applied.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 7f45b55c-ebe7-42ca-a245-490ee3262de5
📒 Files selected for processing (1)
e2e_test/infra/model_specs.py
| # Devstral 2 - nightly benchmarks | ||
| "mistralai/Devstral-2-123B-Instruct-2512": { | ||
| "model": _resolve_model_path("mistralai/Devstral-2-123B-Instruct-2512"), | ||
| "tp": 8, | ||
| "features": ["chat", "streaming", "function_calling"], | ||
| "worker_args": ["--tool-call-parser", "mistral"], | ||
| "vllm_args": ["--tool-call-parser", "mistral", "--enable-auto-tool-choice"], | ||
| "startup_timeout": 1200, | ||
| }, |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Verify which spec keys are accessed in worker.py
echo "=== Keys accessed from spec dict in worker.py ==="
rg -n 'spec\.get\(' e2e_test/infra/worker.py
echo ""
echo "=== Any reference to worker_args? ==="
rg -n 'worker_args' e2e_test/Repository: lightseekorg/smg
Length of output: 624
worker_args is never read; must be sglang_args for tool-call-parser to be configured.
The spec keys accessed in e2e_test/infra/worker.py are: tp, features, sglang_args, vllm_args, trtllm_extra_config, trtllm_args, and startup_timeout. The worker_args key on line 88 is never consumed and has no effect. Since this model spec declares "function_calling" in features and includes tool-call-parser configuration, rename worker_args to sglang_args so the setting is actually applied to the SGLang engine.
🔧 Proposed fix
# Devstral 2 - nightly benchmarks
"mistralai/Devstral-2-123B-Instruct-2512": {
"model": _resolve_model_path("mistralai/Devstral-2-123B-Instruct-2512"),
"tp": 8,
"features": ["chat", "streaming", "function_calling"],
- "worker_args": ["--tool-call-parser", "mistral"],
+ "sglang_args": ["--tool-call-parser", "mistral"],
"vllm_args": ["--tool-call-parser", "mistral", "--enable-auto-tool-choice"],
"startup_timeout": 1200,
},📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Devstral 2 - nightly benchmarks | |
| "mistralai/Devstral-2-123B-Instruct-2512": { | |
| "model": _resolve_model_path("mistralai/Devstral-2-123B-Instruct-2512"), | |
| "tp": 8, | |
| "features": ["chat", "streaming", "function_calling"], | |
| "worker_args": ["--tool-call-parser", "mistral"], | |
| "vllm_args": ["--tool-call-parser", "mistral", "--enable-auto-tool-choice"], | |
| "startup_timeout": 1200, | |
| }, | |
| # Devstral 2 - nightly benchmarks | |
| "mistralai/Devstral-2-123B-Instruct-2512": { | |
| "model": _resolve_model_path("mistralai/Devstral-2-123B-Instruct-2512"), | |
| "tp": 8, | |
| "features": ["chat", "streaming", "function_calling"], | |
| "sglang_args": ["--tool-call-parser", "mistral"], | |
| "vllm_args": ["--tool-call-parser", "mistral", "--enable-auto-tool-choice"], | |
| "startup_timeout": 1200, | |
| }, |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@e2e_test/infra/model_specs.py` around lines 83 - 91, The model spec for
"mistralai/Devstral-2-123B-Instruct-2512" mistakenly uses the unused key
worker_args instead of sglang_args, so the tool-call-parser config never reaches
the SGLang engine; change the key name from "worker_args" to "sglang_args" in
that dict so the value ["--tool-call-parser", "mistral"] is consumed (the worker
reads tp, features, sglang_args, vllm_args, trtllm_extra_config, trtllm_args,
startup_timeout in e2e_test/infra/worker.py), ensuring function_calling
tool-parser is applied.
Description
Problem
The nightly benchmark workflow does not currently include coverage for
mistralai/Devstral-2-123B-Instruct-2512, so regressions for that model are not captured in nightly benchmarking.Solution
Add
mistralai/Devstral-2-123B-Instruct-2512to the H200 single-worker nightly benchmark matrix and register its benchmark/test configuration in the e2e model specs.Changes
mistralai/Devstral-2-123B-Instruct-2512to the nightly benchmark workflow.tp: 8, tool-calling runtime args, and an extended startup timeout.Test Plan
python3 -m py_compile e2e_test/benchmarks/test_nightly_perf.py e2e_test/infra/model_specs.pyruby -e 'require "yaml"; YAML.load_file(".github/workflows/nightly-benchmark.yml"); puts "yaml ok"'TestNightlyDevstral123bSingle.MODEL_SPECSwithtp=8.Summary by CodeRabbit