AIDynamo: enable multiple AIPerf runs during a single test run#907
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughAdds multi-phase AIPerf support and optional DCGM exporter orchestration: new models and validators, TOML test updates, dynamic generation of a phase-aware aiperf.sh, Slurm-side DCGM launcher/cleanup, environment wiring, and tests that validate generated scripts and runtime flags. ChangesPhased Execution and DCGM Exporter
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
d563eb9 to
d112a2a
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/ref_data/ai-dynamo.sbatch (1)
64-128:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winDon't let DCGM cleanup mask the workload failure.
The main
srunstatus is discarded here. Ifai_dynamo.shfails, Line 128 can still make the wrapper exit0, which would falsely report a green job.🛠️ Suggested fix
+trap 'stop_dcgm_exporter' EXIT + srun \ --export=ALL \ --mpi=pmix \ -N2 \ @@ --genai_perf-warmup-request-count "10" \ --aiperf-name "aiperf" \ --aiperf-script /cloudai_run_results/aiperf.sh - -# Stop DCGM exporter when test finishes -stop_dcgm_exporter +workload_status=$? +exit "${workload_status}"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/ref_data/ai-dynamo.sbatch` around lines 64 - 128, The srun wrapper currently discards ai_dynamo.sh's exit status because stop_dcgm_exporter runs unconditionally and the script then exits 0; capture and propagate srun's exit code: immediately after the srun invocation that launches /cloudai_install/ai_dynamo.sh, save its exit status (e.g., SRUN_EXIT=$?), then call stop_dcgm_exporter, and finally exit with that saved status (exit $SRUN_EXIT) so failures from ai_dynamo.sh are not masked; references: the srun command launching /cloudai_install/ai_dynamo.sh and the stop_dcgm_exporter invocation.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/cloudai/workloads/ai_dynamo/slurm_command_gen_strategy.py`:
- Around line 204-207: _render_aiperf_script currently only uses setup_cmd from
phases[0], dropping per-phase setup_cmds; update it to call
_resolve_aiperf_phase for each phase (use the phases list and
_resolve_aiperf_phase(phase) to get per-phase.setup_cmd and the phase-specific
commands) and emit/insert each resolved setup_cmd immediately before that
phase's commands (handle the single_phase branch consistently). Alternatively,
if you prefer to restrict to a single setup, add explicit validation in
_render_aiperf_script that raises or logs an error when more than one phase has
a non-empty setup_cmd, referencing AIPerfPhase and _resolve_aiperf_phase so
reviewers can find the changed logic.
In `@tests/workloads/ai_dynamo/test_command_gen_strategy_slurm.py`:
- Around line 295-300: The two tests calling AIPerf.model_validate and
AIPerfPhase.model_validate should constrain the raised ValueError by adding a
match argument to pytest.raises so they assert the error is about "extra-args"
being non-scalar; update both with pytest.raises(ValueError, match="extra-args")
(or adjust the match string to the exact validator message emitted, e.g.,
containing "extra-args must be" or similar) to ensure the failure is the
intended validation error.
---
Outside diff comments:
In `@tests/ref_data/ai-dynamo.sbatch`:
- Around line 64-128: The srun wrapper currently discards ai_dynamo.sh's exit
status because stop_dcgm_exporter runs unconditionally and the script then exits
0; capture and propagate srun's exit code: immediately after the srun invocation
that launches /cloudai_install/ai_dynamo.sh, save its exit status (e.g.,
SRUN_EXIT=$?), then call stop_dcgm_exporter, and finally exit with that saved
status (exit $SRUN_EXIT) so failures from ai_dynamo.sh are not masked;
references: the srun command launching /cloudai_install/ai_dynamo.sh and the
stop_dcgm_exporter invocation.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 5e2349dd-82e3-44ad-b1e5-5be1bb236390
📒 Files selected for processing (13)
conf/experimental/ai_dynamo/test/sglang.tomlconf/experimental/ai_dynamo/test/vllm.tomlconf/experimental/ai_dynamo/test_scenario/vllm_lmcache.tomlconf/experimental/ai_dynamo/test_scenario/vllm_slurm.tomlsrc/cloudai/workloads/ai_dynamo/__init__.pysrc/cloudai/workloads/ai_dynamo/ai_dynamo.pysrc/cloudai/workloads/ai_dynamo/ai_dynamo.shsrc/cloudai/workloads/ai_dynamo/aiperf.shsrc/cloudai/workloads/ai_dynamo/slurm_command_gen_strategy.pytests/ref_data/ai-dynamo-aiperf.shtests/ref_data/ai-dynamo.sbatchtests/test_acceptance.pytests/workloads/ai_dynamo/test_command_gen_strategy_slurm.py
Summary
This PR extends the AIDynamo Slurm workload with first-class AIPerf multi-run support. A test can now run several named AIPerf phases against the same live Dynamo deployment, without restarting frontend/router, prefill, or decode workers between phases.
It also adds optional DCGM exporter support for AIPerf server metrics, generated AIPerf scripts per test run, and updates the vLLM/SGLang example configs to exercise two AIPerf rounds.
What’s New
Adds
cmd_args.aiperf_phasesfor named AIPerf runs.Keeps
cmd_args.aiperfas the base/default AIPerf config.Each phase inherits from
cmd_args.aiperfand overrides only the needed fields.Generates
/cloudai_run_results/aiperf.shper test run instead of relying on a static runtime script.Preserves legacy single-run artifact layout:
aiperf_artifacts/aiperf.logaiperf_report.csvFor multiple phases:
logs/reportsare separatedaiperf_report.csvis copied from the last phaseAdds
server-metrics = "auto"support for generated AIPerf commands.Adds optional DCGM exporter launch through Slurm/Pyxis/Enroot using CloudAI DockerImage installables.
Adds early failure if DCGM is enabled but metrics endpoints are unreachable.
Updates AIDynamo acceptance coverage to validate generated
aiperf.sh.What’s No Longer Supported / Changed
src/cloudai/workloads/ai_dynamo/aiperf.shis now only a placeholder. Production Slurm runs should use the generated/cloudai_run_results/aiperf.sh.Config Example
DCGM Example
Example Configs Updated
Test Plan
Additional Notes