Skip to content

AIDynamo: enable multiple AIPerf runs during a single test run#907

Merged
podkidyshev merged 17 commits into
mainfrom
ipod/dynamo-aiperf-runs
Jun 1, 2026
Merged

AIDynamo: enable multiple AIPerf runs during a single test run#907
podkidyshev merged 17 commits into
mainfrom
ipod/dynamo-aiperf-runs

Conversation

@podkidyshev

@podkidyshev podkidyshev commented May 29, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Enable multiple AIPerf runs during a single test run
  • Enable support for DCMG

This PR extends the AIDynamo Slurm workload with first-class AIPerf multi-run support. A test can now run several named AIPerf phases against the same live Dynamo deployment, without restarting frontend/router, prefill, or decode workers between phases.

It also adds optional DCGM exporter support for AIPerf server metrics, generated AIPerf scripts per test run, and updates the vLLM/SGLang example configs to exercise two AIPerf rounds.

What’s New

  • Adds cmd_args.aiperf_phases for named AIPerf runs.

  • Keeps cmd_args.aiperf as the base/default AIPerf config.

  • Each phase inherits from cmd_args.aiperf and overrides only the needed fields.

  • Generates /cloudai_run_results/aiperf.sh per test run instead of relying on a static runtime script.

  • Preserves legacy single-run artifact layout:

    • aiperf_artifacts/
    • aiperf.log
    • aiperf_report.csv
  • For multiple phases:

    • phase artifacts go under phase-specific directories
    • phase logs/reports are separated
    • final root aiperf_report.csv is copied from the last phase
  • Adds server-metrics = "auto" support for generated AIPerf commands.

  • Adds optional DCGM exporter launch through Slurm/Pyxis/Enroot using CloudAI DockerImage installables.

  • Adds early failure if DCGM is enabled but metrics endpoints are unreachable.

  • Updates AIDynamo acceptance coverage to validate generated aiperf.sh.

What’s No Longer Supported / Changed

  • The checked-in src/cloudai/workloads/ai_dynamo/aiperf.sh is now only a placeholder. Production Slurm runs should use the generated /cloudai_run_results/aiperf.sh.
  • AIPerf CLI assembly is no longer done by ai_dynamo.sh; it is generated by the Slurm command generator.
  • AIPerf args values are scalar-only. For complex raw CLI syntax, use extra-args as a string.
  • DCGM is not launched with raw docker run; it uses the configured Slurm container runtime path via CloudAI installables.

Config Example

dse_excluded_args = ["cmd_args.aiperf_phases"]

[cmd_args]
workloads = "aiperf.sh"

  [cmd_args.aiperf]
  health-check-between-phases = true
  continue-on-phase-failure = false

    [cmd_args.aiperf.args]
    concurrency = 2
    request-count = 50
    endpoint-type = "chat"
    streaming = true
    server-metrics = "auto"

  [[cmd_args.aiperf_phases]]
  name = "round_1"

    [cmd_args.aiperf_phases.args]
    concurrency = 2

  [[cmd_args.aiperf_phases]]
  name = "round_2"

    [cmd_args.aiperf_phases.args]
    concurrency = 4
    streaming = false

DCGM Example

[Tests.cmd_args.dynamo.dcgm_exporter]
enabled = true
docker-image-url = "nvcr.io/nvidia/k8s/dcgm-exporter:4.5.2-4.8.1-distroless"

Example Configs Updated

  • conf/experimental/ai_dynamo/test/vllm.toml
  • conf/experimental/ai_dynamo/test/sglang.toml
  • conf/experimental/ai_dynamo/test_scenario/vllm_slurm.toml
  • conf/experimental/ai_dynamo/test_scenario/vllm_lmcache.toml

Test Plan

  • Automated CI
  • Manual runs

Additional Notes

@podkidyshev podkidyshev self-assigned this May 29, 2026
@coderabbitai

coderabbitai Bot commented May 29, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 5f3254ec-ad76-417d-87be-0e6e4de2723b

📥 Commits

Reviewing files that changed from the base of the PR and between 2fc6469 and 2113080.

📒 Files selected for processing (2)
  • src/cloudai/workloads/ai_dynamo/slurm_command_gen_strategy.py
  • tests/workloads/ai_dynamo/test_command_gen_strategy_slurm.py

📝 Walkthrough

Walkthrough

Adds multi-phase AIPerf support and optional DCGM exporter orchestration: new models and validators, TOML test updates, dynamic generation of a phase-aware aiperf.sh, Slurm-side DCGM launcher/cleanup, environment wiring, and tests that validate generated scripts and runtime flags.

Changes

Phased Execution and DCGM Exporter

Layer / File(s) Summary
Core data models and exports
src/cloudai/workloads/ai_dynamo/ai_dynamo.py, src/cloudai/workloads/ai_dynamo/__init__.py
Introduces DCGMExporter, adds dcgm_exporter to AIDynamoArgs, extends AIPerf with artifact/health/continuation fields and scalar extra_args validation, adds AIPerfPhase, aiperf_phases on command args, uniqueness validator, and re-exports new types.
Test configuration with phased execution and exporter
conf/experimental/ai_dynamo/test/sglang.toml, conf/experimental/ai_dynamo/test/vllm.toml, conf/experimental/ai_dynamo/test_scenario/vllm_lmcache.toml, conf/experimental/ai_dynamo/test_scenario/vllm_slurm.toml
Updates TOML files to exclude cmd_args.aiperf_phases from DSE, define aiperf_phases arrays (two rounds) with per-round concurrency/request-count and add DCGM exporter config where applicable.
Shell helpers, placeholder and reference scripts
src/cloudai/workloads/ai_dynamo/ai_dynamo.sh, src/cloudai/workloads/ai_dynamo/aiperf.sh, tests/ref_data/ai-dynamo-aiperf.sh
Adds _resolve_aiperf_server_metrics_urls() and env exports in ai_dynamo.sh, replaces the shipped aiperf.sh with a failing placeholder, and adds a reference ai-dynamo-aiperf.sh that runs two sequential AIPerf rounds with inter-phase probes and report consolidation.
Slurm command generation with DCGM orchestration and script rendering
src/cloudai/workloads/ai_dynamo/slurm_command_gen_strategy.py
Generates an executable aiperf.sh when requested, renders scalar CLI args and per-phase artifact/report paths, builds DCGM srun prefix and launcher/cleanup blocks, injects health checks and failure handling, and updates CLI wiring to mount/use generated script.
Sbatch DCGM lifecycle and phase script wiring
tests/ref_data/ai-dynamo.sbatch
Starts DCGM exporter via background srun, waits for Slurm step and per-node /metrics readiness, provides stop_dcgm_exporter() cleanup, runs ai_dynamo with --workloads aiperf.sh and DCGM flags, and ensures teardown after completion.
Acceptance test configuration and script verification
tests/test_acceptance.py
Updates test payload to enable dcgm_exporter, use workloads="aiperf.sh", configures AIPerf with aiperf_phases, and asserts generated aiperf.sh matches reference.
Comprehensive command generation and phase validation tests
tests/workloads/ai_dynamo/test_command_gen_strategy_slurm.py
Expands unit tests to validate generated aiperf.sh content, per-phase resolution, server-metrics handling, extra-args scalar enforcement, DCGM launcher/cleanup behavior, installable image handling, and phase name uniqueness.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes


Suggested reviewers

  • srivatsankrishnan
  • jeffnvidia
  • amaslenn

"🐰 In rounds they run, two phases clear and bright,
Metrics gathered, exporters hum through night,
Scripts assembled, health probes guard the gate,
Reports converge, each phase records its state,
A hopping rabbit cheers this phased profiling flight."

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: enabling multiple AIPerf runs during a single test run, which aligns with the primary objective of the changeset.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, covering new features, examples, and what's changed, all of which match the implemented modifications.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ipod/dynamo-aiperf-runs

Comment @coderabbitai help to get the list of available commands and usage tips.

@podkidyshev podkidyshev added the enhancement New feature or request label May 29, 2026
@podkidyshev podkidyshev force-pushed the ipod/dynamo-aiperf-runs branch from d563eb9 to d112a2a Compare May 29, 2026 20:58
Base automatically changed from ipod/dynamo-multiple-perf-2 to main June 1, 2026 09:26
@podkidyshev podkidyshev marked this pull request as ready for review June 1, 2026 17:30

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/ref_data/ai-dynamo.sbatch (1)

64-128: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Don't let DCGM cleanup mask the workload failure.

The main srun status is discarded here. If ai_dynamo.sh fails, Line 128 can still make the wrapper exit 0, which would falsely report a green job.

🛠️ Suggested fix
+trap 'stop_dcgm_exporter' EXIT
+
 srun \
   --export=ALL \
   --mpi=pmix \
   -N2 \
@@
   --genai_perf-warmup-request-count "10" \
   --aiperf-name "aiperf" \
   --aiperf-script /cloudai_run_results/aiperf.sh
-
-# Stop DCGM exporter when test finishes
-stop_dcgm_exporter
+workload_status=$?
+exit "${workload_status}"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/ref_data/ai-dynamo.sbatch` around lines 64 - 128, The srun wrapper
currently discards ai_dynamo.sh's exit status because stop_dcgm_exporter runs
unconditionally and the script then exits 0; capture and propagate srun's exit
code: immediately after the srun invocation that launches
/cloudai_install/ai_dynamo.sh, save its exit status (e.g., SRUN_EXIT=$?), then
call stop_dcgm_exporter, and finally exit with that saved status (exit
$SRUN_EXIT) so failures from ai_dynamo.sh are not masked; references: the srun
command launching /cloudai_install/ai_dynamo.sh and the stop_dcgm_exporter
invocation.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cloudai/workloads/ai_dynamo/slurm_command_gen_strategy.py`:
- Around line 204-207: _render_aiperf_script currently only uses setup_cmd from
phases[0], dropping per-phase setup_cmds; update it to call
_resolve_aiperf_phase for each phase (use the phases list and
_resolve_aiperf_phase(phase) to get per-phase.setup_cmd and the phase-specific
commands) and emit/insert each resolved setup_cmd immediately before that
phase's commands (handle the single_phase branch consistently). Alternatively,
if you prefer to restrict to a single setup, add explicit validation in
_render_aiperf_script that raises or logs an error when more than one phase has
a non-empty setup_cmd, referencing AIPerfPhase and _resolve_aiperf_phase so
reviewers can find the changed logic.

In `@tests/workloads/ai_dynamo/test_command_gen_strategy_slurm.py`:
- Around line 295-300: The two tests calling AIPerf.model_validate and
AIPerfPhase.model_validate should constrain the raised ValueError by adding a
match argument to pytest.raises so they assert the error is about "extra-args"
being non-scalar; update both with pytest.raises(ValueError, match="extra-args")
(or adjust the match string to the exact validator message emitted, e.g.,
containing "extra-args must be" or similar) to ensure the failure is the
intended validation error.

---

Outside diff comments:
In `@tests/ref_data/ai-dynamo.sbatch`:
- Around line 64-128: The srun wrapper currently discards ai_dynamo.sh's exit
status because stop_dcgm_exporter runs unconditionally and the script then exits
0; capture and propagate srun's exit code: immediately after the srun invocation
that launches /cloudai_install/ai_dynamo.sh, save its exit status (e.g.,
SRUN_EXIT=$?), then call stop_dcgm_exporter, and finally exit with that saved
status (exit $SRUN_EXIT) so failures from ai_dynamo.sh are not masked;
references: the srun command launching /cloudai_install/ai_dynamo.sh and the
stop_dcgm_exporter invocation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 5e2349dd-82e3-44ad-b1e5-5be1bb236390

📥 Commits

Reviewing files that changed from the base of the PR and between 7c2667b and 2fc6469.

📒 Files selected for processing (13)
  • conf/experimental/ai_dynamo/test/sglang.toml
  • conf/experimental/ai_dynamo/test/vllm.toml
  • conf/experimental/ai_dynamo/test_scenario/vllm_lmcache.toml
  • conf/experimental/ai_dynamo/test_scenario/vllm_slurm.toml
  • src/cloudai/workloads/ai_dynamo/__init__.py
  • src/cloudai/workloads/ai_dynamo/ai_dynamo.py
  • src/cloudai/workloads/ai_dynamo/ai_dynamo.sh
  • src/cloudai/workloads/ai_dynamo/aiperf.sh
  • src/cloudai/workloads/ai_dynamo/slurm_command_gen_strategy.py
  • tests/ref_data/ai-dynamo-aiperf.sh
  • tests/ref_data/ai-dynamo.sbatch
  • tests/test_acceptance.py
  • tests/workloads/ai_dynamo/test_command_gen_strategy_slurm.py

Comment thread src/cloudai/workloads/ai_dynamo/slurm_command_gen_strategy.py Outdated
Comment thread tests/workloads/ai_dynamo/test_command_gen_strategy_slurm.py
@podkidyshev podkidyshev merged commit 1c6e284 into main Jun 1, 2026
5 checks passed
@podkidyshev podkidyshev deleted the ipod/dynamo-aiperf-runs branch June 1, 2026 20:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants