AIDynamo: enable multiple AIPerf runs during a single test run by podkidyshev · Pull Request #907 · NVIDIA/cloudai

podkidyshev · 2026-05-29T19:04:22Z

Summary

Enable multiple AIPerf runs during a single test run
Enable support for DCMG

This PR extends the AIDynamo Slurm workload with first-class AIPerf multi-run support. A test can now run several named AIPerf phases against the same live Dynamo deployment, without restarting frontend/router, prefill, or decode workers between phases.

It also adds optional DCGM exporter support for AIPerf server metrics, generated AIPerf scripts per test run, and updates the vLLM/SGLang example configs to exercise two AIPerf rounds.

What’s New

Adds cmd_args.aiperf_phases for named AIPerf runs.
Keeps cmd_args.aiperf as the base/default AIPerf config.
Each phase inherits from cmd_args.aiperf and overrides only the needed fields.
Generates /cloudai_run_results/aiperf.sh per test run instead of relying on a static runtime script.
Preserves legacy single-run artifact layout:
- aiperf_artifacts/
- aiperf.log
- aiperf_report.csv
For multiple phases:
- phase artifacts go under phase-specific directories
- phase logs/reports are separated
- final root aiperf_report.csv is copied from the last phase
Adds server-metrics = "auto" support for generated AIPerf commands.
Adds optional DCGM exporter launch through Slurm/Pyxis/Enroot using CloudAI DockerImage installables.
Adds early failure if DCGM is enabled but metrics endpoints are unreachable.
Updates AIDynamo acceptance coverage to validate generated aiperf.sh.

What’s No Longer Supported / Changed

The checked-in src/cloudai/workloads/ai_dynamo/aiperf.sh is now only a placeholder. Production Slurm runs should use the generated /cloudai_run_results/aiperf.sh.
AIPerf CLI assembly is no longer done by ai_dynamo.sh; it is generated by the Slurm command generator.
AIPerf args values are scalar-only. For complex raw CLI syntax, use extra-args as a string.
DCGM is not launched with raw docker run; it uses the configured Slurm container runtime path via CloudAI installables.

Config Example

dse_excluded_args = ["cmd_args.aiperf_phases"]

[cmd_args]
workloads = "aiperf.sh"

  [cmd_args.aiperf]
  health-check-between-phases = true
  continue-on-phase-failure = false

    [cmd_args.aiperf.args]
    concurrency = 2
    request-count = 50
    endpoint-type = "chat"
    streaming = true
    server-metrics = "auto"

  [[cmd_args.aiperf_phases]]
  name = "round_1"

    [cmd_args.aiperf_phases.args]
    concurrency = 2

  [[cmd_args.aiperf_phases]]
  name = "round_2"

    [cmd_args.aiperf_phases.args]
    concurrency = 4
    streaming = false

DCGM Example

[Tests.cmd_args.dynamo.dcgm_exporter]
enabled = true
docker-image-url = "nvcr.io/nvidia/k8s/dcgm-exporter:4.5.2-4.8.1-distroless"

Example Configs Updated

conf/experimental/ai_dynamo/test/vllm.toml
conf/experimental/ai_dynamo/test/sglang.toml
conf/experimental/ai_dynamo/test_scenario/vllm_slurm.toml
conf/experimental/ai_dynamo/test_scenario/vllm_lmcache.toml

Test Plan

Automated CI
Manual runs

Additional Notes

coderabbitai · 2026-05-29T19:04:30Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 5f3254ec-ad76-417d-87be-0e6e4de2723b

📥 Commits

Reviewing files that changed from the base of the PR and between 2fc6469 and 2113080.

📒 Files selected for processing (2)

src/cloudai/workloads/ai_dynamo/slurm_command_gen_strategy.py
tests/workloads/ai_dynamo/test_command_gen_strategy_slurm.py

📝 Walkthrough

Walkthrough

Adds multi-phase AIPerf support and optional DCGM exporter orchestration: new models and validators, TOML test updates, dynamic generation of a phase-aware aiperf.sh, Slurm-side DCGM launcher/cleanup, environment wiring, and tests that validate generated scripts and runtime flags.

Changes

Phased Execution and DCGM Exporter

Layer / File(s)	Summary
Core data models and exports `src/cloudai/workloads/ai_dynamo/ai_dynamo.py`, `src/cloudai/workloads/ai_dynamo/__init__.py`	Introduces `DCGMExporter`, adds `dcgm_exporter` to `AIDynamoArgs`, extends `AIPerf` with artifact/health/continuation fields and scalar `extra_args` validation, adds `AIPerfPhase`, `aiperf_phases` on command args, uniqueness validator, and re-exports new types.
Test configuration with phased execution and exporter `conf/experimental/ai_dynamo/test/sglang.toml`, `conf/experimental/ai_dynamo/test/vllm.toml`, `conf/experimental/ai_dynamo/test_scenario/vllm_lmcache.toml`, `conf/experimental/ai_dynamo/test_scenario/vllm_slurm.toml`	Updates TOML files to exclude `cmd_args.aiperf_phases` from DSE, define `aiperf_phases` arrays (two rounds) with per-round concurrency/request-count and add DCGM exporter config where applicable.
Shell helpers, placeholder and reference scripts `src/cloudai/workloads/ai_dynamo/ai_dynamo.sh`, `src/cloudai/workloads/ai_dynamo/aiperf.sh`, `tests/ref_data/ai-dynamo-aiperf.sh`	Adds `_resolve_aiperf_server_metrics_urls()` and env exports in ai_dynamo.sh, replaces the shipped `aiperf.sh` with a failing placeholder, and adds a reference `ai-dynamo-aiperf.sh` that runs two sequential AIPerf rounds with inter-phase probes and report consolidation.
Slurm command generation with DCGM orchestration and script rendering `src/cloudai/workloads/ai_dynamo/slurm_command_gen_strategy.py`	Generates an executable `aiperf.sh` when requested, renders scalar CLI args and per-phase artifact/report paths, builds DCGM `srun` prefix and launcher/cleanup blocks, injects health checks and failure handling, and updates CLI wiring to mount/use generated script.
Sbatch DCGM lifecycle and phase script wiring `tests/ref_data/ai-dynamo.sbatch`	Starts DCGM exporter via background `srun`, waits for Slurm step and per-node `/metrics` readiness, provides `stop_dcgm_exporter()` cleanup, runs ai_dynamo with `--workloads aiperf.sh` and DCGM flags, and ensures teardown after completion.
Acceptance test configuration and script verification `tests/test_acceptance.py`	Updates test payload to enable `dcgm_exporter`, use `workloads="aiperf.sh"`, configures `AIPerf` with `aiperf_phases`, and asserts generated `aiperf.sh` matches reference.
Comprehensive command generation and phase validation tests `tests/workloads/ai_dynamo/test_command_gen_strategy_slurm.py`	Expands unit tests to validate generated `aiperf.sh` content, per-phase resolution, server-metrics handling, `extra-args` scalar enforcement, DCGM launcher/cleanup behavior, installable image handling, and phase name uniqueness.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

srivatsankrishnan
jeffnvidia
amaslenn

"🐰 In rounds they run, two phases clear and bright,
Metrics gathered, exporters hum through night,
Scripts assembled, health probes guard the gate,
Reports converge, each phase records its state,
A hopping rabbit cheers this phased profiling flight."

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: enabling multiple AIPerf runs during a single test run, which aligns with the primary objective of the changeset.
Description check	✅ Passed	The description is comprehensive and directly related to the changeset, covering new features, examples, and what's changed, all of which match the implemented modifications.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch ipod/dynamo-aiperf-runs

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/ref_data/ai-dynamo.sbatch (1)

64-128: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Don't let DCGM cleanup mask the workload failure.

The main srun status is discarded here. If ai_dynamo.sh fails, Line 128 can still make the wrapper exit 0, which would falsely report a green job.

🛠️ Suggested fix

+trap 'stop_dcgm_exporter' EXIT
+
 srun \
   --export=ALL \
   --mpi=pmix \
   -N2 \
@@
   --genai_perf-warmup-request-count "10" \
   --aiperf-name "aiperf" \
   --aiperf-script /cloudai_run_results/aiperf.sh
-
-# Stop DCGM exporter when test finishes
-stop_dcgm_exporter
+workload_status=$?
+exit "${workload_status}"

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/ref_data/ai-dynamo.sbatch` around lines 64 - 128, The srun wrapper
currently discards ai_dynamo.sh's exit status because stop_dcgm_exporter runs
unconditionally and the script then exits 0; capture and propagate srun's exit
code: immediately after the srun invocation that launches
/cloudai_install/ai_dynamo.sh, save its exit status (e.g., SRUN_EXIT=$?), then
call stop_dcgm_exporter, and finally exit with that saved status (exit
$SRUN_EXIT) so failures from ai_dynamo.sh are not masked; references: the srun
command launching /cloudai_install/ai_dynamo.sh and the stop_dcgm_exporter
invocation.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cloudai/workloads/ai_dynamo/slurm_command_gen_strategy.py`:
- Around line 204-207: _render_aiperf_script currently only uses setup_cmd from
phases[0], dropping per-phase setup_cmds; update it to call
_resolve_aiperf_phase for each phase (use the phases list and
_resolve_aiperf_phase(phase) to get per-phase.setup_cmd and the phase-specific
commands) and emit/insert each resolved setup_cmd immediately before that
phase's commands (handle the single_phase branch consistently). Alternatively,
if you prefer to restrict to a single setup, add explicit validation in
_render_aiperf_script that raises or logs an error when more than one phase has
a non-empty setup_cmd, referencing AIPerfPhase and _resolve_aiperf_phase so
reviewers can find the changed logic.

In `@tests/workloads/ai_dynamo/test_command_gen_strategy_slurm.py`:
- Around line 295-300: The two tests calling AIPerf.model_validate and
AIPerfPhase.model_validate should constrain the raised ValueError by adding a
match argument to pytest.raises so they assert the error is about "extra-args"
being non-scalar; update both with pytest.raises(ValueError, match="extra-args")
(or adjust the match string to the exact validator message emitted, e.g.,
containing "extra-args must be" or similar) to ensure the failure is the
intended validation error.

---

Outside diff comments:
In `@tests/ref_data/ai-dynamo.sbatch`:
- Around line 64-128: The srun wrapper currently discards ai_dynamo.sh's exit
status because stop_dcgm_exporter runs unconditionally and the script then exits
0; capture and propagate srun's exit code: immediately after the srun invocation
that launches /cloudai_install/ai_dynamo.sh, save its exit status (e.g.,
SRUN_EXIT=$?), then call stop_dcgm_exporter, and finally exit with that saved
status (exit $SRUN_EXIT) so failures from ai_dynamo.sh are not masked;
references: the srun command launching /cloudai_install/ai_dynamo.sh and the
stop_dcgm_exporter invocation.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 5e2349dd-82e3-44ad-b1e5-5be1bb236390

📥 Commits

Reviewing files that changed from the base of the PR and between 7c2667b and 2fc6469.

📒 Files selected for processing (13)

conf/experimental/ai_dynamo/test/sglang.toml
conf/experimental/ai_dynamo/test/vllm.toml
conf/experimental/ai_dynamo/test_scenario/vllm_lmcache.toml
conf/experimental/ai_dynamo/test_scenario/vllm_slurm.toml
src/cloudai/workloads/ai_dynamo/__init__.py
src/cloudai/workloads/ai_dynamo/ai_dynamo.py
src/cloudai/workloads/ai_dynamo/ai_dynamo.sh
src/cloudai/workloads/ai_dynamo/aiperf.sh
src/cloudai/workloads/ai_dynamo/slurm_command_gen_strategy.py
tests/ref_data/ai-dynamo-aiperf.sh
tests/ref_data/ai-dynamo.sbatch
tests/test_acceptance.py
tests/workloads/ai_dynamo/test_command_gen_strategy_slurm.py

podkidyshev self-assigned this May 29, 2026

podkidyshev added the enhancement New feature or request label May 29, 2026

multiple aiperf runs

d112a2a

podkidyshev force-pushed the ipod/dynamo-aiperf-runs branch from d563eb9 to d112a2a Compare May 29, 2026 20:58

fix vllm config

c3877ca

Base automatically changed from ipod/dynamo-multiple-perf-2 to main June 1, 2026 09:26

podkidyshev added 13 commits June 1, 2026 12:08

fix filenames for different aiperf iterations

c4512dc

fix copy crash

2ab10fa

refactor and more tests

bbd5cf2

cleaner phases merge

dda28ef

simplify aiperf handling

ea08df7

add remaning fork functionality

10c1a00

fix dcgm endpoint url

08d8e0a

switch dcgm to use enroot to run the image

16e9639

remove state

baa04d9

fail early if dcgm fails

b749cb3

cleanup hardcoded env vars escaping

f39797a

replace kill with scancel

9042f60

reformat aiperf script cmd generation

a1f225d

podkidyshev marked this pull request as ready for review June 1, 2026 17:30

podkidyshev requested review from jeffnvidia and srivatsankrishnan as code owners June 1, 2026 17:30

Merge branch 'main' into ipod/dynamo-aiperf-runs

2fc6469

coderabbitai Bot reviewed Jun 1, 2026

View reviewed changes

Comment thread src/cloudai/workloads/ai_dynamo/slurm_command_gen_strategy.py Outdated

Comment thread tests/workloads/ai_dynamo/test_command_gen_strategy_slurm.py

honor per-phase AIPerf setup commands

2113080

amaslenn approved these changes Jun 1, 2026

View reviewed changes

podkidyshev merged commit 1c6e284 into main Jun 1, 2026
5 checks passed

podkidyshev deleted the ipod/dynamo-aiperf-runs branch June 1, 2026 20:53

coderabbitai Bot mentioned this pull request Jun 2, 2026

AIDynamo: Optional restart of DynamoRouter between AIPerf re-runs #908

Merged

This was referenced Jun 2, 2026

AIDynamo: shared node disagg inference #909

Merged

add DSE support for AI Dynamo + LMCache aiperf workload #914

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AIDynamo: enable multiple AIPerf runs during a single test run#907

AIDynamo: enable multiple AIPerf runs during a single test run#907
podkidyshev merged 17 commits into
mainfrom
ipod/dynamo-aiperf-runs

podkidyshev commented May 29, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 29, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

podkidyshev commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What’s New

What’s No Longer Supported / Changed

Example Configs Updated

Test Plan

Additional Notes

Uh oh!

coderabbitai Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

podkidyshev commented May 29, 2026 •

edited

Loading

coderabbitai Bot commented May 29, 2026 •

edited

Loading