AIDynamo: add semantic degradation evaluation support by podkidyshev · Pull Request #903 · NVIDIA/cloudai

podkidyshev · 2026-05-27T14:53:09Z

Summary

Enable AIDynamo workload to support semantic degradation scripts, just like standalone vllm/sglang
- Also make standalone vllm/sglang support custom scripts
Fix AIDynamo runs on clusters with problematic localhost

Test Plan

Automated CI
Manual runs

Additional Notes

coderabbitai · 2026-05-27T14:53:17Z

📝 Walkthrough

Walkthrough

Adds AIPerf accuracy benchmarking: new AIPerfAccuracy config, accuracy.sh runner, CSV parsing and metric extraction, integration into ai_dynamo.sh and SLURM arg generation, aiperf.sh enhancements, test/TOML updates, and documentation for accuracy-mode workflows.

Changes

AIPerf Accuracy Benchmark Support

Layer / File(s)	Summary
AIPerfAccuracy model and accuracy parsing `src/cloudai/workloads/ai_dynamo/ai_dynamo.py`	`AIPerfAccuracy` config and constants for accuracy artifact locations; `AIPerf.setup_cmd` field; `parse_aiperf_accuracy()` and helpers to find/normalize accuracy from CSV artifacts.
Accuracy script and bash orchestration `src/cloudai/workloads/ai_dynamo/accuracy.sh`, `src/cloudai/workloads/ai_dynamo/ai_dynamo.sh`	New `accuracy.sh` to run accuracy benchmarks with templated CLI and artifact copying; `ai_dynamo.sh` extended to accept aiperf_accuracy args, resolve side-channel host via `_current_node_ip()`, add `mark_failed()`, and short-circuit on workload failures.
AIPerf script enhancements `src/cloudai/workloads/ai_dynamo/aiperf.sh`	Refactors to support `--artifact-dir-name`, `--cmd`, `--setup-cmd`, parsing of profile args after `--`, extra-args array handling, and robust CSV artifact discovery/collection into result-dir.
Command generation and reporting integration `src/cloudai/workloads/ai_dynamo/slurm_command_gen_strategy.py`, `src/cloudai/workloads/ai_dynamo/report_generation_strategy.py`, `src/cloudai/workloads/ai_dynamo/__init__.py`	SLURM generator now uses `shlex.quote` for complex TOML string values and emits `--aiperf_accuracy-` nested args when configured; report generation supports `accuracy` metric by parsing accuracy CSV; `AIPerfAccuracy` exported.
Test configs and documentation `conf/experimental/ai_dynamo/test/sglang.toml`, `conf/experimental/ai_dynamo/test/vllm.toml`, `conf/experimental/ai_dynamo/test_scenario/sglang_slurm.toml`, `doc/workloads/ai_dynamo.rst`, `doc/workloads/sglang.rst`, `doc/workloads/vllm.rst`	Config TOMLs switched to `aiperf.sh`, added/updated `cmd_args.aiperf_accuracy` sections, adjusted worker-initialized regexes and env vars; sglang_slurm adds single/multinode tests; docs add "Semantic Degradation With AIPerf Accuracy" and update semantic-eval CLI templating examples.
Test coverage for accuracy functionality `tests/workloads/ai_dynamo/test_command_gen_strategy_slurm.py`, `tests/workloads/ai_dynamo/test_report_gen_strategy.py`, `tests/workloads/sglang/test_command_gen_strategy_slurm.py`, `tests/workloads/vllm/test_command_gen_strategy_slurm.py`	Adds tests for aiperf_accuracy arg splitting and quoting, fixtures producing `accuracy_results.csv`, unit tests for `parse_aiperf_accuracy`, and updated semantic-eval command expectations for sglang/vllm changes.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

NVIDIA/cloudai#898: Builds directly on this PR's AIPerf groundwork by extending the existing aiperf.sh/AIPerf command wiring and adding new aiperf_accuracy workload and metric parsing capabilities.
NVIDIA/cloudai#890: Overlaps on semantic-eval and accuracy/metric wiring in vLLM/SGLang command generation and reporting.

Suggested reviewers

srivatsankrishnan
jeffnvidia

Poem

A rabbit hops through accuracy's maze,
It templates CLIs and hunts CSV haze,
Side-channels hum and scripts take flight,
Metrics bloom from artifacts bright,
Hooray for benchmarks—happy hops tonight! 🐇✨

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'AIDynamo: add semantic degradation evaluation support' directly and specifically describes the main objective of enabling AIDynamo to support semantic degradation evaluation, which aligns with the primary changes across the modified files.
Description check	✅ Passed	The PR description outlines the main changes: enabling AIDynamo semantic degradation support, making vllm/sglang support custom scripts, and fixing localhost issues on clusters, all of which are clearly reflected in the changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch ipod/aidynamo-semantic-2

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cloudai/workloads/ai_dynamo/ai_dynamo.py`:
- Around line 536-556: _parse_accuracy_value has an ambiguous edge case for
inputs equal to 1 or "1.0": the current logic treats them as already-normalized
(1.0) rather than 100%, which is a deliberate but undocumented choice; update
the function by adding an explicit check and short comment documenting the
assumption (e.g., if numeric value == 1.0 or parsed float == 1.0 return 1.0), so
the behavior is explicit and readable in _parse_accuracy_value and future
reviewers/tests can rely on that contract.

In `@tests/workloads/ai_dynamo/test_report_gen_strategy.py`:
- Around line 298-317: Add a test that covers raw numeric Accuracy cells (no %
sign) by creating an artifacts dir, writing an accuracy_results.csv whose
OVERALL Accuracy is a plain numeric like "0.35", and asserting
parse_aiperf_accuracy(tmp_path) == 0.35; e.g., add a new test function
(test_parse_aiperf_accuracy_numeric_value) alongside the existing tests that
uses (artifact_dir / "accuracy_results.csv").write_text(...) and calls
parse_aiperf_accuracy to verify numeric parsing works.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 0e313ca1-5654-4eac-8bf8-2ae0787ced38

📥 Commits

Reviewing files that changed from the base of the PR and between 7fc9cb4 and 1c99d60.

📒 Files selected for processing (13)

conf/experimental/ai_dynamo/test/sglang.toml
conf/experimental/ai_dynamo/test/vllm.toml
conf/experimental/ai_dynamo/test_scenario/sglang_slurm.toml
doc/workloads/ai_dynamo.rst
src/cloudai/workloads/ai_dynamo/__init__.py
src/cloudai/workloads/ai_dynamo/accuracy.sh
src/cloudai/workloads/ai_dynamo/ai_dynamo.py
src/cloudai/workloads/ai_dynamo/ai_dynamo.sh
src/cloudai/workloads/ai_dynamo/aiperf.sh
src/cloudai/workloads/ai_dynamo/report_generation_strategy.py
src/cloudai/workloads/ai_dynamo/slurm_command_gen_strategy.py
tests/workloads/ai_dynamo/test_command_gen_strategy_slurm.py
tests/workloads/ai_dynamo/test_report_gen_strategy.py

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/workloads/sglang/test_command_gen_strategy_slurm.py (1)

169-174: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Tighten the custom semantic-eval assertion to lock command shape.

Line 170-Line 174 only validate first/last elements, so extra unintended segments could slip through undetected. Assert the full list instead.

Proposed test assertion update

-    assert command is not None
-    assert command[0] == "python3 /custom/semantic_eval.py"
-    assert command[-1] == (
-        f"--num-questions 200 --data-path {sglang_cmd_gen_strategy.test_run.output_path.absolute()}/gsm8k.jsonl "
-        "--seen ${NODE}:8000"
-    )
+    assert command == [
+        "python3 /custom/semantic_eval.py",
+        f"--num-questions 200 --data-path {sglang_cmd_gen_strategy.test_run.output_path.absolute()}/gsm8k.jsonl "
+        "--seen ${NODE}:8000",
+    ]

Based on learnings: In this repository, prefer expressing behavioral documentation through tests rather than docstrings.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/workloads/sglang/test_command_gen_strategy_slurm.py` around lines 169 -
174, The test currently only checks command[0] and command[-1], allowing
extra/incorrect segments; replace those partial assertions with a single strict
equality assertion that the entire command list equals the expected list
constructed from sglang_cmd_gen_strategy.test_run.output_path.absolute() and the
known segments (e.g., the first element "python3 /custom/semantic_eval.py" and
the full final flags string using f"--num-questions 200 --data-path
{sglang_cmd_gen_strategy.test_run.output_path.absolute()}/gsm8k.jsonl --seen
${NODE}:8000"); locate the assertions around the variable command in
test_command_gen_strategy_slurm.py and change them to assert command == [
...expected elements... ] so the test locks the exact command shape.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@tests/workloads/sglang/test_command_gen_strategy_slurm.py`:
- Around line 169-174: The test currently only checks command[0] and
command[-1], allowing extra/incorrect segments; replace those partial assertions
with a single strict equality assertion that the entire command list equals the
expected list constructed from
sglang_cmd_gen_strategy.test_run.output_path.absolute() and the known segments
(e.g., the first element "python3 /custom/semantic_eval.py" and the full final
flags string using f"--num-questions 200 --data-path
{sglang_cmd_gen_strategy.test_run.output_path.absolute()}/gsm8k.jsonl --seen
${NODE}:8000"); locate the assertions around the variable command in
test_command_gen_strategy_slurm.py and change them to assert command == [
...expected elements... ] so the test locks the exact command shape.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 87d33517-aab9-40c8-8d28-ba8f5a4c74f4

📥 Commits

Reviewing files that changed from the base of the PR and between 1c99d60 and 1d09c7d.

📒 Files selected for processing (11)

conf/experimental/sglang/test/sglang.toml
conf/experimental/vllm/test/vllm.toml
doc/workloads/sglang.rst
doc/workloads/vllm.rst
src/cloudai/workloads/common/llm_serving.py
src/cloudai/workloads/sglang/sglang.py
src/cloudai/workloads/sglang/slurm_command_gen_strategy.py
src/cloudai/workloads/vllm/slurm_command_gen_strategy.py
src/cloudai/workloads/vllm/vllm.py
tests/workloads/sglang/test_command_gen_strategy_slurm.py
tests/workloads/vllm/test_command_gen_strategy_slurm.py

podkidyshev added 8 commits May 26, 2026 19:04

implement semantic degradataion for aidynamo using aiperf

2729197

update conf fix nixl connector

72c8ef2

accuracy fixes

b023c34

add aiperf setup for accuracy test

0e20d74

hard bump aiperf

b6c7982

enable hf online

de1110e

remove first token conf

632e8f5

disable qwen thinking

b8217ec

podkidyshev self-assigned this May 27, 2026

podkidyshev added the feature label May 27, 2026

podkidyshev added 7 commits May 27, 2026 08:39

run both perf and accuracy tests

3c05fb5

refactor

92d4c89

udpate sglang config

6ecf52e

trying to fix missing aiperf for sgalng

e07a44f

fixing sglang

a3e7092

allowing custom scripts

333b272

remove redundant test

1c99d60

podkidyshev marked this pull request as ready for review May 27, 2026 18:59

podkidyshev requested review from jeffnvidia and srivatsankrishnan as code owners May 27, 2026 18:59

coderabbitai Bot reviewed May 27, 2026

View reviewed changes

Comment thread src/cloudai/workloads/ai_dynamo/ai_dynamo.py

Comment thread tests/workloads/ai_dynamo/test_report_gen_strategy.py

support custom scripts for vllm and sglang

1d09c7d

coderabbitai Bot reviewed May 27, 2026

View reviewed changes

amaslenn approved these changes May 27, 2026

View reviewed changes

podkidyshev merged commit 0738c4b into main May 27, 2026
5 checks passed

podkidyshev deleted the ipod/aidynamo-semantic-2 branch May 27, 2026 21:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AIDynamo: add semantic degradation evaluation support#903

AIDynamo: add semantic degradation evaluation support#903
podkidyshev merged 16 commits into
mainfrom
ipod/aidynamo-semantic-2

podkidyshev commented May 27, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 27, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

podkidyshev commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Additional Notes

Uh oh!

coderabbitai Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

podkidyshev commented May 27, 2026 •

edited

Loading

coderabbitai Bot commented May 27, 2026 •

edited

Loading