Skip to content

AIDynamo: add semantic degradation evaluation support#903

Merged
podkidyshev merged 16 commits into
mainfrom
ipod/aidynamo-semantic-2
May 27, 2026
Merged

AIDynamo: add semantic degradation evaluation support#903
podkidyshev merged 16 commits into
mainfrom
ipod/aidynamo-semantic-2

Conversation

@podkidyshev

@podkidyshev podkidyshev commented May 27, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Enable AIDynamo workload to support semantic degradation scripts, just like standalone vllm/sglang
    • Also make standalone vllm/sglang support custom scripts
  • Fix AIDynamo runs on clusters with problematic localhost

Test Plan

  • Automated CI
  • Manual runs

Additional Notes

@podkidyshev podkidyshev self-assigned this May 27, 2026
@coderabbitai

coderabbitai Bot commented May 27, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Adds AIPerf accuracy benchmarking: new AIPerfAccuracy config, accuracy.sh runner, CSV parsing and metric extraction, integration into ai_dynamo.sh and SLURM arg generation, aiperf.sh enhancements, test/TOML updates, and documentation for accuracy-mode workflows.

Changes

AIPerf Accuracy Benchmark Support

Layer / File(s) Summary
AIPerfAccuracy model and accuracy parsing
src/cloudai/workloads/ai_dynamo/ai_dynamo.py
AIPerfAccuracy config and constants for accuracy artifact locations; AIPerf.setup_cmd field; parse_aiperf_accuracy() and helpers to find/normalize accuracy from CSV artifacts.
Accuracy script and bash orchestration
src/cloudai/workloads/ai_dynamo/accuracy.sh, src/cloudai/workloads/ai_dynamo/ai_dynamo.sh
New accuracy.sh to run accuracy benchmarks with templated CLI and artifact copying; ai_dynamo.sh extended to accept aiperf_accuracy args, resolve side-channel host via _current_node_ip(), add mark_failed(), and short-circuit on workload failures.
AIPerf script enhancements
src/cloudai/workloads/ai_dynamo/aiperf.sh
Refactors to support --artifact-dir-name, --cmd, --setup-cmd, parsing of profile args after --, extra-args array handling, and robust CSV artifact discovery/collection into result-dir.
Command generation and reporting integration
src/cloudai/workloads/ai_dynamo/slurm_command_gen_strategy.py, src/cloudai/workloads/ai_dynamo/report_generation_strategy.py, src/cloudai/workloads/ai_dynamo/__init__.py
SLURM generator now uses shlex.quote for complex TOML string values and emits --aiperf_accuracy- nested args when configured; report generation supports accuracy metric by parsing accuracy CSV; AIPerfAccuracy exported.
Test configs and documentation
conf/experimental/ai_dynamo/test/sglang.toml, conf/experimental/ai_dynamo/test/vllm.toml, conf/experimental/ai_dynamo/test_scenario/sglang_slurm.toml, doc/workloads/ai_dynamo.rst, doc/workloads/sglang.rst, doc/workloads/vllm.rst
Config TOMLs switched to aiperf.sh, added/updated cmd_args.aiperf_accuracy sections, adjusted worker-initialized regexes and env vars; sglang_slurm adds single/multinode tests; docs add "Semantic Degradation With AIPerf Accuracy" and update semantic-eval CLI templating examples.
Test coverage for accuracy functionality
tests/workloads/ai_dynamo/test_command_gen_strategy_slurm.py, tests/workloads/ai_dynamo/test_report_gen_strategy.py, tests/workloads/sglang/test_command_gen_strategy_slurm.py, tests/workloads/vllm/test_command_gen_strategy_slurm.py
Adds tests for aiperf_accuracy arg splitting and quoting, fixtures producing accuracy_results.csv, unit tests for parse_aiperf_accuracy, and updated semantic-eval command expectations for sglang/vllm changes.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • NVIDIA/cloudai#898: Builds directly on this PR's AIPerf groundwork by extending the existing aiperf.sh/AIPerf command wiring and adding new aiperf_accuracy workload and metric parsing capabilities.
  • NVIDIA/cloudai#890: Overlaps on semantic-eval and accuracy/metric wiring in vLLM/SGLang command generation and reporting.

Suggested reviewers

  • srivatsankrishnan
  • jeffnvidia

Poem

A rabbit hops through accuracy's maze,
It templates CLIs and hunts CSV haze,
Side-channels hum and scripts take flight,
Metrics bloom from artifacts bright,
Hooray for benchmarks—happy hops tonight! 🐇✨

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'AIDynamo: add semantic degradation evaluation support' directly and specifically describes the main objective of enabling AIDynamo to support semantic degradation evaluation, which aligns with the primary changes across the modified files.
Description check ✅ Passed The PR description outlines the main changes: enabling AIDynamo semantic degradation support, making vllm/sglang support custom scripts, and fixing localhost issues on clusters, all of which are clearly reflected in the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ipod/aidynamo-semantic-2

Comment @coderabbitai help to get the list of available commands and usage tips.

@podkidyshev podkidyshev marked this pull request as ready for review May 27, 2026 18:59

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cloudai/workloads/ai_dynamo/ai_dynamo.py`:
- Around line 536-556: _parse_accuracy_value has an ambiguous edge case for
inputs equal to 1 or "1.0": the current logic treats them as already-normalized
(1.0) rather than 100%, which is a deliberate but undocumented choice; update
the function by adding an explicit check and short comment documenting the
assumption (e.g., if numeric value == 1.0 or parsed float == 1.0 return 1.0), so
the behavior is explicit and readable in _parse_accuracy_value and future
reviewers/tests can rely on that contract.

In `@tests/workloads/ai_dynamo/test_report_gen_strategy.py`:
- Around line 298-317: Add a test that covers raw numeric Accuracy cells (no %
sign) by creating an artifacts dir, writing an accuracy_results.csv whose
OVERALL Accuracy is a plain numeric like "0.35", and asserting
parse_aiperf_accuracy(tmp_path) == 0.35; e.g., add a new test function
(test_parse_aiperf_accuracy_numeric_value) alongside the existing tests that
uses (artifact_dir / "accuracy_results.csv").write_text(...) and calls
parse_aiperf_accuracy to verify numeric parsing works.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 0e313ca1-5654-4eac-8bf8-2ae0787ced38

📥 Commits

Reviewing files that changed from the base of the PR and between 7fc9cb4 and 1c99d60.

📒 Files selected for processing (13)
  • conf/experimental/ai_dynamo/test/sglang.toml
  • conf/experimental/ai_dynamo/test/vllm.toml
  • conf/experimental/ai_dynamo/test_scenario/sglang_slurm.toml
  • doc/workloads/ai_dynamo.rst
  • src/cloudai/workloads/ai_dynamo/__init__.py
  • src/cloudai/workloads/ai_dynamo/accuracy.sh
  • src/cloudai/workloads/ai_dynamo/ai_dynamo.py
  • src/cloudai/workloads/ai_dynamo/ai_dynamo.sh
  • src/cloudai/workloads/ai_dynamo/aiperf.sh
  • src/cloudai/workloads/ai_dynamo/report_generation_strategy.py
  • src/cloudai/workloads/ai_dynamo/slurm_command_gen_strategy.py
  • tests/workloads/ai_dynamo/test_command_gen_strategy_slurm.py
  • tests/workloads/ai_dynamo/test_report_gen_strategy.py

Comment thread src/cloudai/workloads/ai_dynamo/ai_dynamo.py
Comment thread tests/workloads/ai_dynamo/test_report_gen_strategy.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/workloads/sglang/test_command_gen_strategy_slurm.py (1)

169-174: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Tighten the custom semantic-eval assertion to lock command shape.

Line 170-Line 174 only validate first/last elements, so extra unintended segments could slip through undetected. Assert the full list instead.

Proposed test assertion update
-    assert command is not None
-    assert command[0] == "python3 /custom/semantic_eval.py"
-    assert command[-1] == (
-        f"--num-questions 200 --data-path {sglang_cmd_gen_strategy.test_run.output_path.absolute()}/gsm8k.jsonl "
-        "--seen ${NODE}:8000"
-    )
+    assert command == [
+        "python3 /custom/semantic_eval.py",
+        f"--num-questions 200 --data-path {sglang_cmd_gen_strategy.test_run.output_path.absolute()}/gsm8k.jsonl "
+        "--seen ${NODE}:8000",
+    ]
Based on learnings: In this repository, prefer expressing behavioral documentation through tests rather than docstrings.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/workloads/sglang/test_command_gen_strategy_slurm.py` around lines 169 -
174, The test currently only checks command[0] and command[-1], allowing
extra/incorrect segments; replace those partial assertions with a single strict
equality assertion that the entire command list equals the expected list
constructed from sglang_cmd_gen_strategy.test_run.output_path.absolute() and the
known segments (e.g., the first element "python3 /custom/semantic_eval.py" and
the full final flags string using f"--num-questions 200 --data-path
{sglang_cmd_gen_strategy.test_run.output_path.absolute()}/gsm8k.jsonl --seen
${NODE}:8000"); locate the assertions around the variable command in
test_command_gen_strategy_slurm.py and change them to assert command == [
...expected elements... ] so the test locks the exact command shape.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@tests/workloads/sglang/test_command_gen_strategy_slurm.py`:
- Around line 169-174: The test currently only checks command[0] and
command[-1], allowing extra/incorrect segments; replace those partial assertions
with a single strict equality assertion that the entire command list equals the
expected list constructed from
sglang_cmd_gen_strategy.test_run.output_path.absolute() and the known segments
(e.g., the first element "python3 /custom/semantic_eval.py" and the full final
flags string using f"--num-questions 200 --data-path
{sglang_cmd_gen_strategy.test_run.output_path.absolute()}/gsm8k.jsonl --seen
${NODE}:8000"); locate the assertions around the variable command in
test_command_gen_strategy_slurm.py and change them to assert command == [
...expected elements... ] so the test locks the exact command shape.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 87d33517-aab9-40c8-8d28-ba8f5a4c74f4

📥 Commits

Reviewing files that changed from the base of the PR and between 1c99d60 and 1d09c7d.

📒 Files selected for processing (11)
  • conf/experimental/sglang/test/sglang.toml
  • conf/experimental/vllm/test/vllm.toml
  • doc/workloads/sglang.rst
  • doc/workloads/vllm.rst
  • src/cloudai/workloads/common/llm_serving.py
  • src/cloudai/workloads/sglang/sglang.py
  • src/cloudai/workloads/sglang/slurm_command_gen_strategy.py
  • src/cloudai/workloads/vllm/slurm_command_gen_strategy.py
  • src/cloudai/workloads/vllm/vllm.py
  • tests/workloads/sglang/test_command_gen_strategy_slurm.py
  • tests/workloads/vllm/test_command_gen_strategy_slurm.py

@podkidyshev podkidyshev merged commit 0738c4b into main May 27, 2026
5 checks passed
@podkidyshev podkidyshev deleted the ipod/aidynamo-semantic-2 branch May 27, 2026 21:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants