AIDynamo: add semantic degradation evaluation support#903
Conversation
📝 WalkthroughWalkthroughAdds AIPerf accuracy benchmarking: new AIPerfAccuracy config, accuracy.sh runner, CSV parsing and metric extraction, integration into ai_dynamo.sh and SLURM arg generation, aiperf.sh enhancements, test/TOML updates, and documentation for accuracy-mode workflows. ChangesAIPerf Accuracy Benchmark Support
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/cloudai/workloads/ai_dynamo/ai_dynamo.py`:
- Around line 536-556: _parse_accuracy_value has an ambiguous edge case for
inputs equal to 1 or "1.0": the current logic treats them as already-normalized
(1.0) rather than 100%, which is a deliberate but undocumented choice; update
the function by adding an explicit check and short comment documenting the
assumption (e.g., if numeric value == 1.0 or parsed float == 1.0 return 1.0), so
the behavior is explicit and readable in _parse_accuracy_value and future
reviewers/tests can rely on that contract.
In `@tests/workloads/ai_dynamo/test_report_gen_strategy.py`:
- Around line 298-317: Add a test that covers raw numeric Accuracy cells (no %
sign) by creating an artifacts dir, writing an accuracy_results.csv whose
OVERALL Accuracy is a plain numeric like "0.35", and asserting
parse_aiperf_accuracy(tmp_path) == 0.35; e.g., add a new test function
(test_parse_aiperf_accuracy_numeric_value) alongside the existing tests that
uses (artifact_dir / "accuracy_results.csv").write_text(...) and calls
parse_aiperf_accuracy to verify numeric parsing works.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 0e313ca1-5654-4eac-8bf8-2ae0787ced38
📒 Files selected for processing (13)
conf/experimental/ai_dynamo/test/sglang.tomlconf/experimental/ai_dynamo/test/vllm.tomlconf/experimental/ai_dynamo/test_scenario/sglang_slurm.tomldoc/workloads/ai_dynamo.rstsrc/cloudai/workloads/ai_dynamo/__init__.pysrc/cloudai/workloads/ai_dynamo/accuracy.shsrc/cloudai/workloads/ai_dynamo/ai_dynamo.pysrc/cloudai/workloads/ai_dynamo/ai_dynamo.shsrc/cloudai/workloads/ai_dynamo/aiperf.shsrc/cloudai/workloads/ai_dynamo/report_generation_strategy.pysrc/cloudai/workloads/ai_dynamo/slurm_command_gen_strategy.pytests/workloads/ai_dynamo/test_command_gen_strategy_slurm.pytests/workloads/ai_dynamo/test_report_gen_strategy.py
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/workloads/sglang/test_command_gen_strategy_slurm.py (1)
169-174: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick winTighten the custom semantic-eval assertion to lock command shape.
Line 170-Line 174 only validate first/last elements, so extra unintended segments could slip through undetected. Assert the full list instead.
Based on learnings: In this repository, prefer expressing behavioral documentation through tests rather than docstrings.Proposed test assertion update
- assert command is not None - assert command[0] == "python3 /custom/semantic_eval.py" - assert command[-1] == ( - f"--num-questions 200 --data-path {sglang_cmd_gen_strategy.test_run.output_path.absolute()}/gsm8k.jsonl " - "--seen ${NODE}:8000" - ) + assert command == [ + "python3 /custom/semantic_eval.py", + f"--num-questions 200 --data-path {sglang_cmd_gen_strategy.test_run.output_path.absolute()}/gsm8k.jsonl " + "--seen ${NODE}:8000", + ]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/workloads/sglang/test_command_gen_strategy_slurm.py` around lines 169 - 174, The test currently only checks command[0] and command[-1], allowing extra/incorrect segments; replace those partial assertions with a single strict equality assertion that the entire command list equals the expected list constructed from sglang_cmd_gen_strategy.test_run.output_path.absolute() and the known segments (e.g., the first element "python3 /custom/semantic_eval.py" and the full final flags string using f"--num-questions 200 --data-path {sglang_cmd_gen_strategy.test_run.output_path.absolute()}/gsm8k.jsonl --seen ${NODE}:8000"); locate the assertions around the variable command in test_command_gen_strategy_slurm.py and change them to assert command == [ ...expected elements... ] so the test locks the exact command shape.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@tests/workloads/sglang/test_command_gen_strategy_slurm.py`:
- Around line 169-174: The test currently only checks command[0] and
command[-1], allowing extra/incorrect segments; replace those partial assertions
with a single strict equality assertion that the entire command list equals the
expected list constructed from
sglang_cmd_gen_strategy.test_run.output_path.absolute() and the known segments
(e.g., the first element "python3 /custom/semantic_eval.py" and the full final
flags string using f"--num-questions 200 --data-path
{sglang_cmd_gen_strategy.test_run.output_path.absolute()}/gsm8k.jsonl --seen
${NODE}:8000"); locate the assertions around the variable command in
test_command_gen_strategy_slurm.py and change them to assert command == [
...expected elements... ] so the test locks the exact command shape.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 87d33517-aab9-40c8-8d28-ba8f5a4c74f4
📒 Files selected for processing (11)
conf/experimental/sglang/test/sglang.tomlconf/experimental/vllm/test/vllm.tomldoc/workloads/sglang.rstdoc/workloads/vllm.rstsrc/cloudai/workloads/common/llm_serving.pysrc/cloudai/workloads/sglang/sglang.pysrc/cloudai/workloads/sglang/slurm_command_gen_strategy.pysrc/cloudai/workloads/vllm/slurm_command_gen_strategy.pysrc/cloudai/workloads/vllm/vllm.pytests/workloads/sglang/test_command_gen_strategy_slurm.pytests/workloads/vllm/test_command_gen_strategy_slurm.py
Summary
localhostTest Plan
Additional Notes