add DSE support for AI Dynamo + LMCache aiperf workload#914
Conversation
|
Complex PR? Review this PR in Change Stack to move by importance, not file order. Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR adds an AI Dynamo design space exploration reward function that computes log-scaled optimization scores from TTFT and throughput metrics, registers it globally, and hardens the report generation and grid search components to handle flexible metrics and configurable step limits with graceful CSV parsing. ChangesAI Dynamo DSE Enhancements
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@conf/experimental/ai_dynamo/test/dse_vllm_lmcache_cmx.toml`:
- Line 32: Update the comment to match the actual formula used by
ai_dynamo_log_scale_reward: replace "Effective reward: log(throughput+1) -
0.8*log(TTFT+1)" with a line that reflects throughput reward minus
0.7*log(TTFT+1) minus 0.1*log(ITL+1), referencing the ai_dynamo_log_scale_reward
implementation which computes reward = throughput_reward - 0.7 * ttft_penalty -
0.1 * itl_penalty.
In `@src/cloudai/configurator/reward_functions.py`:
- Around line 66-68: The non-positive metrics sentinel is inconsistent:
ai_dynamo_weighted_normalized_reward returns -1.0 while
ai_dynamo_log_scale_reward returns -1e-3; change ai_dynamo_log_scale_reward so
it returns the same terminal sentinel (-1.0) when ttft <= 0 or itl <= 0 or
throughput <= 0 to match ai_dynamo_weighted_normalized_reward and ensure the DSE
runtime treats both failures identically.
In `@src/cloudai/workloads/ai_dynamo/report_generation_strategy.py`:
- Line 42: The read_csv call currently uses on_bad_lines="skip" which drops
malformed rows silently; replace it with an on_bad_lines callable that logs each
skipped line (e.g., using logging.getLogger(__name__) or an existing module
logger) and optionally increments a counter so you can emit a single warning
summarizing how many rows were skipped; update the lazy.pd.read_csv invocation
(the call that assigns df from lazy.pd.read_csv(csv_file, ...)) to pass this
callable instead of the string so skipped row content/line numbers are recorded
for observability.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 36404434-d2f0-4151-89e5-87b47f8307f5
📒 Files selected for processing (9)
conf/experimental/ai_dynamo/system/daria_b300.tomlconf/experimental/ai_dynamo/system/lyris_gb200.tomlconf/experimental/ai_dynamo/test/dse_vllm_lmcache_cmx.tomlconf/experimental/ai_dynamo/test/vllm.tomlconf/experimental/ai_dynamo/test_scenario/dse_vllm_lmcache_daria.tomlconf/experimental/ai_dynamo/test_scenario/dse_vllm_lmcache_lyris.tomlsrc/cloudai/configurator/grid_search.pysrc/cloudai/configurator/reward_functions.pysrc/cloudai/workloads/ai_dynamo/report_generation_strategy.py
380528a to
05494e0
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (2)
src/cloudai/configurator/reward_functions.py (1)
126-127:⚠️ Potential issue | 🟠 Major | ⚡ Quick winInconsistent invalid-observation sentinel across reward functions.
ai_dynamo_weighted_normalized_rewardreturns-1.0for non-positive metrics (line 67), whileai_dynamo_log_scale_rewardreturns-1e-3for the same condition. This inconsistency could cause different failure-handling behavior if the DSE runtime treats-1.0as a terminal failure signal but-1e-3does not trigger the same handling.🔧 Proposed fix to standardize the sentinel value
if ttft <= 0 or itl <= 0 or throughput <= 0: - return -1e-3 + return -1.0🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cloudai/configurator/reward_functions.py` around lines 126 - 127, The two reward functions use different sentinels for invalid observations; make them consistent by choosing a single sentinel (e.g., REWARD_INVALID = -1.0) and use it in both ai_dynamo_weighted_normalized_reward and ai_dynamo_log_scale_reward instead of returning -1e-3 in ai_dynamo_log_scale_reward; add the shared constant near the top of the module and replace the literal returns in both functions with that constant to ensure uniform failure signaling.src/cloudai/workloads/ai_dynamo/report_generation_strategy.py (1)
42-42: 🧹 Nitpick | 🔵 Trivial | ⚖️ Poor tradeoffConsider logging skipped CSV rows for observability.
Using
on_bad_lines="skip"silently discards malformed CSV rows, which could mask data quality issues. Adding a warning log when rows are skipped would help diagnose unexpected metric values or missing data.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cloudai/workloads/ai_dynamo/report_generation_strategy.py` at line 42, The current call df = lazy.pd.read_csv(csv_file, on_bad_lines="skip") silently drops malformed CSV rows; change this to log skipped rows by passing an on_bad_lines callable (or use on_bad_lines="warn") that records which lines were skipped to the module logger (e.g., logger or self.logger) and include CSV filename/context; ensure the module imports and uses the appropriate logger and that the callable logs the bad line content or line number so data-quality issues are observable when using report_generation_strategy.py and the df = lazy.pd.read_csv(...) call.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@src/cloudai/configurator/reward_functions.py`:
- Around line 126-127: The two reward functions use different sentinels for
invalid observations; make them consistent by choosing a single sentinel (e.g.,
REWARD_INVALID = -1.0) and use it in both ai_dynamo_weighted_normalized_reward
and ai_dynamo_log_scale_reward instead of returning -1e-3 in
ai_dynamo_log_scale_reward; add the shared constant near the top of the module
and replace the literal returns in both functions with that constant to ensure
uniform failure signaling.
In `@src/cloudai/workloads/ai_dynamo/report_generation_strategy.py`:
- Line 42: The current call df = lazy.pd.read_csv(csv_file, on_bad_lines="skip")
silently drops malformed CSV rows; change this to log skipped rows by passing an
on_bad_lines callable (or use on_bad_lines="warn") that records which lines were
skipped to the module logger (e.g., logger or self.logger) and include CSV
filename/context; ensure the module imports and uses the appropriate logger and
that the callable logs the bad line content or line number so data-quality
issues are observable when using report_generation_strategy.py and the df =
lazy.pd.read_csv(...) call.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 7cf331c0-9e0f-4bfd-b310-1c0b06671002
📒 Files selected for processing (7)
conf/experimental/ai_dynamo/test/dse_vllm_lmcache_cmx.tomlconf/experimental/ai_dynamo/test/vllm.tomlconf/experimental/ai_dynamo/test_scenario/dse_vllm_lmcache_daria.tomlconf/experimental/ai_dynamo/test_scenario/dse_vllm_lmcache_lyris.tomlsrc/cloudai/configurator/grid_search.pysrc/cloudai/configurator/reward_functions.pysrc/cloudai/workloads/ai_dynamo/report_generation_strategy.py
05494e0 to
8f5472d
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/cloudai/configurator/grid_search.py`:
- Line 62: The code treats self.env.max_steps falsily so agent_steps=0 is
treated as unset; update the assignment for self.max_steps to check for None
explicitly (e.g., use "if self.env.max_steps is not None") so that an explicit 0
is respected, keeping the fallback to len(self.action_combinations) only when
max_steps is actually None; change the expression around self.max_steps,
self.action_combinations, and env.max_steps accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 86c063a0-8763-43ab-abd9-d1244fa8098e
📒 Files selected for processing (7)
conf/experimental/ai_dynamo/test/dse_vllm_lmcache_cmx.tomlconf/experimental/ai_dynamo/test/vllm.tomlconf/experimental/ai_dynamo/test_scenario/dse_vllm_lmcache_daria.tomlconf/experimental/ai_dynamo/test_scenario/dse_vllm_lmcache_lyris.tomlsrc/cloudai/configurator/grid_search.pysrc/cloudai/configurator/reward_functions.pysrc/cloudai/workloads/ai_dynamo/report_generation_strategy.py
8f5472d to
6c14d21
Compare
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
conf/experimental/ai_dynamo/test/dse_vllm_lmcache_cmx.toml (1)
1-128:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winPipeline failure: TOML formatting required.
The CI pipeline reports that this file is not properly formatted according to
taplo-fmtrequirements. Runtaplo fmton this file to apply the required formatting changes before merging.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@conf/experimental/ai_dynamo/test/dse_vllm_lmcache_cmx.toml` around lines 1 - 128, The TOML file (containing name = "dse_vllm_lmcache_cmx" and sections like cmd_args, cmd_args.lmcache, cmd_args.lmcache.extra_config.nixl_backend_params) is not formatted to taplo-fmt standards; fix it by running taplo fmt on this file (or apply equivalent taplo formatting) and commit the formatted file so arrays, tables, indentation, and inline values conform to taplo's output before merging.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@conf/experimental/ai_dynamo/test_scenario/dse_vllm_lmcache_daria.toml`:
- Line 27: The docker_image_url entry currently points to a user-specific Lustre
path; update the docker_image_url in dse_vllm_lmcache_daria.toml to use a
registry-based image (e.g., nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.1.1) for
portability, or if the local image is required for Daria-only testing, replace
the value with the local path but add a comment next to docker_image_url
explaining why the local image is necessary and include brief instructions on
how to obtain or build that image for other users.
In `@src/cloudai/configurator/grid_search.py`:
- Line 62: The assignment to self.max_steps exceeds the line-length limit;
refactor the expression that sets self.max_steps (which references
self.action_combinations and self.env.max_steps) into a shorter, wrapped form
(e.g., assign the right-hand value to a small intermediate variable or use
parentheses and line breaks) so the line is under 120 chars, then run ruff
format to reformat the file.
- Line 62: The GridSearchAgent.configure() call reads self.env.max_steps and the
tests use a MagicMock without that attribute which raises AttributeError; update
the test fixture in tests/test_agents.py to set env.max_steps = None (or an
appropriate int) on the MagicMock(spec=CloudAIGymEnv). In the
GridSearchAgent.configure implementation (the line computing self.max_steps)
split the long expression into a readable, wrapped form and explicitly handle
zero versus None (e.g., check if self.env.max_steps is None vs 0) so that
agent_steps=0 is treated as a valid value; reference the method
GridSearchAgent.configure and the assignment to self.max_steps when making these
changes.
In `@src/cloudai/configurator/reward_functions.py`:
- Around line 137-144: The new reward function currently returns -1e-3 for
invalid observations (checks: len(observation) < 2 and ttft <= 0 or throughput
<= 0); change these sentinel returns to -1.0 to match
ai_dynamo_weighted_normalized_reward so failure handling is consistent across
reward functions. Locate the invalid-observation branches in the new function in
reward_functions.py (the len(observation) < 2 early return and the
ttft/throughput <= 0 check) and replace -1e-3 with -1.0.
In `@src/cloudai/workloads/ai_dynamo/report_generation_strategy.py`:
- Around line 28-32: Update the __contains__ signature on class _AnyMetricSet to
use a more specific type (e.g., Any, Hashable, or a MetricType alias) instead of
plain object, or if you intentionally want duck-typing/compatibility keep object
but add an explanatory comment above __contains__ stating why object is used;
ensure the change references the _AnyMetricSet class and its __contains__ method
so readers understand the intent.
---
Outside diff comments:
In `@conf/experimental/ai_dynamo/test/dse_vllm_lmcache_cmx.toml`:
- Around line 1-128: The TOML file (containing name = "dse_vllm_lmcache_cmx" and
sections like cmd_args, cmd_args.lmcache,
cmd_args.lmcache.extra_config.nixl_backend_params) is not formatted to taplo-fmt
standards; fix it by running taplo fmt on this file (or apply equivalent taplo
formatting) and commit the formatted file so arrays, tables, indentation, and
inline values conform to taplo's output before merging.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 03b53ea2-9dff-4928-a7f1-8c293590c86c
📒 Files selected for processing (7)
conf/experimental/ai_dynamo/test/dse_vllm_lmcache_cmx.tomlconf/experimental/ai_dynamo/test_scenario/dse_vllm_lmcache_daria.tomlconf/experimental/ai_dynamo/test_scenario/dse_vllm_lmcache_lyris.tomlsrc/cloudai/configurator/grid_search.pysrc/cloudai/configurator/reward_functions.pysrc/cloudai/registration.pysrc/cloudai/workloads/ai_dynamo/report_generation_strategy.py
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/cloudai/registration.py`:
- Line 338: The current registration only registers "ai_dynamo_log_scale_cmx"
which will break existing configs that still reference the legacy reward key;
fix by adding an alias registration for the legacy key using the same function,
e.g. add another call like Registry().add_reward_function('<legacy_reward_key>',
ai_dynamo_log_scale_cmx_reward) (choose the actual legacy key used in existing
scenarios/configs), so both Registry.get_reward_function lookups succeed; update
the registration near the existing
Registry().add_reward_function("ai_dynamo_log_scale_cmx",
ai_dynamo_log_scale_cmx_reward) line.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: c24007e0-faa3-47ad-871e-d8ff5692fc53
📒 Files selected for processing (2)
src/cloudai/configurator/reward_functions.pysrc/cloudai/registration.py
|
@saivishal1999 please adjust PR description. I see it mentions things I don't see in this PR. The main issue though: it's a public repo so please avoid mentions of private NV resources :D like clusters names and so on anyway, the main point: this new reward function seems tied to a specific production DSE objective rather than to generic CloudAI behavior. Shouldn't we move it into the private repo? This public repo currently only ships grid_search, and the production agents/configs that need this objective live in the private one, should this reward be defined/registered there instead? If we do keep something in public CloudAI, I’d suggest making it technology-neutral, e.g. ai_dynamo_ttft_throughput_log_scale |
87b1505 to
29803cc
Compare
Accept any aiperf metric dynamically via _AnyMetricSet and skip malformed CSV lines with on_bad_lines='skip' to handle the multi-section aiperf report format.
29803cc to
aaddf33
Compare
This makes sense to me. @saivishal1999 I think we should keep these reward functions private. |
Fix AIDynamoReportGenerationStrategy to correctly parse aiperf CSV output when running with non-standard metric names or multi-section CSV files.
Changes:
needing an explicit allowlist
Why: aiperf reports metrics under varying names depending on workload configuration, and its CSV output can contain non-data header/separator lines
that cause pandas to raise on strict parsing. These two changes make the report generation robust to both issues