Skip to content

add DSE support for AI Dynamo + LMCache aiperf workload#914

Merged
podkidyshev merged 1 commit into
NVIDIA:mainfrom
saivishal1999:spothula/dse-aidynamo-aiperf
Jun 10, 2026
Merged

add DSE support for AI Dynamo + LMCache aiperf workload#914
podkidyshev merged 1 commit into
NVIDIA:mainfrom
saivishal1999:spothula/dse-aidynamo-aiperf

Conversation

@saivishal1999

@saivishal1999 saivishal1999 commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Fix AIDynamoReportGenerationStrategy to correctly parse aiperf CSV output when running with non-standard metric names or multi-section CSV files.

Changes:

  • Add _AnyMetricSet (a list subclass where contains always returns True) so the strategy accepts any metric name reported by aiperf without
    needing an explicit allowlist
  • Pass on_bad_lines="skip" to pd.read_csv to gracefully handle malformed or multi-section lines in the aiperf output CSV

Why: aiperf reports metrics under varying names depending on workload configuration, and its CSV output can contain non-data header/separator lines
that cause pandas to raise on strict parsing. These two changes make the report generation robust to both issues

@coderabbitai

coderabbitai Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Complex PR? Review this PR in Change Stack to move by importance, not file order.

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR adds an AI Dynamo design space exploration reward function that computes log-scaled optimization scores from TTFT and throughput metrics, registers it globally, and hardens the report generation and grid search components to handle flexible metrics and configurable step limits with graceful CSV parsing.

Changes

AI Dynamo DSE Enhancements

Layer / File(s) Summary
Log-scaled CMX reward function definition and registration
src/cloudai/configurator/reward_functions.py, src/cloudai/registration.py
New ai_dynamo_log_scale_cmx_reward() validates 2-element observations of [ttft_ms, throughput_tokens_per_sec] and returns log(throughput + 1) - 0.8 * log(ttft + 1), with fallback penalty of -1e-3 for invalid inputs. Function is imported and registered under key "ai_dynamo_log_scale_cmx".
Report generation and grid search robustness improvements
src/cloudai/workloads/ai_dynamo/report_generation_strategy.py, src/cloudai/configurator/grid_search.py
_AnyMetricSet class allows AIDynamoReportGenerationStrategy.metrics to accept any metric string. CSV metric extraction now skips malformed rows via on_bad_lines="skip". GridSearchAgent.configure() conditionally caps self.max_steps by env.max_steps when truthy, falling back to full action_combinations count otherwise.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested labels

enhancement

Suggested reviewers

  • srivatsankrishnan
  • jeffnvidia

🐰 A reward function hops in with log-scaled cheer,
Throughput soars while TTFT's penalized here!
Metrics flexible, CSV rows skip with grace,
Grid search steps capped—robustness wins the race! 🚀

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main objective of the pull request—adding DSE support for the AI Dynamo plus LMCache aiperf workload.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, detailing fixes to grid_search.py, reward_functions.py, report_generation_strategy.py, and the addition of new configurations.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@conf/experimental/ai_dynamo/test/dse_vllm_lmcache_cmx.toml`:
- Line 32: Update the comment to match the actual formula used by
ai_dynamo_log_scale_reward: replace "Effective reward: log(throughput+1) -
0.8*log(TTFT+1)" with a line that reflects throughput reward minus
0.7*log(TTFT+1) minus 0.1*log(ITL+1), referencing the ai_dynamo_log_scale_reward
implementation which computes reward = throughput_reward - 0.7 * ttft_penalty -
0.1 * itl_penalty.

In `@src/cloudai/configurator/reward_functions.py`:
- Around line 66-68: The non-positive metrics sentinel is inconsistent:
ai_dynamo_weighted_normalized_reward returns -1.0 while
ai_dynamo_log_scale_reward returns -1e-3; change ai_dynamo_log_scale_reward so
it returns the same terminal sentinel (-1.0) when ttft <= 0 or itl <= 0 or
throughput <= 0 to match ai_dynamo_weighted_normalized_reward and ensure the DSE
runtime treats both failures identically.

In `@src/cloudai/workloads/ai_dynamo/report_generation_strategy.py`:
- Line 42: The read_csv call currently uses on_bad_lines="skip" which drops
malformed rows silently; replace it with an on_bad_lines callable that logs each
skipped line (e.g., using logging.getLogger(__name__) or an existing module
logger) and optionally increments a counter so you can emit a single warning
summarizing how many rows were skipped; update the lazy.pd.read_csv invocation
(the call that assigns df from lazy.pd.read_csv(csv_file, ...)) to pass this
callable instead of the string so skipped row content/line numbers are recorded
for observability.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 36404434-d2f0-4151-89e5-87b47f8307f5

📥 Commits

Reviewing files that changed from the base of the PR and between d999b13 and 380528a.

📒 Files selected for processing (9)
  • conf/experimental/ai_dynamo/system/daria_b300.toml
  • conf/experimental/ai_dynamo/system/lyris_gb200.toml
  • conf/experimental/ai_dynamo/test/dse_vllm_lmcache_cmx.toml
  • conf/experimental/ai_dynamo/test/vllm.toml
  • conf/experimental/ai_dynamo/test_scenario/dse_vllm_lmcache_daria.toml
  • conf/experimental/ai_dynamo/test_scenario/dse_vllm_lmcache_lyris.toml
  • src/cloudai/configurator/grid_search.py
  • src/cloudai/configurator/reward_functions.py
  • src/cloudai/workloads/ai_dynamo/report_generation_strategy.py

Comment thread conf/experimental/ai_dynamo/test/dse_vllm_lmcache_cmx.toml Outdated
Comment thread src/cloudai/configurator/reward_functions.py Outdated
Comment thread src/cloudai/workloads/ai_dynamo/report_generation_strategy.py
@saivishal1999 saivishal1999 force-pushed the spothula/dse-aidynamo-aiperf branch from 380528a to 05494e0 Compare June 5, 2026 17:09

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
src/cloudai/configurator/reward_functions.py (1)

126-127: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Inconsistent invalid-observation sentinel across reward functions.

ai_dynamo_weighted_normalized_reward returns -1.0 for non-positive metrics (line 67), while ai_dynamo_log_scale_reward returns -1e-3 for the same condition. This inconsistency could cause different failure-handling behavior if the DSE runtime treats -1.0 as a terminal failure signal but -1e-3 does not trigger the same handling.

🔧 Proposed fix to standardize the sentinel value
     if ttft <= 0 or itl <= 0 or throughput <= 0:
-        return -1e-3
+        return -1.0
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cloudai/configurator/reward_functions.py` around lines 126 - 127, The two
reward functions use different sentinels for invalid observations; make them
consistent by choosing a single sentinel (e.g., REWARD_INVALID = -1.0) and use
it in both ai_dynamo_weighted_normalized_reward and ai_dynamo_log_scale_reward
instead of returning -1e-3 in ai_dynamo_log_scale_reward; add the shared
constant near the top of the module and replace the literal returns in both
functions with that constant to ensure uniform failure signaling.
src/cloudai/workloads/ai_dynamo/report_generation_strategy.py (1)

42-42: 🧹 Nitpick | 🔵 Trivial | ⚖️ Poor tradeoff

Consider logging skipped CSV rows for observability.

Using on_bad_lines="skip" silently discards malformed CSV rows, which could mask data quality issues. Adding a warning log when rows are skipped would help diagnose unexpected metric values or missing data.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cloudai/workloads/ai_dynamo/report_generation_strategy.py` at line 42,
The current call df = lazy.pd.read_csv(csv_file, on_bad_lines="skip") silently
drops malformed CSV rows; change this to log skipped rows by passing an
on_bad_lines callable (or use on_bad_lines="warn") that records which lines were
skipped to the module logger (e.g., logger or self.logger) and include CSV
filename/context; ensure the module imports and uses the appropriate logger and
that the callable logs the bad line content or line number so data-quality
issues are observable when using report_generation_strategy.py and the df =
lazy.pd.read_csv(...) call.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@src/cloudai/configurator/reward_functions.py`:
- Around line 126-127: The two reward functions use different sentinels for
invalid observations; make them consistent by choosing a single sentinel (e.g.,
REWARD_INVALID = -1.0) and use it in both ai_dynamo_weighted_normalized_reward
and ai_dynamo_log_scale_reward instead of returning -1e-3 in
ai_dynamo_log_scale_reward; add the shared constant near the top of the module
and replace the literal returns in both functions with that constant to ensure
uniform failure signaling.

In `@src/cloudai/workloads/ai_dynamo/report_generation_strategy.py`:
- Line 42: The current call df = lazy.pd.read_csv(csv_file, on_bad_lines="skip")
silently drops malformed CSV rows; change this to log skipped rows by passing an
on_bad_lines callable (or use on_bad_lines="warn") that records which lines were
skipped to the module logger (e.g., logger or self.logger) and include CSV
filename/context; ensure the module imports and uses the appropriate logger and
that the callable logs the bad line content or line number so data-quality
issues are observable when using report_generation_strategy.py and the df =
lazy.pd.read_csv(...) call.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 7cf331c0-9e0f-4bfd-b310-1c0b06671002

📥 Commits

Reviewing files that changed from the base of the PR and between 380528a and 05494e0.

📒 Files selected for processing (7)
  • conf/experimental/ai_dynamo/test/dse_vllm_lmcache_cmx.toml
  • conf/experimental/ai_dynamo/test/vllm.toml
  • conf/experimental/ai_dynamo/test_scenario/dse_vllm_lmcache_daria.toml
  • conf/experimental/ai_dynamo/test_scenario/dse_vllm_lmcache_lyris.toml
  • src/cloudai/configurator/grid_search.py
  • src/cloudai/configurator/reward_functions.py
  • src/cloudai/workloads/ai_dynamo/report_generation_strategy.py

@saivishal1999 saivishal1999 force-pushed the spothula/dse-aidynamo-aiperf branch from 05494e0 to 8f5472d Compare June 5, 2026 17:40

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cloudai/configurator/grid_search.py`:
- Line 62: The code treats self.env.max_steps falsily so agent_steps=0 is
treated as unset; update the assignment for self.max_steps to check for None
explicitly (e.g., use "if self.env.max_steps is not None") so that an explicit 0
is respected, keeping the fallback to len(self.action_combinations) only when
max_steps is actually None; change the expression around self.max_steps,
self.action_combinations, and env.max_steps accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 86c063a0-8763-43ab-abd9-d1244fa8098e

📥 Commits

Reviewing files that changed from the base of the PR and between 05494e0 and 8f5472d.

📒 Files selected for processing (7)
  • conf/experimental/ai_dynamo/test/dse_vllm_lmcache_cmx.toml
  • conf/experimental/ai_dynamo/test/vllm.toml
  • conf/experimental/ai_dynamo/test_scenario/dse_vllm_lmcache_daria.toml
  • conf/experimental/ai_dynamo/test_scenario/dse_vllm_lmcache_lyris.toml
  • src/cloudai/configurator/grid_search.py
  • src/cloudai/configurator/reward_functions.py
  • src/cloudai/workloads/ai_dynamo/report_generation_strategy.py

@saivishal1999 saivishal1999 force-pushed the spothula/dse-aidynamo-aiperf branch from 8f5472d to 6c14d21 Compare June 5, 2026 20:17

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
conf/experimental/ai_dynamo/test/dse_vllm_lmcache_cmx.toml (1)

1-128: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Pipeline failure: TOML formatting required.

The CI pipeline reports that this file is not properly formatted according to taplo-fmt requirements. Run taplo fmt on this file to apply the required formatting changes before merging.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@conf/experimental/ai_dynamo/test/dse_vllm_lmcache_cmx.toml` around lines 1 -
128, The TOML file (containing name = "dse_vllm_lmcache_cmx" and sections like
cmd_args, cmd_args.lmcache, cmd_args.lmcache.extra_config.nixl_backend_params)
is not formatted to taplo-fmt standards; fix it by running taplo fmt on this
file (or apply equivalent taplo formatting) and commit the formatted file so
arrays, tables, indentation, and inline values conform to taplo's output before
merging.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@conf/experimental/ai_dynamo/test_scenario/dse_vllm_lmcache_daria.toml`:
- Line 27: The docker_image_url entry currently points to a user-specific Lustre
path; update the docker_image_url in dse_vllm_lmcache_daria.toml to use a
registry-based image (e.g., nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.1.1) for
portability, or if the local image is required for Daria-only testing, replace
the value with the local path but add a comment next to docker_image_url
explaining why the local image is necessary and include brief instructions on
how to obtain or build that image for other users.

In `@src/cloudai/configurator/grid_search.py`:
- Line 62: The assignment to self.max_steps exceeds the line-length limit;
refactor the expression that sets self.max_steps (which references
self.action_combinations and self.env.max_steps) into a shorter, wrapped form
(e.g., assign the right-hand value to a small intermediate variable or use
parentheses and line breaks) so the line is under 120 chars, then run ruff
format to reformat the file.
- Line 62: The GridSearchAgent.configure() call reads self.env.max_steps and the
tests use a MagicMock without that attribute which raises AttributeError; update
the test fixture in tests/test_agents.py to set env.max_steps = None (or an
appropriate int) on the MagicMock(spec=CloudAIGymEnv). In the
GridSearchAgent.configure implementation (the line computing self.max_steps)
split the long expression into a readable, wrapped form and explicitly handle
zero versus None (e.g., check if self.env.max_steps is None vs 0) so that
agent_steps=0 is treated as a valid value; reference the method
GridSearchAgent.configure and the assignment to self.max_steps when making these
changes.

In `@src/cloudai/configurator/reward_functions.py`:
- Around line 137-144: The new reward function currently returns -1e-3 for
invalid observations (checks: len(observation) < 2 and ttft <= 0 or throughput
<= 0); change these sentinel returns to -1.0 to match
ai_dynamo_weighted_normalized_reward so failure handling is consistent across
reward functions. Locate the invalid-observation branches in the new function in
reward_functions.py (the len(observation) < 2 early return and the
ttft/throughput <= 0 check) and replace -1e-3 with -1.0.

In `@src/cloudai/workloads/ai_dynamo/report_generation_strategy.py`:
- Around line 28-32: Update the __contains__ signature on class _AnyMetricSet to
use a more specific type (e.g., Any, Hashable, or a MetricType alias) instead of
plain object, or if you intentionally want duck-typing/compatibility keep object
but add an explanatory comment above __contains__ stating why object is used;
ensure the change references the _AnyMetricSet class and its __contains__ method
so readers understand the intent.

---

Outside diff comments:
In `@conf/experimental/ai_dynamo/test/dse_vllm_lmcache_cmx.toml`:
- Around line 1-128: The TOML file (containing name = "dse_vllm_lmcache_cmx" and
sections like cmd_args, cmd_args.lmcache,
cmd_args.lmcache.extra_config.nixl_backend_params) is not formatted to taplo-fmt
standards; fix it by running taplo fmt on this file (or apply equivalent taplo
formatting) and commit the formatted file so arrays, tables, indentation, and
inline values conform to taplo's output before merging.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 03b53ea2-9dff-4928-a7f1-8c293590c86c

📥 Commits

Reviewing files that changed from the base of the PR and between 8f5472d and 6c14d21.

📒 Files selected for processing (7)
  • conf/experimental/ai_dynamo/test/dse_vllm_lmcache_cmx.toml
  • conf/experimental/ai_dynamo/test_scenario/dse_vllm_lmcache_daria.toml
  • conf/experimental/ai_dynamo/test_scenario/dse_vllm_lmcache_lyris.toml
  • src/cloudai/configurator/grid_search.py
  • src/cloudai/configurator/reward_functions.py
  • src/cloudai/registration.py
  • src/cloudai/workloads/ai_dynamo/report_generation_strategy.py

Comment thread conf/experimental/ai_dynamo/test_scenario/dse_vllm_lmcache_daria.toml Outdated
Comment thread src/cloudai/configurator/grid_search.py Outdated
Comment thread src/cloudai/configurator/reward_functions.py Outdated
Comment thread src/cloudai/workloads/ai_dynamo/report_generation_strategy.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cloudai/registration.py`:
- Line 338: The current registration only registers "ai_dynamo_log_scale_cmx"
which will break existing configs that still reference the legacy reward key;
fix by adding an alias registration for the legacy key using the same function,
e.g. add another call like Registry().add_reward_function('<legacy_reward_key>',
ai_dynamo_log_scale_cmx_reward) (choose the actual legacy key used in existing
scenarios/configs), so both Registry.get_reward_function lookups succeed; update
the registration near the existing
Registry().add_reward_function("ai_dynamo_log_scale_cmx",
ai_dynamo_log_scale_cmx_reward) line.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: c24007e0-faa3-47ad-871e-d8ff5692fc53

📥 Commits

Reviewing files that changed from the base of the PR and between 6c14d21 and dcabb7d.

📒 Files selected for processing (2)
  • src/cloudai/configurator/reward_functions.py
  • src/cloudai/registration.py

Comment thread src/cloudai/registration.py Outdated
@podkidyshev

podkidyshev commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

@saivishal1999 please adjust PR description. I see it mentions things I don't see in this PR. The main issue though: it's a public repo so please avoid mentions of private NV resources :D like clusters names and so on

anyway, the main point: this new reward function seems tied to a specific production DSE objective rather than to generic CloudAI behavior. Shouldn't we move it into the private repo?

This public repo currently only ships grid_search, and the production agents/configs that need this objective live in the private one, should this reward be defined/registered there instead? If we do keep something in public CloudAI, I’d suggest making it technology-neutral, e.g. ai_dynamo_ttft_throughput_log_scale

fyi @srivatsankrishnan

Accept any aiperf metric dynamically via _AnyMetricSet and skip
malformed CSV lines with on_bad_lines='skip' to handle the multi-section
aiperf report format.
@saivishal1999 saivishal1999 force-pushed the spothula/dse-aidynamo-aiperf branch from 29803cc to aaddf33 Compare June 9, 2026 19:32
@podkidyshev podkidyshev merged commit 7a52c83 into NVIDIA:main Jun 10, 2026
4 checks passed
@srivatsankrishnan

Copy link
Copy Markdown
Contributor

@saivishal1999 please adjust PR description. I see it mentions things I don't see in this PR. The main issue though: it's a public repo so please avoid mentions of private NV resources :D like clusters names and so on

anyway, the main point: this new reward function seems tied to a specific production DSE objective rather than to generic CloudAI behavior. Shouldn't we move it into the private repo?

This public repo currently only ships grid_search, and the production agents/configs that need this objective live in the private one, should this reward be defined/registered there instead? If we do keep something in public CloudAI, I’d suggest making it technology-neutral, e.g. ai_dynamo_ttft_throughput_log_scale

fyi @srivatsankrishnan

This makes sense to me. @saivishal1999 I think we should keep these reward functions private.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants