[CLI/Core] Polymorphic agent dispatch + TestRun owns trial counter by rutayan-nv · Pull Request #893 · NVIDIA/cloudai

rutayan-nv · 2026-05-15T20:57:21Z

Issue

Dispatch: handle_dse_job branched on agent identity (flag + Protocol + TypeGuard) to support custom training loops.
Trial counter: TestRun.step had multiple writers; RLlib's frequent reset() collapsed every trial onto step=1, overwriting trajectory.csv/env.csv rows.
Failure handling: the DSE loop didn't distinguish recoverable failures from bugs.

Fix

Dispatch: BaseAgent.run() is the polymorphic entry point and handle_dse_job collapses to err |= agent.run() (deletes the flag/Protocol/TypeGuard/helper, −89 lines).
Trial counter: TestRun.increment_step() is the sole mutator, called only by CloudAIGymEnv.step().
Failure handling: hybrid contract — a non-zero rc accumulates and continues; an exception hard-fails (reports still generated, error written to <scenario_root>/dse_failure.txt, then re-raised).

Testing

pytest tests: 1607 passed, 4 skipped; ruff + pyright + vulture clean.
Downstream cloudaix RL workloads (PPO + DQN) reinstalled on this branch — full suite green; contract tests confirm tr.step monotonicity through the adapter and env.csv.

Stack: main ← #893 ← #900 ← #901 (merge bottom-up).

coderabbitai · 2026-05-15T20:57:32Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Agent execution was centralized into BaseAgent.run and handlers now call agent.run(); CloudAIGymEnv.step advances TestRun.step before applying params; TestRun gained increment_step(); tests and a stub agent were added to validate run() delegation and step-indexed artifacts.

Changes

Agent-run integration and step semantics

Layer / File(s)	Summary
TestRun.increment_step and env step ordering `src/cloudai/_core/test_scenario.py`, `src/cloudai/configurator/cloudai_gym.py`, `tests/test_cloudaigym.py`, `tests/test_test_scenario.py`	Adds `TestRun.increment_step()` and makes `CloudAIGymEnv.step()` call it before `apply_params_set`; tests updated to assert post-advance trial indices in output paths and cached trajectory rows.
BaseAgent.run implementation `src/cloudai/configurator/base_agent.py`	Adds `logging` import and a concrete `BaseAgent.run()` loop that calls `select_action()`, `env.step(action)`, `update_policy(...)`, logs observations/rewards, and returns `0` on completion.
DSE handler change and handler tests `src/cloudai/cli/handlers.py`, `tests/test_handlers.py`	`handle_dse_job` now calls `agent.run()` and ORs its integer return into the aggregated error instead of performing the per-step select_action/env.step/update_policy loop; tests add a `CustomRunStubAgent`, fixture, and assertions that `run()` is invoked and non-zero return values propagate.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hop through runs and count the beats,

I step before I write the feats,
One run to rule the agent's day,
Return a code, then hop away,
✨

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title accurately captures the main design change: replacing a flag-based dispatch pattern with polymorphism on BaseAgent.run() and centralizing the trial counter in TestRun.
Description check	✅ Passed	The PR description clearly explains the issue, fix, and testing approach related to supporting custom training loops and fixing the trial counter and failure handling.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cloudai/cli/handlers.py`:
- Around line 140-152: The finally block in _run_custom_training_loop currently
calls shutdown() directly which can raise and override the earlier return value;
wrap the shutdown invocation (getattr(agent, "shutdown", None) and the callable
check) in its own try/except Exception handler so any exceptions from shutdown
are caught and logged via logging.exception (include agent_type) and not
re-raised, ensuring the original return 0/1 from agent.train() is preserved.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 21ef8346-6b78-45d3-8a1c-29a3c393e440

📥 Commits

Reviewing files that changed from the base of the PR and between 4bdc465 and 3b62ddc.

📒 Files selected for processing (2)

src/cloudai/cli/handlers.py
tests/test_handlers.py

podkidyshev · 2026-05-19T14:44:00Z

    return installables, installer


+@runtime_checkable


let's move this code into base_agent.py. handlers.py is already too long

as for the tests against _run_custom_training_loop: I'm starting to make the tests folder structure replicate the main code structure. so in this case, I'd place all the relevant tests you added into tests/configurator/test_base_agent.py

(not related to tests against handle_dse_job)

makes sense, will do.

Done in the BaseAgent.run() polymorphism refactor (latest push). The custom-loop dispatch, the CustomTrainingLoopAgent Protocol, and the TypeGuard are all gone from handlers.py — handle_dse_job now collapses to err |= agent.run(), and agents that drive their own training loop just override run(). The dispatcher/helper unit tests were replaced with polymorphic tests asserting handle_dse_job delegates to agent.run() and propagates the return code (tests/test_handlers.py).

podkidyshev · 2026-05-19T14:49:57Z

        agent = agent_class(env, agent_config)

+        if _has_custom_training_loop(agent):
+            err |= _run_custom_training_loop(agent, agent_type)


shouldn't we exit (immediate return err) if err is greater than zero? The existing code above doesn't really treat the err well but maybe it's the time to start doing so :D

good point. Let me check that and change it!

I looked into this and decided to keep accumulate-and-continue rather than early-return. A DSE scenario can contain multiple test_runs (multiple agents) that run consecutively, and a failure in one shouldn't prevent the remaining independent runs from executing — we still want their results and reports. The err |= agent.run() accumulation preserves the failure signal: the handler returns non-zero if any run failed, so CI still fails on error. Early-returning would mask the other runs' outcomes. Happy to switch to fail-fast for the whole scenario if you'd prefer that semantics.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cloudai/cli/handlers.py`:
- Line 160: handle_dse_job currently calls agent.run() directly which can throw
and abort the whole DSE batch; wrap the call to agent.run() in a try/except that
catches any exception, sets/updates the existing err variable to a non-zero
return code (e.g., err |= 1 or err = 1) and continues processing remaining runs
instead of re-raising; ensure the catch logs the exception (including agent
identity) for debugging and references the agent.run() call and err variable so
the change is applied in the handle_dse_job function.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 843e393f-7ffd-4c66-ba6b-cd08d22ab804

📥 Commits

Reviewing files that changed from the base of the PR and between 9552e5a and a1a268a.

📒 Files selected for processing (7)

src/cloudai/_core/test_scenario.py
src/cloudai/cli/handlers.py
src/cloudai/configurator/base_agent.py
src/cloudai/configurator/cloudai_gym.py
tests/test_cloudaigym.py
tests/test_handlers.py
tests/test_test_scenario.py

rutayan-nv · 2026-05-23T00:57:51Z

Downstream cloudaix PR that consumes this contract: https://github.com/Mellanox/cloudaix/pull/589

It ports GymnasiumAdapter to cloudaix (where its only consumer — the RL agent stack — lives), introduces a rllib_run dispatch helper for shared RLlib orchestration code, and refactors PPO/DQN onto the new BaseAgent.run() polymorphism. The cloudaix PR also pins the user-visible env.csv monotonicity contract — that's the test that would have caught the original cluster-run bug at unit-test time.

Together the two PRs are the architectural fix for the trial-counter collapse: cloudai owns the trial-counter contract (TestRun.increment_step() + CloudAIGymEnv.step() as sole mutator), cloudaix owns the RL-specific glue (adapter + dispatch helper + agent overrides).

- Agents that set HAS_CUSTOM_TRAINING_LOOP = True drive their own training loop; handle_dse_job calls agent.train() and skips the per-step env.step loop. - New _run_custom_training_loop helper logs exceptions, returns a process-style exit code, and always invokes agent.shutdown() (when defined) in a finally block so resources are released on both success and failure paths. - CustomTrainingLoopAgent Protocol documents the opt-in contract for type checkers and IDEs.

Pyright rejected calling _run_custom_training_loop(agent, ...) because the plain bool predicate did not narrow agent's static type from BaseAgent to CustomTrainingLoopAgent. Return TypeGuard[CustomTrainingLoopAgent] from _has_custom_training_loop so the truthy branch in handle_dse_job sees the opted-in shape and the helper can call agent.train() directly.

If agent.shutdown() raised from the finally block, Python suppressed the earlier return 0/1 from agent.train() and propagated the exception, breaking the outer test-run loop in handle_dse_job (skipped remaining scenarios, failed to accumulate err |= rc). Wrap shutdown() in its own try/except, log via logging.exception, set rc = 1, and return rc after finally so the helper always honours the (int) -> int contract. Adds tests for shutdown-only failure and combined train+shutdown failure.

…mutator Previously, ``test_run.step`` had no clear owner: the dispatcher set it from outside, the adapter rewound it on ``reset()``, and other callers wrote to it ad hoc. In RLlib custom-loop runs this collapsed every trial onto ``step=1``, overwriting ``trajectory.csv`` and ``env.csv`` rows. Centralize the invariant: ``TestRun.increment_step()`` is the single named mutator, and ``CloudAIGymEnv.step()`` is its only caller. One ``env.step()`` call advances the trial counter by exactly one — independent of any episode or dispatcher concept above the gym env. Contract tests in ``TestIncrementStep`` cover the API; ``test_cloudaigym`` asserts ``step`` is advanced *before* ``output_path`` and trajectory rows are computed, so cached and live trials both record the post-increment value.

…orphism Earlier commits in this PR introduced ``HAS_CUSTOM_TRAINING_LOOP`` + a ``CustomTrainingLoopAgent`` Protocol + a TypeGuard helper + an ``if/else`` in ``handle_dse_job`` to switch between the cloudai step loop and an agent-owned ``train()`` loop. That is a type-tagged conditional dispatching on agent identity — the textbook signal to replace conditional with polymorphism (Fowler). Add a default ``BaseAgent.run() -> int`` that holds the step-loop body (``select_action`` / ``env.step`` / ``update_policy`` per trial). Agents that drive their own training (RLlib, etc.) override ``run()`` to delegate to whatever loop they own and return a process-style exit code. ``handle_dse_job`` collapses to ``err |= agent.run()`` — one line, no branching, no Protocol vocabulary. The handler no longer knows that "custom training loops" exist as a category; that's an agent implementation detail. Net: -89 lines on cloudai. Surface area shrinks (no Protocol, no TypeGuard, no flag). ``test_handlers`` replaces the 5 helper unit tests + 2 dispatcher integration tests with 2 polymorphic tests asserting ``handle_dse_job`` delegates to ``agent.run()`` and propagates its return code.

rutayan-nv · 2026-06-12T21:16:02Z

@podkidyshev rebased onto latest main and addressed your review:

"move dispatch out of handlers.py" — done via the BaseAgent.run() polymorphism refactor: handle_dse_job now collapses to err |= agent.run(); the Protocol/TypeGuard/helpers are gone from handlers.py. Dispatcher unit tests replaced with polymorphic tests in tests/test_handlers.py.
"early-return on err > 0" — replied inline: keeping accumulate-and-continue so independent test_runs in a scenario still execute; the OR'd err preserves a non-zero exit if any run fails. Glad to switch to fail-fast if you prefer.

Also note a contract fix surfaced by the rebase: main had evolved the dispatch loop to thread observations (env.reset() → select_action(observation=...)), so I ported that threading into BaseAgent.run() rather than regressing observation-conditioned agents. Full suite green (1637 passed). Re-review would be appreciated when you have a moment.

handle_dse_job now wraps `err |= agent.run()` so an exception from one TestRun is logged with the agent identity, converted to a non-zero exit code, and the loop continues with the remaining independent TestRuns. Previously an uncaught exception unwound past the accumulator and the exit() call, aborting the rest of the scenario and skipping report generation. This honors the dependency-free "all cases run consecutively" contract enforced at the top of the same function. Also aligns CustomRunStubAgent.select_action with the BaseAgent signature (observation kwarg) after the polymorphism rebase. Addresses CodeRabbit review on PR NVIDIA#893.

handle_dse_job uses a hybrid failure model for agent.run(), now made explicit and pinned: - Recoverable failures return a non-zero rc, accumulated via `err |= agent.run()`, and the sweep continues to the next TestRun. Workload-level failures already follow this: CloudAIGymEnv.step maps a failed metric to rewards.metric_failure rather than raising, and rllib_run catches training errors and returns rc=1. - Unexpected failures (framework/agent bugs) raise and hard-fail the job so the bug surfaces instead of being masked as a penalizing reward / non-zero rc. The exception is captured so reports still generate, the aborting error is documented in <scenario_root>/dse_failure.txt, and then it is re-raised. BaseAgent.run() docstring documents the return-vs-raise contract. Tests pin all halves: a non-zero rc accumulates and continues; a raising run() propagates and aborts remaining runs; and on a hard-fail reports are still generated with the failure documented before re-raising. Also aligns CustomRunStubAgent.select_action with the BaseAgent signature (observation kwarg) after the polymorphism rebase.

rutayan-nv · 2026-06-16T13:54:17Z

Closed by a branch rename (custom-training-loop-dispatch → agent-run-polymorphism). The work continues unchanged in #933, with the history rebuilt into 2 clean commits (the intermediate add-then-remove custom-loop-dispatch churn was squashed away). The net diff is identical.

rutayan-nv requested review from jeffnvidia, podkidyshev and srivatsankrishnan as code owners May 15, 2026 20:57

coderabbitai Bot reviewed May 15, 2026

View reviewed changes

Comment thread src/cloudai/cli/handlers.py Outdated

This was referenced May 15, 2026

Gym enhancements #863

Draft

More Gym enhancements #884

Draft

rutayan-nv changed the title ~~feat(cli): support agents with custom training loops in handle_dse_job~~ [CLI] Support agents with custom training loops in handle_dse_job May 15, 2026

rutayan-nv force-pushed the rpatro/custom-training-loop-dispatch branch from 3ffe893 to 9552e5a Compare May 18, 2026 16:33

podkidyshev requested changes May 19, 2026

View reviewed changes

coderabbitai Bot reviewed May 23, 2026

View reviewed changes

Comment thread src/cloudai/cli/handlers.py Outdated

rutayan-nv changed the title ~~[CLI] Support agents with custom training loops in handle_dse_job~~ [CLI/Core] Polymorphic agent dispatch + TestRun owns trial counter May 23, 2026

rutayan-nv mentioned this pull request May 23, 2026

[Configurator] Add GymnasiumAdapter for CloudAI envs #894

Closed

This was referenced May 26, 2026

[TDD-red] CloudAIGymEnv cache key must include env_params (drives domain-randomization correctness fix) #900

Open

fix(configurator): make env_params first-class to fix the trajectory cache key #901

Open

rutayan-nv added 5 commits June 12, 2026 16:17

rutayan-nv force-pushed the rpatro/custom-training-loop-dispatch branch from a1a268a to d0417b1 Compare June 12, 2026 21:14

rutayan-nv force-pushed the rpatro/custom-training-loop-dispatch branch from 11f8625 to 1e5a96d Compare June 15, 2026 22:49

rutayan-nv force-pushed the rpatro/custom-training-loop-dispatch branch from 1e5a96d to d53a88d Compare June 15, 2026 23:20

rutayan-nv mentioned this pull request Jun 16, 2026

feat(core): add ContinuousSpace action-space primitive #927

Open

rutayan-nv closed this Jun 16, 2026

rutayan-nv deleted the rpatro/custom-training-loop-dispatch branch June 16, 2026 13:49

rutayan-nv mentioned this pull request Jun 16, 2026

[CLI/Core] Polymorphic agent dispatch + TestRun owns trial counter #933

Open

Conversation

rutayan-nv commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue

Fix

Testing

Uh oh!

coderabbitai Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

podkidyshev May 19, 2026

Choose a reason for hiding this comment

Uh oh!

rutayan-nv Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

rutayan-nv Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

podkidyshev May 19, 2026

Choose a reason for hiding this comment

Uh oh!

rutayan-nv Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

rutayan-nv Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rutayan-nv commented May 23, 2026

Uh oh!

rutayan-nv commented Jun 12, 2026

Uh oh!

rutayan-nv commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rutayan-nv commented May 15, 2026 •

edited

Loading

coderabbitai Bot commented May 15, 2026 •

edited

Loading