Skip to content

[CLI/Core] Polymorphic agent dispatch + TestRun owns trial counter#893

Closed
rutayan-nv wants to merge 6 commits into
NVIDIA:mainfrom
rutayan-nv:rpatro/custom-training-loop-dispatch
Closed

[CLI/Core] Polymorphic agent dispatch + TestRun owns trial counter#893
rutayan-nv wants to merge 6 commits into
NVIDIA:mainfrom
rutayan-nv:rpatro/custom-training-loop-dispatch

Conversation

@rutayan-nv

@rutayan-nv rutayan-nv commented May 15, 2026

Copy link
Copy Markdown
Contributor

Issue

  • Dispatch: handle_dse_job branched on agent identity (flag + Protocol + TypeGuard) to support custom training loops.
  • Trial counter: TestRun.step had multiple writers; RLlib's frequent reset() collapsed every trial onto step=1, overwriting trajectory.csv/env.csv rows.
  • Failure handling: the DSE loop didn't distinguish recoverable failures from bugs.

Fix

  • Dispatch: BaseAgent.run() is the polymorphic entry point and handle_dse_job collapses to err |= agent.run() (deletes the flag/Protocol/TypeGuard/helper, −89 lines).
  • Trial counter: TestRun.increment_step() is the sole mutator, called only by CloudAIGymEnv.step().
  • Failure handling: hybrid contract — a non-zero rc accumulates and continues; an exception hard-fails (reports still generated, error written to <scenario_root>/dse_failure.txt, then re-raised).

Testing

  • pytest tests: 1607 passed, 4 skipped; ruff + pyright + vulture clean.
  • Downstream cloudaix RL workloads (PPO + DQN) reinstalled on this branch — full suite green; contract tests confirm tr.step monotonicity through the adapter and env.csv.

Stack: main#893#900#901 (merge bottom-up).

@coderabbitai

coderabbitai Bot commented May 15, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Agent execution was centralized into BaseAgent.run and handlers now call agent.run(); CloudAIGymEnv.step advances TestRun.step before applying params; TestRun gained increment_step(); tests and a stub agent were added to validate run() delegation and step-indexed artifacts.

Changes

Agent-run integration and step semantics

Layer / File(s) Summary
TestRun.increment_step and env step ordering
src/cloudai/_core/test_scenario.py, src/cloudai/configurator/cloudai_gym.py, tests/test_cloudaigym.py, tests/test_test_scenario.py
Adds TestRun.increment_step() and makes CloudAIGymEnv.step() call it before apply_params_set; tests updated to assert post-advance trial indices in output paths and cached trajectory rows.
BaseAgent.run implementation
src/cloudai/configurator/base_agent.py
Adds logging import and a concrete BaseAgent.run() loop that calls select_action(), env.step(action), update_policy(...), logs observations/rewards, and returns 0 on completion.
DSE handler change and handler tests
src/cloudai/cli/handlers.py, tests/test_handlers.py
handle_dse_job now calls agent.run() and ORs its integer return into the aggregated error instead of performing the per-step select_action/env.step/update_policy loop; tests add a CustomRunStubAgent, fixture, and assertions that run() is invoked and non-zero return values propagate.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hop through runs and count the beats,

I step before I write the feats,
One run to rule the agent's day,
Return a code, then hop away,

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title accurately captures the main design change: replacing a flag-based dispatch pattern with polymorphism on BaseAgent.run() and centralizing the trial counter in TestRun.
Description check ✅ Passed The PR description clearly explains the issue, fix, and testing approach related to supporting custom training loops and fixing the trial counter and failure handling.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cloudai/cli/handlers.py`:
- Around line 140-152: The finally block in _run_custom_training_loop currently
calls shutdown() directly which can raise and override the earlier return value;
wrap the shutdown invocation (getattr(agent, "shutdown", None) and the callable
check) in its own try/except Exception handler so any exceptions from shutdown
are caught and logged via logging.exception (include agent_type) and not
re-raised, ensuring the original return 0/1 from agent.train() is preserved.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 21ef8346-6b78-45d3-8a1c-29a3c393e440

📥 Commits

Reviewing files that changed from the base of the PR and between 4bdc465 and 3b62ddc.

📒 Files selected for processing (2)
  • src/cloudai/cli/handlers.py
  • tests/test_handlers.py

Comment thread src/cloudai/cli/handlers.py Outdated
This was referenced May 15, 2026
@rutayan-nv rutayan-nv changed the title feat(cli): support agents with custom training loops in handle_dse_job [CLI] Support agents with custom training loops in handle_dse_job May 15, 2026
@rutayan-nv rutayan-nv force-pushed the rpatro/custom-training-loop-dispatch branch from 3ffe893 to 9552e5a Compare May 18, 2026 16:33
Comment thread src/cloudai/cli/handlers.py Outdated
return installables, installer


@runtime_checkable

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's move this code into base_agent.py. handlers.py is already too long

as for the tests against _run_custom_training_loop: I'm starting to make the tests folder structure replicate the main code structure. so in this case, I'd place all the relevant tests you added into tests/configurator/test_base_agent.py

(not related to tests against handle_dse_job)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, will do.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in the BaseAgent.run() polymorphism refactor (latest push). The custom-loop dispatch, the CustomTrainingLoopAgent Protocol, and the TypeGuard are all gone from handlers.pyhandle_dse_job now collapses to err |= agent.run(), and agents that drive their own training loop just override run(). The dispatcher/helper unit tests were replaced with polymorphic tests asserting handle_dse_job delegates to agent.run() and propagates the return code (tests/test_handlers.py).

Comment thread src/cloudai/cli/handlers.py Outdated
agent = agent_class(env, agent_config)

if _has_custom_training_loop(agent):
err |= _run_custom_training_loop(agent, agent_type)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we exit (immediate return err) if err is greater than zero? The existing code above doesn't really treat the err well but maybe it's the time to start doing so :D

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. Let me check that and change it!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into this and decided to keep accumulate-and-continue rather than early-return. A DSE scenario can contain multiple test_runs (multiple agents) that run consecutively, and a failure in one shouldn't prevent the remaining independent runs from executing — we still want their results and reports. The err |= agent.run() accumulation preserves the failure signal: the handler returns non-zero if any run failed, so CI still fails on error. Early-returning would mask the other runs' outcomes. Happy to switch to fail-fast for the whole scenario if you'd prefer that semantics.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cloudai/cli/handlers.py`:
- Line 160: handle_dse_job currently calls agent.run() directly which can throw
and abort the whole DSE batch; wrap the call to agent.run() in a try/except that
catches any exception, sets/updates the existing err variable to a non-zero
return code (e.g., err |= 1 or err = 1) and continues processing remaining runs
instead of re-raising; ensure the catch logs the exception (including agent
identity) for debugging and references the agent.run() call and err variable so
the change is applied in the handle_dse_job function.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 843e393f-7ffd-4c66-ba6b-cd08d22ab804

📥 Commits

Reviewing files that changed from the base of the PR and between 9552e5a and a1a268a.

📒 Files selected for processing (7)
  • src/cloudai/_core/test_scenario.py
  • src/cloudai/cli/handlers.py
  • src/cloudai/configurator/base_agent.py
  • src/cloudai/configurator/cloudai_gym.py
  • tests/test_cloudaigym.py
  • tests/test_handlers.py
  • tests/test_test_scenario.py

Comment thread src/cloudai/cli/handlers.py Outdated
@rutayan-nv rutayan-nv changed the title [CLI] Support agents with custom training loops in handle_dse_job [CLI/Core] Polymorphic agent dispatch + TestRun owns trial counter May 23, 2026
@rutayan-nv

Copy link
Copy Markdown
Contributor Author

Downstream cloudaix PR that consumes this contract: https://github.com/Mellanox/cloudaix/pull/589

It ports GymnasiumAdapter to cloudaix (where its only consumer — the RL agent stack — lives), introduces a rllib_run dispatch helper for shared RLlib orchestration code, and refactors PPO/DQN onto the new BaseAgent.run() polymorphism. The cloudaix PR also pins the user-visible env.csv monotonicity contract — that's the test that would have caught the original cluster-run bug at unit-test time.

Together the two PRs are the architectural fix for the trial-counter collapse: cloudai owns the trial-counter contract (TestRun.increment_step() + CloudAIGymEnv.step() as sole mutator), cloudaix owns the RL-specific glue (adapter + dispatch helper + agent overrides).

- Agents that set HAS_CUSTOM_TRAINING_LOOP = True drive their own training loop;
  handle_dse_job calls agent.train() and skips the per-step env.step loop.
- New _run_custom_training_loop helper logs exceptions, returns a process-style
  exit code, and always invokes agent.shutdown() (when defined) in a finally
  block so resources are released on both success and failure paths.
- CustomTrainingLoopAgent Protocol documents the opt-in contract for type
  checkers and IDEs.
Pyright rejected calling _run_custom_training_loop(agent, ...) because the
plain bool predicate did not narrow agent's static type from BaseAgent to
CustomTrainingLoopAgent. Return TypeGuard[CustomTrainingLoopAgent] from
_has_custom_training_loop so the truthy branch in handle_dse_job sees the
opted-in shape and the helper can call agent.train() directly.
If agent.shutdown() raised from the finally block, Python suppressed the
earlier return 0/1 from agent.train() and propagated the exception, breaking
the outer test-run loop in handle_dse_job (skipped remaining scenarios,
failed to accumulate err |= rc). Wrap shutdown() in its own try/except,
log via logging.exception, set rc = 1, and return rc after finally so the
helper always honours the (int) -> int contract.

Adds tests for shutdown-only failure and combined train+shutdown failure.
…mutator

Previously, ``test_run.step`` had no clear owner: the dispatcher set it from
outside, the adapter rewound it on ``reset()``, and other callers wrote to
it ad hoc. In RLlib custom-loop runs this collapsed every trial onto
``step=1``, overwriting ``trajectory.csv`` and ``env.csv`` rows.

Centralize the invariant: ``TestRun.increment_step()`` is the single named
mutator, and ``CloudAIGymEnv.step()`` is its only caller. One ``env.step()``
call advances the trial counter by exactly one — independent of any episode
or dispatcher concept above the gym env.

Contract tests in ``TestIncrementStep`` cover the API; ``test_cloudaigym``
asserts ``step`` is advanced *before* ``output_path`` and trajectory rows
are computed, so cached and live trials both record the post-increment value.
…orphism

Earlier commits in this PR introduced ``HAS_CUSTOM_TRAINING_LOOP`` + a
``CustomTrainingLoopAgent`` Protocol + a TypeGuard helper + an ``if/else``
in ``handle_dse_job`` to switch between the cloudai step loop and an
agent-owned ``train()`` loop. That is a type-tagged conditional dispatching
on agent identity — the textbook signal to replace conditional with
polymorphism (Fowler).

Add a default ``BaseAgent.run() -> int`` that holds the step-loop body
(``select_action`` / ``env.step`` / ``update_policy`` per trial). Agents
that drive their own training (RLlib, etc.) override ``run()`` to delegate
to whatever loop they own and return a process-style exit code.

``handle_dse_job`` collapses to ``err |= agent.run()`` — one line, no
branching, no Protocol vocabulary. The handler no longer knows that
"custom training loops" exist as a category; that's an agent implementation
detail.

Net: -89 lines on cloudai. Surface area shrinks (no Protocol, no TypeGuard,
no flag). ``test_handlers`` replaces the 5 helper unit tests + 2 dispatcher
integration tests with 2 polymorphic tests asserting ``handle_dse_job``
delegates to ``agent.run()`` and propagates its return code.
@rutayan-nv rutayan-nv force-pushed the rpatro/custom-training-loop-dispatch branch from a1a268a to d0417b1 Compare June 12, 2026 21:14
@rutayan-nv

Copy link
Copy Markdown
Contributor Author

@podkidyshev rebased onto latest main and addressed your review:

  • "move dispatch out of handlers.py" — done via the BaseAgent.run() polymorphism refactor: handle_dse_job now collapses to err |= agent.run(); the Protocol/TypeGuard/helpers are gone from handlers.py. Dispatcher unit tests replaced with polymorphic tests in tests/test_handlers.py.
  • "early-return on err > 0" — replied inline: keeping accumulate-and-continue so independent test_runs in a scenario still execute; the OR'd err preserves a non-zero exit if any run fails. Glad to switch to fail-fast if you prefer.

Also note a contract fix surfaced by the rebase: main had evolved the dispatch loop to thread observations (env.reset()select_action(observation=...)), so I ported that threading into BaseAgent.run() rather than regressing observation-conditioned agents. Full suite green (1637 passed). Re-review would be appreciated when you have a moment.

rutayan-nv added a commit to rutayan-nv/cloudai that referenced this pull request Jun 15, 2026
handle_dse_job now wraps `err |= agent.run()` so an exception from one
TestRun is logged with the agent identity, converted to a non-zero exit
code, and the loop continues with the remaining independent TestRuns.
Previously an uncaught exception unwound past the accumulator and the
exit() call, aborting the rest of the scenario and skipping report
generation. This honors the dependency-free "all cases run consecutively"
contract enforced at the top of the same function.

Also aligns CustomRunStubAgent.select_action with the BaseAgent
signature (observation kwarg) after the polymorphism rebase.

Addresses CodeRabbit review on PR NVIDIA#893.
@rutayan-nv rutayan-nv force-pushed the rpatro/custom-training-loop-dispatch branch from 11f8625 to 1e5a96d Compare June 15, 2026 22:49
handle_dse_job uses a hybrid failure model for agent.run(), now made
explicit and pinned:

- Recoverable failures return a non-zero rc, accumulated via
  `err |= agent.run()`, and the sweep continues to the next TestRun.
  Workload-level failures already follow this: CloudAIGymEnv.step maps a
  failed metric to rewards.metric_failure rather than raising, and
  rllib_run catches training errors and returns rc=1.
- Unexpected failures (framework/agent bugs) raise and hard-fail the job
  so the bug surfaces instead of being masked as a penalizing reward /
  non-zero rc. The exception is captured so reports still generate, the
  aborting error is documented in <scenario_root>/dse_failure.txt, and
  then it is re-raised.

BaseAgent.run() docstring documents the return-vs-raise contract. Tests
pin all halves: a non-zero rc accumulates and continues; a raising run()
propagates and aborts remaining runs; and on a hard-fail reports are
still generated with the failure documented before re-raising.

Also aligns CustomRunStubAgent.select_action with the BaseAgent signature
(observation kwarg) after the polymorphism rebase.
@rutayan-nv rutayan-nv force-pushed the rpatro/custom-training-loop-dispatch branch from 1e5a96d to d53a88d Compare June 15, 2026 23:20
@rutayan-nv rutayan-nv closed this Jun 16, 2026
@rutayan-nv rutayan-nv deleted the rpatro/custom-training-loop-dispatch branch June 16, 2026 13:49
@rutayan-nv

Copy link
Copy Markdown
Contributor Author

Closed by a branch rename (custom-training-loop-dispatchagent-run-polymorphism). The work continues unchanged in #933, with the history rebuilt into 2 clean commits (the intermediate add-then-remove custom-loop-dispatch churn was squashed away). The net diff is identical.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants