[CLI/Core] Polymorphic agent dispatch + TestRun owns trial counter#893
[CLI/Core] Polymorphic agent dispatch + TestRun owns trial counter#893rutayan-nv wants to merge 6 commits into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAgent execution was centralized into BaseAgent.run and handlers now call agent.run(); CloudAIGymEnv.step advances TestRun.step before applying params; TestRun gained increment_step(); tests and a stub agent were added to validate run() delegation and step-indexed artifacts. ChangesAgent-run integration and step semantics
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/cloudai/cli/handlers.py`:
- Around line 140-152: The finally block in _run_custom_training_loop currently
calls shutdown() directly which can raise and override the earlier return value;
wrap the shutdown invocation (getattr(agent, "shutdown", None) and the callable
check) in its own try/except Exception handler so any exceptions from shutdown
are caught and logged via logging.exception (include agent_type) and not
re-raised, ensuring the original return 0/1 from agent.train() is preserved.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 21ef8346-6b78-45d3-8a1c-29a3c393e440
📒 Files selected for processing (2)
src/cloudai/cli/handlers.pytests/test_handlers.py
3ffe893 to
9552e5a
Compare
| return installables, installer | ||
|
|
||
|
|
||
| @runtime_checkable |
There was a problem hiding this comment.
let's move this code into base_agent.py. handlers.py is already too long
as for the tests against _run_custom_training_loop: I'm starting to make the tests folder structure replicate the main code structure. so in this case, I'd place all the relevant tests you added into tests/configurator/test_base_agent.py
(not related to tests against handle_dse_job)
There was a problem hiding this comment.
makes sense, will do.
There was a problem hiding this comment.
Done in the BaseAgent.run() polymorphism refactor (latest push). The custom-loop dispatch, the CustomTrainingLoopAgent Protocol, and the TypeGuard are all gone from handlers.py — handle_dse_job now collapses to err |= agent.run(), and agents that drive their own training loop just override run(). The dispatcher/helper unit tests were replaced with polymorphic tests asserting handle_dse_job delegates to agent.run() and propagates the return code (tests/test_handlers.py).
| agent = agent_class(env, agent_config) | ||
|
|
||
| if _has_custom_training_loop(agent): | ||
| err |= _run_custom_training_loop(agent, agent_type) |
There was a problem hiding this comment.
shouldn't we exit (immediate return err) if err is greater than zero? The existing code above doesn't really treat the err well but maybe it's the time to start doing so :D
There was a problem hiding this comment.
good point. Let me check that and change it!
There was a problem hiding this comment.
I looked into this and decided to keep accumulate-and-continue rather than early-return. A DSE scenario can contain multiple test_runs (multiple agents) that run consecutively, and a failure in one shouldn't prevent the remaining independent runs from executing — we still want their results and reports. The err |= agent.run() accumulation preserves the failure signal: the handler returns non-zero if any run failed, so CI still fails on error. Early-returning would mask the other runs' outcomes. Happy to switch to fail-fast for the whole scenario if you'd prefer that semantics.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/cloudai/cli/handlers.py`:
- Line 160: handle_dse_job currently calls agent.run() directly which can throw
and abort the whole DSE batch; wrap the call to agent.run() in a try/except that
catches any exception, sets/updates the existing err variable to a non-zero
return code (e.g., err |= 1 or err = 1) and continues processing remaining runs
instead of re-raising; ensure the catch logs the exception (including agent
identity) for debugging and references the agent.run() call and err variable so
the change is applied in the handle_dse_job function.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 843e393f-7ffd-4c66-ba6b-cd08d22ab804
📒 Files selected for processing (7)
src/cloudai/_core/test_scenario.pysrc/cloudai/cli/handlers.pysrc/cloudai/configurator/base_agent.pysrc/cloudai/configurator/cloudai_gym.pytests/test_cloudaigym.pytests/test_handlers.pytests/test_test_scenario.py
|
Downstream cloudaix PR that consumes this contract: https://github.com/Mellanox/cloudaix/pull/589 It ports Together the two PRs are the architectural fix for the trial-counter collapse: cloudai owns the trial-counter contract ( |
- Agents that set HAS_CUSTOM_TRAINING_LOOP = True drive their own training loop; handle_dse_job calls agent.train() and skips the per-step env.step loop. - New _run_custom_training_loop helper logs exceptions, returns a process-style exit code, and always invokes agent.shutdown() (when defined) in a finally block so resources are released on both success and failure paths. - CustomTrainingLoopAgent Protocol documents the opt-in contract for type checkers and IDEs.
Pyright rejected calling _run_custom_training_loop(agent, ...) because the plain bool predicate did not narrow agent's static type from BaseAgent to CustomTrainingLoopAgent. Return TypeGuard[CustomTrainingLoopAgent] from _has_custom_training_loop so the truthy branch in handle_dse_job sees the opted-in shape and the helper can call agent.train() directly.
If agent.shutdown() raised from the finally block, Python suppressed the earlier return 0/1 from agent.train() and propagated the exception, breaking the outer test-run loop in handle_dse_job (skipped remaining scenarios, failed to accumulate err |= rc). Wrap shutdown() in its own try/except, log via logging.exception, set rc = 1, and return rc after finally so the helper always honours the (int) -> int contract. Adds tests for shutdown-only failure and combined train+shutdown failure.
…mutator Previously, ``test_run.step`` had no clear owner: the dispatcher set it from outside, the adapter rewound it on ``reset()``, and other callers wrote to it ad hoc. In RLlib custom-loop runs this collapsed every trial onto ``step=1``, overwriting ``trajectory.csv`` and ``env.csv`` rows. Centralize the invariant: ``TestRun.increment_step()`` is the single named mutator, and ``CloudAIGymEnv.step()`` is its only caller. One ``env.step()`` call advances the trial counter by exactly one — independent of any episode or dispatcher concept above the gym env. Contract tests in ``TestIncrementStep`` cover the API; ``test_cloudaigym`` asserts ``step`` is advanced *before* ``output_path`` and trajectory rows are computed, so cached and live trials both record the post-increment value.
…orphism Earlier commits in this PR introduced ``HAS_CUSTOM_TRAINING_LOOP`` + a ``CustomTrainingLoopAgent`` Protocol + a TypeGuard helper + an ``if/else`` in ``handle_dse_job`` to switch between the cloudai step loop and an agent-owned ``train()`` loop. That is a type-tagged conditional dispatching on agent identity — the textbook signal to replace conditional with polymorphism (Fowler). Add a default ``BaseAgent.run() -> int`` that holds the step-loop body (``select_action`` / ``env.step`` / ``update_policy`` per trial). Agents that drive their own training (RLlib, etc.) override ``run()`` to delegate to whatever loop they own and return a process-style exit code. ``handle_dse_job`` collapses to ``err |= agent.run()`` — one line, no branching, no Protocol vocabulary. The handler no longer knows that "custom training loops" exist as a category; that's an agent implementation detail. Net: -89 lines on cloudai. Surface area shrinks (no Protocol, no TypeGuard, no flag). ``test_handlers`` replaces the 5 helper unit tests + 2 dispatcher integration tests with 2 polymorphic tests asserting ``handle_dse_job`` delegates to ``agent.run()`` and propagates its return code.
a1a268a to
d0417b1
Compare
|
@podkidyshev rebased onto latest
Also note a contract fix surfaced by the rebase: |
handle_dse_job now wraps `err |= agent.run()` so an exception from one TestRun is logged with the agent identity, converted to a non-zero exit code, and the loop continues with the remaining independent TestRuns. Previously an uncaught exception unwound past the accumulator and the exit() call, aborting the rest of the scenario and skipping report generation. This honors the dependency-free "all cases run consecutively" contract enforced at the top of the same function. Also aligns CustomRunStubAgent.select_action with the BaseAgent signature (observation kwarg) after the polymorphism rebase. Addresses CodeRabbit review on PR NVIDIA#893.
11f8625 to
1e5a96d
Compare
handle_dse_job uses a hybrid failure model for agent.run(), now made explicit and pinned: - Recoverable failures return a non-zero rc, accumulated via `err |= agent.run()`, and the sweep continues to the next TestRun. Workload-level failures already follow this: CloudAIGymEnv.step maps a failed metric to rewards.metric_failure rather than raising, and rllib_run catches training errors and returns rc=1. - Unexpected failures (framework/agent bugs) raise and hard-fail the job so the bug surfaces instead of being masked as a penalizing reward / non-zero rc. The exception is captured so reports still generate, the aborting error is documented in <scenario_root>/dse_failure.txt, and then it is re-raised. BaseAgent.run() docstring documents the return-vs-raise contract. Tests pin all halves: a non-zero rc accumulates and continues; a raising run() propagates and aborts remaining runs; and on a hard-fail reports are still generated with the failure documented before re-raising. Also aligns CustomRunStubAgent.select_action with the BaseAgent signature (observation kwarg) after the polymorphism rebase.
1e5a96d to
d53a88d
Compare
|
Closed by a branch rename ( |
Issue
handle_dse_jobbranched on agent identity (flag + Protocol + TypeGuard) to support custom training loops.TestRun.stephad multiple writers; RLlib's frequentreset()collapsed every trial ontostep=1, overwritingtrajectory.csv/env.csvrows.Fix
BaseAgent.run()is the polymorphic entry point andhandle_dse_jobcollapses toerr |= agent.run()(deletes the flag/Protocol/TypeGuard/helper, −89 lines).TestRun.increment_step()is the sole mutator, called only byCloudAIGymEnv.step().<scenario_root>/dse_failure.txt, then re-raised).Testing
pytest tests: 1607 passed, 4 skipped; ruff + pyright + vulture clean.tr.stepmonotonicity through the adapter andenv.csv.Stack:
main← #893 ← #900 ← #901 (merge bottom-up).