[train] fix: add Layer-2 safety net and MoE shape match to UniAgentLoop failure path by aoshen02 · Pull Request #50 · verl-project/uni-agent

aoshen02 · 2026-05-28T15:07:55Z

What does this PR do?

PR #26 added _build_empty_agent_output so a Layer-1 trajectory
failure (Modal cold-start, swerex pexpect, reward eval crash)
becomes a reward=0 masked sample rather than killing training.
Production on a Qwen3-235B SWE-bench campaign (~1.4M trajectories)
exposed two follow-on failure modes #26 does not yet cover:

1) Shape mismatch crashes DataProto.concat on MoE rollouts.
When enable_rollout_routing_replay=True, the normal path returns
a routed_experts tensor of shape (length, num_layers, topk).
_build_empty_agent_output returns None. One failed sample in the
batch makes the per-sample tensor stack heterogeneous and the whole
batch's concat raises at
verl/experimental/agent_loop/agent_loop.py:715
(length, layer_num, topk_num = output.routed_experts.shape). Layer 1's
purpose — letting individual trajectories fail cheaply — is defeated.

2) The failure-path builder itself can crash. Observed:
AgentChatModel.prepare_rollout_cache on a partially-built prompt;
apply_chat_template on an unfinished tokenizer load;
AgentLoopOutput Pydantic schema change between patch versions;
attribute error when the loop crashed before
self.interaction / self.tokenizer was set. The inner except then
raises a new exception that replaces run()'s return value, the
Rollouter actor dies, Ray restarts it — 5–10 min idle + weight reload
per crash.

Changes (`uni_agent/agent_loop.py`)

Class-level MoE shape cache _moe_num_layers / _moe_topk,
populated by _ensure_moe_shape_cached() reading
num_hidden_layers and num_experts_per_tok from the HF config
once per worker. Probes nested text_config (Qwen3.5 MoE) and
top-level (Qwen3). Failures here are logged and swallowed.
_synth_failed_routed_experts(response_length) helper returns
np.zeros((response_length, num_layers, topk), dtype=np.int64)
when routing replay is on AND cache is populated; None
otherwise. Wired into _build_empty_agent_output.
_make_minimal_output() Layer-2 safety net: pad-token prompt,
pad-token response, fully masked, traj_masked=1,
traj_exit_reason="build_failed". Only requires
self.tokenizer, so it works even when _build_empty_agent_output
failed because of a partial-init crash.
Outer try/except around _build_empty_agent_output invokes
_make_minimal_output() if Layer 1 itself crashes.
Safe env.close() teardown in finally: a teardown crash
cannot replace output with an exception.
Brace-safe logging in the same handlers (consistent with [interaction] fix: use safe "{}" template for logger calls that embed exception or LLM content #48):
logger.critical("{}", msg) rather than f-string-in-template.

Checklist Before Starting

Search for similar PRs/issues:
- gh pr list --repo verl-project/uni-agent --state open --search "agent loop failure" → none
- gh pr list --repo verl-project/uni-agent --state all --search "routed_experts None" → none
- PR [train] fix: unexpected agent loop output during agent loop break #26 is the prior art this builds on (Layer 1); this PR strictly extends
Format the PR title as [train] fix: ...

Test

Added tests/uni_agent/test_agent_loop_failure_safety.py (8 tests):

_synth_failed_routed_experts shape contract (zero tensor with (response_length, num_layers, topk) shape, int64 dtype)
Returns None when routing replay disabled
Returns None when MoE cache is unpopulated or has zero values
Response-length axis propagates correctly for n ∈ {1, 8, 256, 512}
_make_minimal_output produces a valid output with pad token
Falls back to eos_token_id when pad_token_id is None
Handles list-shaped pad_token_id from multi-modal tokenizers

pytest tests/uni_agent/test_agent_loop_failure_safety.py -v
# 8 passed in 11.10s

Tests bypass UniAgentLoop.__init__ via object.__new__ + manual
attribute assignment. Module-level pytest.importorskip( "verl.experimental.agent_loop.agent_loop") so the file cleanly
skips in the lean CI environment that does not install verl.

API and Usage Example

No public API change. No new config knob. Existing callers of
UniAgentLoop.run() are unchanged. The new _ensure_moe_shape_cached
runs lazily on first run() per worker; no opt-in needed. The
MoE shape probe reads from the same model.path the loop already
uses; on non-MoE models or HF cache misses it falls back to None
silently and _synth_failed_routed_experts returns None.

Design & Code Changes

Class-level cache (not instance) so multiple UniAgentLoop
instances in the same Rollouter actor share the probe result.
_synth_failed_routed_experts extracted as a small helper so
unit tests can exercise the shape contract without bringing up
chat model + tokenizer + interaction.
_make_minimal_output does not call chat_model or
interaction so it works in the partial-init scenario
(Layer 1 failed BEFORE self.interaction was assigned).
Brace-safe logger pattern matches PR [interaction] fix: use safe "{}" template for logger calls that embed exception or LLM content #48's convention.

Checklist Before Submitting

Read the Contribute Guide
pre-commit run --files uni_agent/agent_loop.py tests/uni_agent/test_agent_loop_failure_safety.py — all hooks passed
Unit tests added (tests/uni_agent/test_agent_loop_failure_safety.py, 8 tests)
AI assistance was used (Claude Code); the submitting human (@aoshen02) reviewed every changed line and ran the test suite.
Not duplicating any open PR; builds on the already-merged PR [train] fix: unexpected agent loop output during agent loop break #26.

@aoshen02

…op failure path PR verl-project#26 added `_build_empty_agent_output` so a Layer-1 trajectory failure (Modal cold-start timeout, swerex pexpect, reward eval crash) becomes a `reward=0` masked sample rather than killing the whole training job. Production experience on a Qwen3-235B SWE-bench campaign (~1.4M trajectories) exposed two follow-on failure modes that verl-project#26 does not yet cover: 1) **Shape mismatch crashes DataProto.concat on MoE rollouts.** When `enable_rollout_routing_replay=True`, the normal path returns a real `routed_experts` tensor of shape `(length, num_layers, topk)` (vLLM async server / SGLang async server). `_build_empty_ agent_output` returns `routed_experts=None`. As soon as one failed sample lands in the same batch as a normal sample, verl's per-sample tensor stack hits `None` mixed with `Tensor` and the whole batch's concat raises (observed at `verl/experimental/agent_loop/agent_loop.py:715` -- the destructure `length, layer_num, topk_num = output.routed_experts.shape`). The "single failed trajectory takes down the whole batch" outcome defeats Layer 1's purpose. 2) **The failure-path builder itself can crash.** Observed causes: `AgentChatModel.prepare_rollout_cache` raising on a partially-built prompt; `apply_chat_template` raising on a tokenizer that did not finish loading; `AgentLoopOutput` Pydantic schema change between patch versions; attribute error when the loop crashed before `self.interaction` or `self.tokenizer` was assigned. The inner `except Exception` in `run()` then raises a *new* exception that replaces the function's return value, the Rollouter actor dies, Ray restarts it -- 5-10 min idle + weight reload per crash. ### Changes In `uni_agent/agent_loop.py`: - **Class-level MoE shape cache** (`_moe_num_layers`, `_moe_topk`) populated by a new `_ensure_moe_shape_cached()` method that reads `num_hidden_layers` and `num_experts_per_tok` from the HF model config once per Rollouter actor. Probes both nested `text_config` (Qwen3.5 MoE) and top-level (older Qwen3). Failure here is logged and swallowed -- the failure path then falls back to `None`, same as before this PR. - **`_synth_failed_routed_experts(response_length)`** helper returns a `np.zeros((response_length, num_layers, topk), dtype=np.int64)` when routing replay is on AND the MoE shape cache is populated. Returns `None` otherwise. Used in `_build_empty_agent_output` in place of the previous unconditional `routed_experts=None`. - **`_make_minimal_output()`** Layer-2 safety net: returns an `AgentLoopOutput` with pad-token prompt, pad-token response, fully-masked response_mask, traj_masked=1, traj_exit_reason= "build_failed". Requires only `self.tokenizer` to be set, so it works even when `_build_empty_agent_output` failed because of an earlier partial-init crash. - **Outer `try/except` around `_build_empty_agent_output`** in `run()` invokes `_make_minimal_output()` if Layer 1 itself crashes. - **Safe `env.close()` teardown**: wrap the `finally` block so a failure during teardown (Modal sandbox terminate, swerex session close) does not replace the function's return value with an exception and kill the worker. Trajectory result is already computed at this point; teardown is best-effort. - **Brace-safe logging** in the same handlers, consistent with PR verl-project#48: `logger.critical("{}", precomputed_msg)` rather than f-string-in-template, so an exception repr containing '{' / '}' cannot cascade into a logger crash. ### Test plan Added `tests/uni_agent/test_agent_loop_failure_safety.py` with 8 direct unit tests covering: - `_synth_failed_routed_experts` shape contract (zero tensor with `(response_length, num_layers, topk)` shape, int64 dtype) - Returns `None` when routing replay disabled - Returns `None` when MoE cache is unpopulated or has zero values - Response-length axis propagates correctly for a range of values - `_make_minimal_output` produces a valid output with pad token - Falls back to eos_token_id when pad_token_id is None - Handles list-shaped pad_token_id from multi-modal tokenizers ```bash pytest tests/uni_agent/test_agent_loop_failure_safety.py -v # 8 passed in 11.10s ``` `pre-commit run --files uni_agent/agent_loop.py tests/uni_agent/test_agent_loop_failure_safety.py` -- all hooks pass (ruff, ruff format, mypy, compile-all). Tests bypass `UniAgentLoop.__init__` (which transitively brings up the whole verl AgentLoopBase + chat model + tokenizer load) via `object.__new__` + manual attribute assignment. The file starts with `pytest.importorskip("verl.experimental.agent_loop.agent_loop")` so it cleanly skips in the lean CI environment that does not install verl, while running on developer machines that have the ML stack. ### Production validation Across the Qwen3-235B campaign: - Before this PR: ~3 Rollouter restarts per 24h from Layer 1 build-fail crashes; ~1 batch concat crash per fit_step on MoE configs with `enable_rollout_routing_replay=True`. - After: 0 build-fail Rollouter restarts; 0 batch concat crashes attributable to routed_experts shape mismatch. ### Notes This PR includes AI assistance (Claude Code). The submitting human (@aoshen02) reviewed every changed line and ran the test suite. Signed-off-by: aoshen02 <aoshen@inferact.ai> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request introduces a robust failure-path safety layer to UniAgentLoop to gracefully handle agent loop failures and prevent worker crashes. It implements caching for MoE shapes to synthesize zero-filled routed_experts tensors on failure paths, adds a last-resort minimal output fallback, and includes comprehensive unit tests. The code review feedback suggests defensive improvements: safely retrieving the tokenizer in _make_minimal_output to avoid AttributeError, synthesizing a minimal routed_experts tensor in the fallback to prevent downstream shape mismatches, and handling potential TypeErrors when parsing MoE configuration parameters that might be explicitly set to None.

gemini-code-assist · 2026-05-28T15:09:35Z

+        pad_id = getattr(self.tokenizer, "pad_token_id", None)
+        if pad_id is None:
+            pad_id = getattr(self.tokenizer, "eos_token_id", None) or 0


Since _make_minimal_output is designed as a last-resort fallback for partial-initialization failures, self.tokenizer might be None or not set at all. Accessing self.tokenizer directly can raise an AttributeError. We should retrieve it defensively using getattr to ensure the fallback itself does not crash.

Suggested change

pad_id = getattr(self.tokenizer, "pad_token_id", None)

if pad_id is None:

pad_id = getattr(self.tokenizer, "eos_token_id", None) or 0

tokenizer = getattr(self, "tokenizer", None)

pad_id = getattr(tokenizer, "pad_token_id", None) if tokenizer is not None else None

if pad_id is None:

pad_id = getattr(tokenizer, "eos_token_id", None) if tokenizer is not None else 0

gemini-code-assist · 2026-05-28T15:09:36Z

+        return AgentLoopOutput(
+            prompt_ids=[pad_id],
+            response_ids=[pad_id],
+            response_mask=[0],
+            response_logprobs=None,
+            routed_experts=None,


If _make_minimal_output is triggered and enable_rollout_routing_replay is enabled, returning routed_experts=None will still cause a shape mismatch crash during batch concatenation. We should attempt to synthesize a zero-filled tensor of length 1 to prevent downstream crashes.

routed_experts = None try: routed_experts = self._synth_failed_routed_experts(1) except Exception: pass return AgentLoopOutput( prompt_ids=[pad_id], response_ids=[pad_id], response_mask=[0], response_logprobs=None, routed_experts=routed_experts,

gemini-code-assist · 2026-05-28T15:09:36Z

+            num_layers = int(getattr(text_cfg, "num_hidden_layers", 0)) or int(
+                getattr(model_cfg, "num_hidden_layers", 0)
+            )
+            topk = int(getattr(text_cfg, "num_experts_per_tok", 0)) or int(getattr(model_cfg, "num_experts_per_tok", 0))


If num_hidden_layers or num_experts_per_tok is explicitly set to None in the config, getattr(..., 0) will return None, and passing it to int() will raise a TypeError. Using a fallback chain with or ensures we safely default to 0 without raising an exception.

num_layers = int(getattr(text_cfg, "num_hidden_layers", None) or getattr(model_cfg, "num_hidden_layers", None) or 0) topk = int(getattr(text_cfg, "num_experts_per_tok", None) or getattr(model_cfg, "num_experts_per_tok", None) or 0)

gemini-code-assist Bot reviewed May 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train] fix: add Layer-2 safety net and MoE shape match to UniAgentLoop failure path#50

[train] fix: add Layer-2 safety net and MoE shape match to UniAgentLoop failure path#50
aoshen02 wants to merge 1 commit into
verl-project:mainfrom
aoshen02:feat/agent-loop-failure-safety

aoshen02 commented May 28, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 28, 2026

Uh oh!

gemini-code-assist Bot May 28, 2026

Uh oh!

gemini-code-assist Bot May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aoshen02 commented May 28, 2026

What does this PR do?

Changes (uni_agent/agent_loop.py)

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Changes (`uni_agent/agent_loop.py`)