[train] fix: add Layer-2 safety net and MoE shape match to UniAgentLoop failure path#50
[train] fix: add Layer-2 safety net and MoE shape match to UniAgentLoop failure path#50aoshen02 wants to merge 1 commit into
Conversation
…op failure path PR verl-project#26 added `_build_empty_agent_output` so a Layer-1 trajectory failure (Modal cold-start timeout, swerex pexpect, reward eval crash) becomes a `reward=0` masked sample rather than killing the whole training job. Production experience on a Qwen3-235B SWE-bench campaign (~1.4M trajectories) exposed two follow-on failure modes that verl-project#26 does not yet cover: 1) **Shape mismatch crashes DataProto.concat on MoE rollouts.** When `enable_rollout_routing_replay=True`, the normal path returns a real `routed_experts` tensor of shape `(length, num_layers, topk)` (vLLM async server / SGLang async server). `_build_empty_ agent_output` returns `routed_experts=None`. As soon as one failed sample lands in the same batch as a normal sample, verl's per-sample tensor stack hits `None` mixed with `Tensor` and the whole batch's concat raises (observed at `verl/experimental/agent_loop/agent_loop.py:715` -- the destructure `length, layer_num, topk_num = output.routed_experts.shape`). The "single failed trajectory takes down the whole batch" outcome defeats Layer 1's purpose. 2) **The failure-path builder itself can crash.** Observed causes: `AgentChatModel.prepare_rollout_cache` raising on a partially-built prompt; `apply_chat_template` raising on a tokenizer that did not finish loading; `AgentLoopOutput` Pydantic schema change between patch versions; attribute error when the loop crashed before `self.interaction` or `self.tokenizer` was assigned. The inner `except Exception` in `run()` then raises a *new* exception that replaces the function's return value, the Rollouter actor dies, Ray restarts it -- 5-10 min idle + weight reload per crash. ### Changes In `uni_agent/agent_loop.py`: - **Class-level MoE shape cache** (`_moe_num_layers`, `_moe_topk`) populated by a new `_ensure_moe_shape_cached()` method that reads `num_hidden_layers` and `num_experts_per_tok` from the HF model config once per Rollouter actor. Probes both nested `text_config` (Qwen3.5 MoE) and top-level (older Qwen3). Failure here is logged and swallowed -- the failure path then falls back to `None`, same as before this PR. - **`_synth_failed_routed_experts(response_length)`** helper returns a `np.zeros((response_length, num_layers, topk), dtype=np.int64)` when routing replay is on AND the MoE shape cache is populated. Returns `None` otherwise. Used in `_build_empty_agent_output` in place of the previous unconditional `routed_experts=None`. - **`_make_minimal_output()`** Layer-2 safety net: returns an `AgentLoopOutput` with pad-token prompt, pad-token response, fully-masked response_mask, traj_masked=1, traj_exit_reason= "build_failed". Requires only `self.tokenizer` to be set, so it works even when `_build_empty_agent_output` failed because of an earlier partial-init crash. - **Outer `try/except` around `_build_empty_agent_output`** in `run()` invokes `_make_minimal_output()` if Layer 1 itself crashes. - **Safe `env.close()` teardown**: wrap the `finally` block so a failure during teardown (Modal sandbox terminate, swerex session close) does not replace the function's return value with an exception and kill the worker. Trajectory result is already computed at this point; teardown is best-effort. - **Brace-safe logging** in the same handlers, consistent with PR verl-project#48: `logger.critical("{}", precomputed_msg)` rather than f-string-in-template, so an exception repr containing '{' / '}' cannot cascade into a logger crash. ### Test plan Added `tests/uni_agent/test_agent_loop_failure_safety.py` with 8 direct unit tests covering: - `_synth_failed_routed_experts` shape contract (zero tensor with `(response_length, num_layers, topk)` shape, int64 dtype) - Returns `None` when routing replay disabled - Returns `None` when MoE cache is unpopulated or has zero values - Response-length axis propagates correctly for a range of values - `_make_minimal_output` produces a valid output with pad token - Falls back to eos_token_id when pad_token_id is None - Handles list-shaped pad_token_id from multi-modal tokenizers ```bash pytest tests/uni_agent/test_agent_loop_failure_safety.py -v # 8 passed in 11.10s ``` `pre-commit run --files uni_agent/agent_loop.py tests/uni_agent/test_agent_loop_failure_safety.py` -- all hooks pass (ruff, ruff format, mypy, compile-all). Tests bypass `UniAgentLoop.__init__` (which transitively brings up the whole verl AgentLoopBase + chat model + tokenizer load) via `object.__new__` + manual attribute assignment. The file starts with `pytest.importorskip("verl.experimental.agent_loop.agent_loop")` so it cleanly skips in the lean CI environment that does not install verl, while running on developer machines that have the ML stack. ### Production validation Across the Qwen3-235B campaign: - Before this PR: ~3 Rollouter restarts per 24h from Layer 1 build-fail crashes; ~1 batch concat crash per fit_step on MoE configs with `enable_rollout_routing_replay=True`. - After: 0 build-fail Rollouter restarts; 0 batch concat crashes attributable to routed_experts shape mismatch. ### Notes This PR includes AI assistance (Claude Code). The submitting human (@aoshen02) reviewed every changed line and ran the test suite. Signed-off-by: aoshen02 <aoshen@inferact.ai> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a robust failure-path safety layer to UniAgentLoop to gracefully handle agent loop failures and prevent worker crashes. It implements caching for MoE shapes to synthesize zero-filled routed_experts tensors on failure paths, adds a last-resort minimal output fallback, and includes comprehensive unit tests. The code review feedback suggests defensive improvements: safely retrieving the tokenizer in _make_minimal_output to avoid AttributeError, synthesizing a minimal routed_experts tensor in the fallback to prevent downstream shape mismatches, and handling potential TypeErrors when parsing MoE configuration parameters that might be explicitly set to None.
| pad_id = getattr(self.tokenizer, "pad_token_id", None) | ||
| if pad_id is None: | ||
| pad_id = getattr(self.tokenizer, "eos_token_id", None) or 0 |
There was a problem hiding this comment.
Since _make_minimal_output is designed as a last-resort fallback for partial-initialization failures, self.tokenizer might be None or not set at all. Accessing self.tokenizer directly can raise an AttributeError. We should retrieve it defensively using getattr to ensure the fallback itself does not crash.
| pad_id = getattr(self.tokenizer, "pad_token_id", None) | |
| if pad_id is None: | |
| pad_id = getattr(self.tokenizer, "eos_token_id", None) or 0 | |
| tokenizer = getattr(self, "tokenizer", None) | |
| pad_id = getattr(tokenizer, "pad_token_id", None) if tokenizer is not None else None | |
| if pad_id is None: | |
| pad_id = getattr(tokenizer, "eos_token_id", None) if tokenizer is not None else 0 |
| return AgentLoopOutput( | ||
| prompt_ids=[pad_id], | ||
| response_ids=[pad_id], | ||
| response_mask=[0], | ||
| response_logprobs=None, | ||
| routed_experts=None, |
There was a problem hiding this comment.
If _make_minimal_output is triggered and enable_rollout_routing_replay is enabled, returning routed_experts=None will still cause a shape mismatch crash during batch concatenation. We should attempt to synthesize a zero-filled tensor of length 1 to prevent downstream crashes.
routed_experts = None
try:
routed_experts = self._synth_failed_routed_experts(1)
except Exception:
pass
return AgentLoopOutput(
prompt_ids=[pad_id],
response_ids=[pad_id],
response_mask=[0],
response_logprobs=None,
routed_experts=routed_experts,| num_layers = int(getattr(text_cfg, "num_hidden_layers", 0)) or int( | ||
| getattr(model_cfg, "num_hidden_layers", 0) | ||
| ) | ||
| topk = int(getattr(text_cfg, "num_experts_per_tok", 0)) or int(getattr(model_cfg, "num_experts_per_tok", 0)) |
There was a problem hiding this comment.
If num_hidden_layers or num_experts_per_tok is explicitly set to None in the config, getattr(..., 0) will return None, and passing it to int() will raise a TypeError. Using a fallback chain with or ensures we safely default to 0 without raising an exception.
num_layers = int(getattr(text_cfg, "num_hidden_layers", None) or getattr(model_cfg, "num_hidden_layers", None) or 0)
topk = int(getattr(text_cfg, "num_experts_per_tok", None) or getattr(model_cfg, "num_experts_per_tok", None) or 0)
What does this PR do?
PR #26 added
_build_empty_agent_outputso a Layer-1 trajectoryfailure (Modal cold-start, swerex pexpect, reward eval crash)
becomes a
reward=0masked sample rather than killing training.Production on a Qwen3-235B SWE-bench campaign (~1.4M trajectories)
exposed two follow-on failure modes #26 does not yet cover:
1) Shape mismatch crashes
DataProto.concaton MoE rollouts.When
enable_rollout_routing_replay=True, the normal path returnsa
routed_expertstensor of shape(length, num_layers, topk)._build_empty_agent_outputreturnsNone. One failed sample in thebatch makes the per-sample tensor stack heterogeneous and the whole
batch's concat raises at
verl/experimental/agent_loop/agent_loop.py:715(
length, layer_num, topk_num = output.routed_experts.shape). Layer 1'spurpose — letting individual trajectories fail cheaply — is defeated.
2) The failure-path builder itself can crash. Observed:
AgentChatModel.prepare_rollout_cacheon a partially-built prompt;apply_chat_templateon an unfinished tokenizer load;AgentLoopOutputPydantic schema change between patch versions;attribute error when the loop crashed before
self.interaction/self.tokenizerwas set. The innerexceptthenraises a new exception that replaces
run()'s return value, theRollouter actor dies, Ray restarts it — 5–10 min idle + weight reload
per crash.
Changes (
uni_agent/agent_loop.py)_moe_num_layers/_moe_topk,populated by
_ensure_moe_shape_cached()readingnum_hidden_layersandnum_experts_per_tokfrom the HF configonce per worker. Probes nested
text_config(Qwen3.5 MoE) andtop-level (Qwen3). Failures here are logged and swallowed.
_synth_failed_routed_experts(response_length)helper returnsnp.zeros((response_length, num_layers, topk), dtype=np.int64)when routing replay is on AND cache is populated;
Noneotherwise. Wired into
_build_empty_agent_output._make_minimal_output()Layer-2 safety net: pad-token prompt,pad-token response, fully masked,
traj_masked=1,traj_exit_reason="build_failed". Only requiresself.tokenizer, so it works even when_build_empty_agent_outputfailed because of a partial-init crash.
try/exceptaround_build_empty_agent_outputinvokes_make_minimal_output()if Layer 1 itself crashes.env.close()teardown infinally: a teardown crashcannot replace
outputwith an exception.logger.critical("{}", msg)rather than f-string-in-template.Checklist Before Starting
gh pr list --repo verl-project/uni-agent --state open --search "agent loop failure"→ nonegh pr list --repo verl-project/uni-agent --state all --search "routed_experts None"→ none[train] fix: ...Test
Added
tests/uni_agent/test_agent_loop_failure_safety.py(8 tests):_synth_failed_routed_expertsshape contract (zero tensor with(response_length, num_layers, topk)shape, int64 dtype)Nonewhen routing replay disabledNonewhen MoE cache is unpopulated or has zero valuesn ∈ {1, 8, 256, 512}_make_minimal_outputproduces a valid output with pad tokeneos_token_idwhenpad_token_idisNonepad_token_idfrom multi-modal tokenizerspytest tests/uni_agent/test_agent_loop_failure_safety.py -v # 8 passed in 11.10sTests bypass
UniAgentLoop.__init__viaobject.__new__+ manualattribute assignment. Module-level
pytest.importorskip( "verl.experimental.agent_loop.agent_loop")so the file cleanlyskips in the lean CI environment that does not install verl.
API and Usage Example
No public API change. No new config knob. Existing callers of
UniAgentLoop.run()are unchanged. The new_ensure_moe_shape_cachedruns lazily on first
run()per worker; no opt-in needed. TheMoE shape probe reads from the same
model.paththe loop alreadyuses; on non-MoE models or HF cache misses it falls back to
Nonesilently and
_synth_failed_routed_expertsreturnsNone.Design & Code Changes
UniAgentLoopinstances in the same Rollouter actor share the probe result.
_synth_failed_routed_expertsextracted as a small helper sounit tests can exercise the shape contract without bringing up
chat model + tokenizer + interaction.
_make_minimal_outputdoes not callchat_modelorinteractionso it works in the partial-init scenario(Layer 1 failed BEFORE
self.interactionwas assigned).Checklist Before Submitting
pre-commit run --files uni_agent/agent_loop.py tests/uni_agent/test_agent_loop_failure_safety.py— all hooks passedtests/uni_agent/test_agent_loop_failure_safety.py, 8 tests)