Skip to content

[train] fix: add Layer-2 safety net and MoE shape match to UniAgentLoop failure path#50

Open
aoshen02 wants to merge 1 commit into
verl-project:mainfrom
aoshen02:feat/agent-loop-failure-safety
Open

[train] fix: add Layer-2 safety net and MoE shape match to UniAgentLoop failure path#50
aoshen02 wants to merge 1 commit into
verl-project:mainfrom
aoshen02:feat/agent-loop-failure-safety

Conversation

@aoshen02
Copy link
Copy Markdown
Collaborator

What does this PR do?

PR #26 added _build_empty_agent_output so a Layer-1 trajectory
failure (Modal cold-start, swerex pexpect, reward eval crash)
becomes a reward=0 masked sample rather than killing training.
Production on a Qwen3-235B SWE-bench campaign (~1.4M trajectories)
exposed two follow-on failure modes #26 does not yet cover:

1) Shape mismatch crashes DataProto.concat on MoE rollouts.
When enable_rollout_routing_replay=True, the normal path returns
a routed_experts tensor of shape (length, num_layers, topk).
_build_empty_agent_output returns None. One failed sample in the
batch makes the per-sample tensor stack heterogeneous and the whole
batch's concat raises at
verl/experimental/agent_loop/agent_loop.py:715
(length, layer_num, topk_num = output.routed_experts.shape). Layer 1's
purpose — letting individual trajectories fail cheaply — is defeated.

2) The failure-path builder itself can crash. Observed:
AgentChatModel.prepare_rollout_cache on a partially-built prompt;
apply_chat_template on an unfinished tokenizer load;
AgentLoopOutput Pydantic schema change between patch versions;
attribute error when the loop crashed before
self.interaction / self.tokenizer was set. The inner except then
raises a new exception that replaces run()'s return value, the
Rollouter actor dies, Ray restarts it — 5–10 min idle + weight reload
per crash.

Changes (uni_agent/agent_loop.py)

  • Class-level MoE shape cache _moe_num_layers / _moe_topk,
    populated by _ensure_moe_shape_cached() reading
    num_hidden_layers and num_experts_per_tok from the HF config
    once per worker. Probes nested text_config (Qwen3.5 MoE) and
    top-level (Qwen3). Failures here are logged and swallowed.
  • _synth_failed_routed_experts(response_length) helper returns
    np.zeros((response_length, num_layers, topk), dtype=np.int64)
    when routing replay is on AND cache is populated; None
    otherwise. Wired into _build_empty_agent_output.
  • _make_minimal_output() Layer-2 safety net: pad-token prompt,
    pad-token response, fully masked, traj_masked=1,
    traj_exit_reason="build_failed". Only requires
    self.tokenizer, so it works even when _build_empty_agent_output
    failed because of a partial-init crash.
  • Outer try/except around _build_empty_agent_output invokes
    _make_minimal_output() if Layer 1 itself crashes.
  • Safe env.close() teardown in finally: a teardown crash
    cannot replace output with an exception.
  • Brace-safe logging in the same handlers (consistent with [interaction] fix: use safe "{}" template for logger calls that embed exception or LLM content #48):
    logger.critical("{}", msg) rather than f-string-in-template.

Checklist Before Starting

  • Search for similar PRs/issues:
  • Format the PR title as [train] fix: ...

Test

Added tests/uni_agent/test_agent_loop_failure_safety.py (8 tests):

  • _synth_failed_routed_experts shape contract (zero tensor with (response_length, num_layers, topk) shape, int64 dtype)
  • Returns None when routing replay disabled
  • Returns None when MoE cache is unpopulated or has zero values
  • Response-length axis propagates correctly for n ∈ {1, 8, 256, 512}
  • _make_minimal_output produces a valid output with pad token
  • Falls back to eos_token_id when pad_token_id is None
  • Handles list-shaped pad_token_id from multi-modal tokenizers
pytest tests/uni_agent/test_agent_loop_failure_safety.py -v
# 8 passed in 11.10s

Tests bypass UniAgentLoop.__init__ via object.__new__ + manual
attribute assignment. Module-level pytest.importorskip( "verl.experimental.agent_loop.agent_loop") so the file cleanly
skips in the lean CI environment that does not install verl.

API and Usage Example

No public API change. No new config knob. Existing callers of
UniAgentLoop.run() are unchanged. The new _ensure_moe_shape_cached
runs lazily on first run() per worker; no opt-in needed. The
MoE shape probe reads from the same model.path the loop already
uses; on non-MoE models or HF cache misses it falls back to None
silently and _synth_failed_routed_experts returns None.

Design & Code Changes

  • Class-level cache (not instance) so multiple UniAgentLoop
    instances in the same Rollouter actor share the probe result.
  • _synth_failed_routed_experts extracted as a small helper so
    unit tests can exercise the shape contract without bringing up
    chat model + tokenizer + interaction.
  • _make_minimal_output does not call chat_model or
    interaction so it works in the partial-init scenario
    (Layer 1 failed BEFORE self.interaction was assigned).
  • Brace-safe logger pattern matches PR [interaction] fix: use safe "{}" template for logger calls that embed exception or LLM content #48's convention.

Checklist Before Submitting

  • Read the Contribute Guide
  • pre-commit run --files uni_agent/agent_loop.py tests/uni_agent/test_agent_loop_failure_safety.py — all hooks passed
  • Unit tests added (tests/uni_agent/test_agent_loop_failure_safety.py, 8 tests)
  • AI assistance was used (Claude Code); the submitting human (@aoshen02) reviewed every changed line and ran the test suite.
  • Not duplicating any open PR; builds on the already-merged PR [train] fix: unexpected agent loop output during agent loop break #26.

…op failure path

PR verl-project#26 added `_build_empty_agent_output` so a Layer-1 trajectory
failure (Modal cold-start timeout, swerex pexpect, reward eval crash)
becomes a `reward=0` masked sample rather than killing the whole
training job. Production experience on a Qwen3-235B SWE-bench
campaign (~1.4M trajectories) exposed two follow-on failure modes
that verl-project#26 does not yet cover:

1) **Shape mismatch crashes DataProto.concat on MoE rollouts.**
   When `enable_rollout_routing_replay=True`, the normal path returns
   a real `routed_experts` tensor of shape `(length, num_layers,
   topk)` (vLLM async server / SGLang async server). `_build_empty_
   agent_output` returns `routed_experts=None`. As soon as one failed
   sample lands in the same batch as a normal sample, verl's per-sample
   tensor stack hits `None` mixed with `Tensor` and the whole batch's
   concat raises (observed at
   `verl/experimental/agent_loop/agent_loop.py:715` -- the destructure
   `length, layer_num, topk_num = output.routed_experts.shape`). The
   "single failed trajectory takes down the whole batch" outcome
   defeats Layer 1's purpose.

2) **The failure-path builder itself can crash.** Observed causes:
   `AgentChatModel.prepare_rollout_cache` raising on a partially-built
   prompt; `apply_chat_template` raising on a tokenizer that did not
   finish loading; `AgentLoopOutput` Pydantic schema change between
   patch versions; attribute error when the loop crashed before
   `self.interaction` or `self.tokenizer` was assigned. The inner
   `except Exception` in `run()` then raises a *new* exception that
   replaces the function's return value, the Rollouter actor dies,
   Ray restarts it -- 5-10 min idle + weight reload per crash.

### Changes

In `uni_agent/agent_loop.py`:

- **Class-level MoE shape cache** (`_moe_num_layers`, `_moe_topk`)
  populated by a new `_ensure_moe_shape_cached()` method that reads
  `num_hidden_layers` and `num_experts_per_tok` from the HF model
  config once per Rollouter actor. Probes both nested `text_config`
  (Qwen3.5 MoE) and top-level (older Qwen3). Failure here is logged
  and swallowed -- the failure path then falls back to `None`, same
  as before this PR.

- **`_synth_failed_routed_experts(response_length)`** helper returns
  a `np.zeros((response_length, num_layers, topk), dtype=np.int64)`
  when routing replay is on AND the MoE shape cache is populated.
  Returns `None` otherwise. Used in `_build_empty_agent_output` in
  place of the previous unconditional `routed_experts=None`.

- **`_make_minimal_output()`** Layer-2 safety net: returns an
  `AgentLoopOutput` with pad-token prompt, pad-token response,
  fully-masked response_mask, traj_masked=1, traj_exit_reason=
  "build_failed". Requires only `self.tokenizer` to be set, so it
  works even when `_build_empty_agent_output` failed because of an
  earlier partial-init crash.

- **Outer `try/except` around `_build_empty_agent_output`** in
  `run()` invokes `_make_minimal_output()` if Layer 1 itself crashes.

- **Safe `env.close()` teardown**: wrap the `finally` block so a
  failure during teardown (Modal sandbox terminate, swerex session
  close) does not replace the function's return value with an
  exception and kill the worker. Trajectory result is already
  computed at this point; teardown is best-effort.

- **Brace-safe logging** in the same handlers, consistent with
  PR verl-project#48: `logger.critical("{}", precomputed_msg)` rather than
  f-string-in-template, so an exception repr containing '{' / '}'
  cannot cascade into a logger crash.

### Test plan

Added `tests/uni_agent/test_agent_loop_failure_safety.py` with 8
direct unit tests covering:

- `_synth_failed_routed_experts` shape contract (zero tensor with
  `(response_length, num_layers, topk)` shape, int64 dtype)
- Returns `None` when routing replay disabled
- Returns `None` when MoE cache is unpopulated or has zero values
- Response-length axis propagates correctly for a range of values
- `_make_minimal_output` produces a valid output with pad token
- Falls back to eos_token_id when pad_token_id is None
- Handles list-shaped pad_token_id from multi-modal tokenizers

```bash
pytest tests/uni_agent/test_agent_loop_failure_safety.py -v
# 8 passed in 11.10s
```

`pre-commit run --files uni_agent/agent_loop.py
tests/uni_agent/test_agent_loop_failure_safety.py` -- all hooks pass
(ruff, ruff format, mypy, compile-all).

Tests bypass `UniAgentLoop.__init__` (which transitively brings up
the whole verl AgentLoopBase + chat model + tokenizer load) via
`object.__new__` + manual attribute assignment. The file starts with
`pytest.importorskip("verl.experimental.agent_loop.agent_loop")` so
it cleanly skips in the lean CI environment that does not install
verl, while running on developer machines that have the ML stack.

### Production validation

Across the Qwen3-235B campaign:
- Before this PR: ~3 Rollouter restarts per 24h from Layer 1
  build-fail crashes; ~1 batch concat crash per fit_step on MoE
  configs with `enable_rollout_routing_replay=True`.
- After: 0 build-fail Rollouter restarts; 0 batch concat crashes
  attributable to routed_experts shape mismatch.

### Notes

This PR includes AI assistance (Claude Code). The submitting human
(@aoshen02) reviewed every changed line and ran the test suite.

Signed-off-by: aoshen02 <aoshen@inferact.ai>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a robust failure-path safety layer to UniAgentLoop to gracefully handle agent loop failures and prevent worker crashes. It implements caching for MoE shapes to synthesize zero-filled routed_experts tensors on failure paths, adds a last-resort minimal output fallback, and includes comprehensive unit tests. The code review feedback suggests defensive improvements: safely retrieving the tokenizer in _make_minimal_output to avoid AttributeError, synthesizing a minimal routed_experts tensor in the fallback to prevent downstream shape mismatches, and handling potential TypeErrors when parsing MoE configuration parameters that might be explicitly set to None.

Comment thread uni_agent/agent_loop.py
Comment on lines +271 to +273
pad_id = getattr(self.tokenizer, "pad_token_id", None)
if pad_id is None:
pad_id = getattr(self.tokenizer, "eos_token_id", None) or 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Since _make_minimal_output is designed as a last-resort fallback for partial-initialization failures, self.tokenizer might be None or not set at all. Accessing self.tokenizer directly can raise an AttributeError. We should retrieve it defensively using getattr to ensure the fallback itself does not crash.

Suggested change
pad_id = getattr(self.tokenizer, "pad_token_id", None)
if pad_id is None:
pad_id = getattr(self.tokenizer, "eos_token_id", None) or 0
tokenizer = getattr(self, "tokenizer", None)
pad_id = getattr(tokenizer, "pad_token_id", None) if tokenizer is not None else None
if pad_id is None:
pad_id = getattr(tokenizer, "eos_token_id", None) if tokenizer is not None else 0

Comment thread uni_agent/agent_loop.py
Comment on lines +276 to +281
return AgentLoopOutput(
prompt_ids=[pad_id],
response_ids=[pad_id],
response_mask=[0],
response_logprobs=None,
routed_experts=None,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If _make_minimal_output is triggered and enable_rollout_routing_replay is enabled, returning routed_experts=None will still cause a shape mismatch crash during batch concatenation. We should attempt to synthesize a zero-filled tensor of length 1 to prevent downstream crashes.

        routed_experts = None
        try:
            routed_experts = self._synth_failed_routed_experts(1)
        except Exception:
            pass

        return AgentLoopOutput(
            prompt_ids=[pad_id],
            response_ids=[pad_id],
            response_mask=[0],
            response_logprobs=None,
            routed_experts=routed_experts,

Comment thread uni_agent/agent_loop.py
Comment on lines +224 to +227
num_layers = int(getattr(text_cfg, "num_hidden_layers", 0)) or int(
getattr(model_cfg, "num_hidden_layers", 0)
)
topk = int(getattr(text_cfg, "num_experts_per_tok", 0)) or int(getattr(model_cfg, "num_experts_per_tok", 0))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If num_hidden_layers or num_experts_per_tok is explicitly set to None in the config, getattr(..., 0) will return None, and passing it to int() will raise a TypeError. Using a fallback chain with or ensures we safely default to 0 without raising an exception.

            num_layers = int(getattr(text_cfg, "num_hidden_layers", None) or getattr(model_cfg, "num_hidden_layers", None) or 0)
            topk = int(getattr(text_cfg, "num_experts_per_tok", None) or getattr(model_cfg, "num_experts_per_tok", None) or 0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant