[rollout, vllm] fix: avoid SIGSEGV on ROCm TP=1 by conditionally omitting distributed_executor_backend by HeShiLie · Pull Request #6459 · verl-project/verl

HeShiLie · 2026-05-25T03:31:13Z

Summary

On ROCm (AMD GPU) with tensor_parallel_size=1, verl's hardcoded distributed_executor_backend="mp" forces vLLM to use MultiprocExecutor, creating a two-level multiprocessing spawn chain inside a Ray actor:

Ray actor (CUDA/HIP already initialized)
  └→ AsyncLLM
       └→ EngineCore_DP0 (1st spawn)
            └→ MultiprocExecutor
                 └→ WorkerProc (2nd spawn) → SIGSEGV (exitcode=-11)

The inner WorkerProc consistently dies with SIGSEGV before entering worker_main. This appears to be a ROCm/HIP runtime issue triggered by nested multiprocessing spawn after HIP initialization.

Fix: Conditionally omit distributed_executor_backend on ROCm + TP=1, letting vLLM auto-select UniProcExecutor (which keeps the worker in-process, no nested spawn). For TP>1, "mp" is still set as MultiprocExecutor is required.

CUDA/NVIDIA behavior: unchanged (condition is current_platform.is_rocm())
TP>1 behavior: unchanged (condition requires tensor_model_parallel_size == 1)

Investigation

Systematically eliminated all other hypotheses (import order, Ray worker pool, vLLM extensions, ZMQ IPC, etc.) across 10+ test scripts and 50+ runs, then isolated the crash to AsyncLLM + backend="mp" specifically:

Variant	`distributed_executor_backend`	Result
Bare `LLM()`	N/A	PASS
`AsyncLLM` + `"mp"`	forced `"mp"`	CRASH (SIGSEGV)
`AsyncLLM` + default	not set (→ `UniProcExecutor`)	PASS

Verified that UniProcExecutor supports the full verl worker extension path (vLLMColocateWorkerExtension, collective_rpc, reset_mm_cache).

Environment

Component	Version
GPU	AMD MI308X
ROCm	7.0.2
PyTorch	2.8.0+rocm7.0.2
vLLM	0.11.0rc2 (V1 engine)
verl	main branch
Ray	2.44.1

Scope and Limitations

TP>1 on ROCm: Not tested. The nested spawn issue likely persists for TP>1 since MultiprocExecutor is still needed — may require a vLLM upstream fix.
Related: PR [fully_async, rollout, trainer, tool, cfg] fix: ROCm async training compatibility for AMD MI300X #6062 addresses ROCm async training on a different stack (ROCm 7.2 + vLLM 0.18.1rc1).

Test plan

Verified fix with AsyncLLM + UniProcExecutor in Ray actor (4 test variants, all PASS)
Verified vLLMColocateWorkerExtension injection works with UniProcExecutor
Verified collective_rpc calls (reset_mm_cache, monkey_patch_model) succeed
Full TAGPO training script reaches wandb init successfully (no more SIGSEGV)
CUDA/NVIDIA regression test (change is gated by is_rocm(), no behavioral change expected)

AI assistance was used (Claude) for drafting this PR description. All code changes were written, reviewed, and tested by a human.

🤖 Generated with Claude Code

…ributed_executor_backend On ROCm with TP=1, the hardcoded distributed_executor_backend="mp" forces vLLM to use MultiprocExecutor, creating a two-level multiprocessing spawn chain (EngineCore -> MultiprocExecutor -> WorkerProc) inside a Ray actor where CUDA/HIP is already initialized. This causes the inner WorkerProc to SIGSEGV (exitcode=-11) before entering worker_main. By omitting distributed_executor_backend when on ROCm with TP=1, vLLM auto-selects UniProcExecutor which keeps the worker in-process, avoiding the nested spawn. For TP>1, "mp" is still set as MultiprocExecutor is required for inter-process communication. CUDA/NVIDIA and TP>1 behavior are unchanged. Tested on: ROCm 7.0.2 + PyTorch 2.8.0 + vLLM 0.11.0rc2 + verl main Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: HeShiLie <gaozhe.gao@alibaba-inc.com>

CLAassistant · 2026-05-25T03:31:21Z

All committers have signed the CLA.

gemini-code-assist

Code Review

This pull request updates vllm_async_server.py to handle a specific SIGSEGV issue on ROCm platforms with a tensor parallel size of 1 by conditionally omitting the distributed_executor_backend setting. Feedback highlights a regression where the new logic overwrites user-specified backend configurations in engine_kwargs. A code suggestion was provided to ensure that the default 'mp' backend is only applied if not already defined by the user.

gemini-code-assist · 2026-05-25T03:33:32Z

+        if not self._should_omit_distributed_executor_backend():
+            args["distributed_executor_backend"] = "mp"
+        else:
+            logger.info(
+                "ROCm + TP=1 detected: omitting distributed_executor_backend "
+                "so vLLM uses UniProcExecutor and avoids nested WorkerProc spawn."
+            )


The current implementation of the conditional logic for distributed_executor_backend causes a regression in precedence. Previously, users could override the default "mp" backend by providing distributed_executor_backend in engine_kwargs (via **engine_kwargs at line 273). Now, the logic at lines 280-281 will overwrite any value provided in engine_kwargs with "mp" for all cases except ROCm with TP=1.

Additionally, the log message at line 283 might be misleading if the user explicitly provided a backend in engine_kwargs on a ROCm+TP=1 system, as the code wouldn't actually be "omitting" it (it would just be respecting the user's choice).

To fix this, we should only set the default "mp" backend if it hasn't already been specified in args (which now includes engine_kwargs).

Suggested change

if not self._should_omit_distributed_executor_backend():

args["distributed_executor_backend"] = "mp"

else:

logger.info(

"ROCm + TP=1 detected: omitting distributed_executor_backend "

"so vLLM uses UniProcExecutor and avoids nested WorkerProc spawn."

)

if "distributed_executor_backend" not in args:

if not self._should_omit_distributed_executor_backend():

args["distributed_executor_backend"] = "mp"

else:

logger.info(

"ROCm + TP=1 detected: omitting distributed_executor_backend "

"so vLLM uses UniProcExecutor and avoids nested WorkerProc spawn."

)

Good catch, thank you Mr. gemini! Fixed in 42f2bff — now the logic checks "distributed_executor_backend" not in args first, so user-specified values from engine_kwargs are preserved.

If the user already provides distributed_executor_backend via engine_kwargs, do not overwrite it with the default "mp". This preserves the existing precedence where engine_kwargs can override built-in defaults. Addresses review feedback from gemini-code-assist. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: HeShiLie <gaozhe.gao@alibaba-inc.com>

HeShiLie requested review from ArronHZG, PeterSH6, chenhaiq and wuxibin89 as code owners May 25, 2026 03:31

gemini-code-assist Bot reviewed May 25, 2026

View reviewed changes

HeShiLie changed the title ~~[ROCm] Fix AsyncLLM WorkerProc SIGSEGV when distributed_executor_backend="mp" is forced with TP=1~~ [rollout, vllm] fix: avoid SIGSEGV on ROCm TP=1 by conditionally omitting distributed_executor_backend May 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rollout, vllm] fix: avoid SIGSEGV on ROCm TP=1 by conditionally omitting distributed_executor_backend#6459

[rollout, vllm] fix: avoid SIGSEGV on ROCm TP=1 by conditionally omitting distributed_executor_backend#6459
HeShiLie wants to merge 2 commits into
verl-project:mainfrom
HeShiLie:fix/rocm-asyncllm-sigsegv

HeShiLie commented May 25, 2026

Uh oh!

CLAassistant commented May 25, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 25, 2026

Uh oh!

HeShiLie May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

HeShiLie commented May 25, 2026

Summary

Investigation

Environment

Scope and Limitations

Test plan

Uh oh!

CLAassistant commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

HeShiLie May 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CLAassistant commented May 25, 2026 •

edited

Loading