[rollout, vllm] fix: avoid SIGSEGV on ROCm TP=1 by conditionally omitting distributed_executor_backend#6459
Conversation
…ributed_executor_backend On ROCm with TP=1, the hardcoded distributed_executor_backend="mp" forces vLLM to use MultiprocExecutor, creating a two-level multiprocessing spawn chain (EngineCore -> MultiprocExecutor -> WorkerProc) inside a Ray actor where CUDA/HIP is already initialized. This causes the inner WorkerProc to SIGSEGV (exitcode=-11) before entering worker_main. By omitting distributed_executor_backend when on ROCm with TP=1, vLLM auto-selects UniProcExecutor which keeps the worker in-process, avoiding the nested spawn. For TP>1, "mp" is still set as MultiprocExecutor is required for inter-process communication. CUDA/NVIDIA and TP>1 behavior are unchanged. Tested on: ROCm 7.0.2 + PyTorch 2.8.0 + vLLM 0.11.0rc2 + verl main Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: HeShiLie <gaozhe.gao@alibaba-inc.com>
There was a problem hiding this comment.
Code Review
This pull request updates vllm_async_server.py to handle a specific SIGSEGV issue on ROCm platforms with a tensor parallel size of 1 by conditionally omitting the distributed_executor_backend setting. Feedback highlights a regression where the new logic overwrites user-specified backend configurations in engine_kwargs. A code suggestion was provided to ensure that the default 'mp' backend is only applied if not already defined by the user.
| if not self._should_omit_distributed_executor_backend(): | ||
| args["distributed_executor_backend"] = "mp" | ||
| else: | ||
| logger.info( | ||
| "ROCm + TP=1 detected: omitting distributed_executor_backend " | ||
| "so vLLM uses UniProcExecutor and avoids nested WorkerProc spawn." | ||
| ) |
There was a problem hiding this comment.
The current implementation of the conditional logic for distributed_executor_backend causes a regression in precedence. Previously, users could override the default "mp" backend by providing distributed_executor_backend in engine_kwargs (via **engine_kwargs at line 273). Now, the logic at lines 280-281 will overwrite any value provided in engine_kwargs with "mp" for all cases except ROCm with TP=1.
Additionally, the log message at line 283 might be misleading if the user explicitly provided a backend in engine_kwargs on a ROCm+TP=1 system, as the code wouldn't actually be "omitting" it (it would just be respecting the user's choice).
To fix this, we should only set the default "mp" backend if it hasn't already been specified in args (which now includes engine_kwargs).
| if not self._should_omit_distributed_executor_backend(): | |
| args["distributed_executor_backend"] = "mp" | |
| else: | |
| logger.info( | |
| "ROCm + TP=1 detected: omitting distributed_executor_backend " | |
| "so vLLM uses UniProcExecutor and avoids nested WorkerProc spawn." | |
| ) | |
| if "distributed_executor_backend" not in args: | |
| if not self._should_omit_distributed_executor_backend(): | |
| args["distributed_executor_backend"] = "mp" | |
| else: | |
| logger.info( | |
| "ROCm + TP=1 detected: omitting distributed_executor_backend " | |
| "so vLLM uses UniProcExecutor and avoids nested WorkerProc spawn." | |
| ) |
There was a problem hiding this comment.
Good catch, thank you Mr. gemini! Fixed in 42f2bff — now the logic checks "distributed_executor_backend" not in args first, so user-specified values from engine_kwargs are preserved.
If the user already provides distributed_executor_backend via engine_kwargs, do not overwrite it with the default "mp". This preserves the existing precedence where engine_kwargs can override built-in defaults. Addresses review feedback from gemini-code-assist. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: HeShiLie <gaozhe.gao@alibaba-inc.com>
Summary
On ROCm (AMD GPU) with
tensor_parallel_size=1, verl's hardcodeddistributed_executor_backend="mp"forces vLLM to useMultiprocExecutor, creating a two-level multiprocessing spawn chain inside a Ray actor:The inner
WorkerProcconsistently dies with SIGSEGV before enteringworker_main. This appears to be a ROCm/HIP runtime issue triggered by nested multiprocessing spawn after HIP initialization.Fix: Conditionally omit
distributed_executor_backendon ROCm + TP=1, letting vLLM auto-selectUniProcExecutor(which keeps the worker in-process, no nested spawn). For TP>1,"mp"is still set asMultiprocExecutoris required.current_platform.is_rocm())tensor_model_parallel_size == 1)Investigation
Systematically eliminated all other hypotheses (import order, Ray worker pool, vLLM extensions, ZMQ IPC, etc.) across 10+ test scripts and 50+ runs, then isolated the crash to
AsyncLLM+backend="mp"specifically:distributed_executor_backendLLM()AsyncLLM+"mp""mp"AsyncLLM+ defaultUniProcExecutor)Verified that
UniProcExecutorsupports the full verl worker extension path (vLLMColocateWorkerExtension,collective_rpc,reset_mm_cache).Environment
Scope and Limitations
MultiprocExecutoris still needed — may require a vLLM upstream fix.Test plan
vLLMColocateWorkerExtensioninjection works with UniProcExecutorcollective_rpccalls (reset_mm_cache,monkey_patch_model) succeedis_rocm(), no behavioral change expected)AI assistance was used (Claude) for drafting this PR description. All code changes were written, reviewed, and tested by a human.
🤖 Generated with Claude Code