作者你好,我现在在尝试webshop的时候启动有问题,安装环境是按照流程走的
环境 Ubuntu 22.04 2*H800 RAM 250GB 好像是ray版本有问题,我想看看你的环境
(verl-webshop) qzluo@h800:~/workspace/ACD$ bash examples/sdar_trainer/run_webshop_3b.sh
- ENGINE=vllm
- export CUDA_VISIBLE_DEVICES=0,1
- CUDA_VISIBLE_DEVICES=0,1
- num_cpus_per_env_worker=0.1
- sdar_coef=0.01
- gate_beta=5.0
- skill_all=false
- train_data_size=16
- val_data_size=128
- group_size=8
- experiment_name=sdar_qwen2.5_3b_coef0.01_beta5.0_skillallfalse
- export WANDB_API_KEY=[REDACTED]
- python3 -m verl.trainer.main_sdar ... actor_rollout_ref.rollout.name=vllm ... trainer.total_epochs=150
+ray_init.include_dashboard=False trainer.val_before_train=True
ray init kwargs: {
'num_cpus': None,
'include_dashboard': False,
'runtime_env': {
'env_vars': {
'TOKENIZERS_PARALLELISM': 'true',
'NCCL_DEBUG': 'WARN',
'VLLM_LOGGING_LEVEL': 'WARN',
'VLLM_ALLOW_RUNTIME_LORA_UPDATING': 'true',
'VLLM_ALLREDUCE_USE_SYMM_MEM': '0',
'CUDA_DEVICE_MAX_CONNECTIONS': '1',
'NCCL_CUMEM_ENABLE': '0'
},
'working_dir': None
}
}
2026-06-04 15:16:11,685 INFO worker.py:2013 -- Started a local Ray instance.
/data_b/qzluo/miniconda3/envs/verl-webshop/lib/python3.10/site-packages/ray/_private/worker.py:2052:
FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or
num_gpus=None (default).
To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
(raylet) The node with node id: 80f742564b2eca1f572800c312fa118af2ae1410c734e4c6e75014b4 and address: 172.23.148.69 and node
name: 172.23.148.69 has been marked dead because the detector has missed too many heartbeats from it.
This can happen when a
(1) raylet crashes unexpectedly (OOM, etc.)
(2) raylet has lagging heartbeats due to slow network or busy workload.
(raylet) [2026-06-04 15:16:12,972 E 573463 573532] agent_manager.cc:87:
The raylet exited immediately because one Ray agent failed, agent_name = dashboard_agent.
(raylet) The raylet fate shares with the agent. This can happen because
(raylet) - The version of grpcio doesn't follow Ray's requirement.
(raylet) - The agent failed to start because of unexpected error or port conflict.
(raylet) - The agent is killed by the OS (e.g., out of memory).
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
class_name: SDARTaskRunner
actor_id: c460a4644a8a2bd4e675d84a01000000
namespace: dde19708-8892-4d2d-9c5b-54cf9749b426
The actor is dead because its owner has died.
Owner worker exit type: SYSTEM_ERROR
Worker exit detail: Owner's node has crashed.
The actor never ran - it was cancelled before it started running.
作者你好,我现在在尝试webshop的时候启动有问题,安装环境是按照流程走的
环境 Ubuntu 22.04 2*H800 RAM 250GB 好像是ray版本有问题,我想看看你的环境
(verl-webshop) qzluo@h800:~/workspace/ACD$ bash examples/sdar_trainer/run_webshop_3b.sh
+ray_init.include_dashboard=False trainer.val_before_train=True
ray init kwargs: {
'num_cpus': None,
'include_dashboard': False,
'runtime_env': {
'env_vars': {
'TOKENIZERS_PARALLELISM': 'true',
'NCCL_DEBUG': 'WARN',
'VLLM_LOGGING_LEVEL': 'WARN',
'VLLM_ALLOW_RUNTIME_LORA_UPDATING': 'true',
'VLLM_ALLREDUCE_USE_SYMM_MEM': '0',
'CUDA_DEVICE_MAX_CONNECTIONS': '1',
'NCCL_CUMEM_ENABLE': '0'
},
'working_dir': None
}
}
2026-06-04 15:16:11,685 INFO worker.py:2013 -- Started a local Ray instance.
/data_b/qzluo/miniconda3/envs/verl-webshop/lib/python3.10/site-packages/ray/_private/worker.py:2052:
FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or
num_gpus=None (default).
To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
(raylet) The node with node id: 80f742564b2eca1f572800c312fa118af2ae1410c734e4c6e75014b4 and address: 172.23.148.69 and node
name: 172.23.148.69 has been marked dead because the detector has missed too many heartbeats from it.
This can happen when a
(1) raylet crashes unexpectedly (OOM, etc.)
(2) raylet has lagging heartbeats due to slow network or busy workload.
(raylet) [2026-06-04 15:16:12,972 E 573463 573532] agent_manager.cc:87:
The raylet exited immediately because one Ray agent failed, agent_name = dashboard_agent.
(raylet) The raylet fate shares with the agent. This can happen because
(raylet) - The version of
grpciodoesn't follow Ray's requirement.(raylet) - The agent failed to start because of unexpected error or port conflict.
(raylet) - The agent is killed by the OS (e.g., out of memory).
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
class_name: SDARTaskRunner
actor_id: c460a4644a8a2bd4e675d84a01000000
namespace: dde19708-8892-4d2d-9c5b-54cf9749b426
The actor is dead because its owner has died.
Owner worker exit type: SYSTEM_ERROR
Worker exit detail: Owner's node has crashed.
The actor never ran - it was cancelled before it started running.