Skip to content

运行create_service_qwen2.5_math_hf.sh无法启动RM服务 #97

@LGAG

Description

@LGAG

System Info

Linux
python 3.10
A6000*3

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the codebase (such as scrips/, ...)
  • My own task or dataset (give details below)

Reproduction

运行create_service_qwen2.5_math_hf.sh脚本时RM模型的服务出现一会儿后立马消失,我的脚本如下所示(已经修改了--LM和--RM)
set -e

HOST_ADDR=0.0.0.0
CONTROLER_PORT=28777
WORKER_BASE_PORT=30010

echo PYTHON_EXECUTABLE=$(which python3)
PYTHON_EXECUTABLE=$(which python3)

MODEL_BASE=/data1/luozeyu/hf_models
CUDA_DEVICE_BASE=0
POLICY_MODEL_NAME=Qwen2.5-Math-1.5B-Instruct
VALUE_MODEL_NAME=checkpoint-117
MODEL_PATH=$MODEL_BASE/$POLICY_MODEL_NAME
VALUE_MODEL_PATH=$MODEL_BASE/$VALUE_MODEL_NAME

LOGDIR=logs_fastchat

tmux start-server
tmux new-session -s FastChat -n controller -d
tmux send-keys "export LOGDIR=${LOGDIR}" Enter
tmux send-keys "$PYTHON_EXECUTABLE -m fastchat.serve.controller --port ${CONTROLER_PORT} --host $HOST_ADDR" Enter

NUM_LM_WORKER=1
NUM_RM_WORKER=1

echo "Wait 5 seconds ..."
sleep 5

echo "Starting workers"
for i in $(seq 0 $((NUM_LM_WORKER-1)))
do
WORKER_PORT=$((WORKER_BASE_PORT+i))
tmux new-window -n policy_worker_$i
tmux send-keys "export LOGDIR=${LOGDIR}" Enter
tmux send-keys "CUDA_VISIBLE_DEVICES=$((i+CUDA_DEVICE_BASE)) $PYTHON_EXECUTABLE -m reason.llm_service.workers.model_worker --model-path $MODEL_PATH --controller-address http://$HOST_ADDR:$CONTROLER_PORT --host $HOST_ADDR --port $WORKER_PORT --worker-address http://$HOST_ADDR:$WORKER_PORT" Enter
done

start value service

for i in $(seq 0 $((NUM_RM_WORKER-1)))
do
WORKER_PORT=$((i+WORKER_BASE_PORT+NUM_LM_WORKER))
tmux new-window -n value_worker
tmux send-keys "export LOGDIR=${LOGDIR}" Enter
tmux send-keys "CUDA_VISIBLE_DEVICES=$((i+NUM_LM_WORKER+CUDA_DEVICE_BASE)) $PYTHON_EXECUTABLE -m reason.llm_service.workers.reward_model_worker --model-path $VALUE_MODEL_PATH --controller-address http://$HOST_ADDR:$CONTROLER_PORT --host $HOST_ADDR --port $WORKER_PORT --worker-address http://$HOST_ADDR:$WORKER_PORT" Enter
done

在刚运行完时可以看到启动了两个服务:
luozeyu 1269640 514008 0 14:31 ? 00:00:00 /data1/luozeyu/anaconda3/envs/open_reasoner/bin/python /data/luozeyu/.vscode-server/extensions/ms-python.autopep8-2025.2.0/bundled/tool/lsp_server.py
luozeyu 3200562 3200447 42 20:58 pts/0 00:00:03 /data1/luozeyu/anaconda3/envs/open_reasoner/bin/python3 -m fastchat.serve.controller --port 28777 --host 0.0.0.0
luozeyu 3201088 3200953 99 20:58 pts/10 00:00:10 /data1/luozeyu/anaconda3/envs/open_reasoner/bin/python3 -m reason.llm_service.workers.model_worker --model-path /data1/luozeyu/hf_models/Qwen2.5-Math-1.5B-Instruct --controller-address http://0.0.0.0:28777 --host 0.0.0.0 --port 30010 --worker-address http://0.0.0.0:30010
luozeyu 3201089 3200962 99 20:58 pts/18 00:00:07 /data1/luozeyu/anaconda3/envs/open_reasoner/bin/python3 -m reason.llm_service.workers.reward_model_worker --model-path /data1/luozeyu/hf_models/checkpoint-117 --controller-address http://0.0.0.0:28777 --host 0.0.0.0 --port 30011 --worker-address http://0.0.0.0:30011
luozeyu 3201403 2870305 0 20:58 pts/3 00:00:00 grep --color=auto open_reasoner
但是接下来RM的服务立马就消失了:
luozeyu 1269640 514008 0 14:31 ? 00:00:00 /data1/luozeyu/anaconda3/envs/open_reasoner/bin/python /data/luozeyu/.vscode-server/extensions/ms-python.autopep8-2025.2.0/bundled/tool/lsp_server.py
luozeyu 3200562 3200447 31 20:58 pts/0 00:00:03 /data1/luozeyu/anaconda3/envs/open_reasoner/bin/python3 -m fastchat.serve.controller --port 28777 --host 0.0.0.0
luozeyu 3201088 3200953 99 20:58 pts/10 00:00:14 /data1/luozeyu/anaconda3/envs/open_reasoner/bin/python3 -m reason.llm_service.workers.model_worker --model-path /data1/luozeyu/hf_models/Qwen2.5-Math-1.5B-Instruct --controller-address http://0.0.0.0:28777 --host 0.0.0.0 --port 30010 --worker-address http://0.0.0.0:30010
luozeyu 3201691 2870305 0 20:58 pts/3 00:00:00 grep --color=auto open_reasoner
请问这个问题的原因有哪些可能性,已经该如何进行修改?还请各位大佬的指点

Expected behavior

期待一个正常的输出,我猜是因为这个服务没有能够正常启动,导致我在运行cot_greedy.sh这些脚本时都出现了报错:json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions