System Info
Linux
python 3.10
A6000*3
Who can help?
No response
Information
Tasks
Reproduction
运行create_service_qwen2.5_math_hf.sh脚本时RM模型的服务出现一会儿后立马消失,我的脚本如下所示(已经修改了--LM和--RM)
set -e
HOST_ADDR=0.0.0.0
CONTROLER_PORT=28777
WORKER_BASE_PORT=30010
echo PYTHON_EXECUTABLE=$(which python3)
PYTHON_EXECUTABLE=$(which python3)
MODEL_BASE=/data1/luozeyu/hf_models
CUDA_DEVICE_BASE=0
POLICY_MODEL_NAME=Qwen2.5-Math-1.5B-Instruct
VALUE_MODEL_NAME=checkpoint-117
MODEL_PATH=$MODEL_BASE/$POLICY_MODEL_NAME
VALUE_MODEL_PATH=$MODEL_BASE/$VALUE_MODEL_NAME
LOGDIR=logs_fastchat
tmux start-server
tmux new-session -s FastChat -n controller -d
tmux send-keys "export LOGDIR=${LOGDIR}" Enter
tmux send-keys "$PYTHON_EXECUTABLE -m fastchat.serve.controller --port ${CONTROLER_PORT} --host $HOST_ADDR" Enter
NUM_LM_WORKER=1
NUM_RM_WORKER=1
echo "Wait 5 seconds ..."
sleep 5
echo "Starting workers"
for i in $(seq 0 $((NUM_LM_WORKER-1)))
do
WORKER_PORT=$((WORKER_BASE_PORT+i))
tmux new-window -n policy_worker_$i
tmux send-keys "export LOGDIR=${LOGDIR}" Enter
tmux send-keys "CUDA_VISIBLE_DEVICES=$((i+CUDA_DEVICE_BASE)) $PYTHON_EXECUTABLE -m reason.llm_service.workers.model_worker --model-path $MODEL_PATH --controller-address http://$HOST_ADDR:$CONTROLER_PORT --host $HOST_ADDR --port $WORKER_PORT --worker-address http://$HOST_ADDR:$WORKER_PORT" Enter
done
start value service
for i in $(seq 0 $((NUM_RM_WORKER-1)))
do
WORKER_PORT=$((i+WORKER_BASE_PORT+NUM_LM_WORKER))
tmux new-window -n value_worker
tmux send-keys "export LOGDIR=${LOGDIR}" Enter
tmux send-keys "CUDA_VISIBLE_DEVICES=$((i+NUM_LM_WORKER+CUDA_DEVICE_BASE)) $PYTHON_EXECUTABLE -m reason.llm_service.workers.reward_model_worker --model-path $VALUE_MODEL_PATH --controller-address http://$HOST_ADDR:$CONTROLER_PORT --host $HOST_ADDR --port $WORKER_PORT --worker-address http://$HOST_ADDR:$WORKER_PORT" Enter
done
在刚运行完时可以看到启动了两个服务:
luozeyu 1269640 514008 0 14:31 ? 00:00:00 /data1/luozeyu/anaconda3/envs/open_reasoner/bin/python /data/luozeyu/.vscode-server/extensions/ms-python.autopep8-2025.2.0/bundled/tool/lsp_server.py
luozeyu 3200562 3200447 42 20:58 pts/0 00:00:03 /data1/luozeyu/anaconda3/envs/open_reasoner/bin/python3 -m fastchat.serve.controller --port 28777 --host 0.0.0.0
luozeyu 3201088 3200953 99 20:58 pts/10 00:00:10 /data1/luozeyu/anaconda3/envs/open_reasoner/bin/python3 -m reason.llm_service.workers.model_worker --model-path /data1/luozeyu/hf_models/Qwen2.5-Math-1.5B-Instruct --controller-address http://0.0.0.0:28777 --host 0.0.0.0 --port 30010 --worker-address http://0.0.0.0:30010
luozeyu 3201089 3200962 99 20:58 pts/18 00:00:07 /data1/luozeyu/anaconda3/envs/open_reasoner/bin/python3 -m reason.llm_service.workers.reward_model_worker --model-path /data1/luozeyu/hf_models/checkpoint-117 --controller-address http://0.0.0.0:28777 --host 0.0.0.0 --port 30011 --worker-address http://0.0.0.0:30011
luozeyu 3201403 2870305 0 20:58 pts/3 00:00:00 grep --color=auto open_reasoner
但是接下来RM的服务立马就消失了:
luozeyu 1269640 514008 0 14:31 ? 00:00:00 /data1/luozeyu/anaconda3/envs/open_reasoner/bin/python /data/luozeyu/.vscode-server/extensions/ms-python.autopep8-2025.2.0/bundled/tool/lsp_server.py
luozeyu 3200562 3200447 31 20:58 pts/0 00:00:03 /data1/luozeyu/anaconda3/envs/open_reasoner/bin/python3 -m fastchat.serve.controller --port 28777 --host 0.0.0.0
luozeyu 3201088 3200953 99 20:58 pts/10 00:00:14 /data1/luozeyu/anaconda3/envs/open_reasoner/bin/python3 -m reason.llm_service.workers.model_worker --model-path /data1/luozeyu/hf_models/Qwen2.5-Math-1.5B-Instruct --controller-address http://0.0.0.0:28777 --host 0.0.0.0 --port 30010 --worker-address http://0.0.0.0:30010
luozeyu 3201691 2870305 0 20:58 pts/3 00:00:00 grep --color=auto open_reasoner
请问这个问题的原因有哪些可能性,已经该如何进行修改?还请各位大佬的指点
Expected behavior
期待一个正常的输出,我猜是因为这个服务没有能够正常启动,导致我在运行cot_greedy.sh这些脚本时都出现了报错:json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
System Info
Linux
python 3.10
A6000*3
Who can help?
No response
Information
Tasks
Reproduction
运行create_service_qwen2.5_math_hf.sh脚本时RM模型的服务出现一会儿后立马消失,我的脚本如下所示(已经修改了--LM和--RM)
set -e
HOST_ADDR=0.0.0.0
CONTROLER_PORT=28777
WORKER_BASE_PORT=30010
echo PYTHON_EXECUTABLE=$(which python3)
PYTHON_EXECUTABLE=$(which python3)
MODEL_BASE=/data1/luozeyu/hf_models
CUDA_DEVICE_BASE=0
POLICY_MODEL_NAME=Qwen2.5-Math-1.5B-Instruct
VALUE_MODEL_NAME=checkpoint-117
MODEL_PATH=$MODEL_BASE/$POLICY_MODEL_NAME
VALUE_MODEL_PATH=$MODEL_BASE/$VALUE_MODEL_NAME
LOGDIR=logs_fastchat
tmux start-server
tmux new-session -s FastChat -n controller -d
tmux send-keys "export LOGDIR=${LOGDIR}" Enter
tmux send-keys "$PYTHON_EXECUTABLE -m fastchat.serve.controller --port ${CONTROLER_PORT} --host $HOST_ADDR" Enter
NUM_LM_WORKER=1
NUM_RM_WORKER=1
echo "Wait 5 seconds ..."
sleep 5
echo "Starting workers"$(seq 0 $ ((NUM_LM_WORKER-1)))
for i in
do
WORKER_PORT=$((WORKER_BASE_PORT+i))
tmux new-window -n policy_worker_$i
tmux send-keys "export LOGDIR=${LOGDIR}" Enter
tmux send-keys "CUDA_VISIBLE_DEVICES=$((i+CUDA_DEVICE_BASE)) $PYTHON_EXECUTABLE -m reason.llm_service.workers.model_worker --model-path $MODEL_PATH --controller-address http://$HOST_ADDR:$CONTROLER_PORT --host $HOST_ADDR --port $WORKER_PORT --worker-address http://$HOST_ADDR:$WORKER_PORT" Enter
done
start value service
for i in$(seq 0 $ ((NUM_RM_WORKER-1)))
do
WORKER_PORT=$((i+WORKER_BASE_PORT+NUM_LM_WORKER))
tmux new-window -n value_worker
tmux send-keys "export LOGDIR=${LOGDIR}" Enter
tmux send-keys "CUDA_VISIBLE_DEVICES=$((i+NUM_LM_WORKER+CUDA_DEVICE_BASE)) $PYTHON_EXECUTABLE -m reason.llm_service.workers.reward_model_worker --model-path $VALUE_MODEL_PATH --controller-address http://$HOST_ADDR:$CONTROLER_PORT --host $HOST_ADDR --port $WORKER_PORT --worker-address http://$HOST_ADDR:$WORKER_PORT" Enter
done
在刚运行完时可以看到启动了两个服务:
luozeyu 1269640 514008 0 14:31 ? 00:00:00 /data1/luozeyu/anaconda3/envs/open_reasoner/bin/python /data/luozeyu/.vscode-server/extensions/ms-python.autopep8-2025.2.0/bundled/tool/lsp_server.py
luozeyu 3200562 3200447 42 20:58 pts/0 00:00:03 /data1/luozeyu/anaconda3/envs/open_reasoner/bin/python3 -m fastchat.serve.controller --port 28777 --host 0.0.0.0
luozeyu 3201088 3200953 99 20:58 pts/10 00:00:10 /data1/luozeyu/anaconda3/envs/open_reasoner/bin/python3 -m reason.llm_service.workers.model_worker --model-path /data1/luozeyu/hf_models/Qwen2.5-Math-1.5B-Instruct --controller-address http://0.0.0.0:28777 --host 0.0.0.0 --port 30010 --worker-address http://0.0.0.0:30010
luozeyu 3201089 3200962 99 20:58 pts/18 00:00:07 /data1/luozeyu/anaconda3/envs/open_reasoner/bin/python3 -m reason.llm_service.workers.reward_model_worker --model-path /data1/luozeyu/hf_models/checkpoint-117 --controller-address http://0.0.0.0:28777 --host 0.0.0.0 --port 30011 --worker-address http://0.0.0.0:30011
luozeyu 3201403 2870305 0 20:58 pts/3 00:00:00 grep --color=auto open_reasoner
但是接下来RM的服务立马就消失了:
luozeyu 1269640 514008 0 14:31 ? 00:00:00 /data1/luozeyu/anaconda3/envs/open_reasoner/bin/python /data/luozeyu/.vscode-server/extensions/ms-python.autopep8-2025.2.0/bundled/tool/lsp_server.py
luozeyu 3200562 3200447 31 20:58 pts/0 00:00:03 /data1/luozeyu/anaconda3/envs/open_reasoner/bin/python3 -m fastchat.serve.controller --port 28777 --host 0.0.0.0
luozeyu 3201088 3200953 99 20:58 pts/10 00:00:14 /data1/luozeyu/anaconda3/envs/open_reasoner/bin/python3 -m reason.llm_service.workers.model_worker --model-path /data1/luozeyu/hf_models/Qwen2.5-Math-1.5B-Instruct --controller-address http://0.0.0.0:28777 --host 0.0.0.0 --port 30010 --worker-address http://0.0.0.0:30010
luozeyu 3201691 2870305 0 20:58 pts/3 00:00:00 grep --color=auto open_reasoner
请问这个问题的原因有哪些可能性,已经该如何进行修改?还请各位大佬的指点
Expected behavior
期待一个正常的输出,我猜是因为这个服务没有能够正常启动,导致我在运行cot_greedy.sh这些脚本时都出现了报错:json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)