- checkpoints/ # Model checkpoints storage
- data/ # Training data storage
- evaluate_log/ # Evaluation logs storage
- evaluate_results/ # Evaluation results storage
- models/ # Base models (e.g., Qwen2.5-7B)
- scripts/ # Scripts for converting models to HuggingFace format
- training_log/ # Training logs storage
- TRL_sft/ # Supervised Fine-Tuning (SFT) code
- verl/ # Reinforcement Learning (RL) training code
- vllm_serve/ # vLLM-based inference service deployment code
conda create -n xxx python=3.12.9
conda activate xxxInstall vLLM and dependencies
pip install vllm==0.6.3
pip install swanlab
pip install tensordict
pip install omegaconf
pip install torchdata
pip install accelerateInstall flash-attention
pip install flash_attn-2.7.4.post1+cu12torch2.4cxx11abiFALSE-cp312-cp312-linux_x86_64.whlFollow Agentboard installation instructions
Download data
wget https://www.modelscope.cn/datasets/hkust-nlp/agentboard/resolve/master/data.tar.gzInstall CMake
conda install cmakeInstall dependencies
CMAKE_POLICY_VERSION_MINIMUM=3.5 pip install downward-faster_replan.zip
pip install TextWorld-handcoded_expert_integration.zip
pip install nltkInstall Java
conda install -c conda-forge openjdk=11Alfworld
pip install alfworld
pip install textworld==1.6.2Sciworld
pip install scienceworldNote: Modify a core library to ensure multi-environment concurrency stability
BabyAI
pip install minigridNote: Pay attention to minigrid version
# if "BabyAI-GoToRedBallGrey-v0" not in gymnasium.envs.registry.keys():
# minigrid.register_minigrid_envs()Jericho
pip install jerichoPDDL
pip install imageio
pip install matplotlib
pip install scikit-imageTrain Exp-Thinker
-
SFT Cold Start: Run in
./TRL_sftdirectorybash run_qwen.sh
-
RL Training:
- Deploy a fixed model (e.g., llama3-8B or qwen2.5-7B) as Actor service
- Replace the service API in
actor_rollout_ref.rollout.fixed_actor_apiparameter intrain_thinker_qwen_subtask.sh - Run in the current directory
bash train_thinker_qwen_subtask.sh
Key Parameters (
train_thinker_qwen_subtask.sh):Parameter Description data.train_filesTraining files data.val_filesValidation files data.train_batch_sizeTraining batch size data.val_batch_sizeValidation batch size data.max_prompt_lengthPrompt length for training data data.max_response_lengthResponse length for training data algorithm.adv_reviseAdvantage adjustment for training stability algorithm.adv_format_reward_coefAdvantage adjustment for training stability algorithm.adv_format_reward_minAdvantage adjustment for training stability algorithm.adv_other_reward_coefAdvantage adjustment for training stability algorithm.adv_other_reward_minAdvantage adjustment for training stability actor_rollout_ref.model.pathReference model path actor_rollout_ref.model.chat_templateChat template for data concatenation after rollout trainer.default_local_dirModel checkpoint save path trainer.save_freqModel checkpoint save frequency trainer.test_freqModel test frequency (-1 = no test during training) trainer.total_epochsNumber of training epochs (default: 3) actor_rollout_ref.rollout.max_turnsMaximum interaction turns between Actor and environment actor_rollout_ref.rollout.only_one_deepthink_nodeWhether to keep only one deep thinking node (default: True) actor_rollout_ref.rollout.environment.actor_lengthMaximum Actor single output length actor_rollout_ref.rollout.environment.thinker_lengthMaximum Thinker single output length actor_rollout_ref.rollout.train_actor_or_thinkerOptimization target: actor or thinker (choose one) actor_rollout_ref.rollout.fixed_actor_apiFixed Actor service API (only valid when training Thinker)
- Direct RL Training: Run in the current directory
bash train_actor_qwen.sh
-
SFT Cold Start: Run in
./TRL_sftdirectorybash run_llama.sh
-
RL Training: Run in the current directory
bash train_actor_llama.sh
- Navigate to
./vllm_servedirectory - Modify
rollout_modelparameters inrun_vllm_server.py(GPU count, max inference length, etc.) - Modify model path, service port, and GPU IDs in
run_serve.sh - Run
bash run_serve.sh
-
Navigate to the current directory, modify parameters in
evaluate_methods_with_api.py:model_name: Target model identifierACTOR_URL: Actor service APITHINKER_URL: Thinker service API
-
Adjust
test_settingsin thetest_different_settingsfunction as needed:Setting Description ReActReAct framework testing with Actor only (baseline) ReAct_TTExploreMain method using Thinker to guide Actor behavior ReAct_ReflexionReflection upon task failure (max 5 times), Test-Time Scaling baseline ReAct_Best_of_NExecute task 5 times and take best result, Test-Time Scaling baseline ReAct_TTExplore_Best_of_NOrthogonal combination of main method with Best-of-N -
Run evaluation
bash test_evaluate.sh