Code of Test-Time Exploration Project

Directory Structure

- checkpoints/          # Model checkpoints storage
- data/                 # Training data storage
- evaluate_log/         # Evaluation logs storage
- evaluate_results/     # Evaluation results storage
- models/               # Base models (e.g., Qwen2.5-7B)
- scripts/              # Scripts for converting models to HuggingFace format
- training_log/         # Training logs storage
- TRL_sft/              # Supervised Fine-Tuning (SFT) code
- verl/                 # Reinforcement Learning (RL) training code
- vllm_serve/           # vLLM-based inference service deployment code

Environment Setup

Python Environment

conda create -n xxx python=3.12.9
conda activate xxx

Training Environment (verl)

Install vLLM and dependencies

pip install vllm==0.6.3
pip install swanlab
pip install tensordict
pip install omegaconf
pip install torchdata
pip install accelerate

Install flash-attention

pip install flash_attn-2.7.4.post1+cu12torch2.4cxx11abiFALSE-cp312-cp312-linux_x86_64.whl

Task Environment

Follow Agentboard installation instructions

Download data

wget https://www.modelscope.cn/datasets/hkust-nlp/agentboard/resolve/master/data.tar.gz

Install CMake

conda install cmake

Install dependencies

CMAKE_POLICY_VERSION_MINIMUM=3.5 pip install downward-faster_replan.zip
pip install TextWorld-handcoded_expert_integration.zip 
pip install nltk

Install Java

conda install -c conda-forge openjdk=11

Alfworld

pip install alfworld
pip install textworld==1.6.2

Sciworld

pip install scienceworld

Note: Modify a core library to ensure multi-environment concurrency stability

BabyAI

pip install minigrid

Note: Pay attention to minigrid version

# if "BabyAI-GoToRedBallGrey-v0" not in gymnasium.envs.registry.keys():
#     minigrid.register_minigrid_envs()

Jericho

pip install jericho

PDDL

pip install imageio
pip install matplotlib
pip install scikit-image

Training

Train Exp-Thinker

SFT Cold Start: Run in ./TRL_sft directory
```
bash run_qwen.sh
```

RL Training:

Deploy a fixed model (e.g., llama3-8B or qwen2.5-7B) as Actor service
Replace the service API in actor_rollout_ref.rollout.fixed_actor_api parameter in train_thinker_qwen_subtask.sh
Run in the current directory

bash train_thinker_qwen_subtask.sh

Key Parameters (train_thinker_qwen_subtask.sh):

Parameter	Description
`data.train_files`	Training files
`data.val_files`	Validation files
`data.train_batch_size`	Training batch size
`data.val_batch_size`	Validation batch size
`data.max_prompt_length`	Prompt length for training data
`data.max_response_length`	Response length for training data
`algorithm.adv_revise`	Advantage adjustment for training stability
`algorithm.adv_format_reward_coef`	Advantage adjustment for training stability
`algorithm.adv_format_reward_min`	Advantage adjustment for training stability
`algorithm.adv_other_reward_coef`	Advantage adjustment for training stability
`algorithm.adv_other_reward_min`	Advantage adjustment for training stability
`actor_rollout_ref.model.path`	Reference model path
`actor_rollout_ref.model.chat_template`	Chat template for data concatenation after rollout
`trainer.default_local_dir`	Model checkpoint save path
`trainer.save_freq`	Model checkpoint save frequency
`trainer.test_freq`	Model test frequency (-1 = no test during training)
`trainer.total_epochs`	Number of training epochs (default: 3)
`actor_rollout_ref.rollout.max_turns`	Maximum interaction turns between Actor and environment
`actor_rollout_ref.rollout.only_one_deepthink_node`	Whether to keep only one deep thinking node (default: True)
`actor_rollout_ref.rollout.environment.actor_length`	Maximum Actor single output length
`actor_rollout_ref.rollout.environment.thinker_length`	Maximum Thinker single output length
`actor_rollout_ref.rollout.train_actor_or_thinker`	Optimization target: actor or thinker (choose one)
`actor_rollout_ref.rollout.fixed_actor_api`	Fixed Actor service API (only valid when training Thinker)

Train Qwen2.5-Actor

Direct RL Training: Run in the current directory
```
bash train_actor_qwen.sh
```

Train LLaMA3-Actor

SFT Cold Start: Run in ./TRL_sft directory
```
bash run_llama.sh
```
RL Training: Run in the current directory
```
bash train_actor_llama.sh
```

Inference

Deploy Model Inference Service

Navigate to ./vllm_serve directory
Modify rollout_model parameters in run_vllm_server.py (GPU count, max inference length, etc.)
Modify model path, service port, and GPU IDs in run_serve.sh
Run
```
bash run_serve.sh
```

Model Evaluation

Navigate to the current directory, modify parameters in evaluate_methods_with_api.py:
- model_name: Target model identifier
- ACTOR_URL: Actor service API
- THINKER_URL: Thinker service API

Adjust test_settings in the test_different_settings function as needed:

Setting	Description
`ReAct`	ReAct framework testing with Actor only (baseline)
`ReAct_TTExplore`	Main method using Thinker to guide Actor behavior
`ReAct_Reflexion`	Reflection upon task failure (max 5 times), Test-Time Scaling baseline
`ReAct_Best_of_N`	Execute task 5 times and take best result, Test-Time Scaling baseline
`ReAct_TTExplore_Best_of_N`	Orthogonal combination of main method with Best-of-N

Run evaluation
```
bash test_evaluate.sh
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Code of Test-Time Exploration Project

Directory Structure

Environment Setup

Python Environment

Training Environment (verl)

Task Environment

Training

Train Exp-Thinker

Train Qwen2.5-Actor

Train LLaMA3-Actor

Inference

Deploy Model Inference Service

Model Evaluation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.vscode		.vscode
TRL_sft		TRL_sft
data		data
scripts		scripts
verl		verl
vllm_serve		vllm_serve
README.md		README.md
README.zh-CN.md		README.zh-CN.md
TextWorld-handcoded_expert_integration.zip		TextWorld-handcoded_expert_integration.zip
downward-faster_replan.zip		downward-faster_replan.zip
evaluate_methods_with_api.py		evaluate_methods_with_api.py
test_evaluate.sh		test_evaluate.sh
train_actor_llama.sh		train_actor_llama.sh
train_actor_qwen.sh		train_actor_qwen.sh
train_thinker_qwen_subtask.sh		train_thinker_qwen_subtask.sh
train_thinker_qwen_whole_task.sh		train_thinker_qwen_whole_task.sh

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Code of Test-Time Exploration Project

Directory Structure

Environment Setup

Python Environment

Training Environment (verl)

Task Environment

Training

Train Exp-Thinker

Train Qwen2.5-Actor

Train LLaMA3-Actor

Inference

Deploy Model Inference Service

Model Evaluation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages