Skip to content

RUCBM/TTExplore

Repository files navigation

Code of Test-Time Exploration Project

Directory Structure

- checkpoints/          # Model checkpoints storage
- data/                 # Training data storage
- evaluate_log/         # Evaluation logs storage
- evaluate_results/     # Evaluation results storage
- models/               # Base models (e.g., Qwen2.5-7B)
- scripts/              # Scripts for converting models to HuggingFace format
- training_log/         # Training logs storage
- TRL_sft/              # Supervised Fine-Tuning (SFT) code
- verl/                 # Reinforcement Learning (RL) training code
- vllm_serve/           # vLLM-based inference service deployment code

Environment Setup

Python Environment

conda create -n xxx python=3.12.9
conda activate xxx

Training Environment (verl)

Install vLLM and dependencies

pip install vllm==0.6.3
pip install swanlab
pip install tensordict
pip install omegaconf
pip install torchdata
pip install accelerate

Install flash-attention

pip install flash_attn-2.7.4.post1+cu12torch2.4cxx11abiFALSE-cp312-cp312-linux_x86_64.whl

Task Environment

Follow Agentboard installation instructions

Download data

wget https://www.modelscope.cn/datasets/hkust-nlp/agentboard/resolve/master/data.tar.gz

Install CMake

conda install cmake

Install dependencies

CMAKE_POLICY_VERSION_MINIMUM=3.5 pip install downward-faster_replan.zip
pip install TextWorld-handcoded_expert_integration.zip 
pip install nltk

Install Java

conda install -c conda-forge openjdk=11

Alfworld

pip install alfworld
pip install textworld==1.6.2

Sciworld

pip install scienceworld

Note: Modify a core library to ensure multi-environment concurrency stability

BabyAI

pip install minigrid

Note: Pay attention to minigrid version

# if "BabyAI-GoToRedBallGrey-v0" not in gymnasium.envs.registry.keys():
#     minigrid.register_minigrid_envs()

Jericho

pip install jericho

PDDL

pip install imageio
pip install matplotlib
pip install scikit-image

Training

  1. SFT Cold Start: Run in ./TRL_sft directory

    bash run_qwen.sh
  2. RL Training:

    • Deploy a fixed model (e.g., llama3-8B or qwen2.5-7B) as Actor service
    • Replace the service API in actor_rollout_ref.rollout.fixed_actor_api parameter in train_thinker_qwen_subtask.sh
    • Run in the current directory
    bash train_thinker_qwen_subtask.sh

    Key Parameters (train_thinker_qwen_subtask.sh):

    Parameter Description
    data.train_files Training files
    data.val_files Validation files
    data.train_batch_size Training batch size
    data.val_batch_size Validation batch size
    data.max_prompt_length Prompt length for training data
    data.max_response_length Response length for training data
    algorithm.adv_revise Advantage adjustment for training stability
    algorithm.adv_format_reward_coef Advantage adjustment for training stability
    algorithm.adv_format_reward_min Advantage adjustment for training stability
    algorithm.adv_other_reward_coef Advantage adjustment for training stability
    algorithm.adv_other_reward_min Advantage adjustment for training stability
    actor_rollout_ref.model.path Reference model path
    actor_rollout_ref.model.chat_template Chat template for data concatenation after rollout
    trainer.default_local_dir Model checkpoint save path
    trainer.save_freq Model checkpoint save frequency
    trainer.test_freq Model test frequency (-1 = no test during training)
    trainer.total_epochs Number of training epochs (default: 3)
    actor_rollout_ref.rollout.max_turns Maximum interaction turns between Actor and environment
    actor_rollout_ref.rollout.only_one_deepthink_node Whether to keep only one deep thinking node (default: True)
    actor_rollout_ref.rollout.environment.actor_length Maximum Actor single output length
    actor_rollout_ref.rollout.environment.thinker_length Maximum Thinker single output length
    actor_rollout_ref.rollout.train_actor_or_thinker Optimization target: actor or thinker (choose one)
    actor_rollout_ref.rollout.fixed_actor_api Fixed Actor service API (only valid when training Thinker)

Train Qwen2.5-Actor

  1. Direct RL Training: Run in the current directory
    bash train_actor_qwen.sh

Train LLaMA3-Actor

  1. SFT Cold Start: Run in ./TRL_sft directory

    bash run_llama.sh
  2. RL Training: Run in the current directory

    bash train_actor_llama.sh

Inference

Deploy Model Inference Service

  1. Navigate to ./vllm_serve directory
  2. Modify rollout_model parameters in run_vllm_server.py (GPU count, max inference length, etc.)
  3. Modify model path, service port, and GPU IDs in run_serve.sh
  4. Run
    bash run_serve.sh

Model Evaluation

  1. Navigate to the current directory, modify parameters in evaluate_methods_with_api.py:

    • model_name: Target model identifier
    • ACTOR_URL: Actor service API
    • THINKER_URL: Thinker service API
  2. Adjust test_settings in the test_different_settings function as needed:

    Setting Description
    ReAct ReAct framework testing with Actor only (baseline)
    ReAct_TTExplore Main method using Thinker to guide Actor behavior
    ReAct_Reflexion Reflection upon task failure (max 5 times), Test-Time Scaling baseline
    ReAct_Best_of_N Execute task 5 times and take best result, Test-Time Scaling baseline
    ReAct_TTExplore_Best_of_N Orthogonal combination of main method with Best-of-N
  3. Run evaluation

    bash test_evaluate.sh

About

Test-Time Deep Thinking to Explore Implicit Rules

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors