Why is the reward greater than 1?
@chenllliang, can you please help regarding this? Please check my training bash script, and I am also attaching here my WADB plots.
Here is my train bash:
export DEBUG_MODE="true"
export LOG_PATH="./debug_log_2b.txt"
torchrun --nproc_per_node=4
R1-V/src/r1-v/src/open_r1/grpo.py"
--output_dir clever
--model_name_or_path Qwen/Qwen2-VL-2B-Instruct
--dataset_name leonardPKU/clevr_cogen_a_train
--max_prompt_length 1024
--per_device_train_batch_size 1
--gradient_accumulation_steps 8
--logging_steps 1
--bf16 True
--report_to wandb
--gradient_checkpointing false
--attn_implementation flash_attention_2
--max_pixels 401408
--num_train_epochs 1
--run_name Qwen2-VL-2B-GRPO-CLEVR-70k-orginal
--save_steps 100
--save_only_model true
--num_generations 4
--temperature 1.0
Here are my training plot



Why is the reward greater than 1?
@chenllliang, can you please help regarding this? Please check my training bash script, and I am also attaching here my WADB plots.
Here is my train bash:
export DEBUG_MODE="true"
export LOG_PATH="./debug_log_2b.txt"
torchrun --nproc_per_node=4
R1-V/src/r1-v/src/open_r1/grpo.py"
--output_dir clever
--model_name_or_path Qwen/Qwen2-VL-2B-Instruct
--dataset_name leonardPKU/clevr_cogen_a_train
--max_prompt_length 1024
--per_device_train_batch_size 1
--gradient_accumulation_steps 8
--logging_steps 1
--bf16 True
--report_to wandb
--gradient_checkpointing false
--attn_implementation flash_attention_2
--max_pixels 401408
--num_train_epochs 1
--run_name Qwen2-VL-2B-GRPO-CLEVR-70k-orginal
--save_steps 100
--save_only_model true
--num_generations 4
--temperature 1.0
Here are my training plot