Skip to content

Reproduce results with leonardPKU/clevr_cogen_a_train Qwen2-VL-2B-Instruct #190

@GilalNauman

Description

@GilalNauman

Why is the reward greater than 1?
@chenllliang, can you please help regarding this? Please check my training bash script, and I am also attaching here my WADB plots.

Here is my train bash:

export DEBUG_MODE="true"
export LOG_PATH="./debug_log_2b.txt"

torchrun --nproc_per_node=4
R1-V/src/r1-v/src/open_r1/grpo.py"
--output_dir clever
--model_name_or_path Qwen/Qwen2-VL-2B-Instruct
--dataset_name leonardPKU/clevr_cogen_a_train
--max_prompt_length 1024
--per_device_train_batch_size 1
--gradient_accumulation_steps 8
--logging_steps 1
--bf16 True
--report_to wandb
--gradient_checkpointing false
--attn_implementation flash_attention_2
--max_pixels 401408
--num_train_epochs 1
--run_name Qwen2-VL-2B-GRPO-CLEVR-70k-orginal
--save_steps 100
--save_only_model true
--num_generations 4
--temperature 1.0

Here are my training plot

Image

Image

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions