Reproduce results  with leonardPKU/clevr_cogen_a_train Qwen2-VL-2B-Instruct

Why is the reward greater than 1?
@chenllliang, can you please help regarding this? Please check my training bash script, and I am also attaching here my WADB plots.

Here is my train bash:

export DEBUG_MODE="true"
export LOG_PATH="./debug_log_2b.txt"


torchrun --nproc_per_node=4 \
       R1-V/src/r1-v/src/open_r1/grpo.py" \
    --output_dir clever \
    --model_name_or_path Qwen/Qwen2-VL-2B-Instruct \
    --dataset_name leonardPKU/clevr_cogen_a_train \
    --max_prompt_length 1024 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --logging_steps 1 \
    --bf16 True \
    --report_to wandb \
    --gradient_checkpointing false \
    --attn_implementation flash_attention_2 \
    --max_pixels 401408 \
    --num_train_epochs 1 \
    --run_name Qwen2-VL-2B-GRPO-CLEVR-70k-orginal \
    --save_steps 100 \
    --save_only_model true\
    --num_generations 4\
    --temperature 1.0



Here are my training plot 

![Image](https://github.com/user-attachments/assets/bf6a92d6-259b-4cf0-ad25-f6faba9cf42c)

![Image](https://github.com/user-attachments/assets/8f9d9697-f534-46f8-8014-47a431fb2a53)

![Image](https://github.com/user-attachments/assets/889163be-0c94-4294-ba30-dc8078807a42)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproduce results with leonardPKU/clevr_cogen_a_train Qwen2-VL-2B-Instruct #190

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reproduce results with leonardPKU/clevr_cogen_a_train Qwen2-VL-2B-Instruct #190

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions