"train_micro_batch_size_per_gpu": 1,
"wall_clock_breakdown": true
}
Gradient accumulation steps mismatch: GradientAccumulationPlugin has 1, DeepSpeed config has 4. Using DeepSpeed's value.
wandb: Currently logged in as: zhaoyuan to https://api.wandb.ai. Use wandb login --relogin to force relogin
wandb: Tracking run with wandb version 0.21.1
wandb: Run data is saved locally in /home/ubuntun/disk7T/zy/code/Other/Seg-R1/wandb/run-20250828_211617-6iu5tknw
wandb: Run wandb offline to turn off syncing.
wandb: Syncing run Seg-R1
wandb: ⭐️ View project at https://wandb.ai/zhaoyuan/huggingface
wandb: 🚀 View run at https://wandb.ai/zhaoyuan/huggingface/runs/6iu5tknw
0%| | 0/25 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/ubuntun/disk7T/zy/code/Other/Seg-R1/seg-r1/src/open_r1/grpo.py", line 507, in
main(script_args, training_args, model_args)
File "/home/ubuntun/disk7T/zy/code/Other/Seg-R1/seg-r1/src/open_r1/grpo.py", line 500, in main
trainer.train()
File "/home/ubuntun/anaconda3/envs/SegR1/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train
return inner_training_loop(
File "/home/ubuntun/anaconda3/envs/SegR1/lib/python3.10/site-packages/transformers/trainer.py", line 2548, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
File "/home/ubuntun/anaconda3/envs/SegR1/lib/python3.10/site-packages/transformers/trainer.py", line 3692, in training_step
inputs = self._prepare_inputs(inputs)
File "/home/ubuntun/disk7T/zy/code/Other/Seg-R1/seg-r1/src/open_r1/trainer/vllm_grpo_trainer_modified.py", line 443, in _prepare_inputs
prompt_inputs = self.processing_class(
File "/home/ubuntun/anaconda3/envs/SegR1/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2877, in call
encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
File "/home/ubuntun/anaconda3/envs/SegR1/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2965, in _call_one
return self.batch_encode_plus(
File "/home/ubuntun/anaconda3/envs/SegR1/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3167, in batch_encode_plus
return self._batch_encode_plus(
TypeError: PreTrainedTokenizerFast._batch_encode_plus() got an unexpected keyword argument 'images'
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/ubuntun/disk7T/zy/code/Other/Seg-R1/seg-r1/src/open_r1/grpo.py", line 507, in
[rank0]: main(script_args, training_args, model_args)
[rank0]: File "/home/ubuntun/disk7T/zy/code/Other/Seg-R1/seg-r1/src/open_r1/grpo.py", line 500, in main
[rank0]: trainer.train()
[rank0]: File "/home/ubuntun/anaconda3/envs/SegR1/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train
[rank0]: return inner_training_loop(
[rank0]: File "/home/ubuntun/anaconda3/envs/SegR1/lib/python3.10/site-packages/transformers/trainer.py", line 2548, in _inner_training_loop
[rank0]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank0]: File "/home/ubuntun/anaconda3/envs/SegR1/lib/python3.10/site-packages/transformers/trainer.py", line 3692, in training_step
[rank0]: inputs = self._prepare_inputs(inputs)
[rank0]: File "/home/ubuntun/disk7T/zy/code/Other/Seg-R1/seg-r1/src/open_r1/trainer/vllm_grpo_trainer_modified.py", line 443, in _prepare_inputs
[rank0]: prompt_inputs = self.processing_class(
[rank0]: File "/home/ubuntun/anaconda3/envs/SegR1/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2877, in call
[rank0]: encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
[rank0]: File "/home/ubuntun/anaconda3/envs/SegR1/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2965, in _call_one
[rank0]: return self.batch_encode_plus(
[rank0]: File "/home/ubuntun/anaconda3/envs/SegR1/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3167, in batch_encode_plus
[rank0]: return self._batch_encode_plus(
[rank0]: TypeError: PreTrainedTokenizerFast._batch_encode_plus() got an unexpected keyword argument 'images'
wandb:
wandb: 🚀 View run Seg-R1 at: https://wandb.ai/zhaoyuan/huggingface/runs/6iu5tknw
wandb: Find logs at: ../../../../ubuntun/disk7T/zy/code/Other/Seg-R1/wandb/run-20250828_211617-6iu5tknw/logs
[rank0]:[W828 21:16:28.937748560 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
E0828 21:16:31.519000 1441441 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 1441530) of binary: /home/ubuntun/anaconda3/envs/SegR1/bin/python3.10
Traceback (most recent call last):
File "/home/ubuntun/anaconda3/envs/SegR1/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/ubuntun/anaconda3/envs/SegR1/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/ubuntun/anaconda3/envs/SegR1/lib/python3.10/site-packages/torch/distributed/run.py", line 923, in
main()
File "/home/ubuntun/anaconda3/envs/SegR1/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, **kwargs)
File "/home/ubuntun/anaconda3/envs/SegR1/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main
run(args)
File "/home/ubuntun/anaconda3/envs/SegR1/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/home/ubuntun/anaconda3/envs/SegR1/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ubuntun/anaconda3/envs/SegR1/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
seg-r1/src/open_r1/grpo.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2025-08-28_21:16:31
host : ubuntun-TG659V2
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1441530)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Process finished with exit code 1
"train_micro_batch_size_per_gpu": 1,
"wall_clock_breakdown": true
}
Gradient accumulation steps mismatch: GradientAccumulationPlugin has 1, DeepSpeed config has 4. Using DeepSpeed's value.
wandb: Currently logged in as: zhaoyuan to https://api.wandb.ai. Use
wandb login --reloginto force reloginwandb: Tracking run with wandb version 0.21.1
wandb: Run data is saved locally in /home/ubuntun/disk7T/zy/code/Other/Seg-R1/wandb/run-20250828_211617-6iu5tknw
wandb: Run
wandb offlineto turn off syncing.wandb: Syncing run Seg-R1
wandb: ⭐️ View project at https://wandb.ai/zhaoyuan/huggingface
wandb: 🚀 View run at https://wandb.ai/zhaoyuan/huggingface/runs/6iu5tknw
0%| | 0/25 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/ubuntun/disk7T/zy/code/Other/Seg-R1/seg-r1/src/open_r1/grpo.py", line 507, in
main(script_args, training_args, model_args)
File "/home/ubuntun/disk7T/zy/code/Other/Seg-R1/seg-r1/src/open_r1/grpo.py", line 500, in main
trainer.train()
File "/home/ubuntun/anaconda3/envs/SegR1/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train
return inner_training_loop(
File "/home/ubuntun/anaconda3/envs/SegR1/lib/python3.10/site-packages/transformers/trainer.py", line 2548, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
File "/home/ubuntun/anaconda3/envs/SegR1/lib/python3.10/site-packages/transformers/trainer.py", line 3692, in training_step
inputs = self._prepare_inputs(inputs)
File "/home/ubuntun/disk7T/zy/code/Other/Seg-R1/seg-r1/src/open_r1/trainer/vllm_grpo_trainer_modified.py", line 443, in _prepare_inputs
prompt_inputs = self.processing_class(
File "/home/ubuntun/anaconda3/envs/SegR1/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2877, in call
encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
File "/home/ubuntun/anaconda3/envs/SegR1/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2965, in _call_one
return self.batch_encode_plus(
File "/home/ubuntun/anaconda3/envs/SegR1/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3167, in batch_encode_plus
return self._batch_encode_plus(
TypeError: PreTrainedTokenizerFast._batch_encode_plus() got an unexpected keyword argument 'images'
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/ubuntun/disk7T/zy/code/Other/Seg-R1/seg-r1/src/open_r1/grpo.py", line 507, in
[rank0]: main(script_args, training_args, model_args)
[rank0]: File "/home/ubuntun/disk7T/zy/code/Other/Seg-R1/seg-r1/src/open_r1/grpo.py", line 500, in main
[rank0]: trainer.train()
[rank0]: File "/home/ubuntun/anaconda3/envs/SegR1/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train
[rank0]: return inner_training_loop(
[rank0]: File "/home/ubuntun/anaconda3/envs/SegR1/lib/python3.10/site-packages/transformers/trainer.py", line 2548, in _inner_training_loop
[rank0]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank0]: File "/home/ubuntun/anaconda3/envs/SegR1/lib/python3.10/site-packages/transformers/trainer.py", line 3692, in training_step
[rank0]: inputs = self._prepare_inputs(inputs)
[rank0]: File "/home/ubuntun/disk7T/zy/code/Other/Seg-R1/seg-r1/src/open_r1/trainer/vllm_grpo_trainer_modified.py", line 443, in _prepare_inputs
[rank0]: prompt_inputs = self.processing_class(
[rank0]: File "/home/ubuntun/anaconda3/envs/SegR1/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2877, in call
[rank0]: encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
[rank0]: File "/home/ubuntun/anaconda3/envs/SegR1/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2965, in _call_one
[rank0]: return self.batch_encode_plus(
[rank0]: File "/home/ubuntun/anaconda3/envs/SegR1/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3167, in batch_encode_plus
[rank0]: return self._batch_encode_plus(
[rank0]: TypeError: PreTrainedTokenizerFast._batch_encode_plus() got an unexpected keyword argument 'images'
wandb:
wandb: 🚀 View run Seg-R1 at: https://wandb.ai/zhaoyuan/huggingface/runs/6iu5tknw
wandb: Find logs at: ../../../../ubuntun/disk7T/zy/code/Other/Seg-R1/wandb/run-20250828_211617-6iu5tknw/logs
[rank0]:[W828 21:16:28.937748560 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
E0828 21:16:31.519000 1441441 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 1441530) of binary: /home/ubuntun/anaconda3/envs/SegR1/bin/python3.10
Traceback (most recent call last):
File "/home/ubuntun/anaconda3/envs/SegR1/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/ubuntun/anaconda3/envs/SegR1/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/ubuntun/anaconda3/envs/SegR1/lib/python3.10/site-packages/torch/distributed/run.py", line 923, in
main()
File "/home/ubuntun/anaconda3/envs/SegR1/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, **kwargs)
File "/home/ubuntun/anaconda3/envs/SegR1/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main
run(args)
File "/home/ubuntun/anaconda3/envs/SegR1/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/home/ubuntun/anaconda3/envs/SegR1/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ubuntun/anaconda3/envs/SegR1/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
seg-r1/src/open_r1/grpo.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2025-08-28_21:16:31
host : ubuntun-TG659V2
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1441530)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Process finished with exit code 1