Skip to content

Unable to train using 1 GPU #234

@amira-essawy

Description

@amira-essawy

I am trying to run the train.py using 1 GPU using this command

python tools/train.py configs/soft_teacher/soft_teacher_faster_rcnn_r50_caffe_fpn_coco_full_720k.py --gpus 1 --cfg-options fold=1 percent=10

The training started and ran till 4000 epoch then stopped giving this error, I am facing this problem on COCO dataset and my custom dataset

2023-02-09 13:00:26,292 - mmdet.ssod - INFO - Saving checkpoint at 4000 iterations 2023-02-09 13:00:36,802 - mmdet.ssod - INFO - Exp name: cv3.py 2023-02-09 13:00:36,803 - mmdet.ssod - INFO - Iter [4000/1080000] lr: 1.000e-02, eta: 9598 days, 18:19:02, time: 15.415, data_time: 0.941, memory: 6573, ema_momentum: 0.9990, sup_loss_rpn_cls: 0.0315, sup_loss_rpn_bbox: 0.0125, sup_loss_cls: 0.0654, sup_acc: 97.9980, sup_loss_bbox: 0.0812, loss: 0.1906 Traceback (most recent call last): File "tools/train.py", line 198, in <module> main() File "tools/train.py", line 186, in main train_detector( File "/root/workspace/amiras/SoftTeacher/ssod/apis/train.py", line 206, in train_detector runner.run(data_loaders, cfg.workflow) File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 144, in run iter_runner(iter_loaders[i], **kwargs) File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 70, in train self.call_hook('after_train_iter') File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 317, in call_hook getattr(hook, fn_name)(self) File "/root/workspace/amiras/SoftTeacher/ssod/utils/hooks/submodules_evaluation.py", line 38, in after_train_iter self._do_evaluate(runner) File "/root/workspace/amiras/SoftTeacher/ssod/utils/hooks/submodules_evaluation.py", line 52, in _do_evaluate dist.broadcast(module.running_var, 0) File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1399, in broadcast default_pg = _get_default_group() File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 584, in _get_default_group raise RuntimeError( RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

Is the code not compatible with only GPU?!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions