Unable to train using 1 GPU

I am trying to run the train.py using 1 GPU using this command 

python tools/train.py configs/soft_teacher/soft_teacher_faster_rcnn_r50_caffe_fpn_coco_full_720k.py --gpus 1 --cfg-options fold=1 percent=10

The training started and ran till 4000 epoch then stopped giving this error, I am facing this problem on COCO dataset and my custom dataset


`2023-02-09 13:00:26,292 - mmdet.ssod - INFO - Saving checkpoint at 4000 iterations
2023-02-09 13:00:36,802 - mmdet.ssod - INFO - Exp name: cv3.py
2023-02-09 13:00:36,803 - mmdet.ssod - INFO - Iter [4000/1080000]       lr: 1.000e-02, eta: 9598 days, 18:19:02, time: 15.415, data_time: 0.941, memory: 6573, ema_momentum: 0.9990, sup_loss_rpn_cls: 0.0315, sup_loss_rpn_bbox: 0.0125, sup_loss_cls: 0.0654, sup_acc: 97.9980, sup_loss_bbox: 0.0812, loss: 0.1906
Traceback (most recent call last):
  File "tools/train.py", line 198, in <module>
    main()
  File "tools/train.py", line 186, in main
    train_detector(
  File "/root/workspace/amiras/SoftTeacher/ssod/apis/train.py", line 206, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 144, in run
    iter_runner(iter_loaders[i], **kwargs)
  File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 70, in train
    self.call_hook('after_train_iter')
  File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 317, in call_hook
    getattr(hook, fn_name)(self)
  File "/root/workspace/amiras/SoftTeacher/ssod/utils/hooks/submodules_evaluation.py", line 38, in after_train_iter
    self._do_evaluate(runner)
  File "/root/workspace/amiras/SoftTeacher/ssod/utils/hooks/submodules_evaluation.py", line 52, in _do_evaluate
    dist.broadcast(module.running_var, 0)
  File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1399, in broadcast
    default_pg = _get_default_group()
  File "/opt/conda/envs/openmmlab/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 584, in _get_default_group
    raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.`

Is the code not compatible with only GPU?!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to train using 1 GPU #234

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unable to train using 1 GPU #234

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions