Multi-gpu, error with lr_scheduling, involving iter and steps

Hi. Thanks for your help.

I'm hitting this error when the train step is half-way. I suspect that this is due to the number of training devices, since I'm using 2 gpus to train. If I use 4 gpus to train, it stops at 25% progress. Any ideas on how to fix?

```
[Train Step] 75000/150000: train/disc_loss: 0.89815, train/logits_real: -0.02255, train/logits_fake: -0.23431, lr: 0.00000, :  50%|▌| 75000/15                
[rank0]: Traceback (most recent call last):                                                                                                                   
[rank0]:   File "/home/s2254242/projects/triffuser/ADM-Public/train_vae.py", line 342, in <module>                                                            
[rank0]:     main(args)                                                                                                                                       
[rank0]:   File "/home/s2254242/projects/triffuser/ADM-Public/train_vae.py", line 100, in main                                                                
[rank0]:     trainer.train()                                                                                                                                  
[rank0]:   File "/home/s2254242/projects/triffuser/ADM-Public/train_vae.py", line 288, in train                                                               
[rank0]:     self.lr_scheduler_ae.step()                                                                                                                      
[rank0]:   File "/localdisk/home/s2254242/miniconda3/envs/triffuser/lib/python3.11/site-packages/accelerate/scheduler.py", line 82, in step                   
[rank0]:     self.scheduler.step(*args, **kwargs)                                                                                                             
[rank0]:   File "/localdisk/home/s2254242/miniconda3/envs/triffuser/lib/python3.11/site-packages/torch/optim/lr_scheduler.py", line 241, in step              
[rank0]:     values = self.get_lr()                                                                                                                           
[rank0]:              ^^^^^^^^^^^^^                                            
[rank0]:   File "/localdisk/home/s2254242/miniconda3/envs/triffuser/lib/python3.11/site-packages/torch/optim/lr_scheduler.py", line 386, in get_lr
[rank0]:     return [                  
[rank0]:            ^                  
[rank0]:   File "/localdisk/home/s2254242/miniconda3/envs/triffuser/lib/python3.11/site-packages/torch/optim/lr_scheduler.py", line 387, in <listcomp>
[rank0]:     base_lr * lmbda(self.last_epoch)                                  
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^                                  
[rank0]:   File "/home/s2254242/projects/triffuser/ADM-Public/train_vae.py", line 165, in <lambda>                                                            
[rank0]:     lr_lambda = lambda iter: max((1 - iter / train_num_steps) ** 0.95, min_lr/train_lr)                                                              
[rank0]:                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                              
[rank0]: TypeError: '>' not supported between instances of 'float' and 'complex'
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-gpu, error with lr_scheduling, involving iter and steps #14

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Multi-gpu, error with lr_scheduling, involving iter and steps #14

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions