Skip to content

Multi-gpu, error with lr_scheduling, involving iter and steps #14

@miquel-espinosa

Description

@miquel-espinosa

Hi. Thanks for your help.

I'm hitting this error when the train step is half-way. I suspect that this is due to the number of training devices, since I'm using 2 gpus to train. If I use 4 gpus to train, it stops at 25% progress. Any ideas on how to fix?

[Train Step] 75000/150000: train/disc_loss: 0.89815, train/logits_real: -0.02255, train/logits_fake: -0.23431, lr: 0.00000, :  50%|▌| 75000/15                
[rank0]: Traceback (most recent call last):                                                                                                                   
[rank0]:   File "/home/s2254242/projects/triffuser/ADM-Public/train_vae.py", line 342, in <module>                                                            
[rank0]:     main(args)                                                                                                                                       
[rank0]:   File "/home/s2254242/projects/triffuser/ADM-Public/train_vae.py", line 100, in main                                                                
[rank0]:     trainer.train()                                                                                                                                  
[rank0]:   File "/home/s2254242/projects/triffuser/ADM-Public/train_vae.py", line 288, in train                                                               
[rank0]:     self.lr_scheduler_ae.step()                                                                                                                      
[rank0]:   File "/localdisk/home/s2254242/miniconda3/envs/triffuser/lib/python3.11/site-packages/accelerate/scheduler.py", line 82, in step                   
[rank0]:     self.scheduler.step(*args, **kwargs)                                                                                                             
[rank0]:   File "/localdisk/home/s2254242/miniconda3/envs/triffuser/lib/python3.11/site-packages/torch/optim/lr_scheduler.py", line 241, in step              
[rank0]:     values = self.get_lr()                                                                                                                           
[rank0]:              ^^^^^^^^^^^^^                                            
[rank0]:   File "/localdisk/home/s2254242/miniconda3/envs/triffuser/lib/python3.11/site-packages/torch/optim/lr_scheduler.py", line 386, in get_lr
[rank0]:     return [                  
[rank0]:            ^                  
[rank0]:   File "/localdisk/home/s2254242/miniconda3/envs/triffuser/lib/python3.11/site-packages/torch/optim/lr_scheduler.py", line 387, in <listcomp>
[rank0]:     base_lr * lmbda(self.last_epoch)                                  
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^                                  
[rank0]:   File "/home/s2254242/projects/triffuser/ADM-Public/train_vae.py", line 165, in <lambda>                                                            
[rank0]:     lr_lambda = lambda iter: max((1 - iter / train_num_steps) ** 0.95, min_lr/train_lr)                                                              
[rank0]:                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                              
[rank0]: TypeError: '>' not supported between instances of 'float' and 'complex'

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions