Hi. Thanks for your help.
I'm hitting this error when the train step is half-way. I suspect that this is due to the number of training devices, since I'm using 2 gpus to train. If I use 4 gpus to train, it stops at 25% progress. Any ideas on how to fix?
[Train Step] 75000/150000: train/disc_loss: 0.89815, train/logits_real: -0.02255, train/logits_fake: -0.23431, lr: 0.00000, : 50%|▌| 75000/15
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/s2254242/projects/triffuser/ADM-Public/train_vae.py", line 342, in <module>
[rank0]: main(args)
[rank0]: File "/home/s2254242/projects/triffuser/ADM-Public/train_vae.py", line 100, in main
[rank0]: trainer.train()
[rank0]: File "/home/s2254242/projects/triffuser/ADM-Public/train_vae.py", line 288, in train
[rank0]: self.lr_scheduler_ae.step()
[rank0]: File "/localdisk/home/s2254242/miniconda3/envs/triffuser/lib/python3.11/site-packages/accelerate/scheduler.py", line 82, in step
[rank0]: self.scheduler.step(*args, **kwargs)
[rank0]: File "/localdisk/home/s2254242/miniconda3/envs/triffuser/lib/python3.11/site-packages/torch/optim/lr_scheduler.py", line 241, in step
[rank0]: values = self.get_lr()
[rank0]: ^^^^^^^^^^^^^
[rank0]: File "/localdisk/home/s2254242/miniconda3/envs/triffuser/lib/python3.11/site-packages/torch/optim/lr_scheduler.py", line 386, in get_lr
[rank0]: return [
[rank0]: ^
[rank0]: File "/localdisk/home/s2254242/miniconda3/envs/triffuser/lib/python3.11/site-packages/torch/optim/lr_scheduler.py", line 387, in <listcomp>
[rank0]: base_lr * lmbda(self.last_epoch)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/s2254242/projects/triffuser/ADM-Public/train_vae.py", line 165, in <lambda>
[rank0]: lr_lambda = lambda iter: max((1 - iter / train_num_steps) ** 0.95, min_lr/train_lr)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: TypeError: '>' not supported between instances of 'float' and 'complex'
Hi. Thanks for your help.
I'm hitting this error when the train step is half-way. I suspect that this is due to the number of training devices, since I'm using 2 gpus to train. If I use 4 gpus to train, it stops at 25% progress. Any ideas on how to fix?