-
Notifications
You must be signed in to change notification settings - Fork 184
Description
Hello!
I am trying to train on train-clean-100 subset of LibriTTS.
I have resampled all of them to be 22khz (example), and reusing the filelists file as was provided in original repository. I was able to successfully run them for about 15000 iterations on a T4 GPU. (g4dn.xlarge instance on AWS)
When I tried turning on mixed precision training by setting tp16_run=True on the same instance, it runs for a few iterations, then runs into gradient overflows. It keeps trying to decrease the loss scale by 2 until ~1e-100 (at which point i stopped). The loss is NaN rather than inf, which according to Apex github issue I should not be observing.
Wondering if anyone has an idea why this might be happening.

+I am also wondering how many iterations the uploaded models for LibriTTS and LJS was trained for - is that some information that the team could share?
Thank you in advance!