Gradient overflow with Mixed Precision Training 

Hello!
I am trying to train on train-clean-100 subset of LibriTTS.
I have resampled all of them to be 22khz ([example](https://drive.google.com/open?id=1UqTirosw3N-j_n2Ud3oSXL5T7uFxlCLP)), and reusing the `filelists` file as was provided in original repository. I was able to successfully run them for about 15000 iterations on a T4 GPU. (g4dn.xlarge instance on AWS)

When I tried turning on mixed precision training by setting `tp16_run=True` on the same instance, it runs for a few iterations, then runs into gradient overflows. It keeps trying to decrease the loss scale by 2 until ~1e-100 (at which point i stopped). The loss is NaN rather than inf, which according to [Apex github issue](https://github.com/NVIDIA/apex/issues/318) I should not be observing.

Wondering if anyone has an idea why this might be happening. 
![image](https://user-images.githubusercontent.com/14953749/82131589-8d97b700-978b-11ea-8cb4-d8cb121a60bf.png)

+I am also wondering how many iterations the uploaded models for LibriTTS and LJS was trained for - is that some information that the team could share?

Thank you in advance!






Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gradient overflow with Mixed Precision Training #63

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Gradient overflow with Mixed Precision Training #63

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions