Skip to content

Gradient overflow with Mixed Precision Training  #63

@MinHyung-Kang

Description

@MinHyung-Kang

Hello!
I am trying to train on train-clean-100 subset of LibriTTS.
I have resampled all of them to be 22khz (example), and reusing the filelists file as was provided in original repository. I was able to successfully run them for about 15000 iterations on a T4 GPU. (g4dn.xlarge instance on AWS)

When I tried turning on mixed precision training by setting tp16_run=True on the same instance, it runs for a few iterations, then runs into gradient overflows. It keeps trying to decrease the loss scale by 2 until ~1e-100 (at which point i stopped). The loss is NaN rather than inf, which according to Apex github issue I should not be observing.

Wondering if anyone has an idea why this might be happening.
image

+I am also wondering how many iterations the uploaded models for LibriTTS and LJS was trained for - is that some information that the team could share?

Thank you in advance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions