Skip to content

Add StyleTTS2 to EV #686

@roedoejet

Description

@roedoejet

Description & Motivation

We have done preliminary experiments with StyleTTS2 and would like to add it as a supported e2e model. This is because StyleTTS2 seems to perform well with relatively small amounts of data and is able to handle noisy recordings as well.

Pitch

This is a very large task that needs to include:

  • Turn the code into a library
  • Fix imports and style
  • Loading HF models for the pre-trained BERT encoder
  • Have the auxiliary ASR module integrated as well (or maybe replace with Tim's method)
  • Refactoring the model code to Lightning
  • Refactor the train and validation loops to Lightning
  • Refactor the optimizers to Lightning
  • Refactor the dataloader to Lightning
  • Store the default text and integrate a type of mapping method for new datasets
  • Parameterize everything according to our config strategy
  • Hook up with everyvoice synthesize [StyleTTS2] feat: add demo and inference integration basics #804
  • Hook up with everyvoice demo [StyleTTS2] feat: add demo and inference integration basics #804
  • Hook up with everyvoice inspect checkpoint [StyleTTS2] feat: add demo and inference integration basics #804
  • decoder config matches EveryVoice's 22.05 kHz default, but not StyleTTS2's 24kHz default. I need to think more about what we want to do here.
  • I'm handling the OOD data required by StyleTTS2 kind of sloppily, but improving this will require some thought
  • everyvoice preprocess is currently just using the fs2 preprocess. I kind of think this function should be refactored into the main ev repo
  • the "save-top-k" checkpoints isn't working, it's just saving every epoch [StyleTTS2] Fix training checkpointing #805
  • only one audio sample seems to be getting logged to tensorboard (should be two or more) [StyleTTS2] Fix training checkpointing #805
  • fix redundancies in the config (i.e. values that are defined in two places, both the ev config and the styletts2 config
  • The text config is confusing in the StyleTTS2 config because we don't really use the punctuation that gets automatically defined there. We need to consider the symbol mapping requirements for styletts2 more carefully here.
  • everyvoice new-project is very slow now when exporting the yaml @joanise (or spinner)
  • when doing everyvoice train text-to-wav config/everyvoice-text-to-wav.yaml on a config file created by EV<=0.4, four extra_forbidden validation errors are produced. This should be easy to catch and output a message explaining the version mismatch and how to fix it.
  • update the readme in styletts2 to reflect our fork's information
  • uses a source-installed version of monotonic align which will block us from publishing on pypi @joanise
  • look for a way to remove the full path (which might embed the username in the model) from five fields in everyvoice-text-to-wav.yaml - partially fixed by [StyleTTS2] load pretrained checkpoint from HuggingFace #807

Alternatives

Of current (2025) SOTA models (VALL-E, VITS2, YourTTS) I think StyleTTS2 is the best and has the most reliable code-base given that it was released by the authors (unlike VALL-E or VITS2, which are great, but are reproductions). F5 is good but seems to hallucinate more, there are more repos but the code bases seem slightly less mature

Additional context

No response

Metadata

Metadata

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions