Add StyleTTS2 to EV

### Description & Motivation

We have done preliminary experiments with StyleTTS2 and would like to add it as a supported e2e model. This is because StyleTTS2 seems to perform well with relatively small amounts of data and is able to handle noisy recordings as well.

### Pitch

This is a very large task that needs to include:

- [x] Turn the code into a library
- [x] Fix imports and style
- [x] Loading HF models for the pre-trained BERT encoder
- [x] Have the auxiliary ASR module integrated as well (or maybe replace with Tim's method)
- [x] Refactoring the model code to Lightning
- [x] Refactor the train and validation loops to Lightning
- [x] Refactor the optimizers to Lightning
- [x] Refactor the dataloader to Lightning
- [x] Store the default text and integrate a type of mapping method for new datasets
- [x] Parameterize everything according to our config strategy
- [ ] Hook up with `everyvoice synthesize` https://github.com/EveryVoiceTTS/EveryVoice/pull/804
- [ ] Hook up with `everyvoice demo` https://github.com/EveryVoiceTTS/EveryVoice/pull/804
- [ ] Hook up with `everyvoice inspect checkpoint` https://github.com/EveryVoiceTTS/EveryVoice/pull/804
- [ ] decoder config matches EveryVoice's 22.05 kHz default, but not StyleTTS2's 24kHz default. I need to think more about what we want to do here.
- [ ] I'm handling the OOD data required by StyleTTS2 kind of sloppily, but improving this will require some thought
- [ ] everyvoice preprocess is currently just using the fs2 preprocess. I kind of think this function should be refactored into the main ev repo
- [ ] the "save-top-k" checkpoints isn't working, it's just saving every epoch https://github.com/EveryVoiceTTS/EveryVoice/pull/805
- [ ] only one audio sample seems to be getting logged to tensorboard (should be two or more) https://github.com/EveryVoiceTTS/EveryVoice/pull/805
- [ ] fix redundancies in the config (i.e. values that are defined in two places, both the ev config and the styletts2 config
- [ ] The text config is confusing in the StyleTTS2 config because we don't really use the punctuation that gets automatically defined there. We need to consider the symbol mapping requirements for styletts2 more carefully here.
- [ ] everyvoice new-project is very slow now when exporting the yaml @joanise (or spinner)
- [ ] when doing everyvoice train text-to-wav config/everyvoice-text-to-wav.yaml on a config file created by EV<=0.4, four extra_forbidden validation errors are produced. This should be easy to catch and output a message explaining the version mismatch and how to fix it.
- [ ] update the readme in styletts2 to reflect our fork's information
- [ ] uses a source-installed version of monotonic align which will block us from publishing on pypi @joanise 
- [ ] look for a way to remove the full path (which might embed the username in the model) from five fields in everyvoice-text-to-wav.yaml - partially fixed by https://github.com/EveryVoiceTTS/EveryVoice/pull/807

### Alternatives

Of current (2025) SOTA models (VALL-E, VITS2, YourTTS) I think StyleTTS2 is the best and has the most reliable code-base given that it was released by the authors (unlike VALL-E or VITS2, which are great, but are reproductions). F5 is good but seems to hallucinate more, there are more repos but the code bases seem slightly less mature

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add StyleTTS2 to EV #686

Description & Motivation

Pitch

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Add StyleTTS2 to EV #686

Description

Description & Motivation

Pitch

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions