Description & Motivation
We have done preliminary experiments with StyleTTS2 and would like to add it as a supported e2e model. This is because StyleTTS2 seems to perform well with relatively small amounts of data and is able to handle noisy recordings as well.
Pitch
This is a very large task that needs to include:
Alternatives
Of current (2025) SOTA models (VALL-E, VITS2, YourTTS) I think StyleTTS2 is the best and has the most reliable code-base given that it was released by the authors (unlike VALL-E or VITS2, which are great, but are reproductions). F5 is good but seems to hallucinate more, there are more repos but the code bases seem slightly less mature
Additional context
No response
Description & Motivation
We have done preliminary experiments with StyleTTS2 and would like to add it as a supported e2e model. This is because StyleTTS2 seems to perform well with relatively small amounts of data and is able to handle noisy recordings as well.
Pitch
This is a very large task that needs to include:
everyvoice synthesize[StyleTTS2] feat: add demo and inference integration basics #804everyvoice demo[StyleTTS2] feat: add demo and inference integration basics #804everyvoice inspect checkpoint[StyleTTS2] feat: add demo and inference integration basics #804Alternatives
Of current (2025) SOTA models (VALL-E, VITS2, YourTTS) I think StyleTTS2 is the best and has the most reliable code-base given that it was released by the authors (unlike VALL-E or VITS2, which are great, but are reproductions). F5 is good but seems to hallucinate more, there are more repos but the code bases seem slightly less mature
Additional context
No response