Environment: VoXtream v0.2.0, herimor/voxtream2 model, RTX 3060 12GB, CUDA 12.8, Ubuntu Linux, Python 3.12
Description:
When the input text contains repeated identical words — specifically tested with "...oh" and "oh oh oh" — the model enters a degenerate state and produces noise or garbage audio instead of speech. Once triggered, subsequent synthesis requests in the same session also produce degraded output until the service is restarted.
Steps to reproduce:
- Start VoXtream server with a reference WAV
- Send several normal synthesis requests (these work fine)
- Send text containing "oh oh oh" or "...oh"
- Observe: output is noise/garbled audio, not speech
- Subsequent requests also produce degraded audio
Expected behavior:
Model should synthesize the repeated words as speech, even if the prosody is unusual.
Workaround:
Pre-filter input text to collapse 3+ repeated identical words down to 2 before sending to the model.
Environment: VoXtream v0.2.0, herimor/voxtream2 model, RTX 3060 12GB, CUDA 12.8, Ubuntu Linux, Python 3.12
Description:
When the input text contains repeated identical words — specifically tested with "...oh" and "oh oh oh" — the model enters a degenerate state and produces noise or garbage audio instead of speech. Once triggered, subsequent synthesis requests in the same session also produce degraded output until the service is restarted.
Steps to reproduce:
Expected behavior:
Model should synthesize the repeated words as speech, even if the prosody is unusual.
Workaround:
Pre-filter input text to collapse 3+ repeated identical words down to 2 before sending to the model.