Environment: VoXtream v0.2.0, herimor/voxtream2 model, RTX 3060 12GB, CUDA 12.8, Ubuntu Linux, Python 3.12
Description:
When synthesis input exceeds approximately 500 characters, the model begins repeating the last sentence of the output multiple times and audio quality degrades progressively. The documentation suggests a 1000-character limit, but in practice the model becomes unreliable well before that.
Steps to reproduce:
- Send a synthesis request with 500-800 characters of text
- Observe: the model repeats the final sentence multiple times, it doesn't happen 100% of the time, but is inconsistent on longer synthesis.
- Audio quality degrades toward the end of the output (pacing issues, garbled speech)
Expected behavior:
Model should synthesize the full text without looping or repeating sentences.
Workaround:
Split input text into chunks of ~250 characters before sending to the model. At 250 characters we saw improvement in the performance, but there are still pacing issues for shorter character lengths.
Environment: VoXtream v0.2.0, herimor/voxtream2 model, RTX 3060 12GB, CUDA 12.8, Ubuntu Linux, Python 3.12
Description:
When synthesis input exceeds approximately 500 characters, the model begins repeating the last sentence of the output multiple times and audio quality degrades progressively. The documentation suggests a 1000-character limit, but in practice the model becomes unreliable well before that.
Steps to reproduce:
Expected behavior:
Model should synthesize the full text without looping or repeating sentences.
Workaround:
Split input text into chunks of ~250 characters before sending to the model. At 250 characters we saw improvement in the performance, but there are still pacing issues for shorter character lengths.