Environment: VoXtream v0.2.0, herimor/voxtream2 model, RTX 3060 12GB, CUDA 12.8, Ubuntu Linux, Python 3.12
Description:
The model applies inconsistent pacing and emphasis during synthesis. Some words receive dramatic emphasis that doesn't match the sentence context, and the speaking rate varies unpredictably — sometimes speeding up mid-sentence, sometimes slowing down. This may be related to the base model's training data influencing the output cadence.
Steps to reproduce:
Send several synthesis requests with conversational text. Observe that emphasis placement and speaking speed vary unpredictably between and within utterances.
Expected behavior:
Pacing and emphasis should be relatively consistent and contextually appropriate.
Environment: VoXtream v0.2.0, herimor/voxtream2 model, RTX 3060 12GB, CUDA 12.8, Ubuntu Linux, Python 3.12
Description:
The model applies inconsistent pacing and emphasis during synthesis. Some words receive dramatic emphasis that doesn't match the sentence context, and the speaking rate varies unpredictably — sometimes speeding up mid-sentence, sometimes slowing down. This may be related to the base model's training data influencing the output cadence.
Steps to reproduce:
Send several synthesis requests with conversational text. Observe that emphasis placement and speaking speed vary unpredictably between and within utterances.
Expected behavior:
Pacing and emphasis should be relatively consistent and contextually appropriate.