Environment: VoXtream v0.2.0, herimor/voxtream2 model, RTX 3060 12GB, CUDA 12.8, Ubuntu Linux, Python 3.12
Description:
The model occasionally substitutes entirely unrelated words in the output. For example, the word "small" in the input text was spoken as "fish" in the audio output. These are not mispronunciations — they are completely different words with no phonetic similarity.
Steps to reproduce:
Intermittent — no reliable reproduction steps identified yet. Occurs during normal conversational synthesis with a voice-cloned reference WAV.
Expected behavior:
The spoken output should match the input text.
Environment: VoXtream v0.2.0, herimor/voxtream2 model, RTX 3060 12GB, CUDA 12.8, Ubuntu Linux, Python 3.12
Description:
The model occasionally substitutes entirely unrelated words in the output. For example, the word "small" in the input text was spoken as "fish" in the audio output. These are not mispronunciations — they are completely different words with no phonetic similarity.
Steps to reproduce:
Intermittent — no reliable reproduction steps identified yet. Occurs during normal conversational synthesis with a voice-cloned reference WAV.
Expected behavior:
The spoken output should match the input text.