Repeated phonemes cause degenerate output (noise/looping)

**Environment:** VoXtream v0.2.0, herimor/voxtream2 model, RTX 3060 12GB, CUDA 12.8, Ubuntu Linux, Python 3.12

**Description:**
When the input text contains repeated identical words — specifically tested with "...oh" and "oh oh oh" — the model enters a degenerate state and produces noise or garbage audio instead of speech. Once triggered, subsequent synthesis requests in the same session also produce degraded output until the service is restarted.

**Steps to reproduce:**

1. Start VoXtream server with a reference WAV
2. Send several normal synthesis requests (these work fine)
3. Send text containing "oh oh oh" or "...oh"
4. Observe: output is noise/garbled audio, not speech
5. Subsequent requests also produce degraded audio

**Expected behavior:**
Model should synthesize the repeated words as speech, even if the prosody is unusual.

**Workaround:**
Pre-filter input text to collapse 3+ repeated identical words down to 2 before sending to the model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repeated phonemes cause degenerate output (noise/looping) #12

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Repeated phonemes cause degenerate output (noise/looping) #12

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions