Skip to content

Repeated phonemes cause degenerate output (noise/looping) #12

@DragonbornElric

Description

@DragonbornElric

Environment: VoXtream v0.2.0, herimor/voxtream2 model, RTX 3060 12GB, CUDA 12.8, Ubuntu Linux, Python 3.12

Description:
When the input text contains repeated identical words — specifically tested with "...oh" and "oh oh oh" — the model enters a degenerate state and produces noise or garbage audio instead of speech. Once triggered, subsequent synthesis requests in the same session also produce degraded output until the service is restarted.

Steps to reproduce:

  1. Start VoXtream server with a reference WAV
  2. Send several normal synthesis requests (these work fine)
  3. Send text containing "oh oh oh" or "...oh"
  4. Observe: output is noise/garbled audio, not speech
  5. Subsequent requests also produce degraded audio

Expected behavior:
Model should synthesize the repeated words as speech, even if the prosody is unusual.

Workaround:
Pre-filter input text to collapse 3+ repeated identical words down to 2 before sending to the model.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions